Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | silvester-woods |
View: | 218 times |
Download: | 1 times |
“Ideal” learning of language and categories
Nick ChaterDepartment of
PsychologyUniversity of Warwick
Paul VitányiCentrum voor Wiskunde en InformaticaAmsterdam
OVERVIEW
I. Learning from experience: The problem
II. Learning to predictIII. Learning to identifyIV. A methodology for assessing
learnabilityVI. Where next?
I. Learning from experience: The problem
Learning: How few assumptions will work?
Model fitting Assume M(x) Optimize x Easy, but needs prior
knowledge
No assumptions Learning is
impossible---”no free lunch”
Can a more minimal model of learning still work?
?
?
?
?
Learning from +/- vs. + data
target language/category
guess
overlap
Under-general Over-general
+ data
- data
But how about learning from + data only? Categorization Language
acquisition
?
?
??
??
Learning from +ive data seems to raise in principle problems In Categorization, rules
out: Almost all learning
experiments in psychology
Exemplar models Prototype models NNs, SVMs…
?
?
?
??
?
Language acquisition Assumed that children
only needing access to positive evidence
Sometimes viewed as ruling out learning models entirely
The “Logical” problem of language acquisition(e.g., Hornstein & Lightfoot, 1981; Pinker, 1979)
Must be solvable: A parallel with science
Science only has access to positive data
Yet it seems to be possible So overgeneral theories must be
eliminated, somehow e.g., “Anything goes” seems a bad theory
Theories must capture regularities, not just fit data
Absence as implicit negative evidence?
Thus overgeneral grammars may predict lots of missing sentences
And their absence is a systematic clue that the theory is probably wrong
This idea only seems convincing if can be proved that convergence works well, statistically... So what do we need to assume?
Modest assumption: Computability constraint
Assume that data is generated by
Random factors Computable factors i.e., nothing
uncomputable
“Monkeys typing into a programming language”
A modest assumption!
…HHTTTHTTHTTHT…
Computable process
Chance
NP V
NPV
S
…The cat sat on the mat. The dog…
Learning by simplicity Find explanation of “input” that is as
simple as possible An ‘explanation’ reconstructs the input Simplicity measured in code length
Long history in perception: Mach, Koffka, Hochberg, Attneave, Leeuwenberg, van der Helm
Mimicry theorem with Bayesian analysis E.g., Li & Vitányi (2000); Chater (1996); Chater & Vitányi ( ms.)
Relation to Bayesian inference Widely used in statistics and machine learning
Consider “ideal” learning Given the data, what
is the shortest code How well does the
shortest code work? Prediction Identification
Ignore the question of search
Makes general results feasible
But search won’t go away…!
Fundamental question: when is learning data-limited or search-limited?
Three kinds of induction
Prediction: converge on correct predictions
Identification: identify generating category/distribution in the limit
Learning causal mechanisms?? Inferring counterfactuals---effects of intervention (cf Pearl: from probability to causes)
II. Learning to predict
Prediction by simplicity Find shortest ‘program/explanation’
for current data Predict using that program
Strictly, use ‘weighted sum’ of explanations, weighted by brevity…
Equivalent to Bayes with (roughly) a 2-K(x)
prior, where K(x) is the length of the shortest program generating x
Summed error has finite bound (Solomonoff, 1978)
)(2
2logs
1=jj Ke
So prediction converges [faster than 1/nlog(n), for corpus size n]
Inductive inference is possible!
No independence or stationarity assumptions; just computability of generating mechanism
Applications
Language
A. Grammaticality judgements
B. Language production
C. Form-meaning mappings
Categorization
Learning from positive examples
A: Grammaticality judgments
We want a grammar that doesn’t over- or under- generalize (much) w.r.t., ‘true’ grammar, on sentences that are statistically likely to occur
NB. No guarantees for… Colorless green ideas sleep furiously (Chomsky) Bulldogs bulldogs bulldogs fight fight fight (Fodor)
Converging on a grammar Fixing undergeneralization is easy (such
grammars get ‘falsified’) Overgeneralization is the hard problem
Need to use absence as evidence But the language is infinite; any corpus finite So almost all grammatical sentences are also absent
Logical problem of language acquisition; Baker’s paradox Impossibility of ‘mere’ learning from positive evidence
Overgeneralization Theorem
Suppose learner has probability j of
erroneously guessing an ungrammatical jth word
Intuitive explanation: overgeneralization implies smaller than
need probs to grammatical sentences; and hence excessive code lengths
2log
)(
1 ejj
K
B: Language production
Simplicity allows ‘mimicry’ of any computable statistical method of generating a corpus
Arbitrary prob, ; simplicity prob,
Li & Vitányi, 1997
1)|(
)|(
xy
xy
C: Learning form-meaning mappings
So far we have ignored semantics Suppose language inputs consists
of form-meaning pairs (cf Pinker) Assume only the form→meaning and
meaning → form mappings are computable (don’t have to be deterministic)…
A theorem It follows that:
Total errors in mapping forms to (sets of) meanings (with probs) and
Total errors in mapping forms to (sets of) meanings (with probs)
…have a finite bound (and hence average errors/sentence tend to 0)
Categorization Sample n items from
category C (assume each all items equally likely)
Guess, by choosing the D that provides the shortest code for the data
General proof method:1. Overgeneralization D must be basis for a
shorter code than C (or you wouldn’t prefer it)
2. Undergeneralization Typical data from
category C will have no code shorter than nlog|C|
1. Fighting overgeneralization D can’t be much
bigger than C, or it’ll have a longer code length:
K(D)+nlog|D| ≤ K(C)+nlog|C|
as n, constraint is that
|D|/|C| ≤ 1+O(1/n)
2. Fighting undergeneralization But guess must cover
most of the correct category---or it’d provide a “suspiciously” short code for the data
Typicality:K(D|C)+nlog|CD|≥nlog|C| as n, constraint is
that |CD|/|C| ≥ 1-O(1/n)
C
C
D
D
Implication |D| converges
to near |C|
Accuracy bounded by O(1/n), with n samples
i.i.d. assumptions
Actual rate depends on structure of category is crucial
Language: need lots of examples (but how many?)
Some categories may only need a few (one?) example (Tenenbaum, Feldman)
III. Learning to identify
Hypothesis identification
Induction of ‘true’ hypothesis, or category, or language
In philosophy of science, typically viewed as hard problem…
Needs stronger assumptions than prediction
Identification in the limit: The problem Assume endless data Goal: specify an
algorithm that, at each point, picks a hypothesis
And eventually locks in on the correct hypothesis
though can never announce it---as there may always be an additional low frequency item that’s yet to be encountered
Gold, Osherson et al have studied this extensively
Sometimes viewed as showing identification not possible (but really a mix of positive and negative results)
But i.i.d. and computability allows a general positive result
Algorithm Algorithms have two
parts Program which
specifies set Pr Sample from Pr, using
average code length H(Pr) per data point
Pick a specific set of data (which needs to be ‘long enough’)
Won’t necessarily know what is long enough---an extra assumption
Specify enumeration of programs for Pr, e.g., in order of length
Run, dovetailing
Initialize with any Pr
Flip to Pr that corresponds to shortest program so far, that has generated data
Dovetailing
prog1 1 2 4 7prog2 3 5 8prog3 6 9prog4 10
Runs for ever…
Run these in order, dovetailing, where each program gets 2(-
length) steps This process runs for
ever (looping programs)
Shortest prog so far “pocketed”…
This will always finish on the “true” program
Overwhelmingly likely to work... (as n, Prob correct identification1)
For large enough stream of n typical data, no alternative model does better
Expected code length of coding data generated by Pr, by Pr’ rather than Pr, wastes
n.D(Pr’||Pr)
D(Pr’||Pr) > 0; so swamps initial code length, for large enough n
K(Pr)
K(Pr’)
Initial Code n=8Pr wins
IV. A methodology for assessing learnability
Assessing learnability in cognition?
Constraint c is learnable if code which
1. “invests” l(c) bits to encode c (investing) can…
2. recoup its investment save more than l(c) bits in encoding the data
Nativism? c is acquired But not enough data
can’t recoup investment
(e.g., little/no relevant data)
Viability of empiricism?
Ample supply of data to recoup l(c)
Cf Tenenbaum, Feldman…
Language acquisition: Poverty of the stimulus, quantified
Consider of linguistic constraint
(e.g., noun-verb agreement; subjacency, phonological constraints)
Cost assessed by length of formulation
(length of linguistic rules)
Saving: reduction in cost of coding data (perceptual, linguistic)
Easy example: learning singular-plural
John loves tennisThey love_ tennis
John loves tennis*John love_ tennis They love_ tennis*They loves tennis
x bitsy bits
x+1 bits
y+1 bits
If constraint applies to proportion p of n sentences, constraint saves pn bits.
Visual structure―ample data? Depth from
stereo: Invest: algorithm
for correspondence
Recoup: almost a whole image (that’s a lot!)
Perhaps could infer stereo for a single stereo image?
Object/texture models (Yuille) Investment in
building the model But recoup in
compression, over “raw” image description
Presumably few images needed?
A harder linguistic case: Baker’s
paradox (with Luca Onnis and Matthew Roberts)
Quasi-regular structures are ubiquitous in language: e.g., alternations
It is likely that John will come
It is possible that John will come
John is likely to come *John is possible to
come
(Baker,1979, see also Culicover)
Strong winds High winds Strong currents *High currents
I love going to Italy! I enjoy going to Italy! I love to go to Italy! *I enjoy to go to
Italy!
Baker’s paradox (Baker, 1979)
Selectional restrictions: “holes” in the space of possible sentences allowed by a given grammar…
How does the learner avoid falling into the holes??
i.e., how does the learner distinguish genuine ‘holes’ from the infinite number of unheard grammatical constructions?
Our abstract theory tells us something
Theorem on grammaticality judgments show that the paradox is solvable, in the asymptote, and with no computational restrictions
But can this be scaled down… Learn specific ‘alternation’ patterns With corpus the child hears
Argument by information investment To encode an
exception, which appears to have probability x, requires
Log2(1/x) bits
But this elimination of x makes all other sentences (1-x) times more likely, saving:
n(Log2(x/1-x) bitsDoes the saving outweigh the investment?
An exampleRecovery from overgeneralisations
The rabbit hidYou hid the rabbit!The rabbit disappeared*You disappeared the rabbit!
Return on ‘investment’ over 5M words from the CHILDES database is easily sufficient
But this methodology can be applied much more widely (and aimed at fitting time-course of U-shaped generalization; and when
overgeneralizations do or do not arise).
V. Where next?
Can we learn causal structure from observation?
What happens if we move the left hand stick?
Liftability Breakability Edibility Whats is attached to what What is resting on what
Without this, perception is fairly useless as an input for action
The output of perception provides a description in terms of causality
Inferring causality from observation: The hard problem of induction
Formal question Suppose a modular computer program generate
stream of data of indefinite length… Under what conditions can modularity be recovered? How might “interventions”/expts help?
(Key technical idea: Kolmogorov sufficient statistic)
Sensory input
Generative process
Fairly uncharted territory If data is generated by
independent processes Then one model of the data will
involve recapitulation of those processes
But will there be other alternative modular programs?
Which might be shorter? Hopefully not!
Completely open field…