Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 218 times |
Download: | 1 times |
Priors and predictions ineveryday cognition
Tom GriffithsCognitive and Linguistic Sciences
data behaviorQuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
What computational problem is the brain solving?
Does human behavior correspond to an optimal solution to that problem?
Inductive problems
• Inferring structure from data
• Perception– e.g. structure of 3D world from 2D visual data
data hypotheses
cube
shaded hexagon
Inductive problems
• Inferring structure from data
• Perception– e.g. structure of 3D world from 2D data
• Cognition– e.g. relationship between variables from samples
data hypotheses
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Reverend Thomas Bayes
Bayes’ theorem
∑∈′
′′=
Hh
hphdp
hphdpdhp
)()|(
)()|()|(
Posteriorprobability
Likelihood Priorprobability
Sum over space of hypothesesh: hypothesis
d: data
Bayes’ theorem
€
p(h | d)∝ p(d | h)p(h)
h: hypothesisd: data
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Perception is optimal
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Körding & Wolpert (2004)
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Cognition is not
Do people use priors?
Standard answer: no(Tversky & Kahneman, 1974)
€
p(h | d)∝ p(d | h)p(h)
This talk: yes
What are people’s priors?
Explaining inductive leaps
• How do people – infer causal relationships– identify the work of chance– predict the future– assess similarity and make generalizations– learn functions, languages, and concepts
. . . from such limited data?
Explaining inductive leaps
• How do people – infer causal relationships– identify the work of chance– predict the future– assess similarity and make generalizations– learn functions, languages, and concepts
. . . from such limited data?
• What knowledge guides human inferences?
Prior knowledge matters when…
• …using a single datapoint– predicting the future– joint work
• …using secondhand data– effects of priors on cultural transmission
Outline
• …using a single datapoint– predicting the future– joint work with Josh Tenenbaum (MIT)
• …using secondhand data– effects of priors on cultural transmission– joint work with Mike Kalish (Louisiana)
• Conclusions
Outline
• …using a single datapoint– predicting the future– joint work with Josh Tenenbaum (MIT)
• …using secondhand data– effects of priors on cultural transmission– joint work with Mike Kalish (Louisiana)
• Conclusions
Predicting the future
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
How often is Google News updated?
t = time since last update
ttotal = time between updates
What should we guess for ttotal given t?
Making predictions
• You encounter a phenomenon that has existed for t units of time. How long will it continue into the future? (i.e. what’s ttotal?)
• We could replace “time” with any other variable that ranges from 0 to some unknown upper limit
Everyday prediction problems• You read about a movie that has made $60 million to date.
How much money will it make in total?
• You see that something has been baking in the oven for 34 minutes. How long until it’s ready?
• You meet someone who is 78 years old. How long will they live?
• Your friend quotes to you from line 17 of his favorite poem. How long is the poem?
• You see taxicab #107 pull up to the curb in front of the train station. How many cabs in this city?
Bayesian inference
p(ttotal|t) p(t|ttotal) p(ttotal)
posterior probability
likelihood prior
Bayesian inference
p(ttotal|t) p(t|ttotal) p(ttotal)
p(ttotal|t) 1/ttotal p(ttotal)
assumerandomsample
(0 < t < ttotal)
posterior probability
likelihood prior
Bayesian inference
p(ttotal|t) p(t|ttotal) p(ttotal)
p(ttotal|t) 1/ttotal 1/ttotal
assumerandomsample
(0 < t < ttotal)
posterior probability
likelihood prior
“uninformative” prior
How about maximal value of p(ttotal|t)?
Bayesian inference
p(ttotal|t) 1/ttotal 1/ttotal
posterior probability
What is the best guess for ttotal?
p(ttotal|t)
ttotalttotal = t
randomsampling
“uninformative” prior
Bayesian inference
p(ttotal|t)
ttotal
What is the best guess for ttotal? Instead, compute t* such that p(ttotal > t*|t) = 0.5:
p(ttotal|t) 1/ttotal 1/ttotal
posterior probability
randomsampling
“uninformative” prior
Bayesian inference
Yields Gott’s Rule: P(ttotal > t*|t) = 0.5 when t* = 2t i.e., best guess for ttotal = 2t
What is the best guess for ttotal? Instead, compute t* such that p(ttotal > t*|t) = 0.5.
p(ttotal|t) 1/ttotal 1/ttotal
posterior probability
randomsampling
“uninformative” prior
Applying Gott’s rule
t 4000 years, t* 8000 years
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Applying Gott’s rule
t 130,000 years, t* 260,000 years
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Predicting everyday events
• You meet someone who is 35 years old. How long will they live?– “70 years” seems reasonable
• Not so simple:– You meet someone who is 78 years old. How long will they
live?
– You meet someone who is 6 years old. How long will they live?
The effects of priors
• Different kinds of priors p(ttotal) are appropriate in different domains.
Uninformative: p(ttotal) 1/ttotal
The effects of priors
• Different kinds of priors p(ttotal) are appropriate in different domains.
e.g. wealth e.g. height
The effects of priors
Evaluating human predictions
• Different domains with different priors:– a movie has made $60 million [power-law]
– your friend quotes from line 17 of a poem [power-law]
– you meet a 78 year old man [Gaussian]
– a movie has been running for 55 minutes [Gaussian]
– a U.S. congressman has served for 11 years [Erlang]
• Prior distributions derived from actual data
• Use 5 values of t for each
• People predict ttotal
peopleparametric priorempirical prior
Gott’s rule
Nonparametric priors
You arrive at a friend’s house, and see that a cake has been in the oven for 34 minutes. How long will it be in the oven?
You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?
No direct experience
You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?
How long did the typicalpharaoh reign in ancientEgypt?
No direct experience
…using a single datapoint
• People produce accurate predictions for the duration and extent of everyday events
• Strong prior knowledge – form of the prior (power-law or exponential)– distribution given that form (parameters)– non-parametric distribution when necessary
• Reveals a surprising correspondence between probabilities in the mind and in the world
Outline
• …using a single datapoint– predicting the future– joint work with Josh Tenenbaum (MIT)
• …using secondhand data– effects of priors on cultural transmission– joint work with Mike Kalish (Louisiana)
• Conclusions
Cultural transmission
• Most knowledge is based on secondhand data
• Some things can only be learned from others– cultural objects transmitted across generations
• Cultural transmission provides an opportunity for priors to influence cultural objects
Iterated learning(Briscoe, 1998; Kirby, 2001)
• Each learner sees data, forms a hypothesis, produces the data given to the next learner
• c.f. the playground game “telephone”
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Objects of iterated learning
• Languages
• Religious concepts
• Social norms
• Myths and legends
• Causal theories
Explaining linguistic universals
• Human languages are a subset of all logically possible communication schemes– universal properties common to all languages
(Comrie, 1981; Greenberg, 1963; Hawkins, 1988)
• Two questions:– why do linguistic universals exist?– why are particular properties universal?
Explaining linguistic universals
• Traditional answer:– linguistic universals reflect innate constraints
specific to a system for acquiring language
• Alternative answer:– iterated learning imposes “information bottleneck”– universal properties survive this bottleneck
(Briscoe, 1998; Kirby, 2001)
Analyzing iterated learning
What are the consequences of iterated learning?
Simulations
Analytic results
Complexalgorithms
Simplealgorithms
Komarova, Niyogi, & Nowak (2002)
Brighton (2002)
Kirby (2001)
Smith, Kirby, & Brighton (2003)
?
Iterated Bayesian learning
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
p(h|d)
p(d|h)
p(h|d)
p(d|h)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Learners are rational Bayesian agents(covers a wide range of learning algorithms)
Markov chains
• Variables x(t+1) independent of history given x(t)
• Converges to a stationary distribution under easily checked conditions
x x x x x x x x
Transition matrixP(x(t+1)|x(t))
Markov chain Monte Carlo
• A strategy for sampling from complex probability distributions
• Key idea: construct a Markov chain which converges to target distribution– e.g. Metropolis algorithm– e.g. Gibbs sampling
Gibbs sampling
For variables x = x1, x2, …, xn
Draw xi(t+1) from P(xi|x-i)
x-i = x1(t+1), x2
(t+1),…, xi-1(t+1)
, xi+1(t)
, …, xn(t)
(a.k.a. the heat bath algorithm in statistical physics)
(Geman & Geman, 1984)
Gibbs sampling
(MacKay, 2002)
Iterated Bayesian learning
• Defines a Markov chain on (h,d)
Iterated Bayesian learning
• Defines a Markov chain on (h,d)
• This Markov chain is a Gibbs sampler for
€
p(d,h) = p(d | h) p(h)
Iterated Bayesian learning
• Defines a Markov chain on (h,d)
• This Markov chain is a Gibbs sampler for
• Rate of convergence is geometric– Gibbs sampler converges geometrically
(Liu, Wong, & Kong, 1995)€
p(d,h) = p(d | h) p(h)
Analytic results
• Iterated Bayesian learning converges to
• Corollaries:– distribution over hypotheses converges to p(h)– distribution over data converges to p(d)– the proportion of a population of iterated learners
with hypothesis h converges to p(h)
€
p(d,h) = p(d | h) p(h)
Implications for linguistic universals
• Two questions:– why do linguistic universals exist?– why are particular properties universal?
• Different answers: – existence explained through iterated learning– universal properties depend on the prior
• Focuses inquiry on the priors of the learners– cultural objects reflect the human mind
A method for discovering priors
Iterated learning converges to the prior…
…evaluate prior by producing iterated learning
Iterated function learning
• Each learner sees a set of (x,y) pairs
• Makes predictions of y for new x values
• Predictions are data for the next learner
data hypotheses
Function learning in the lab
Stimulus
Response
Slider
Feedback
Examine iterated learning with different initial data
1 2 3 4 5 6 7 8 9
IterationInitialdata
…using secondhand data
• Iterated Bayesian learning converges to the prior• Constrains explanations of linguistic universals• Open questions in Bayesian language evolution
– variation in priors– other selective pressures
• Provides a method for evaluating priors– concepts, causal relationships, languages, …
Outline
• …using a single datapoint– predicting the future
• …using secondhand data– effects of priors on cultural transmission
• Conclusions
Bayes’ theorem
€
p(h | d)∝ p(d | h)p(h)
A unifying principle for explaining inductive inferences
Bayes’ theorem
behavior = f(data,knowledge)
data behaviorQuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Bayes’ theorem
behavior = f(data,knowledge)
data behaviorQuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
knowledge
Explaining inductive leaps
• How do people – infer causal relationships– identify the work of chance– predict the future– assess similarity and make generalizations– learn functions, languages, and concepts
. . . from such limited data?
• What knowledge guides human inferences?
HHTHT
HHTHT
HHHHT
p(HHTHT|random)
p(random|HHTHT)
What’s the computational problem?
An inference about the structure of the world
An example: Gaussians
• If we assume…– data, d, is a single real number, x– hypotheses, h, are means of a Gaussian, – prior, p(), is Gaussian(0,0
2)
• …then p(xn+1|xn) is Gaussian(n, x2 + n
2)
€
n =xn /σ x
2 + μ0 /σ 02
1/σ x2 +1/σ 0
2
€
n2 =
1
1/σ x2 +1/σ 0
2
0 = 0, 02 = 1, x0 = 20
Iterated learning results in rapid convergence to prior
An example: Linear regression
• Assume– data, d, are pairs of real numbers (x, y)– hypotheses, h, are functions
• An example: linear regression– hypotheses have slope and pass through origin
– p() is Gaussian(0,02)
}x = 1
y
}x = 1
y
0 = 1, 02 = 0.1, y0 = -1