Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences.

Priors and predictions ineveryday cognition

Tom GriffithsCognitive and Linguistic Sciences

data behaviorQuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

What computational problem is the brain solving?

Does human behavior correspond to an optimal solution to that problem?

Inductive problems

• Inferring structure from data

• Perception– e.g. structure of 3D world from 2D visual data

data hypotheses

cube

shaded hexagon

Inductive problems

• Inferring structure from data

• Perception– e.g. structure of 3D world from 2D data

• Cognition– e.g. relationship between variables from samples

data hypotheses

QuickTime™ and aTIFF (Uncompressed) decompressor


Reverend Thomas Bayes

Bayes’ theorem

∑∈′

′′=

Hh

hphdp

hphdpdhp

)()|(

)()|()|(

Posteriorprobability

Likelihood Priorprobability

Sum over space of hypothesesh: hypothesis

d: data

Bayes’ theorem

€

p(h | d)∝ p(d | h)p(h)

h: hypothesisd: data



QuickTime™ and aTIFF (LZW) decompressor






Perception is optimal



Körding & Wolpert (2004)









Cognition is not

Do people use priors?

Standard answer: no(Tversky & Kahneman, 1974)

€

p(h | d)∝ p(d | h)p(h)

This talk: yes

What are people’s priors?

Explaining inductive leaps

• How do people – infer causal relationships– identify the work of chance– predict the future– assess similarity and make generalizations– learn functions, languages, and concepts

. . . from such limited data?




• What knowledge guides human inferences?

Prior knowledge matters when…

• …using a single datapoint– predicting the future– joint work

• …using secondhand data– effects of priors on cultural transmission

Outline

• …using a single datapoint– predicting the future– joint work with Josh Tenenbaum (MIT)

• …using secondhand data– effects of priors on cultural transmission– joint work with Mike Kalish (Louisiana)

• Conclusions

Outline



• Conclusions

Predicting the future



How often is Google News updated?

t = time since last update

ttotal = time between updates

What should we guess for ttotal given t?

Making predictions

• You encounter a phenomenon that has existed for t units of time. How long will it continue into the future? (i.e. what’s ttotal?)

• We could replace “time” with any other variable that ranges from 0 to some unknown upper limit

Everyday prediction problems• You read about a movie that has made $60 million to date.

How much money will it make in total?

• You see that something has been baking in the oven for 34 minutes. How long until it’s ready?

• You meet someone who is 78 years old. How long will they live?

• Your friend quotes to you from line 17 of his favorite poem. How long is the poem?

• You see taxicab #107 pull up to the curb in front of the train station. How many cabs in this city?

Bayesian inference

p(ttotal|t) p(t|ttotal) p(ttotal)

posterior probability

likelihood prior

Bayesian inference


p(ttotal|t) 1/ttotal p(ttotal)

assumerandomsample

(0 < t < ttotal)


likelihood prior

Bayesian inference


p(ttotal|t) 1/ttotal 1/ttotal

assumerandomsample

(0 < t < ttotal)


likelihood prior

“uninformative” prior

How about maximal value of p(ttotal|t)?

Bayesian inference



What is the best guess for ttotal?

p(ttotal|t)

ttotalttotal = t

randomsampling


Bayesian inference

p(ttotal|t)

ttotal

What is the best guess for ttotal? Instead, compute t* such that p(ttotal > t*|t) = 0.5:



randomsampling


Bayesian inference

Yields Gott’s Rule: P(ttotal > t*|t) = 0.5 when t* = 2t i.e., best guess for ttotal = 2t

What is the best guess for ttotal? Instead, compute t* such that p(ttotal > t*|t) = 0.5.



randomsampling


Applying Gott’s rule

t 4000 years, t* 8000 years



Applying Gott’s rule

t 130,000 years, t* 260,000 years



Predicting everyday events

• You meet someone who is 35 years old. How long will they live?– “70 years” seems reasonable

• Not so simple:– You meet someone who is 78 years old. How long will they

live?

– You meet someone who is 6 years old. How long will they live?

The effects of priors

• Different kinds of priors p(ttotal) are appropriate in different domains.

Uninformative: p(ttotal) 1/ttotal


• Different kinds of priors p(ttotal) are appropriate in different domains.

e.g. wealth e.g. height


Evaluating human predictions

• Different domains with different priors:– a movie has made $60 million [power-law]

– your friend quotes from line 17 of a poem [power-law]

– you meet a 78 year old man [Gaussian]

– a movie has been running for 55 minutes [Gaussian]

– a U.S. congressman has served for 11 years [Erlang]

• Prior distributions derived from actual data

• Use 5 values of t for each

• People predict ttotal

peopleparametric priorempirical prior

Gott’s rule

Nonparametric priors

You arrive at a friend’s house, and see that a cake has been in the oven for 34 minutes. How long will it be in the oven?

You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?

No direct experience

You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?

How long did the typicalpharaoh reign in ancientEgypt?

No direct experience

…using a single datapoint

• People produce accurate predictions for the duration and extent of everyday events

• Strong prior knowledge – form of the prior (power-law or exponential)– distribution given that form (parameters)– non-parametric distribution when necessary

• Reveals a surprising correspondence between probabilities in the mind and in the world

Outline



• Conclusions

Cultural transmission

• Most knowledge is based on secondhand data

• Some things can only be learned from others– cultural objects transmitted across generations

• Cultural transmission provides an opportunity for priors to influence cultural objects

Iterated learning(Briscoe, 1998; Kirby, 2001)

• Each learner sees data, forms a hypothesis, produces the data given to the next learner

• c.f. the playground game “telephone”



Objects of iterated learning

• Languages

• Religious concepts

• Social norms

• Myths and legends

• Causal theories

Explaining linguistic universals

• Human languages are a subset of all logically possible communication schemes– universal properties common to all languages

(Comrie, 1981; Greenberg, 1963; Hawkins, 1988)

• Two questions:– why do linguistic universals exist?– why are particular properties universal?

Explaining linguistic universals

• Traditional answer:– linguistic universals reflect innate constraints

specific to a system for acquiring language

• Alternative answer:– iterated learning imposes “information bottleneck”– universal properties survive this bottleneck

(Briscoe, 1998; Kirby, 2001)

Analyzing iterated learning

What are the consequences of iterated learning?

Simulations

Analytic results

Complexalgorithms

Simplealgorithms

Komarova, Niyogi, & Nowak (2002)

Brighton (2002)

Kirby (2001)

Smith, Kirby, & Brighton (2003)

?

Iterated Bayesian learning



p(h|d)

p(d|h)

p(h|d)

p(d|h)





Learners are rational Bayesian agents(covers a wide range of learning algorithms)

Markov chains

• Variables x(t+1) independent of history given x(t)

• Converges to a stationary distribution under easily checked conditions

x x x x x x x x

Transition matrixP(x(t+1)|x(t))

Markov chain Monte Carlo

• A strategy for sampling from complex probability distributions

• Key idea: construct a Markov chain which converges to target distribution– e.g. Metropolis algorithm– e.g. Gibbs sampling

Gibbs sampling

For variables x = x1, x2, …, xn

Draw xi(t+1) from P(xi|x-i)

x-i = x1(t+1), x2

(t+1),…, xi-1(t+1)

, xi+1(t)

, …, xn(t)

(a.k.a. the heat bath algorithm in statistical physics)

(Geman & Geman, 1984)

Gibbs sampling

(MacKay, 2002)


• Defines a Markov chain on (h,d)



• This Markov chain is a Gibbs sampler for

€

p(d,h) = p(d | h) p(h)



• This Markov chain is a Gibbs sampler for

• Rate of convergence is geometric– Gibbs sampler converges geometrically

(Liu, Wong, & Kong, 1995)€

p(d,h) = p(d | h) p(h)

Analytic results

• Iterated Bayesian learning converges to

• Corollaries:– distribution over hypotheses converges to p(h)– distribution over data converges to p(d)– the proportion of a population of iterated learners

with hypothesis h converges to p(h)

€

p(d,h) = p(d | h) p(h)

Implications for linguistic universals

• Two questions:– why do linguistic universals exist?– why are particular properties universal?

• Different answers: – existence explained through iterated learning– universal properties depend on the prior

• Focuses inquiry on the priors of the learners– cultural objects reflect the human mind

A method for discovering priors

Iterated learning converges to the prior…

…evaluate prior by producing iterated learning

Iterated function learning

• Each learner sees a set of (x,y) pairs

• Makes predictions of y for new x values

• Predictions are data for the next learner

data hypotheses

Function learning in the lab

Stimulus

Response

Slider

Feedback

Examine iterated learning with different initial data

1 2 3 4 5 6 7 8 9

IterationInitialdata

…using secondhand data

• Iterated Bayesian learning converges to the prior• Constrains explanations of linguistic universals• Open questions in Bayesian language evolution

– variation in priors– other selective pressures

• Provides a method for evaluating priors– concepts, causal relationships, languages, …

Outline

• …using a single datapoint– predicting the future

• …using secondhand data– effects of priors on cultural transmission

• Conclusions

Bayes’ theorem

€

p(h | d)∝ p(d | h)p(h)

A unifying principle for explaining inductive inferences

Bayes’ theorem

behavior = f(data,knowledge)



Bayes’ theorem

behavior = f(data,knowledge)



knowledge




• What knowledge guides human inferences?

HHTHT

HHTHT

HHHHT

p(HHTHT|random)

p(random|HHTHT)

What’s the computational problem?

An inference about the structure of the world

An example: Gaussians

• If we assume…– data, d, is a single real number, x– hypotheses, h, are means of a Gaussian, – prior, p(), is Gaussian(0,0

2)

• …then p(xn+1|xn) is Gaussian(n, x2 + n

2)

€

n =xn /σ x

2 + μ0 /σ 02

1/σ x2 +1/σ 0

2

€

n2 =

1

1/σ x2 +1/σ 0

2

0 = 0, 02 = 1, x0 = 20

Iterated learning results in rapid convergence to prior

An example: Linear regression

• Assume– data, d, are pairs of real numbers (x, y)– hypotheses, h, are functions

• An example: linear regression– hypotheses have slope and pass through origin

– p() is Gaussian(0,02)

}x = 1

y

}x = 1

y

0 = 1, 02 = 0.1, y0 = -1

Date post:	19-Dec-2015
Category:	Documents
View:	218 times
Download:	1 times

Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences.

Documents