+ All Categories
Home > Documents > Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational...

Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational...

Date post: 28-Mar-2015
Category:
Upload: aaron-armstrong
View: 221 times
Download: 3 times
Share this document with a friend
Popular Tags:
42
Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit [email protected] (thanks to Yael Niv
Transcript
Page 1: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Reinforcement Learning I: prediction and classical conditioning

Peter Dayan

Gatsby Computational Neuroscience Unit

[email protected] (thanks to Yael Niv)

Page 2: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Global plan

• Reinforcement learning I: – prediction

– classical conditioning

– dopamine

• Reinforcement learning II:– dynamic programming; action selection

– Pavlovian misbehaviour

– vigor

• Chapter 9 of Theoretical Neuroscience

Page 3: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

3

Conditioning

• Ethology– optimality– appropriateness

• Psychology– classical/operant

conditioning

• Computation– dynamic progr.– Kalman filtering

• Algorithm– TD/delta rules– simple weights

• Neurobiologyneuromodulators; amygdala; OFCnucleus accumbens; dorsal striatum

prediction: of important eventscontrol: in the light of those predictions

Page 4: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

= Conditioned Stimulus

= Unconditioned Stimulus

= Unconditioned Response (reflex);Conditioned Response (reflex)

Animals learn predictions

Ivan Pavlov

Page 5: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Animals learn predictions

Ivan Pavlov

Acquisition Extinction

020406080

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Blocks of 10 Trials

% C

Rs

very general across species, stimuli, behaviors

Page 6: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

But do they really?

temporal contiguity is not enough - need contingencytemporal contiguity is not enough - need contingency

1. Rescorla’s control

P(food | light) > P(food | no light)

Page 7: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

But do they really?

contingency is not enough either… need surprisecontingency is not enough either… need surprise

2. Kamin’s blocking

Page 8: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

But do they really?

seems like stimuli compete for learningseems like stimuli compete for learning

3. Reynold’s overshadowing

Page 9: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Theories of prediction learning: Goals

• Explain how the CS acquires “value”• When (under what conditions) does this happen?• Basic phenomena: gradual learning and extinction curves• More elaborate behavioral phenomena• (Neural data)

P.S. Why are we looking at old-fashioned Pavlovian conditioning?

it is the perfect uncontaminated test case for examining prediction learning on its own

Page 10: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

error-driven learning: change in value is proportional to the differencebetween actual and predicted outcome

Assumptions:

1. learning is driven by error (formalizes notion of surprise)

2. summations of predictors is linear

A simple model - but very powerful!– explains: gradual acquisition & extinction, blocking, overshadowing,

conditioned inhibition, and more..

– predicted overexpectation

note: US as “special stimulus”

Rescorla & Wagner (1972)

jCSUSCS jiVrV

Page 11: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

• how does this explain acquisition and extinction?

• what would V look like with 50% reinforcement? eg. 1 1 0 1 0 0 1 1 1 0 0

– what would V be on average after learning?

– what would the error term be on average after learning?

Rescorla-Wagner learning

Vt1 Vt rt Vt

Page 12: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

how is the prediction on trial (t) influenced by rewards at times (t-1), (t-2), …?

Rescorla-Wagner learning

Vt (1 )t i rii1

t

tttt VrVV 1

0

0.1

0.2

0.3

0.4

0.5

0.6

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1

Vt1 (1 )Vt rt

recent rewards weigh more heavily

why is this sensible?

learning rate = forgetting rate!

the R-W rule estimates expected reward using a weighted average of past rewards

Page 13: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Summary so far

Predictions are useful for behavior

Animals (and people) learn predictions (Pavlovian conditioning = prediction learning)

Prediction learning can be explained by an error-correcting learning rule (Rescorla-Wagner): predictions are learned from experiencing the world and comparing predictions to reality

Marr:

VrV

EV

VrE

VV

USCS

CS

US

jCS

i

i

j

2

Page 14: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

But: second order conditioning

15

20

25

30

35

40

45

50

number of phase 2 pairings

con

dit

ion

ed r

esp

ond

ing

animals learn that a predictor of a predictor is also a predictor of reward!

not interested solely in predicting immediate reward

animals learn that a predictor of a predictor is also a predictor of reward!

not interested solely in predicting immediate reward

phase 1:

phase 2:

test: ?what do you think will happen?

what would Rescorla-Wagner learning predict here?

Page 15: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

lets start over: this time from the topMarr’s 3 levels:• The problem: optimal prediction of future reward

• what’s the obvious prediction error?

• what’s the obvious problem with this?

Vt E riit

T

want to predict expected sum of

future reward in a trial/episode

(N.B. here t indexes time within a trial)

t

T

tiit Vr

CSVr RW

Page 16: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

lets start over: this time from the top

Marr’s 3 levels:• The problem: optimal prediction of future reward

Vt E riit

T

want to predict expected sum of

future reward in a trial/episode

tttt

tt

Tttt

Ttttt

VVrE

VrE

rrrErE

rrrrEV

1

1

21

21

...

...

Bellman eqnfor policyevaluation

Page 17: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

lets start over: this time from the top

Marr’s 3 levels:• The problem: optimal prediction of future reward• The algorithm: temporal difference learning

Vt E rt Vt1Vt (1 )Vt (rt Vt1)Vt Vt (rt Vt1 Vt )

temporal difference prediction error t

VT 1 VT rT VT compare to:

Page 18: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

18

prediction error

no prediction prediction, reward prediction, no reward

TD error

Vt

R

RL

tttt VVr 1

t

Page 19: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Summary so far

Temporal difference learning versus Rescorla-Wagner

• derived from first principles about the future

• explains everything that R-W does, and more (eg. 2nd order conditioning)

• a generalization of R-W to real time

Page 20: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Back to Marr’s 3 levels

• The problem: optimal prediction of future reward• The algorithm: temporal difference learning• Neural implementation: does the brain use TD learning?

Page 21: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Dopamine

Dorsal Striatum (Caudate, Putamen)

Ventral TegmentalArea

Substantia Nigra

Amygdala

Nucleus Accumbens(Ventral Striatum)

Prefrontal CortexDorsal Striatum (Caudate, Putamen)

Ventral TegmentalArea

Substantia Nigra

Amygdala

Nucleus Accumbens(Ventral Striatum)

Prefrontal CortexDorsal Striatum (Caudate, Putamen)

Ventral TegmentalArea

Substantia Nigra

Amygdala

Nucleus Accumbens(Ventral Striatum)

Prefrontal CortexDorsal Striatum (Caudate, Putamen)

Ventral TegmentalArea

Substantia Nigra

Amygdala

Nucleus Accumbens(Ventral Striatum)

Prefrontal CortexParkinson’s Disease Motor control +

initiation?

Intracranial self-stimulation;

Drug addiction;Natural rewards Reward pathway? Learning?

Also involved in:• Working memory• Novel situations• ADHD• Schizophrenia• …

Page 22: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Role of dopamine: Many hypotheses

• Anhedonia hypothesis• Prediction error (learning, action selection)• Salience/attention• Incentive salience• Uncertainty• Cost/benefit computation • Energizing/motivating behavior

Page 23: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

23

dopamine and prediction error

no prediction prediction, reward prediction, no reward

TD error

Vt

R

RL

tttt VVr 1

)(t

Page 24: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

prediction error hypothesis of dopamine

Tobler et al, 2005

Fio

rillo

et a

l, 20

03

The idea: Dopamine encodes a reward prediction error

The idea: Dopamine encodes a reward prediction error

Page 25: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

prediction error hypothesis of dopamine

model prediction error

mea

sure

d fir

ing

rate

Bayer & Glimcher (2005)

at end of trial: t = rt - Vt (just like R-W)

Vt (1 )t i rii1

t

Page 26: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Where does dopamine project to? Basal ganglia

Several large subcortical nuclei(unfortunate anatomical names follow structure rather than function, eg caudate + putamen + nucleus accumbens are all relatively similar pieces of striatum; but globus pallidus & substantia nigra each comprise two different things)

Page 27: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Where does dopamine project to? Basal ganglia

inputs to BG are from all over the cortex (and topographically mapped)

Voorn et al, 2004

Page 28: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Dopamine and plasticity

Prediction errors are for learning…

Indeed: cortico-striatal synapses show

dopamine-dependent plasticity

Wickens et al, 1996

Page 29: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Dopamine and plasticity

Prediction errors are for learning…

Indeed: cortico-striatal synapses show

dopamine-dependent plasticity

Wickens et al, 1996

Page 30: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Corticostriatal synapses: 3 factor learning

X1 X2 X3 XN

V1 V2 V3 VN

CortexStimulusRepresentation

adjustablesynapses

PPTN, habenula etc

Striatumlearned values

VTA, SNc

PredictionError (Dopamine)R

but also amygdala; orbitofrontal cortex; ...

Page 31: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

31

HighPain

LowPain

0.8 1.0

0.8 1.0

0.2

0.2

Prediction error

punishment prediction error

Value

TD error tttt VVr 1

Page 32: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

32

TD model

?

A – B – HIGH C – D – LOW C – B – HIGH A – B – HIGH A – D – LOW C – D – LOW A – B – HIGH A – B – HIGH C – D – LOW C – B – HIGH

Brain responses Prediction error

experimental sequence…..

MR scanner

Ben Seymour; John O’Doherty

punishment prediction error

Page 33: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

33

TD prediction error: ventral striatum

Z=-4 R

punishment prediction error

Page 34: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

34

right anterior insula

dorsal raphe (5HT)?

punishment prediction

Page 35: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

35

generalization

Page 36: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

36

generalization

Page 37: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

37

aversion

Page 38: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

38

opponency

Page 39: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

39

Solomon & Corbit

Page 40: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Summary of this part: prediction and RL

Prediction is important for action selection

• The problem: prediction of future reward

• The algorithm: temporal difference learning

• Neural implementation: dopamine dependent learning in BG

A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model

Precise (normative!) theory for generation of dopamine firing patterns

Explains anticipatory dopaminergic responding, second order conditioning

Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

Page 41: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Striatum and learned values

Striatal neurons show ramping activity that precedes a reward (and changes with learning!)

(Schultz)

start food

(Daw)

Page 42: Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael.

Phasic dopamine also responds to…

• Novel stimuli• Especially salient (attention grabbing) stimuli• Aversive stimuli (??)

• Reinforcers and appetitive stimuli induce approach behavior and learning, but also have attention functions (elicit orienting response) and disrupt ongoing behaviour.

→ Perhaps DA reports salience of stimuli (to attract attention; switching) and not a prediction error? (Horvitz, Redgrave)


Recommended