+ All Categories
Home > Documents > Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Date post: 02-Jan-2016
Category:
Upload: jackson-daugherty
View: 31 times
Download: 2 times
Share this document with a friend
Description:
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no all knowing teacher, the reinforcement signal carries less information. Central problem – temporal credit assignment. - PowerPoint PPT Presentation
Popular Tags:
24
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no all knowing teacher, the reinforcement signal carries less information. Central problem – temporal credit assignment.
Transcript
Page 1: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Reinforcement learning

This is mostly taken from Dayan and Abbot ch. 9

Reinforcement learning is different than supervised learning in that there is no all knowing teacher, the reinforcement signal carries less information.

Central problem – temporal credit assignment.

Page 2: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Example: Spatial learning is impaired by block of NMDA receptors (Morris, 1989)

Morris water maze rat

platform

Page 3: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9
Page 4: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9
Page 5: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Solving this problem is comprised of two separate tasks.

1. Predicting reward

2.Choosing the correct action

or

1. Policy evaluation (critic)

2. Policy improvement (actor)

Page 6: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Classical vs. instrumental conditioning

Classical think -> Pavlov dog

In instrumental the animal is rewarded for “correct” actions, and not, or even punished for incorrect.

In instrumental (Operant) what the animal does (Policy) matters.

Page 7: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Predicting reward – Rascola-Wagner rule

Notation

u – stimulusr - rewardv – expected rewardw – weight (filter)

uww

wuv

vr With:

For more than one stimulus: uu

Page 8: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Learning, r=1 Extinction, r=0

Random reward

Page 9: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Predicting future reward: Temporal Difference learning

In more realistic conditions, especially in operant conditioning the actual reward might come some time after the signal for the reward. What we might care about is not the immediate reward at this time point, but rather the total reward predicted given the choice made at this time. How can we estimate the total reward?

Total averagefuture reward at time t:

Assume that we estimate this with a linear estimator:

tT

tr0

)(

t

tuwtv0

)()()(

Page 10: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Use the δ rule at time t:

)()()()( tutww

Where δ is the difference between the actual future rewards, and the prediction of these rewards:

)()()( tvtrt

But, we do not know

Instead we can approximate this by:

)1()()( tvtrtr

Page 11: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Which gives us:

The temporal difference learning rule then becomes:

)()1()()( tvtvtrt

)()()()( tutww

(1)

(2)

Page 12: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Dopamine and predicted reward

Activity of VTA doparminergic neurons in a monkey. A. top- before learning, bottom after learningB. After learning. top- with reward, bottom – no reward

Page 13: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Generalization of TD(0)

1. u can be a vector u, so w is also a vector. This is for more complex, or multiple possible stimuli.

2. A decay term. Here:

)()()()( ' uvuvura a

Current location Location moved to after action a

This has the effect of putting a stronger emphasis on rewards that take fewer steps to reach.

Page 14: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Until now – how do we predict a reward.Still need to see how we make decisions of which path to take, or what policy to use.

Describe bee foraging example:

?

Different reward for each flower

Different reward for each flower P(rb) and P(ry)

Page 15: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Learn “action values” mb and my (the actor), these will determine which choice to make.

Assume rb=1, ry=2, what is the best choice we can make?

The average reward is:

What will maximize this reward?

yb ryPrbPr ][][

Page 16: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Learn “action values” mb and my, these will determine which choice to make.

Use softmax:

This is a stochastic choice, β is a variability parameter.A good choice for the “action values”: is to set them

to the mean reward:

This is also called “indirect actor” (???)

)exp()exp(

)exp()(;

)exp()exp(

)exp()(

yb

y

yb

b

mm

myP

mm

mbP

;; bbbb rmrm

Page 17: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

How good is this choice?

Assume β=1, rb=1, ry=2, what is <r>

)exp()exp(

)exp()(;

)exp()exp(

)exp()(

yb

y

yb

b

rr

ryP

rr

rbP

>> rb=1;ry=2;>> pb=exp(rb)/(exp(rb)+exp(ry))pb = 0.2689>> py=exp(ry)/(exp(rb)+exp(ry))py = 0.7311 >> r_av=rb*pb+ry*pyr_av = 1.7311

Page 18: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

;; bbbb rmrm

This choice can be learned using a delta rule

xxxx mrmm ;

β=1 β=50

t<100; rb=1, ry=2

t >100; rb=2, ry=1

Page 19: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Another option (direct actor ???) is to set the activation values to maximize the expected reward:

This can be done by stochastic gradient decent on <r>

For example:

So that generally for actions variable mx given action a:

A good choice for r0 is the mean of rx over all possible choices. (See D&A book pg 344)

yb ryPrbPr ][][

ybb

rbPyPrbPbPm

r][][])[1]([

)])([( 0rrxPmm aaxxx

Page 20: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

The Maze task and sequential action choice

75.1))()((2

1)(

,1)20(2

1)(

,5.2)50(2

1)(

BvAvCv

Bv

Av

Policy evaluation: Initial random policy

)()1()()(

)()(

tvtvtrt

uwuw

Policy evaluation

What would it be for an ideal policy?

Page 21: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Policy improvement

Using the direct actor learn to improve the policy.

)(])[( 0rrxPmm aaxxx

Note – policy improvement and policy evaluation are best carried out sequentially:evaluate – improve – evaluate – Improve …

?At A:

75.0)()(0

75.0)()(0

AvCv

AvBv

For leftturn

For rightturn

Page 22: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

V(a)=1.75

V(B)=2.5 V(C)=1

Page 23: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9
Page 24: Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Reinforcement learning - summary


Recommended