+ All Categories
Home > Documents > Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell)...

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell)...

Date post: 13-Jan-2016
Category:
Upload: gavin-spencer
View: 214 times
Download: 1 times
Share this document with a friend
Popular Tags:
21
nditioned stimulus (food) causes unconditioned resp itioned stimulus (bell) causes conditioned response
Transcript
Page 1: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Unconditioned stimulus (food) causes unconditioned response (saliva)Conditioned stimulus (bell) causes conditioned response (saliva)

Page 2: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Rescola-Wagner Rule

• V=wu, with u stimulus (0,1), w weight and v is predicted response. Adapt w to minimize quadratic error

Page 3: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Rescola Wagner rule for multiple inputs can predict various phenomena:Blocking: learned s1 to r prevents learning of association s2 to rInhibition: s2 reduces prediction when combined with any predicting stimulus

Page 4: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Temporal difference learning

• Interpret v(t) as ‘total future expected reward’

• v(t) is predicted from the past

Page 5: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Page 6: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

After learning delta(t)=0 implies: v(t=0) is sum of expected future rewardv(t) constant, thus expected reward r(t)=0v(t) decreasing, positive expected reward

Page 7: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Explanation fig 9.2Since u(t)=delta(t,0), Eq. 9.6 becomes: v(t)=w(t)Eq. 9.7 becomes delta w(t)= \epsilon delta(t)Thus, delta v(t)= \epsilon(r(t)+v(t+1)-v(t))R(t)=delta(t,T)Step 1: only change is v(T)=v(T)+epsilonStep 2: change v(T-1) and v(T)Etc.

Page 8: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Dopamine

• Monkey release button and press other after stimulus to receive reward. A: VTA cells respond to reward in early trials and to stimulus in late trials. Similar to delta in TD rule fig. 9.2

Page 9: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Dopamine

• Dopamine neurons encode reward prediction error (delta). B: witholding reward reduced neural firing in agreement with delta interpretation.

Page 10: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Static action choice

• Rewards result directly from actions

• Bees visit flowers whose color (blue, yellow) predict reward (sugar).

– M are action values, encode expected reward. Beta implements exploration

– P are action probabilities

Page 11: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

The indirect actor model

Learn the average nectar volumes for each flower and act accordingly.

Implemented by on-line learning. When visit blue flower

And leave yellow estimate unchanged

Fig: rb=1, ry=2 for t=1:100 and reversedFor t=101:200. A: my, mb; B-D Cumulated reward low beta (B), highBeta (C,D).

Page 12: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Bumble bees

• Risk aversion:

– Blue: r=2 for all flowers, yellow: r=6 for 1/3 of the flowers. When switched at t=15 bees adapt fast.

– A: av. Of 5 bees

– B: subjective utility function m(2) > 2/3 m(0)+ 1/3 m(6) favours risk avoidance

– C: model prediction

Page 13: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Sequential action choice/Delayed reward

• Reward obtained after sequence of actions– Rat moves without back tracking. After reward removed from maze and restart

• Delayed reward problem:– Choice at A has no direct reward

Page 14: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Sequential action choice/Delayed reward

• Policy iteration (see also Kaelbling 3.2.2):

• Loop:– Policy evaluation: Compute value V_pi for policy pi. Run Bellman backup until convergence

– Policy improvement: Improve pi

Page 15: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Sequential action choice/Delayed reward

• Actor Critic (see also Kaelbling 4.1):

• Loop:

– Critic: use TD eval. V(state) using current policy

– Actor: improve policy p(state)

Page 16: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Policy evaluation

• Policy is random left/right at each turn.

• Implemented as TD (w=v):

Page 17: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Policy improvement

• Base action on expected future reward minus expected current reward

• Example: state A:

• Use epsilon greedy or softmax for exploration.

Page 18: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Policy improvement

• Policy improvement changes policy, thus reevaluate policy for proven convergence

• Interleaving PI and PE is called actor-critic• Fig: AC learning of maze. NB learning at C is slow.

Page 19: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Generalizations

• Discounted reward:

• TD rule changes to

• TD(lambda): apply TD rule not only to update value of current state but also of recently past visited states. TD(0)=TD, TD(1)=updating all past states.

Page 20: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Water maze

• State dependent place cell activity (Foster Eq. 1). 8 actions

• Critic and Actor (Foster Eqs. 3-10)

Page 21: Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Comparing rats and model

• Left: average performance of 12 rats, four trials per day.

• RL predicts well initial learning, but not change to new task.


Recommended