Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell)...

Unconditioned stimulus (food) causes unconditioned response (saliva)Conditioned stimulus (bell) causes conditioned response (saliva)

Rescola-Wagner Rule

• V=wu, with u stimulus (0,1), w weight and v is predicted response. Adapt w to minimize quadratic error

Rescola Wagner rule for multiple inputs can predict various phenomena:Blocking: learned s1 to r prevents learning of association s2 to rInhibition: s2 reduces prediction when combined with any predicting stimulus

Temporal difference learning

• Interpret v(t) as ‘total future expected reward’

• v(t) is predicted from the past

After learning delta(t)=0 implies: v(t=0) is sum of expected future rewardv(t) constant, thus expected reward r(t)=0v(t) decreasing, positive expected reward

Explanation fig 9.2Since u(t)=delta(t,0), Eq. 9.6 becomes: v(t)=w(t)Eq. 9.7 becomes delta w(t)= \epsilon delta(t)Thus, delta v(t)= \epsilon(r(t)+v(t+1)-v(t))R(t)=delta(t,T)Step 1: only change is v(T)=v(T)+epsilonStep 2: change v(T-1) and v(T)Etc.

Dopamine

• Monkey release button and press other after stimulus to receive reward. A: VTA cells respond to reward in early trials and to stimulus in late trials. Similar to delta in TD rule fig. 9.2

Dopamine

• Dopamine neurons encode reward prediction error (delta). B: witholding reward reduced neural firing in agreement with delta interpretation.

Static action choice

• Rewards result directly from actions

• Bees visit flowers whose color (blue, yellow) predict reward (sugar).

– M are action values, encode expected reward. Beta implements exploration

– P are action probabilities

The indirect actor model

Learn the average nectar volumes for each flower and act accordingly.

Implemented by on-line learning. When visit blue flower

And leave yellow estimate unchanged

Fig: rb=1, ry=2 for t=1:100 and reversedFor t=101:200. A: my, mb; B-D Cumulated reward low beta (B), highBeta (C,D).

Bumble bees

• Risk aversion:

– Blue: r=2 for all flowers, yellow: r=6 for 1/3 of the flowers. When switched at t=15 bees adapt fast.

– A: av. Of 5 bees

– B: subjective utility function m(2) > 2/3 m(0)+ 1/3 m(6) favours risk avoidance

– C: model prediction

Sequential action choice/Delayed reward

• Reward obtained after sequence of actions– Rat moves without back tracking. After reward removed from maze and restart

• Delayed reward problem:– Choice at A has no direct reward


• Policy iteration (see also Kaelbling 3.2.2):

• Loop:– Policy evaluation: Compute value V_pi for policy pi. Run Bellman backup until convergence

– Policy improvement: Improve pi


• Actor Critic (see also Kaelbling 4.1):

• Loop:

– Critic: use TD eval. V(state) using current policy

– Actor: improve policy p(state)

Policy evaluation

• Policy is random left/right at each turn.

• Implemented as TD (w=v):

Policy improvement

• Base action on expected future reward minus expected current reward

• Example: state A:

• Use epsilon greedy or softmax for exploration.

Policy improvement

• Policy improvement changes policy, thus reevaluate policy for proven convergence

• Interleaving PI and PE is called actor-critic• Fig: AC learning of maze. NB learning at C is slow.

Generalizations

• Discounted reward:

• TD rule changes to

• TD(lambda): apply TD rule not only to update value of current state but also of recently past visited states. TD(0)=TD, TD(1)=updating all past states.

Water maze

• State dependent place cell activity (Foster Eq. 1). 8 actions

• Critic and Actor (Foster Eqs. 3-10)

Comparing rats and model

• Left: average performance of 12 rats, four trials per day.

• RL predicts well initial learning, but not change to new task.

Date post:	13-Jan-2016
Category:	Documents
Upload:	gavin-spencer
View:	214 times
Download:	1 times

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell)...

Documents