R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
RL Lecture 7: Eligibility Traces
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2
N-step TD Prediction
❐ Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3
❐ Monte Carlo:
❐ TD:n Use V to estimate remaining return
❐ n-step TD:n 2 step return:
n n-step return:
Mathematics of N-step TD Prediction
TtT
tttt rrrrR 13
221
−−+++ ++++= γγγ !
)( 11)1(
++ += tttt sVrR γ
)( 22
21)2(
+++ ++= ttttt sVrrR γγ
)(13
221
)(ntt
nnt
nttt
nt sVrrrrR ++
−+++ +++++= γγγγ !
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4
Learning with N-step Backups
❐ Backup (on-line or off-line):
❐ Error reduction property of n-step returns
❐ Using this, you can show that n-step methods converge
)()(max)(}|{max sVsVsVssREs
nt
nts
πππ γ −≤−=
n step return
Maximum error using n-step return Maximum error using V
ΔVt(st ) = α Rt(n) − Vt(st )[ ]
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5
Random Walk Examples
❐ How does 2-step TD work here?❐ How about 3-step TD?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6
A Larger Example
❐ Task: 19 state random walk
❐ Do you think there is an optimal n (for everything)?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7
Averaging N-step Returns
❐ n-step methods were introduced to help with TD(λ) understanding
❐ Idea: backup an average of several returnsn e.g. backup half of 2-step and half of 4-
step
❐ Called a complex backupn Draw each componentn Label with the weights for that
component
)4()2(
21
21
ttavgt RRR +=
One backup
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8
Forward View of TD(λ)
❐ TD(λ) is a method for averaging all n-step backups n weight by λn-1 (time since
visitation)n λ-return:
❐ Backup using λ-return:
Rtλ = (1− λ ) λn −1
n=1
∞
∑ Rt(n)
ΔVt(st ) = α Rtλ − Vt(st )[ ]
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9
λ-return Weighting Function
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10
Relation to TD(0) and MC
❐ λ-return can be rewritten as:
❐ If λ = 1, you get MC:
❐ If λ = 0, you get TD(0)
Rtλ = (1− λ ) λn−1
n=1
T− t−1
∑ Rt(n) + λT−t−1Rt
Rtλ = (1−1) 1n−1
n=1
T−t−1
∑ Rt(n ) + 1T− t−1Rt = Rt
Rtλ = (1− 0) 0n−1
n=1
T−t−1
∑ Rt(n ) + 0T− t−1Rt = Rt
(1)
Until termination After termination
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11
Forward View of TD(λ) II
❐ Look forward from each state to determine update from future states and rewards:
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12
λ-return on the Random Walk
❐ Same 19 state random walk as before❐ Why do you think intermediate values of λ are best?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13
Backward View of TD(λ)
❐ The forward view was for theory❐ The backward view is for mechanism
❐ New variable called eligibility tracen On each step, decay all traces by γλ and increment the
trace for the current state by 1n Accumulating trace
+∑")(set
et(s) =γλet−1(s) if s ≠ st
γλet−1(s) +1 if s = st% & '
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14
On-line Tabular TD(λ)Initialize V(s) arbitrarily and e(s) = 0, for all s ∈SRepeat (for each episode) : Initialize s Repeat (for each step of episode) : a← action given by π for s Take action a, observe reward, r, and next state $ s δ ← r +γV( $ s ) − V (s) e(s)← e(s) +1 For all s : V(s) ←V(s) +αδe(s) e(s) ←γλe(s) s← $ s Until s is terminal
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15
Backward View
❐ Shout δt backwards over time❐ The strength of your voice decreases with temporal
distance by γλ
)()( 11 tttttt sVsVr −+= ++ γδ
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16
Relation of Backwards View to MC & TD(0)
❐ Using update rule:
❐ As before, if you set λ to 0, you get to TD(0) ❐ If you set λ to 1, you get MC but in a better way
n Can apply TD(1) to continuing tasksn Works incrementally and on-line (instead of waiting to
the end of the episode)
)()( sesV ttt αδ=Δ
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17
Forward View = Backward View
❐ The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating
❐ The book shows:
❐ On-line updating with small α is similar
ΔVtTD(s)
t= 0
T−1
∑ = αt= 0
T−1
∑ Isst (γλ)k− tδ kk=t
T−1
∑ ΔVtλ(st )Isst
t= 0
T−1
∑ = αt= 0
T−1
∑ Isst (γλ)k− tδ kk=t
T−1
∑
ΔVtTD(s)
t= 0
T−1
∑ = ΔVtλ(st )
t= 0
T−1
∑ Isst
Backward updates Forward updates
algebra shown in book
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18
On-line versus Off-line on Random Walk
❐ Same 19 state random walk❐ On-line performs better over a broader range of parameters
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19
Control: Sarsa(λ)
❐ Save eligibility for state-action pairs instead of just states
et(s, a) =γλet−1(s, a) +1 if s = st and a = atγλet−1(s,a) otherwise
$ % &
Qt+1(s, a) = Qt(s, a) +αδtet(s, a)
δt = rt+1 + γQt(st+1,at+1) −Qt(st , at )
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20
Sarsa(λ) Algorithm
Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, aRepeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, ! s Choose ! a from ! s using policy derived from Q (e.g. ? - greedy) δ ← r +γQ( ! s , ! a ) −Q(s, a) e(s,a)← e(s,a) +1 For all s,a : Q(s, a)←Q(s, a) +αδe(s, a) e(s, a) ←γλe(s, a) s← ! s ;a ← ! a Until s is terminal
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21
Sarsa(λ) Gridworld Example
❐ With one trial, the agent has much more information about how to get to the goal n not necessarily the best way
❐ Can considerably accelerate learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22
Three Approaches to Q(λ)
❐ How can we extend this to Q-learning?
❐ If you mark every state action pair as eligible, you backup over non-greedy policyn Watkins: Zero out eligibility
trace after a non-greedy action. Do max when backing up at first non-greedy choice.
et(s, a) =
1 + γλet−1(s, a)0
γλet−1(s,a)
if s = st , a = at ,Qt−1(st ,at ) = max a Qt−1(st , a) if Qt−1(st ,at) ≠ maxa Qt−1(st ,a)
otherwise
%
& '
( '
Qt +1(s, a) = Qt(s, a) +αδtet(s, a)
δt = rt +1 + γ max + a Qt(st +1, + a ) −Qt (st ,at)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23
Watkins’s Q(λ)Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, aRepeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, ! s Choose ! a from ! s using policy derived from Q (e.g. ? - greedy)
a* ← arg maxb Q( ! s , b) (if a ties for the max, then a* ← ! a )
δ ← r +γQ( ! s , ! a ) −Q(s, a* ) e(s,a)← e(s,a) +1 For all s,a : Q(s, a)←Q(s, a) +αδe(s, a)
If ! a = a*, then e(s, a) ←γλe(s,a) else e(s, a)← 0 s← ! s ;a ← ! a Until s is terminal
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24
Peng’s Q(λ)
❐ Disadvantage to Watkins’s method:n Early in learning, the
eligibility trace will be “cut” (zeroed out) frequently resulting in little advantage to traces
❐ Peng: n Backup max action except
at endn Never cut traces
❐ Disadvantage:n Complicated to implement
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25
Naïve Q(λ)
❐ Idea: is it really a problem to backup exploratory actions?n Never zero tracesn Always backup max at
current action (unlike Peng or Watkins’s)
❐ Is this truly naïve?❐ Works well is preliminary
empirical studies
What is the backup diagram?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26
Comparison Task
From McGovern and Sutton (1997). Towards a better Q(λ)
❐ Compared Watkins’s, Peng’s, and Naïve (called McGovern’s here) Q(λ) on several tasks.n See McGovern and Sutton (1997). Towards a Better
Q(λ) for other tasks and results (stochastic tasks, continuing tasks, etc)
❐ Deterministic gridworld with obstaclesn 10x10 gridworldn 25 randomly generated obstaclesn 30 runsn α = 0.05, γ = 0.9, λ = 0.9, ε = 0.05, accumulating traces
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27
Comparison Results
From McGovern and Sutton (1997). Towards a better Q(λ)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28
Convergence of the Q(λ)’s
❐ None of the methods are proven to converge.n Much extra credit if you can prove any of them.
❐ Watkins’s is thought to converge to Q*
❐ Peng’s is thought to converge to a mixture of Qπ and Q*
❐ Naïve - Q*?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29
Eligibility Traces for Actor-Critic Methods
❐ Critic: On-policy learning of Vπ. Use TD(λ) as described before.
❐ Actor: Needs eligibility traces for each state-action pair.❐ We change the update equation:
❐ Can change the other actor-critic update:
pt+1(s, a) =pt(s,a) +αδ t if a = at and s = st
pt (s, a) otherwise# $ %
),(),(),(1 aseaspasp tttt αδ+=+to
pt+1(s, a) =pt(s,a) +αδ t 1− π (s, a)[ ] if a = at and s = st
pt(s,a) otherwise% & ' to ),(),(),(1 aseaspasp tttt αδ+=+
et(s, a) =γλet−1(s, a) +1 − πt (st ,at) if s = st and a = at
γλet−1(s, a) otherwise% & '
where
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30
Replacing Traces
❐ Using accumulating traces, frequently visited states can have eligibilities greater than 1n This can be a problem for convergence
❐ Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1
et(s) =γλet−1(s) if s ≠ st
1 if s = st% & '
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31
Replacing Traces Example
❐ Same 19 state random walk task as before❐ Replacing traces perform better than accumulating traces over more
values of λ
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32
Why Replacing Traces?
❐ Replacing traces can significantly speed learning
❐ They can make the system perform well for a broader set of parameters
❐ Accumulating traces can do poorly on certain types of tasks
Why is this task particularly onerous for accumulating traces?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33
More Replacing Traces
❐ Off-line replacing trace TD(1) is identical to first-visit MC
❐ Extension to action-values:n When you revisit a state, what should you do with the
traces for the other actions?n Singh and Sutton say to set them to zero:
et(s, a) =
10
γλet−1(s, a)
$
% &
' &
if s = st and a = atif s = st and a ≠ at
if s ≠ st
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34
Implementation Issues
❐ Could require much more computationn But most eligibility traces are VERY close to zero
❐ If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35
Variable λ
❐ Can generalize to variable λ
❐ Here λ is a function of timen Could define
et(s) =γλtet−1(s) if s ≠ st
γλtet−1(s) +1 if s = st% & '
τλλλλt
ttt s == or )(
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 36
Conclusions
❐ Provides efficient, incremental way to combine MC and TDn Includes advantages of MC (can deal with lack of
Markov property)n Includes advantages of TD (using TD error,
bootstrapping)❐ Can significantly speed learning❐ Does have a cost in computation
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 37
Something Here is Not Like the Other