+ All Categories
Home > Documents > Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f...

Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f...

Date post: 25-Feb-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
27
Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien Ngo Marc Toussaint University of Stuttgart
Transcript
Page 1: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

ReinforcementLearning

Function Approximation

Continuous state/action space, mean-squareerror, gradient temporal difference learning,

least-square temporal difference, least squarespolicy iteration

Vien NgoMarc Toussaint

University of Stuttgart

Page 2: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Outline

• Function Approximation

– Gradient Descent Methods.

– Least-Square Temporal Difference.

2/??

Page 3: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Value Iteration in Continuous MDP

V (s) = supa

[r(s, a) + γ

∫P (s′|s, a)V (s′)dx′

]

3/??

Page 4: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Continuous state/actions in model-free RL

• All of this is fine in small finite state & action spaces.Q(s, a) is a |S| × |A|-matrix of numbers.π(a|s) is a |S| × |A|-matrix of numbers.

• In the following: two examples for handling continuous states/actions– use function approximation to estimate Q(s, a): Gradient descent (TDwith FA), LSPI.– optimize a parameterized π(a|s) (policy search - next lecture).

4/??

Page 5: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Value Function Approximation

(from Satinder Singh, RL: A tutorial at videolectures.net)

• Estimate of the value function

Vt(s) = V (s, θt)

5/??

Page 6: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Performance Measure

• Minimizing the mean-squared error (MSE) over some distribution, P , ofthe states

MSE(βt) =∑s∈S

P (s)[V π(s)− Vt(s)

]2where V π(s) is the true value function of the policy π.

• Set P to the stationary distribution of policy π in on-policy learningmethods (e.g. SARSA).

6/??

Page 7: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Value Function Approximation• The estimate value function:

V (s, βt) = β>t φ(s)

where β ∈ <d is a vector of parameters, φ : S 7→ <d is a mapping fromstates to d-dimensional spaces.

– Examples: polynomial, RBF, fourier, wavelet basis, tile-coding. (suffer from the curse

of dimensionality)

• Nonparametric methods: k-nearest neighbor, nonparametric kernelsmoothing, spline smoothers, Gaussian process regression,... 7/??

Page 8: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Value Function Approximation

(1035 states, 105 binary features and parameters.)(Sutton, presentation at ICML 2009) )

8/??

Page 9: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

TD(λ) with Function Approximation• The gradient at any point βt

∇MSE(βt) = −2∑s∈S

P (s)[V π(s)− Vt(s)

]∇V (s, βt)

= −2∑s∈S

P (s)[V π(s)− Vt(s)

]φ(s)

• Applying stochastic approximation and bootstrapping, we caniteratively update the parameters (TD(0) with function approximation)

βt+1 = βt − αt[rt + γV (s′, βt)− Vt(s, βt)

]φ(s)

• TD(λ) (with eligibility trace)

et+1 = γλet + φ(s)

βt+1 = βt − αtet+1

[rt + γV (s′, βt)− Vt(s, βt)

]9/??

Page 10: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

TD(λ) with Function Approximation(Gradient-descent SARSA(λ))

Repeat (for each episode)

• e = 0

• initial state s = s0

• Repeatat = π(s)

Take a, observe rt, s′

et+1 = γλet + φ(s, a)

βt+1 = βt − αtet+1

[rt + γQ(s′, π(s′), βt)−Q(s, at, βt)

]s← s′

• until s is terminal.

10/??

Page 11: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

TD(λ) with Function Approximation

• Convergence proof: If the stochastic process St is ergodic Markovprocess, the whose stationary distribution is the same as the stationarydistribution of the underlying MDP (e.g. on-policy distribution).

• The convergence property

MSE(β∞) ≤ 1− γλ1− λ

MSE(β∗)

(Tsitsiklis & Van Roy, An analysis of temporal-difference learning with function

approximation. IEEE Transactions on Automatic Control, 1997)

• Convergence guarantee for off-policy methods (e.g Q-learning withlinear function approximation)?

11/??

Page 12: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Gradient temporal difference learning

• GTD (gradient temporal difference learning)

• GTD2 (gradient temporal difference learning, version 2)

• TDC (temporal difference learning with corrections.)

1. Sutton, Szepesveri and Maei. A convergent O(n) temporal difference algorithm for

off-policy learning with linear function approximation, NIPS 2008.

2. Sutton, Maei, Precup, Bhatnagar, Silver, Szepesvri, Wiewiora: Fast gradient-descent

methods for temporal-difference learning with linear function approximation. ICML 2009.

12/??

Page 13: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Value function geometry• Bellman operator

TV = R+ γPV

(The space spanned by the feature vectors)

RMSBE : Residual mean-squared Bellman errorRMSPBE: Residual mean-squared projected Bellman error 13/??

Page 14: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

TD performance measure

• Error from the true value:||Vβ − V ∗||

• Error in the Bellman update (used in previous section: gradient descentmethods)

||Vβ − TVβ ||

• Error in Bellman update after projection

||Vβ −∏

TVβ ||

14/??

Page 15: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

TD performance measure

• GTD(0): the norm of the expected TD update

NEU(β) = E(δφ)>E(δφ)

• GTD(2) and TDC: the norm of the expected TD update weighted by thecovariance matrix of the features

MSPBE(β) = E(δφ)>E(φφ)−1E(δφ)

(δ is the TD error.)(GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.)

15/??

Page 16: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

• TD algorithms with linear function approximation problems areguaranteed convergent under both general on- and off-policy training.

• the compuational complexity is only O(n) (n is the number of features).

• the curse of dimensionality is removed

16/??

Page 17: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

LSPI: Least Squares Policy Iteration

• Gradient-descent methods are sensitive to the choice of learning ratesand initial parameter values.

• Least-square temporal difference (LSTD) method: LSPI.

– Bellman residual minimization

– Least Squares Fixed-Point Approximation

17/??

Page 18: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Bellman residual minimization• The Q-functions for a given policy π fulfills for any s, a:

Qπ(s, a) = R(s, a) + γ∑s′

P (s′ | a, s) Qπ(s′, π(s′))

• If we have n data points D = {(si, ai, ri, s′i)}ni=1, we require that thisequation holds (approximately) for these n data points:

∀i : Qπ(si, ai) = ri + γQπ(s′i, π(s′i))

• Written in vector notation: Q = R + gQ̄ with N -dim data vectorsQ,R, Q̄

• Written as optmization: minimize the Bellman residual error

L(Qπ) = ||R+ γPΠQπ −Qπ||(true residual)

=

n∑i=1

[Qπ(si, ai)− ri − γQπ(s′i, π(s′i))]2 = ||R−Q + γQ̄||2

18/??

Page 19: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Bellman residual minimization

• The true fixed point of Bellman Residual Minimization (this is anoverconstrained system)

βπ =(

(Φ− γPΠΦ)>(Φ− γPΠΦ))−1

(Φ− γPΠΦ)r

• the solution βπ of the system is unique since the columns of Φ (thebasis functions) are linearly independent by definition.(See Lagoudakis & Parr (JMLR 2003) for details.)

19/??

Page 20: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

LSPI: Least Squares Fixed-Point Approximation

• Projection TπQ back onto span(Φ)

T̂π(Q) = Φ(Φ>Φ)−1Φ>(TπQ)

• The approximate fixed-point

βπ =(

Φ>(Φ− γPΠΦ))−1

Φ>r

20/??

Page 21: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

LSPI: Comparisons of two views

• the Bellman residual minimizing method: focus on the magnitude of thechange.

• the least-squares fixed-point approximation: focus on the direction ofthe change.

• the least-squares fixed point approximation is less stable and lesspredictable

• the least-squares fixed-point method might be preferable. Because– Learning the Bellman residual minimizing approximation requires doubled samples.

– Experimentally, it often delivers policies that are superior.

(See Lagoudakis & Parr (JMLR 2003) for details.)

21/??

Page 22: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

LSPI: LSTDQ algorithm

A =(

Φ>(Φ− γPΠΦ))−1

b = Φ>r

• For each (s, a, r, s′) ∈ D

A← A+ φ(s, a)(φ(s, a)− γφ(s′, π(s′))

)>b← b+ φ(s, a)r

• β ← A−1b

22/??

Page 23: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

LSPI algorithmgiven D

• repeat

π ← π′

π′ ← LSTDQ(π) (π′ is a policy of βπ)

• return π

23/??

Page 24: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

LSPI: Riding a bike

(from Alma A. M. Rahat’s simulation.)• States = {θ, θ̇, ω, ω̇, ω̈, ψ}. where θ is the angle of the handlebar, ω is the vertical angle

of the bicycle, and ψ is the angle of the bicycle to the goal.• Actions: {τ, ν}. τ ∈ {−2, 0, 2} is the torque applied to the handlebar,ν ∈ {−0.02, 0, 0.02} is the displacement of the rider.

• For each a, the value function Q(s, a) uses 20 features

(1, ω, ω̇, ω2, ωω̇, θ, θ̇, θ2, θ̇2, θθ̇, ωθ, ωθ2, ω2θ, ψ, ψ2, ψθ, ψ̄, ψ̄2, ψ̄θ)

where ψ = sign(ψ)× π − ψ. 24/??

Page 25: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

LSPI: Riding a bike

from Lagoudakis & Parr (JMLR 2003)

25/??

Page 26: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

LSPI: Riding a bike

• Training samples were collected in advance by initializing the bicycle to a small randomperturbation from the initial position (0, 0, 0, 0, 0, π/2) and running each episode up to 20steps using a purely random policy.

• Each successful ride must complete a distance of 2 kilometers.

• This experiment was repeated 100 times

from Lagoudakis & Parr (JMLR 2003)26/??

Page 27: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of

Feature Selection/Building Problems

• Feature selection.

• Online/increment feature learning.

Wu and Givan (2005); Keller et al. (2006); Parr et al. (2007); Mahadevan and Liu (2010);Parr et al., (2007); Mahadevan and Liu, (2010); Mahadevan et al., (2006), Kolter and Ng,(2009), Boots and Gordon, (2010), Sun et al., (2011), etc.

27/??


Recommended