+ All Categories
Home > Documents > RL 10: Algorithms for Large State Spaces: RL with Function … · Remarks The“truevalue” r t+1...

RL 10: Algorithms for Large State Spaces: RL with Function … · Remarks The“truevalue” r t+1...

Date post: 22-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
26
RL 10: Algorithms for Large State Spaces: RL with Function Approximation Michael Herrmann University of Edinburgh, School of Informatics 12/02/2016
Transcript
  • RL 10: Algorithms for Large State Spaces: RL withFunction Approximation

    Michael Herrmann

    University of Edinburgh, School of Informatics

    12/02/2016

  • Overview

    Algorithms for spatially continuous problemsBasis functionsReformulation of algorithms in terms of gradients

    12/02/2016 Michael Herrmann RL 10

  • Large State Spaces

    Grid-world algorithms: V (s) is a vector, Q (s, a) a matrix(“look-up table”)In large problems, in particular in continuous domains wherediscretisation is not obvious, the complexity is often beyondpractical limits

    storage spaceexploration timeconvergence time

    Generalisation and flexibility is lowPossible approaches:

    Hierarchical representationsVector quantisation for efficient state space representationFunction approximation

    12/02/2016 Michael Herrmann RL 10

  • Function approximation

    Use methods that are know to generalise well.Learning from examples: Reconstruct the underlying functionby supervised learningE.g. neural networks

    given samples {xi , yi}i=1,...,Minitialise weight vector/matrix θrealised function: f (·, θ), i.e. y = f (x , θ)error measure: E = 12

    ∑Mi=1 ‖yi − f (xi , θ)‖

    2

    update weights, e.g. simply by: θt = θt−1 − η ∂E∂θ , i.e. forsingle datum:

    ∆θt = η (yi − f (xi , θ))∇θf (xi , θ)

    try to control generalisation properties

    12/02/2016 Michael Herrmann RL 10

  • Function approximation in RL

    in RL: x encodes state, f represents value function or policyProblems:

    Distribution of samples depends on policy

    E =12

    ∑x∈S

    d (x) ‖y (x)− f (x , θ)‖2

    where d is a distribution over states,∑

    x∈S d (x) = 1While the state x is given, we do not have immediate access tothe target, if y is the value function.

    Possible solution (see slide 24):1 Agent produces samples2 Estimate value function on samples3 Use estimate to train the function approximator

    Another possibility: Use δ error also in function approximation

    12/02/2016 Michael Herrmann RL 10

  • Function approximation for TD(0)

    Use δ-error

    δt = rt+1 + γV̂ (xt+1, θ)− V̂ (xt , θ)

    for parameter update (learning rate η)

    ∆θ = ηδtζt

    with errorζt = ∇θV̂ (xt , θ)

    Initialisation: θ = (0, . . . , 0)>

    Termination: max δ sufficiently small

    12/02/2016 Michael Herrmann RL 10

  • Function approximation for TD(λ)

    Use δ-error

    δt,1 = rt+1 + γV̂ (xt+1, θ)− V̂ (xt , θ)

    for parameter update (learning rate η)

    ∆θ = ηδtζt

    including eligibility trace ζt = (ζt,1, . . . , ζt,n)> with

    n� (1− λ)−1,

    ζt = γλζt−1 +∇θV̂ (xt , θ)

    δi = (δt,1, . . . , δt,n) is also a vector containing the past δ values.

    Initialisation: ζt = (0, . . . , 0)>, θ = (0, . . . , 0)> (or small random)

    Termination: max δ sufficiently small

    12/02/2016 Michael Herrmann RL 10

  • Remarks

    The “true value” rt+1 + γV̂ (xt+1, θ) in the error term containsthe current approximation of the value function and is updatedonly in the next stepFor λ = 0, the algorithm become a standard gradient descent(but with respect to an estimate), compare slide 4.For λ > 0 a trace over previous errors is assumed to improvethe current error estimateThe λ > 0 case is very appropriate here, since we are assumingthat the values function is smooth. Furthermore, theinformation from previous states does not enter the valuefunction just for a specific state but typically (via theparameters) for some region in state space.The gradient w.r.t. θ must be calculated which is possible inparametric function approximation: The parametrisationshould be expressive and convenient.

    12/02/2016 Michael Herrmann RL 10

  • Using basis functions as representation of features

    Value function can be represented e.g. in linear form

    Vθ (x) = θ>ϕ (x) =N∑

    i=1

    θiϕi (x)

    where x ∈ RD is the state, θ ∈ RN the parameter vector, andϕ : RD → RN the feature vector, ϕ (x)=(ϕ1 (x) , . . . , ϕN (x))>

    Many choices for the basis functions ϕ are possible, e.g. RBFsIncluding the trivial case of the look-up table representation forθs = V (s) and

    ϕs (x) =

    {1 if round(x) = s0 otherwise

    12/02/2016 Michael Herrmann RL 10

  • Feature spaces

    Vθ (x) = θ>ϕ (x) is a linear (weighted) sum of non-linearfunctionsCan be universal function approximators (RBF network)θ ∈ RN

    parameter vector or weight vectorcarries the information about the current estimate of the valuefunction

    ϕ : X → RN

    ϕ (x) = (ϕ1 (x) , . . . , ϕN (x))>

    ϕi : X → R is a basis functionϕi (x): a feature of the state xExamples: polynomial, wavelets, RBF, . . .

    mathematically convenient: easily differentiable ⇒ gradientOther parametrisations may be more effective. Currently, deepnetworks are of interest in RL.

    12/02/2016 Michael Herrmann RL 10

  • Radial basis functions

    For a function f : RD → R chooseparameters such that

    f (x) ≈ θ>ϕ (x)

    with ϕ : RD → RN , e.g.

    ϕi (x) = exp

    (−‖x − x

    (i)‖2

    2σ2

    )

    with i = 1, . . . ,N.

    example:N = 2, θ = (1, 1)

    Determine θ ∈ RN by

    ‖f − θ>ϕ‖ → minSolution: see e.g.http://en.wikipedia.org/wiki/Radial_basis_function_network#Training

    12/02/2016 Michael Herrmann RL 10

  • Using normalisation (Kernel Smoothing)

    Define ϕi (x) =φ(‖x−x(i)‖)∑N

    m=1 φ(‖x−x(m)‖)for a covering∗ set of functions φ

    Vθ (x) =N∑

    i=1

    θiφ(‖x − x (i)‖

    )∑Nm=1 φ

    (‖x − x (m)‖

    )More generally,

    Vθ (x) =N∑

    i=1

    θigi (x)

    satisfying the conditions gi (x) > 0 and∑N

    i=1 gi (x) = 1 ∀x

    Vθ is an “averager”, which mixes the values of θ differently atdifferent points in space

    Denominator should never be zero: All x covered by some of the φi

    12/02/2016 Michael Herrmann RL 10

  • Variants of look-up table implementations

    Binary features: ϕ (x) ∈ {0, 1}N

    Vθ (x) =∑

    i :ϕi (x)=1

    θi

    Interesting case: only few components of ϕ are non-zero(sparse) and the relevant indexes can be computed efficiently.State aggregation: Indicator function over a certain region instate spaceCan easily implement hierarchical value functionsTile coding: CMAC (Cerebellar Model Articulation Controller,Albus 1971) uses partially overlapping hyper-rectangles

    12/02/2016 Michael Herrmann RL 10

  • Curse of dimensionality

    Tile-code spaces are usually huge =⇒ use only cells that areactually visitedExample: a robot with 6 DoF is characterised by 6 positionsand 6 velocities, but e.g. cameras will producehigh-dimensional state spaces =⇒ use projection methods(e.g. non-linear PCA or deep neural networks)Often there are not too many data points =⇒ usenon-parametric methods

    12/02/2016 Michael Herrmann RL 10

  • TD(0) with linear function approximation (see above)

    Express changes of the value function as changes of parametersChanges in parameters are usually small, so δ rule

    δt+1 = rt + γV̂t (st+1)− V̂t (st)

    V̂t+1 (st) := V̂t (st) + ηδt+1

    ⇔ ∆V̂t+1 (st) = ηδt+1

    becomes for Vθ (x) = θ>ϕ (x)

    ∆θ = η (∇θVθt (xt)) δt+1 = ηϕ (x) δt+1

    We assume that the (finite) changes of the value function arelinearly reflected in parameter changes and use the chain rule.Alternatively, use gradient descent on the fit function.

    12/02/2016 Michael Herrmann RL 10

  • TD(λ) with linear function approximation (see above)

    Given initial values of θ in Vθ = θ>ϕ and of the eligibility tracesz0 = (0, . . . , 0) and previous state xt and next state xt+1

    δt+1 = rt+1 + γVθt (xt+1)− Vθt (xt)zt+1 = ∇θVθt (xt) + λztθt+1 = θt + ηtδt+1zt+1

    where ∇θf (θ) =(

    ∂∂θ1

    f (θ) , . . . , ∂∂θN f (θ))>

    is the gradient of f (θ)

    For Vθ = θ>ϕ we have simply ∇θVθ (x) = (ϕ1 (x) , . . . , ϕN (x))

    Here, eligibility traces measure how much a parameter contributedto V now and, weighted by λ, in the past.

    12/02/2016 Michael Herrmann RL 10

  • Algorithm: TD(λ) with function approximation

    x last state, y next state, r immediate reward, θ parameter vector,z vector of eligibility traces

    1 δ ← r + γθ>ϕ [y ]− θ>ϕ [x ]2 z ← ϕ [x ] + λz3 θ ← θ + αδz4 return (θ, z)

    Note: Supposes linear approximation of V

    12/02/2016 Michael Herrmann RL 10

  • Linear SARSA (see next slide) for the mountain car problem

    Matt Kretchmar, 1995

    12/02/2016 Michael Herrmann RL 10

  • Linear SARSA(λ) (see Fig 9.8 in S&B 2nd Ed.)

    12/02/2016 Michael Herrmann RL 10

  • Q-learning with function approximation

    Recall Qt+1 (xt , at) = Qt (xt , at) + α (rr + γV (xt+1)−Qt (xt , at))

    Now

    δt+1 = rt+1 + γV (xt+1)−Qt (xt , at)θt+1 = θt + αtδt+1 (Qθt )∇θQθt (xt , at)

    with Qθt = θ>ϕ and ϕ : X ×A → RN is a basis function over thestate-action space. V is given as a maximum of Q w.r.t. a.

    12/02/2016 Michael Herrmann RL 10

  • Algorithm: Q-learning with function approximation

    x last state, y next state, r immediate reward, θ parameter vector

    1 δ ← r + γmaxa′∈A θ>ϕ [y , a′]− θ>ϕ [x , a]2 θ ← θ + αδϕ [x , a]3 return θ

    12/02/2016 Michael Herrmann RL 10

  • Convergence

    Widely used but convergence can be shown only locally (localoptima!)Even in the linear case, parameters may diverge (Bertsekas andTsitsiklis, 1996) due to biased sampling or for non-linearapproximations of V or Q.Almost sure convergence to a unique parameter vector wasshown for linear approximation, ergodic Markov process withwell-behaved stationary distribution under the Robbins-Monroconditions and for linearly independent ϕ.If convergent, the best approximation of the true valuefunction among all the linear approximations is found.

    12/02/2016 Michael Herrmann RL 10

  • The choice of the function space

    In look-up table algorithms averaging happens within the cellsof the table and is safe under the RM conditionsHere, however, approximation and estimation of the valuefunction may interfereTarget function V and approximation Vθ: Approximation error

    E = infθ‖Vθ − V ‖2

    Choosing sufficiently many features, the error on a finitenumber of values (e.g. in an episodic task) can be reduced tozero, but

    overfitting is possible for noisy rewards/states andtrade-off between approximation errors (model) and estimation(values) needs to be considered

    Use regularisation!

    12/02/2016 Michael Herrmann RL 10

  • Fitted Q-learning: Algorithm

    Use all (recent) state-action pairs for the update ⇒ Monte Carlo

    1 S ← [ ] // create empty list2 for t = 1 to T // to present3 V̂ ← ri+1 + γmaxa′∈A predict ((yt+1, a′) , θ) //estimate value4 S ← append

    (S ,({xt , at} , V̂

    ))5 end for6 θ ← regress (S) // maximise likelihood of model

    Notes: Prediction and regression should be matched. May divergefor unsuitable regressor.

    12/02/2016 Michael Herrmann RL 10

  • Summary

    Large state space require clever representationsContinuous state spaces require function approximationAlgorithms do not necessarily become more complex, but losethe property of global convergenceChoice of function space is an open problem (often not toodifficult for practical problems)So far we considered only the representation of the valuefunction. Should we approximate instead the policy? Or both?Next time: Compatible representations

    12/02/2016 Michael Herrmann RL 10

  • Acknowledgements

    Some material was adapted from web resources associated withSutton and Barto’s Reinforcement Learning book

    Today mainly based on sections 2.2 and 3.3.2 from C. Szepesvári’s(2010) Algorithms for reinforcement learning. Morgan & ClaypoolPublishers. (see alsowww.cs.ualberta.ca/system/files/tech_report/2009/TR09-13.pdf)

    12/02/2016 Michael Herrmann RL 10


Recommended