Bayesian Reinforcement Learning in Continuous...

Motivation POMDP Bayesian RL Experiments Conclusion

Bayesian Reinforcement Learning inContinuous POMDPs

Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1

1School of Computer Science, McGill University, Canada2Department of Computer Science, Laval University, Canada

May 23rd , 2008

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 1 / 17


Motivation

Robots have to make decisions under :Imperfect ActuatorsNoisy SensorsPoor/Approximate Model

How to maximize long-term rewards ?[Rottmann]



Continuous POMDP

States : S ⊆ Rm

Actions : A ⊆ Rn

Observations : Z ⊆ Rp

Rewards : R(s,a) ∈ R

Gaussian model for Transition/Observation function :

st = gT (st−1,at−1,Xt) Xt ∼ N(µX ,ΣX )zt = gO(st ,at−1,Yt) Yt ∼ N(µY ,ΣY )



Example

Simple Robot Navigation Task :

[x ′

y ′

]=

[xy

]+ v

[cos θ − sin θsin θ cos θ

] [X1X2

][

zxzy

]=

[xy

]+

[Y1Y2

]

+1 reward when ||s − sGOAL||2 < d



Problem

In practice : µX , ΣX , µY , ΣY unknown.

Need to trade-off between :Learning the modelIdentifying the stateGathering rewards



Bayesian Reinforcement Learning

Observation

Action

New Posterior

Current Prior/Posterior




Planning problem representable as a new POMDP :States : (s, θ)Actions : a ∈ AObservations : z ∈ ZRewards : R(s, θ,a) = R(s,a)

Joint Transition-Observation Probabilities :Pr(s′, θ′, z|s, θ,a) = Pr(s′, z|s,a, θ)Iθ(θ′)




Belief State = Posterior

Belief Update :baz(s′, θ) ∝

∫S b(s, θ) Pr(s

′, z|s,a, θ)ds

Optimal policy by solving :V ∗(b) =maxa∈A

[∫S R(s,a) Pr(s|b)ds + γ

∫Z Pr(z|b,a)V

∗(baz)dz]



Belief Update

Bayesian Learning of (µ,Σ) :Normal-Wishart prior⇒ Normal-Wishart posteriorParametrized by (n, µ̂, Σ̂)

Start with prior : (n0, µ̂0, Σ̂0)

Posterior Update (after observing X = x) :n′ = n + 1µ̂′ = nµ̂+xn+1

Σ̂′ = n−1n Σ̂ +1

n+1 (x − µ̂)(x − µ̂)T



Belief Update

But X not directly observable :Pr(µ,Σ|z) ∝

∫Pr(µ,Σ|x) Pr(z|x) Pr(x)dx

Approximate infinite mixture by finite mixture

Particle filter :Use particles of the form (s, φ, ψ)φ,ψ : Normal-Wishart posterior parameters for X ,Y



Particle Filter



Online Planning

Monte Carlo Online Planning (Receding Horizon Control) :

b0

b1

b2

b3

a1

a2

an

o1

o2

on

...

...



Simple Robot Navigation TaskAverage evolution of the return over time :

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Training Steps

Ave

rage

Ret

urn

Prior modelExact ModelLearning



Simple Robot Navigation Task

Average accuracy of the model over time :

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Training Steps

WL1

Model Accuracy is measured as follows :WL1(b) =∑

(s,φ,ψ) b(s, φ, ψ) [||µφ − µX ||1 + ||Σφ − ΣX ||1 + ||µψ − µY ||1 + ||Σψ − ΣY ||1]



Conclusion

Presented a framework for optimal control undermodel and state uncertainty.Monte Carlo approximations for efficient tracking andplanning.Framework can easily be extended to unknownrewards and mixture of Gaussians model.



Future Work

What if gT , gO unknown ?What if (µ,Σ) change over time ?More efficient planning algorithms.Apply to a real robot.



Thank you !

Questions?


MotivationPOMDPBayesian Reinforcement LearningExperimentsConclusion

Date post:	31-Jan-2021
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Bayesian Reinforcement Learning in Continuous...

Documents