Motivation POMDP Bayesian RL Experiments Conclusion
Bayesian Reinforcement Learning inContinuous POMDPs
Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1
1School of Computer Science, McGill University, Canada2Department of Computer Science, Laval University, Canada
May 23rd , 2008
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 1 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Motivation
Robots have to make decisions under :Imperfect ActuatorsNoisy SensorsPoor/Approximate Model
How to maximize long-term rewards ?[Rottmann]
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 2 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Continuous POMDP
States : S ⊆ Rm
Actions : A ⊆ Rn
Observations : Z ⊆ Rp
Rewards : R(s,a) ∈ R
Gaussian model for Transition/Observation function :
st = gT (st−1,at−1,Xt) Xt ∼ N(µX ,ΣX )zt = gO(st ,at−1,Yt) Yt ∼ N(µY ,ΣY )
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 3 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Example
Simple Robot Navigation Task :
[x ′
y ′
]=
[xy
]+ v
[cos θ − sin θsin θ cos θ
] [X1X2
][
zxzy
]=
[xy
]+
[Y1Y2
]
+1 reward when ||s − sGOAL||2 < d
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 4 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Problem
In practice : µX , ΣX , µY , ΣY unknown.
Need to trade-off between :Learning the modelIdentifying the stateGathering rewards
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 5 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Bayesian Reinforcement Learning
Observation
Action
New Posterior
Current Prior/Posterior
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 6 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Bayesian Reinforcement Learning
Planning problem representable as a new POMDP :States : (s, θ)Actions : a ∈ AObservations : z ∈ ZRewards : R(s, θ,a) = R(s,a)
Joint Transition-Observation Probabilities :Pr(s′, θ′, z|s, θ,a) = Pr(s′, z|s,a, θ)Iθ(θ′)
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 7 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Bayesian Reinforcement Learning
Belief State = Posterior
Belief Update :baz(s′, θ) ∝
∫S b(s, θ) Pr(s
′, z|s,a, θ)ds
Optimal policy by solving :V ∗(b) =maxa∈A
[∫S R(s,a) Pr(s|b)ds + γ
∫Z Pr(z|b,a)V
∗(baz)dz]
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 8 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Belief Update
Bayesian Learning of (µ,Σ) :Normal-Wishart prior⇒ Normal-Wishart posteriorParametrized by (n, µ̂, Σ̂)
Start with prior : (n0, µ̂0, Σ̂0)
Posterior Update (after observing X = x) :n′ = n + 1µ̂′ = nµ̂+xn+1
Σ̂′ = n−1n Σ̂ +1
n+1 (x − µ̂)(x − µ̂)T
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 9 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Belief Update
But X not directly observable :Pr(µ,Σ|z) ∝
∫Pr(µ,Σ|x) Pr(z|x) Pr(x)dx
Approximate infinite mixture by finite mixture
Particle filter :Use particles of the form (s, φ, ψ)φ,ψ : Normal-Wishart posterior parameters for X ,Y
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 10 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Particle Filter
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 11 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Online Planning
Monte Carlo Online Planning (Receding Horizon Control) :
b0
b1
b2
b3
a1
a2
an
o1
o2
on
...
...
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 12 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Simple Robot Navigation TaskAverage evolution of the return over time :
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Training Steps
Ave
rage
Ret
urn
Prior modelExact ModelLearning
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 13 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Simple Robot Navigation Task
Average accuracy of the model over time :
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Training Steps
WL1
Model Accuracy is measured as follows :WL1(b) =∑
(s,φ,ψ) b(s, φ, ψ) [||µφ − µX ||1 + ||Σφ − ΣX ||1 + ||µψ − µY ||1 + ||Σψ − ΣY ||1]
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 14 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Conclusion
Presented a framework for optimal control undermodel and state uncertainty.Monte Carlo approximations for efficient tracking andplanning.Framework can easily be extended to unknownrewards and mixture of Gaussians model.
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 15 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Future Work
What if gT , gO unknown ?What if (µ,Σ) change over time ?More efficient planning algorithms.Apply to a real robot.
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 16 / 17
Motivation POMDP Bayesian RL Experiments Conclusion
Thank you !
Questions?
Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 17 / 17
MotivationPOMDPBayesian Reinforcement LearningExperimentsConclusion