Stochastic optimal control theory
Bert KappenSNN Radboud UniversityNijmegen the Netherlands
July 5, 2008
Bert Kappen
Introduction
Optimal control theory: Optimize sum of a path cost and end cost. Result isoptimal control sequence and optimal trajectory.
Input: Cost function. Output: Optimal trajectory and controls.
Classical control theory: what control signal should I give to move a plant to adesired state?
Input: Desired state trajectory. Output: Optimal control trajectory.
Bert Kappen ICML, July 5 2008 1
Types of optimal control problems
Finite horizon (fixed horizon time)
t tf
controlled trajectory
environmentx
Dynamics and environment may depend explicitly on time.Optimal control depends explicitly on time.
Bert Kappen ICML, July 5 2008 2
Types of optimal control problems
Finite horizon (moving horizon)
t tf
Dynamics and environment are static.Optimal control is time independent.
Bert Kappen ICML, July 5 2008 3
Types of optimal control problems
Finite horizon (moving horizon)
t tft tf
Dynamics and environment are static.Optimal control is time independent.Similar to RL.
Bert Kappen ICML, July 5 2008 4
Types of optimal control problems
Other types of control problems:- minimum time- infinite horizon, average reward- infinite horizon, absorbing states
In addition one should distinguish:- discrete vs. continuous state- discrete vs. continuous time- observable vs. partial observable
Bert Kappen ICML, July 5 2008 5
Overview
Deterministic optimal control (Kappen, 30 min.)- Introduction of delayed reward problem in discrete time;- Dynamic programming solution and deterministic Bellman equations;- Solution in continuous time and states;- Example: Mass on a spring- Pontryagin maximum principle; Notion of an optimal (particle) trajectory- Again Mass on a spring
Bert Kappen ICML, July 5 2008 6
Overview
Stochastic optimal control, discrete case (Toussaint, 40 min.)- Stochastic Bellman equation (discrete state and time) and Dynamic Programming- Reinforcement learning (exact solution, value iteration, policy improvement);Actor critic networks;- Markov decision problems and probabilistic inference;- Example: robotic motion control and planning
Bert Kappen ICML, July 5 2008 7
Overview
Stochastic optimal control, continuous case (Kappen, 40 min.)- Stochastic differential equations- Hamilton-Jacobi-Bellman equation (continuous state and time)- LQ control, Ricatti equation;- Example of LQ control- Learning; Partial observability: Inference and control;- Certainty equivalence- Path integral control; the role of noise and symmetry breaking; efficientapproximate computation (MC, MF, BP, ...)- Examples: Double slit, delayed choice, n joint arm
Bert Kappen ICML, July 5 2008 8
Overview
Research issues (Toussaint, 30 min.)- Learning;- efficient methods to compute value functions/cost-to-go- control under partial observability (POMDPs)
Bert Kappen ICML, July 5 2008 9
Discrete time control
Consider the control of a discrete time deterministic dynamical system:
xt+1 = xt + f(t, xt, ut), t = 0, 1, . . . , T − 1
xt describes the state and ut specifies the control or action at time t.
Given xt=0 = x0 and u0:T−1 = u0, u1, . . . , uT − 1, we can compute x1:T .
Define a cost for each sequence of controls:
C(x0, u0:T−1) = φ(xT ) +T−1∑
t=0
R(t, xt, ut)
The problem of optimal control is to find the sequence u0:T−1 that minimizesC(x0, u0:T−1).
Bert Kappen ICML, July 5 2008 10
Dynamic programming
Find the minimal cost path from A to J.
C(J) = 0, C(H) = 3, C(I) = 4
C(F ) = min(6 + C(H), 3 + C(I))
Bert Kappen ICML, July 5 2008 11
Discrete time control
The optimal control problem can be solved by dynamic programming. Introducethe optimal cost-to-go:
J(t, xt) = minut:T−1
(φ(xT ) +
T−1∑
s=t
R(s, xs, us)
)
which solves the optimal control problem from an intermediate time t until thefixed end time T , for all intermediate states xt.
Then,
J(T, x) = φ(x)
J(0, x) = minu0:T−1
C(x, u0:T−1)
Bert Kappen ICML, July 5 2008 12
Discrete time control
One can recursively compute J(t, x) from J(t+ 1, x) for all x in the following way:
J(t, xt) = minut:T−1
(φ(xT ) +
T−1∑
s=t
R(s, xs, us)
)
= minut
(R(t, xt, ut) + min
ut+1:T−1
[φ(xT ) +
T−1∑
s=t+1
R(s, xs, us)
])
= minut
(R(t, xt, ut) + J(t+ 1, xt+1))
= minut
(R(t, xt, ut) + J(t+ 1, xt + f(t, xt, ut)))
This is called the Bellman Equation.
Computes u as a function of x, t for all intermediate t and all x.
Bert Kappen ICML, July 5 2008 13
Discrete time control
The algorithm to compute the optimal control u∗0:T−1, the optimal trajectory x∗1:T
and the optimal cost is given by
1. Initialization: J(T, x) = φ(x)
2. Backwards: For t = T − 1, . . . , 0 and for all x compute
u∗t (x) = arg minu{R(t, x, u) + J(t+ 1, x+ f(t, x, u))}
J(t, x) = R(t, x, u∗t ) + J(t+ 1, x+ f(t, x, u∗t ))
3. Forwards: For t = 0, . . . , T − 1 compute
x∗t+1 = x∗t + f(t, x∗t , u∗t (x∗t ))
NB: the backward computation requires u∗t (x) for all x.
Bert Kappen ICML, July 5 2008 14
Continuous limit
Replace t+ 1 by t+ dt with dt→ 0.
xt+dt = xt + f(xt, ut, t)dt
C(x0, u0→T ) = φ(xT ) +
∫ T
0
dτR(τ, x(τ), u(τ))
Assume J(x, t) is smooth.
J(t, x) = minu
(R(t, x, u)dt+ J(t+ dt, x+ f(x, u, t)dt))
≈ minu
(R(t, x, u)dt+ J(t, x) + ∂tJ(t, x)dt+ ∂xJ(t, x)f(x, u, t)dt)
−∂tJ(t, x) = minu
(R(t, x, u) + f(x, u, t)∂xJ(x, t))
with boundary condition J(x, T ) = φ(x).
Bert Kappen ICML, July 5 2008 15
Continuous limit
−∂tJ(t, x) = minu
(R(t, x, u) + f(x, u, t)∂xJ(x, t))
with boundary condition J(x, T ) = φ(x).
This is called the Hamilton-Jacobi-Bellman Equation.
Computes the anticipated potential J(t, x) from the future potential φ(x).
Bert Kappen ICML, July 5 2008 16
Example: Mass on a spring
The spring force Fz = −z towards the rest position and control force Fu = u.
Newton’s LawF = −z + u = mz
with m = 1.
Control problem: Given initial position and velocity z(0) = z(0) = 0 at time t = 0,find the control path −1 < u(0→ T ) < 1 such that z(T ) is maximal.
Bert Kappen ICML, July 5 2008 17
Example: Mass on a spring
Introduce x1 = z, x2 = z, then
x1 = x2
x2 = −x1 + u
The end cost is φ(x) = −x1; path cost R(x, u, t) = 0.
The HJB takes the form:
−∂tJ = minu
(−x2
∂J
∂x1+ x1
∂J
∂x2+∂J
∂x2u
)
= −x2∂J
∂x1+ x1
∂J
∂x2−∣∣∣∣∂J
∂x2
∣∣∣∣ , u = −sign
(∂J
∂x2
)
Bert Kappen ICML, July 5 2008 18
Example: Mass on a spring
The solution is
J(t, x1, x2) = − cos(t− T )x1 + sin(t− T )x2 + α(t)
u(t, x1, x2) = −sign(sin(t− T ))
As an example consider T = 2π. Then, the optimal control is
u = −1, 0 < t < π
u = 1, π < t < 2π
0 2 4 6 8−2
−1
0
1
2
3
4
t
x1
x2
Bert Kappen ICML, July 5 2008 19
Pontryagin minimum principle
The HJB equation is a PDE with boundary condition at future time. The PDE issolved using discretization of space and time.
The solution is an optimal cost-to-go for all x and t. From this we compute theoptimal trajectory and optimal control.
An alternative approach is a variational approach that directly finds the optimaltrajectory and optimal control.
Bert Kappen ICML, July 5 2008 20
Pontryagin minimum principle
We can write the optimal control problem as a constrained optimization problemwith independent variables u(0→ T ) and x(0→ T )
minu(0→T ),x(0→T )
φ(x(T )) +
∫ T
0
dtR(x(t), u(t), t)
subject to the constraintx = f(x, u, t)
and boundary condition x(0) = x0.
Introduce the Lagrange multiplier function λ(t):
C = φ(x(T )) +
∫ T
0
dt [R(t, x(t), u(t))− λ(t)(f(t, x(t), u(t))− x(t))]
= φ(x(T )) +
∫ T
0
dt[−H(t, x(t), u(t), λ(t)) + λ(t)x(t))]
−H(t, x, u, λ) = R(t, x, u)− λf(t, x, u)
Bert Kappen ICML, July 5 2008 21
Derivation PMP
The solution is found by extremizing C. This gives a necessary but not sufficientcondition for a solution.
If we vary the action wrt to the trajectory x, the control u and the Lagrangemultiplier λ, we get:
δC = φx(x(T ))δx(T )
+
∫ T
0
dt[−Hxδx(t)−Huδu(t) + (−Hλ + x(t))δλ(t) + λ(t)δx(t)]
= (φx(x(T )) + λ(T )) δx(T )
+
∫ T
0
dt[(−Hx − λ(t))δx(t)−Huδu(t) + (−Hλ + x(t))δλ(t)
]
For instance, Hx = ∂H(t,x(t),u(t),λ(t))∂x(t) .
Bert Kappen ICML, July 5 2008 22
We can solve Hu(t, x, u, λ) = 0 for u and denote the solution as
u∗(t, x, λ)
Assumes H convex in u.
The remaining equations are
x = Hλ(t, x, u∗(t, x, λ), λ)
λ = −Hx(t, x, u∗(t, x, λ), λ)
with boundary conditions
x(0) = x0 λ(T ) = −φx(x(T ))
Mixed boundary value problem.
Bert Kappen ICML, July 5 2008 23
Again mass on a spring
Problem
x1 = x2, x2 = −x1 + u
R(x, u, t) = 0 φ(x) = −x1
Hamiltonian
H(t, x, u, λ) = −R(t, x, u) + λTf(t, x, u) = λ1x2 + λ2(−x1 + u)
H∗(t, x, λ) = λ1x2 − λ2x1 − |λ2| u∗ = −sign(λ2)
The Hamilton equations
x =∂H∗
∂λ⇒ x1 = x2, x2 = −x1 − sign(λ2)
λ = −∂H∗
∂x⇒ λ1 = −λ2, λ2 = λ1
with x(t = 0) = x0 and λ(t = T ) = 1.
Bert Kappen ICML, July 5 2008 24
Comments
The HJB method gives a sufficient (and often necessary) condition for optimality.The solution of the PDE is expensive.
The PMP method provides a necessary condition for optimal control. This meansthat it provides candidate solutions for optimality.
The PMP method is computationally less complicated than the HJB methodbecause it does not require discretization of the state space.
Optimal control in continuous space and time contains many complications relatedto the existence, uniqueness and smoothness of the solution, particular in theabsence of noise. In the presence of noise many of these intricacies disappear.
HJB generalizes to the stochastic case, PMP does not (at least not easy).
Bert Kappen ICML, July 5 2008 25
Stochastic differential equations
Consider the random walk on the line:
xt+1 = xt + ξt ξt = ±1
with x0 = 0. We can compute
xt =t∑
i=1
ξi
Since xt is a sum of random variables, xt becomes Gaussian distributed with
〈xt〉 =t∑
i=1
〈ξi〉 = 0
⟨x2t
⟩=
t∑
i,j=1
〈ξiξj〉 =t∑
i=1
⟨ξ2i
⟩+
t∑
i,j=1,j 6=i〈ξiξj〉 = t
Note, that the fluctuations ∝√t.
Bert Kappen ICML, July 5 2008 26
Stochastic differential equations
In the continuous time limit we define
dxt = xt+dt − xt = dξ
with dξ an infinitesimal mean zero Gaussian variable with⟨dξ2⟩
= νdt.
Then
d
dt〈x〉 = 0, ⇒ 〈x〉 (t) = 0
d
dt
⟨x2⟩
= ν, ⇒⟨x2⟩
(t) = νt
ρ(x, t|x0, 0) =1√
2πνtexp
(−(x− x0)2
2νt
)
Bert Kappen ICML, July 5 2008 27
Stochastic optimal control
Consider a stochastic dynamical system
dx = f(t, x, u)dt+ dξ
dξ Gaussian noise 〈dξidξj〉 = νij(t, x, u)dt.
The cost becomes an expectation:
C(t, x, u(t→ T )) =
⟨φ(x(T )) +
∫ T
t
dτR(t, x(t), u(t))
⟩
over all stochastic trajectories starting at x with control path u(t→ T ).
Note, that u(t) as part of u(t → T ) is used at time t. Next move to x + dx andrepeat the optimization.
Bert Kappen ICML, July 5 2008 28
Stochastic optimal control
We obtain the Bellman recursion
J(t, xt) = minut
R(t, xt, ut) + 〈J(t+ dt, xt+dt)〉
〈J(t+ dt, xt+dt)〉 =
∫dxt+dtN (xt+dt|xt, νdt)J(t+ dt, xt+dt)
= J(t, xt) + dt∂tJ(t, xt) + 〈dx〉 ∂xJ(t, xt) +1
2
⟨dx2⟩∂2xJ(t, xt)
〈dx〉 = f(x, u, t)dt⟨dx2⟩
= ν(t, x, u)dt
Thus,
−∂tJ(t, x) = minu
(R(t, x, u) + f(x, u, t)∂xJ(x, t) +
1
2ν(t, x, u)∂2
xJ(x, t)
)
with boundary condition J(x, T ) = φ(x).
Bert Kappen ICML, July 5 2008 29
Linear Quadratic control
The dynamics is linear
dx = [A(t)x+B(t)u+ b(t)]dt+m∑
j=1
(Cj(t)x+Dj(t)u+ σj(t))dξj,⟨dξjdξj′
⟩= δjj′dt
The cost function is quadratic
φ(x) =1
2xTGx
R(x, u, t) =1
2xTQ(t)x+ uTS(t)x+
1
2uTR(t)u
In this case the optimal cost-to-go is quadratic in x:
J(t, x) =1
2xTP (t)x+ αT (t)x+ β(t)
u(t) = −Ψ(t)x(t)− ψ(t)
Bert Kappen ICML, July 5 2008 30
Substitution in the HJB equation yields ODEs for P,α, β:
−P = PA+ATP +mX
j=1
CTj PCj +Q− ST R−1S
−α = [A−BR−1S]
Tα+
mX
j=1
[Cj −DjR−1S]
TPσj + Pb
β =1
2
˛˛pRψ˛˛2
− αTb− 1
2
mX
j=1
σTj Pσj
R = R+mX
j=1
DTj PDj
S = BTP + S +
mX
j=1
DTj PCj
Ψ = R−1S
ψ = R−1
(BTα+
mX
j=1
DTj Pσj)
with P (tf) = G and α(tf) = β(tf) = 0.
Bert Kappen ICML, July 5 2008 31
Example
Find the optimal control for the dynamics
dx = (x+ u)dt+ dξ,⟨dξ2⟩
= νdt
with end cost φ(x) = 0 and path cost R(x, u) = 12(x2 + u2).
The Ricatti equations reduce to
−P = 2P + 1− P 2
−α = 0
β = −1
2νP
with P (T ) = α(T ) = β(T ) = 0 and
u(x, t) = −P (t)x
0 2 4 6 8 100
1
2
3
4
5
6
t
Pβ
Bert Kappen ICML, July 5 2008 32
Comments
Note, that in the last example the optimal control is independent of ν, i.e. optimalstochastic control equals optimal deterministic control.
This is true in general for ’non-multiplicative’ noise (Cj = Dj = 0).
Bert Kappen ICML, July 5 2008 33
Learning
What happens if (part of) the state is not observed?
For instance,- As a result of measurement error we do not know xt but p(xt|y0:t)- We do not know the parameters of the dynamics- We do not know the cost/rewards (RL case)
Bert Kappen ICML, July 5 2008 34
Learning in RL or receding horizon
Imagine eternity, and you are ordered to cook a very good meal for tomorrow.
You decide to spend all of today to learn the recipes.
Bert Kappen ICML, July 5 2008 35
Learning in RL or receding horizon
Imagine eternity, and you are ordered to cook a very good meal for tomorrow.
You decide to spend all of today to learn the recipes.
When the next day arrives, you are ordered to cook a very good meal for tomorrow.
You decide to spend all of today to learn the recipes
. . .
Bert Kappen ICML, July 5 2008 36
Learning in RL or receding horizon
Imagine eternity, and you are ordered to cook a very good meal for tomorrow.
You decide to spend all of today to learn the recipes.
When the next day arrives, you are ordered to cook a very good meal for tomorrow.
You decide to spend all of today to learn the recipes.
. . .
tt
V
- The learning phase takes forever.- Mix of exploration and optimizing (RL: actor-critic, E3, ...)- Learning is not part of the control problem
Bert Kappen ICML, July 5 2008 37
Learning in RL or receding horizon
Imagine eternity, and you are ordered to cook a very good meal for tomorrow.
You decide to spend all of today to learn the recipes.
When the next day arrives, you are ordered to cook a very good meal for tomorrow.
You decide to spend all of today to learn the recipes.
. . .
tt
V
- The learning phase takes forever.- Mix of exploration and optimizing (RL: actor-critic, E3, ...)- Learning is not part of the control problem
Bert Kappen ICML, July 5 2008 38
Finite horizon learning
Imagine instead life as we know it. It is finite and we have only one life.
Aim is to maximize accumulated reward. This requires to plan your learning!- At t = 0, action is useless.- At t = T , learning is useless.
t
learningaction
Problem of inference and control.
Bert Kappen ICML, July 5 2008 39
Inference and control
As an example, consider the problem
dx = αudt+ dξ
with α unobserved and x observed. with α unobserved and x observed. Path costR(x, u, t), end cost φ(x) and noise variance
⟨dξ2⟩
= νdt are given.
The problem is that the future information that we receive about α depends on u.Each time step we observe dx and u and thus learn about α.
pt+dt(α|dx, u) ∝ p(dx|α, u)pt(α)
The solution is to augment the state space with parameters θt (sufficient statistics)that describe pt(α) = p(α|θt) and θ0 known. Then with α = ±1, pt(α = 1) =σ(θt):
dθ =u
νdx =
u
ν(αudt+ dξ)
NB: In forward pass dθ = F (dx), thus θ also observed.
Bert Kappen ICML, July 5 2008 40
With zt = (xt, θt) we obtain a standard HJB:
−∂tJ(t, z)dt = minu
(R(t, z, u)dt+ 〈dz〉z ∂zJ(z, t) +
1
2
⟨dz2⟩z∂2zJ(z, t)
)
with boundary condition J(z, T ) = φ(x).
The expectation values are conditioned on (xt, θt) and are averages over p(α|θt)and the Gaussian noise, cf.
〈dx〉x,θ = 〈αudt+ dξ〉x,θ = αudt α =
∫dαp(α|θ)α
t tft+dt
J(z,t+dt)<dz>z<dz >z2
Bert Kappen ICML, July 5 2008 41
Certainty equivalence
An important special case of a partial observable control problem is the Kalmanfilter (y observed, x not observed).
dx = (x+ u)dt+ dξ
y = x+ η
The cost is quadratic in x and u, for instance
C(xt, t, ut:T ) =
⟨T∑
τ=t
1
2(x2τ + u2
τ)
⟩
The optimal control is u(x, t).
When xt is not observed, we can compute p(xt|y0:t) using Kalman filtering andthe optimal control minimizes
CKF(y0:t, t, ut:T ) =
∫dxtp(xt|y0:t)C(xt, t, ut:T )
Bert Kappen ICML, July 5 2008 42
Since p(xt|y0:t) = N (xt|µt, σ2t ) is Gaussian and
CKF(y0:t, t, ut:T ) =
∫dxtC(xt, t, ut:T )N (xt|µt, σ2
t ) =
T∑
τ=t
1
2u2τ +
T∑
τ=t
⟨x2τ
⟩µt,σt
= · · ·= C(µt, t, ut:T ) +
1
2(T − t)σ2
t
The first term is identical to the observed case with xt → µt. The second termdoes not depend on u and thus does not affect the optimal control.
The optimal control for the Kalman filter is identical to the observed case with xtreplaced by µt:
uKF(y0:t, t) = u(µt, t)
Bert Kappen ICML, July 5 2008 43
Summary
For infinite time problems, learning is a meta problem with a time scale unrelatedto the horizon time.
A finite time partial observable (or adaptive) control problem is in general equivalentto an observable non-adaptive control problem in an extended state space.
The partial observable case is generally more complex than the observable case.For instance a LQ problem with unknown parameters is not LQ.
For a Kalman filter with unobserved states and known parameters, the partialobservability does not affect the optimal control law (Certainty equivalence).
Learning can be done either in the maximum likelihood sense or in a full Bayesianway.
Bert Kappen ICML, July 5 2008 44
Path integral control
Solving the PDE is hard. We consider a special case that can be ’solved’.
dx = (f(x, t) + u)dt+ dξ
R(x, u, t) = V (x, t) +1
2uTRu
and φ(x) arbitrary.
The stochastic HJB equation becomes:
−∂tJ = minu
(1
2uTRu+ V + (f + u)T∂xJ +
1
2Tr(ν∂2
xJ)
)
= −1
2(∂xJ)TR−1(∂xJ) + V + fT∂xJ +
1
2Tr(ν∂2
xJ)
u = −R−1∂xJ(x, t)
must be solved backward in time with boundary condition J(T, x) = φ(x).
Bert Kappen ICML, July 5 2008 45
Closed form solution
If we further assume that R−1 = λν for some scalar λ then we can solve J inclosed form. The solution contains the following steps:
Substitute J = −λ log Ψ in the HJB equations. Because of the relation R−1 = λνthe terms quadratic in Ψ cancel and only linear terms remain.
∂tΨ = −HΨ, H = −Vλ
+ f∂x +1
2ν∂2
x
This equation must be solved backward in time with boundary condition Ψ(x, T ) =exp(−φ(x)/λ).
The linearity allows us to reverse the direction of computation, replacing it by adiffusion process, in the following way.
Bert Kappen ICML, July 5 2008 46
Closed form solution
Let ρ(y, τ |x, t) describe a diffusion process defined by the Fokker-Planck equation
∂τρ = −Vνρ− ∂y(fρ) +
1
2ν∂2
yρ = H†ρ (1)
with ρ(y, t|x, t) = δ(y − x).
Define
A(x, t) =
∫dyρ(y, τ |x, t)Ψ(y, τ).
It is easy to see by using the equations of motions for Ψ and ρ that A(x, t) isindependent of τ . Evaluating A(x, t) for τ = t yields A(x, t) = Ψ(x, t). EvaluatingA(x, t) for τ = T yields A(x, t) =
∫dyρ(y, T |x, t)Ψ(y, T ). Thus,
A(x, t) = A(x, T )
Ψ(x, t) =
∫dyρ(y, T |x, t) exp(−φ(y)/ν)
Bert Kappen ICML, July 5 2008 47
Forward sampling of the diffusion process
The diffusion equation
∂τρ = −Vλρ− ∂y(fρ) +
1
2ν∂2
yρ (2)
can be sampled as
dx = f(x, t)dt+ dξ
x = x+ dx, with probability 1− V (x, t)dt/λ
xi = †, with probability V (x, t)dt/λ
Bert Kappen ICML, July 5 2008 48
Forward sampling of the diffusion process
We can estimate
Ψ(x, t) =
∫dyρ(y, T |x, t) exp(−φ(y)/λ)
≈ 1
N
∑
i∈alive
exp(−φ(xi(T ))/λ)
by computing N trajectories xi(t→ T ), i = 1, . . . , N .
’Alive’ denotes the subset of trajectories that do not get killed along the way bythe † operation.
Bert Kappen ICML, July 5 2008 49
The diffusion process
The diffusion process can be written as a path integral:
ρ(y, T |x, t) =
∫[dx]yx exp
(−1
νSpath(x(t→ T ))
)
Spath(x(t→ T )) =
∫ T
t
dτ1
2(x(τ)− f(x(τ), τ))2 + V (x(τ), τ)
t t f
xy
Bert Kappen ICML, July 5 2008 50
The path integral formulation
Ψ(x, t) =
∫dyρ(y, T |x, t) exp
(−φ(x)
ν
)
=
∫[dx]x exp
(−1
νS(x(t→ T ))
)
S(x(t→ T )) = Spath(x(t→ T ) + φ(x(T ))
Ψ is a partition sum and J = −ν log Ψ therefore can be interpreted as a freeenergy. S is the energy of a path and ν the temperature.
The corresponding probability distribution is
p(x(t→ T )|x, t) =1
Ψ(x, t)exp
(−1
νS(x(t→ T ))
)
Bert Kappen ICML, July 5 2008 51
An example: double slit
dx = udt+ dξ
C =
⟨1
2x(T )2 +
∫ T
0
dτ1
2u(τ)2 + V (x, t)
⟩
V (x, t = 1) implements a slit at an intermediatetime t = 1.
Ψ(x, t) =
∫dyρ(y, T |x, t)Ψ(y, T )
can be solved in closed form.
0 0.5 1 1.5 2
−6
−4
−2
0
2
4
6
8
−10 −5 0 5 100
1
2
3
4
5
x
J
t=0t=0.99t=1.01t=2
Bert Kappen ICML, July 5 2008 52
MC sampling on double slit
J(x, t) ≈ −ν log
(1
N
N∑
i=1
exp(−φ(xiT )/ν)
)
0 0.5 1 1.5 2−10
−5
0
5
10
(a) Naive sample paths
−10 −5 0 5 100
2
4
6
8
10
x
J
MCExact
(b) J(x, t = 0) by naivesampling (n = 100000).
−10 −5 0 5 100.5
1
1.5
2
2.5
3
x
J
(c) J(x, t = 0) by Laplaceapproximation and importancesampling (n = 100).
Bert Kappen ICML, July 5 2008 53
The delayed choice
If we go further back in time we encounter a new phenomenon: a delayed choicewhen to decide.
When slit size goes to zero, J is given by
J(x, T ) = −ν log
∫dyρ(y|x)e−φ(y)/ν
=1
T
(1
2x2 − νT log 2 cosh
x
νT
)
where T the time to reach the slits.The expression between brackets is a typical freeenergy with temperature νT .
−2 −1 0 1 20.4
0.6
0.8
1
1.2
1.4
1.6
1.8
x
J(x,
t)
T=2
T=1
T=0.5
Symmetry breaking at νT = 1 separates two qualitatively different behaviors.
Bert Kappen ICML, July 5 2008 54
The delayed choice
0 0.5 1 1.5 2−2
−1
0
1
2stochastic
0 0.5 1 1.5 2−2
−1
0
1
2
0 0.5 1 1.5 2−2
−1
0
1
2deterministic
0 0.5 1 1.5 2−2
−1
0
1
2
The timing of the decision, that is when the automaton decides to go left or right,is the consequence of spontaneous symmetry breaking.
Bert Kappen ICML, July 5 2008 55
N joint arm in a plane
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6
dθi = uidt+ dξi, i = 1, . . . , n
C =
⟨φ(xn(~θ)) +
∫dτ
1
2u(τ)2
⟩
ui =〈θi〉 − θiT − t , i = 1, . . . , n
〈· · ·〉 ∝ ρ(θ, T |θ0, t) exp(−φ(xn(~θ)))
〈θ〉 from uncontrolled penalized diffusion. Variational approximation.
Bert Kappen ICML, July 5 2008 56
Other path integral control issues
Multiple agents coordination
Relation Reinforcement learning and learning.
Realistic robotics applications; Non-differentiable cost functions (obstacles).
Bert Kappen ICML, July 5 2008 57
Summary
Deterministic control can be solved by- HJB, a PDE- PMP, two coupled ODEs with mixed boundary conditions
Stochastic control can be solved by- HJB in general- Ricatti equation for sufficient statistics in LQ case- Path integral is LQ control with arbitrary cost and dynamics- RL is special case
Learning, (PO states or parameter values)- Decoupled from control in the RL case- Joint inference and control typically harder than control only- For Kalman filters the PO is irrelevant (certainty equivalence)
Bert Kappen ICML, July 5 2008 58
Summary
The PI control problems provides novel link to machine learning:- statistical physics- symmetry breaking, ’phase transitions’- ’efficient’ computation (MCMC, BP, MF, EP)
Bert Kappen ICML, July 5 2008 59
Further reading
Check out the ICML tutorial paper and references there and other general referencesat the tutorial web page
http://ml.cs.tu-berlin.de/~mtoussai/08-optimal-control/
Bert Kappen ICML, July 5 2008 60