of 58
8/3/2019 Dynamic Programming and Optimal Control Script
1/58
Dynamic Programming andOptimal Control
Script
Prof. Raffaello DAndrea
Lecture notes
Dieter Baldinger Thomas Mantel Daniel Rohrer
HS 2010
8/3/2019 Dynamic Programming and Optimal Control Script
2/58
8/3/2019 Dynamic Programming and Optimal Control Script
3/58
Contents
Contents 1
1 Introduction 3
1.1 Class Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Key Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Open Loop versus Closed Loop Control . . . . . . . . . . . . . 41.4 Discrete State and Finite State Problem . . . . . . . . . . . . . 51.5 The Basic Problem . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Dynamic Programming Algorithm 92.1 Principle of Optimality . . . . . . . . . . . . . . . . . . . . . . . 92.2 The DPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Chess Match Strategy Revisited . . . . . . . . . . . . . . . . . . 112.4 Converting non-standard problems . . . . . . . . . . . . . . . 13
2.4.1 Time Lags . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.2 Correlated Disturbances . . . . . . . . . . . . . . . . . . 132.4.3 Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Deterministic, Finite State Systems . . . . . . . . . . . . . . . . 152.5.1 Convert DP to Shortest Path Problem . . . . . . . . . . 152.5.2 DP algorithm . . . . . . . . . . . . . . . . . . . . . . . . 162.5.3 Forward DP algorithm . . . . . . . . . . . . . . . . . . . 16
2.6 Converting Shortest Path to DP . . . . . . . . . . . . . . . . . . 162.7 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 172.8 Shortest Path Algorithms . . . . . . . . . . . . . . . . . . . . . 18
2.8.1 Label Correcting Methods . . . . . . . . . . . . . . . . . 19
2.8.2 A Algorithm . . . . . . . . . . . . . . . . . . . . . . . 212.9 Multi-Objective Problems . . . . . . . . . . . . . . . . . . . . . 21
2.9.1 Extended Principle of Optimality . . . . . . . . . . . . 222.10 Infinite Horizon Problems . . . . . . . . . . . . . . . . . . . . . 232.11 Stochastic, Shortest Path Problems . . . . . . . . . . . . . . . . 23
2.11.1 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . 242.11.2 Sketch of Proof . . . . . . . . . . . . . . . . . . . . . . . 252.11.3 Prove B . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.12 Summary of previous lecture . . . . . . . . . . . . . . . . . . . 272.13 How do we solve Bellmans Equation? . . . . . . . . . . . . . . 28
1
8/3/2019 Dynamic Programming and Optimal Control Script
4/58
Contents
2.13.1 Method 1: Value iteration (VI) . . . . . . . . . . . . . . 28
2.13.2 Method 2: Policy Iteration (PI) . . . . . . . . . . . . . . 282.13.3 Third Method: Linear Programming . . . . . . . . . . . 312.13.4 Analogies and Connections . . . . . . . . . . . . . . . . 32
2.14 Discounted Problems . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Continuous Time Optimal Control 36
3.1 The Hamilton Jacobi Bellman (HJB) Equation . . . . . . . . . . 373.2 Aside on Notation . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 The Minimum Principle . . . . . . . . . . . . . . . . . . 423.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Fixed Terminal State . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Free initial state, with cost . . . . . . . . . . . . . . . . . 463.4 Linear Systems and Quadratic Costs . . . . . . . . . . . . . . . 483.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 General Problem Formulation . . . . . . . . . . . . . . . . . . . 52
Bibliography 56
2
8/3/2019 Dynamic Programming and Optimal Control Script
5/58
Chapter 1
Introduction
1.1 Class Objective
The class objective is to make multiple decisions in stages to minimize a costthat captures undesirable outcomes.
1.2 Key Ingredients
1. Underlying discrete time system:
xk+1 = fk (xk, uk, wk) k = 0 , 1 , . . . , N
1
k: discrete time index
xk: state
uk: control input, decision variable
wk: disturbance or noise, random parameters
N: time horizon
fk: function, captures system evolution
2. Additive cost function:
gN(xN) terminal cost
+N1k=0
gk (xk, uk, wk) state cost
accumulated cost
gk is a given nonlinear function.
The cost is a function of the control applied Because the wk are random, typically consider the expected cost:
Ewk
gN(xN) +
N1k=0
gk (xk, uk, wk)
3
8/3/2019 Dynamic Programming and Optimal Control Script
6/58
1. Introduction
Example 1: Inventory Control:
Keeping an item stocked in a warehouse. Too little, you run out (bad).Too much, cost of storage and misuse of capital (bad).
xk: stock in the warehouse at the beginning ofkth time period
uk: stock ordered and immideately delivered at the beginning of kth
time period
wk: demand during kth period, with some given probability distribu-
tion
Dynamics:
xk+1 = xk + uk wk
Excess demand is backlogged and corresponds to negative values.
Cost:
E
R (xN) +
N1k=0
(r (xk) + c uk)
r (xk): penaltise too much stock or negative stock
c uk: cost of items
R (xn): terminal cost from items at the end that cant be sold or de-mand that cant be met
Objective: The objective is to minimize the cost subject to uk 0
1.3 Open Loop versus Closed Loop Control
Open Loop: Come up with control inputs u0, . . . , uN1 before k = 0. InOpen Loop the objective is to calculate {u0, . . . , uN1}.
Closed Loop: Wait until time k to make decision. Asumes xk is measurable.Closed Loop will always give better performance, but is computationallymuch more expensive. In Closed Loop the objective is to calculate the optimalrule: uk = k (xk). = {0, . . . , N1} ist a policy or control law.
4
8/3/2019 Dynamic Programming and Optimal Control Script
7/58
Wednesday 22nd September, 2010 Discrete State and Finite State Problem
Example 2:
k (xk) =
sk xk if xk < sk
0 otherwise
sk is some threshold.
1.4 Discrete State and Finite State Problem
When state xk takes on discrete values or is finite in size, it is often conve-
nient to express the dynamics in terms of transition probabilities:Pij (u, k) := Prob (xk+1 = j | xk = i, uk = u)
i: start state
j: possible future state
u: control input
k: time
This is equivalent to: xk+1 = wk, where wk has the following
Prob (wk = j|
xk = i, uk = u) := Pij (u, k)
Example 3: Optimizing Chess Playing Strategies:
Two game chess match with an opponent, the objective is to comeup with a strategy that maximizes the chance to win.
Each game can have 2 outcomes:a) Win by one player: 1 point for the winner, 0 points for the
loser.
b) Draw: 0.5 points for each player.
If games are tied 1-1 at the end of 2 games, go into sudden death
mode until someone wins.
Decision variable for player, two player styles:1) Timid play: Draw with probability pd, lose with probability
1pd2) Bold play: Win with probability pw, lose with probability
1pw Asume pd > pw as a necessary condition for problem to make
sense.
5
8/3/2019 Dynamic Programming and Optimal Control Script
8/58
1. Introduction
Problem: What playing style should be chosen? Since it doesnt make
sense to play Timid if we are tied 1-1 at the end of 2 games, it is a2-stage-finite problem.
Transition Probability Graph The graphis below show all possible out-comes.
0, 0
1, 0
0, 1
pw
1 pw
(a) Timid Play
0, 0
12 ,
12
0, 1
pd
1 pd
(b) Bold Play
Figure 1.1: First Game
1, 0
1
2, 1
2
0, 1
2, 0
32 ,
12
1, 1
12 ,
32
0, 2
pd
1 pdpd
1 pdpd
1 pd
(a) Timid Play
1, 0
1
2, 1
2
0, 1
2, 0
32 ,
12
1, 1
12 ,
32
0, 2
pw
1 pwpw
1 pwpw
1 pw
(b) Bold Play
Figure 1.2: Second Game
Closed Loop Strategy Play timid iff player is ahead.
The probability of winning is:
pd pw + pw ((1 pd) pw + pw (1 pw)) = p2w (2 pw) + pw (1 pw) pdFor {pw = 0.45, pd = 0.9} and {pw = 0.5, pd = 1} the probabilities towin are 0.54 and 0.625.
Open Loop Strategy Possibilities:
6
8/3/2019 Dynamic Programming and Optimal Control Script
9/58
Wednesday 29th September, 2010 The Basic Problem
0, 0
1, 0
0, 1
32 ,
12 WIN
1, 1
0, 2 LOSE
2, 1 WIN
1, 2 LOSE
pw
1 pw
pd
1pd
pw
1 pw
pw
1 pw
Figure 1.3: Closed Loop Strategy
1) Timid for first 2 games: p2d pw
2) Bold in both: p2w (3 2 pw)3) Bold in first, timid in second game: pw pd + pw (1 pd) pw4) Timid in first, bold in second game: pw pd + pw (1pd) pw
Clearly 1) is not the optimal OL strategy, because p2d pw pd pw + . . .Best strategy yields:
p2w + pw (1pw) max (2 pw, pd)
if pd > 2 pw. The optimal OL strategy is 3) or 4). It can be shown that ifpw 0.5, then the probability of winning is 0.5.
1.5 The Basic Problem
Summarize basic problem setup:
xk+1 = fk(xk, uk, wk), k = 0 , 1 , . . . , N 1
xk Sk state spaceuk Ck control spacewk Dk disturbance space
uk U(xk) Ck. Constrained not only as a function of time, but alsoof current state.
wk PR (|xk, uk). Noise distribution can depend on current state andapplied control.
7
8/3/2019 Dynamic Programming and Optimal Control Script
10/58
1. Introduction
Consider policies, or control laws,
= {0, 1, . . . , N1}where k maps state xk into controls uk = k(xk), such that k(xk) U(xk) xk Sk.The set of all called Admissible Policies, denoted .
Given a policy , the expected cost of starting at state x0:
J(x0) := Ewk
gN(xN) +
N1k=0
gk(xk, k(xk), wk)
Optimal Policy: J(x0) J(x0) Optimal Cost: J(x0) := J(x0)
8
8/3/2019 Dynamic Programming and Optimal Control Script
11/58
Chapter 2
Dynamic ProgrammingAlgorithm
At the heart of the DP algorithm is the following very simple and intuitiveidea.
2.1 Principle of Optimality
Let = {0 , 1, . . . , n1}be an optimal policy. Assume that in the processof using , a state xi occurs at time i. Consider the subproblem whereby attime i we are at state xi and we want to minimize
Ewk
gn(xn) +
N1k=i
gk(xk, k(xk), wk)
.
Then the truncated policy {i , i+1, . . . , N1} is optimal for this problem.The proof is simple: prove by contradiction. If the above were not optimal,you could find a different policy that would give a lower cost. Applyingthe same policy to the original problem from i would therefore give a lowercost, which contradicts that was an optimal policy.
Example 4: Deterministic Scheduling Problem
Have 4 machines A, B, C, D, that are used to make something.
A must occur before B. C before D.
The solution is obtained by calculating the optimal cost for each node,beginning at the bottom of the tree. See figure 2.1.
9
8/3/2019 Dynamic Programming and Optimal Control Script
12/58
2. Dynamic Programming Algorithm
I.C.
A
C
AB
AC
CA
CD
ABC ABCD
ACB
ACD
ACBD
ACDB
CAB
CAD
CABD
CADB
CDA CDAB
5
3
2
3
4
6
3
4
6
2
4
3
6
1
3
1
3
2
6
1
3
1
3
2
9
5
3
5
8
7
10
Figure 2.1: Problem of example 4 with optimal cost for each node written above it (in circles).
2.2 The DPA
For every initial state x0, the optimal cost J(x0) is equal to J0(x0), given by
the last step of the following recursive algorithm, which proceeds backwardsin time from N
1 to 0:
Initialization: JN(xN) = gN(xN) xn SNRecursion: For the recursion we use
Jk(xk) = minukUk(xk)
Ewk
gk(xk, uk, wk) + Jk+1
fk(xk, uk, wk)
.
where expectation is taken with respect to PR(|xk, uk).
Furthemore, if uk = k (xk) minimizes the recursion equation for each xk
and k, the policy ={0 , . . . ,
N
1
}is optimal.
Comments
For each recursion step, we have to perform the optimization over allpossible values xk Sk, since we dont know a priori which states wewill actually visit.
This pointwise optimization is what gives us k .
Proof 1: (Read section 1.5 in [1] if you are mathematically inclined).
10
8/3/2019 Dynamic Programming and Optimal Control Script
13/58
Wednesday 6th October, 2010 Chess Match Strategy Revisited
Denote k := {k, k+1, . . . , N1}. Denote
Jk (xk) = mink
Ewk,...,wN1
gN(xN) +
N1i=k
gi
xi, i(xi), wi
.
optimal cost when starting at time k, we find ourselves at state xk.
JN(xN) = gN(xN), finally.
We will show that Jk = Jk generated by the DPA which give usthe desired result when k = 0.
Induction: JN(xN) = JN(xN), true for k = NAssume true for k + 1: Jk+1(xk+1) = Jk+1(xk+1) xk+1 Sk+1.Then, since k = {k, k+1}, have
Jk (xk) = min(k,k+1)
Ewk ,...,wN1
gk(xk , k(xk), wk) + gN(xN) +
N1
i=k+1
gi (xi, i(xi), wi)
by principle of optimality:
= mink
Ewk
gk(xk, k(xk), wk) + min
k+1E
wk+1,...,wN1
gN(xN) +
N1
i=k+1
gi (xi, i(xi), wi)
by definition of Jk+1 and update equation:= min
kEwk
gk(xk, k(xk), wk) + J
k+1(fk(xk, k (xk), wk))
by induction hypothesis:
= mink
Ewk
(gk(xk , k(xk), wk) + Jk+1(fk(xk , k(xk), wk)))
= minukUk(xk)
Ewk
(gk(xk, k(xk), wk) + Jk+1(fk(xk, uk, wk)))
= Jk(xk)
In other words: Search over a function is simply like solving what the function doespartwise.
Jk(xk) is called the cost-to-go at state xk.Jk() is called the cost-to-go function.
2.3 Chess Match Strategy Revisited
Recall
Timid Play, prob. tie = pd, prob. loss = 1 pd
11
8/3/2019 Dynamic Programming and Optimal Control Script
14/58
2. Dynamic Programming Algorithm
Bold Play, prob. win = pw, prob. loss = 1 pw 2 game match, + tie breaker if necessary
Objective Find policy which maximizes probability of winning. We willsolve with DP, replace min by max.Asume pd > pw.
Define xk = difference between our score and opponent score at the end ofgame k. Recall 1 point for win, 0 for loss and 0.5 for tie.
Define Jk(xk) = probability of winning match at time k if state = xk.
Start Recursion
J2(x2) =
1 if x2 > 0
pw if x2 = 0
0 if x2 < 0
Recursive Equation
Jk(xk) = max pd Jk+1 + (1pd) Jk+1(xk 1) timid, pw Jk+1 + (1
pw) Jk+1(xk
1)
bold
Convince yourself that this is equivalent to the formal definitions:
Jk(xk) = maxuk
Ewk
gk(xk, uk, wk) + Jk+1(fk
xk, uk, wk)
Note: There is only a terminal cost in this problem.
J1(x1) = max [pd J2(x1) + (1 pd) J2(x1 1), pw J2(x1 + 1) + (1 pw) J2(x1 1)]
If x
1= 1: max pd + (1pd) pw
timid
, pw
+ (1
pw
) pw
bold
Which is bigger? Timid - Bold = (pdpw)(1pw) > 0 Timid is optimal,and J1(1) = pd + (1 pd) pw.
If x1 = 0: max
pd pw + (1 pd) 0 timid
, pw + (1pw) 0 bold
Optimal is pw, and J1(0) = pw, Bold is optimal strategy.
If x1 = 1: max
0, p2w
J1(1) = p2w, optimal strategy is Bold.
12
8/3/2019 Dynamic Programming and Optimal Control Script
15/58
Wednesday 6th October, 2010 Converting non-standard problems
J0(0) = max [pd J1(0) + (1 pd) J1(1), pw J1(1) + (1pw) J1(1)]= max
pd pw + (1 pd) p2w, pw (pd + (1 pd) pw) + (1 pw) p2w
= max
pd pw + (1 pd) p2w, pd pw + (1 pd) p2w + (1pw) p2w
J0(0) = pd pw + (1 pd) p2w + (1 pw) p2w, the optimal strategy is Bold.
Optimal Strategy If ahead, play Timid.
2.4 Converting non-standard problems to Basic Prob-lem
2.4.1 Time Lags
Assume update equation is of the following form:
xx+1 = fk(xk, xk1, uk, uk1, wk)
Define yk = xk1, sk = uk1
xk+1
yk+1
sk+1
=
fk(xk,yk, sk, wk)
xk
uk
fk
Let xk = (xk,yk, sk), xk+1 = fk(xk, uk, wk).
The control is uk = k(xk, uk1, xk1).This can be generalized to more than one time lag.
2.4.2 Correlated Disturbances
If disturbances are not independent, but can be modeled as the output of asystem driven by independent disturbances Colored Noise.
Example 5:
wk = Ck yk+1
yk+1 = Ak yk + k
13
8/3/2019 Dynamic Programming and Optimal Control Script
16/58
2. Dynamic Programming Algorithm
Ak, Ck are given, {k} is independent.As usual, xk+1 = fk(xk, uk, wk).
xk+1
yk+1
=
fkxk, uk, Ck Ak yk + k
Ak yk + k
and uk = k(xk,yk)
which is now in the standard form. In general, yk cannot be measuredand must be estimated.
2.4.3 Forecasts
When state information includes knowledge of probability distributions. Atthe beginning of each period k, we receive information about wk+1 probabil-ity distribution. In particular, assume wk+1 could have the following prob-ability distributions {Q1, Q2, . . . , Qm}, with a priori probabilities p1, . . . , pm.At time k, we receive forecast i that Qi is used to generate wk+1. Model asfollows: yk+1 = k, k is a random variable, taking value i with probabilitypi. In particular, wk has probability distribution Qyk .
Then have
xk+1
yk+1
=
fk(xk, uk, wk)
k
. New state xk = (xk,yk).
Since yk is known at time k, we have a Basic Problem formulation. Newdisturbance wk = (wk, k), depends on current state, which is allowed. DPAtakes on the following form:
JN(xN,yN) = gN(xN)
Jk(xk,yk) = minuk
Ewk
Ek
gk(xk, uk, wk) + Jk+1
fk(xk, uk, wk), k
yk= min
ukEwk
gk(xk, uk, wk) +
m
i=1
pi Jk+1
fk(xk, uk, wk), i yk
where conditional expectation simply means that wk has probability distri-bution Qyk . yk {1 , . . . , m}, expectation over wk is taken with respect to thedistribution Qyk .
14
8/3/2019 Dynamic Programming and Optimal Control Script
17/58
Wednesday 13th October, 2010 Deterministic, Finite State Systems
2.5 Deterministic, Finite State Systems
Recall Basic Problem
xk+1 = fk(xk, uk, wk) k = 0 , . . . , N 1gk(xk, uk, wk) cost at stage k.
Consider Problems where
1. xk Sk, Sk is a finite set2. No disturbances wk
We assume, without loss of generality, that there is only one way to go fromstate i Sk to j Sk+1 (If there is more than one way, pick one with lowestcost at stage k).
2.5.1 Convert DP to Shortest Path Problem
S...
Stage 1
...
Stage 2
.
.
.
..
.
...
Stage N
. . .
. . .
. . .
T
Figure 2.2: General Shortest Path Problem
akij = cost to go from state i Sk to state j Sk+1, time k. This is equal ifthere is no way to go from i Sk to j Sk+1.
aNiT = terminal cost of state i SNIn other words,
akij = gk(i, uijk ), where j = fk(i, u
ijk )
aNiT = gN(i)
15
8/3/2019 Dynamic Programming and Optimal Control Script
18/58
2. Dynamic Programming Algorithm
2.5.2 DP algorithm
JN(i) = aNiT i SN
Jk(i) = minjSk+1
akij + Jk+1(j)
i Sk, k = 0 , . . . , N 1
This solves shortest path problem.
2.5.3 Forward DP algorithm
By inspection, the problem is symmetric. Shortest path from S to T is thesame as from T to S, motivating following algorithm, Jk(j) optimal cost toarrive to state j:
JN(j) = a0sj j S1
Jk(j) = miniSNk
aNkij + Jk+1(i)
j SNk+1, k = 1 , . . . , N
J0(T) = miniSN
aNiT + J1(i)
J0(T) = J0(s)
2.6 Converting Shortest Path to DP
start end
.
.
.
.
.
.
Figure 2.3: Another Path Problem, in which circles are allowed
As an example for a mental picture, one could imagine cities on a map.
16
8/3/2019 Dynamic Programming and Optimal Control Script
19/58
Wednesday 13th October, 2010 Viterbi Algorithm
Let {1 , 2 , . . . , N, T} be the nodes of graph, aij the cost to move fromi to j, aij = if there is no edge. i and j denote nodes, as opposed toprevious section where they were states.
Assume that all cycles have non-negative cost. This isnt an issue if alledges have cost 0.
Note that with above assumption, an optimal path length N (visitsall nodes).
Setup problem where we require exactly N moves, degenerate moves areallowed (aii = 0).
Jk(i) optimal cost of getting from i to T in N k moves
JN(i) = aiT (can be infinite, of course)
Jk(i) = minj
aij + Jk+1(j)
(optimal N k move is aij + optimal N k 1
move from j).
Notice that degenerate moves are allowed. (remove in the end)
Terminate procedure if Jk(i) = Jk+1(i) i.
2.7 Viterbi Algorithm
This is a powerful combination of D.P. and Bayes Rule for optimal estima-tion.
Given Markov Chain, with state transition probabilities pij .
pij = P (xk+1 = j|xk = i) 1 i, j M
p(x0) = initial probability for starting state.
Can only indirectly observe state via measurement.
r(z; i, j) = P (meas = z|xk = i, xk+1 = j) k
where P is the likelihood function.
17
8/3/2019 Dynamic Programming and Optimal Control Script
20/58
2. Dynamic Programming Algorithm
Objective: Given ZN = {z1, . . . , zN}measurements, construct XN = {x0, . . . , xN}that maximizes over all XN = {x0, . . . , xN} PR(XN|ZN). Most likely state.
Recall PR(XN, ZN) = PR(XN|ZN) PR(ZN).
For a given ZN, maximizing PR(XN, ZN) over XN gives same result as max-imizing PR(XN|ZN) over XN.
PR(XN, ZN) = PR(x0, . . . , xN, z1, . . . , zN)
= PR(x1, . . . , xN, z1, . . . , zN|x0) PR(x0)= PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1) PR(x1, z1|x0) PR(x0)= PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1) PR(z1|x0, x1) PR(x1|x0) PR(x0)= PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1) r(z1; x0, x1)px0,x1 PR(x0)
One more step:
PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1)= PR(x3, . . . , xN, z3, . . . , zN|x0, x1, z1, z2, x2) PR(x2, z2|x0, x1, z1)= PR(x3, . . . , xN, z3, . . . , zN|x0, x1, z1, z2, x2) PR(z2|x0, x1, z1, x2) PR(x2|x0, x1, z1)= PR(x3, . . . , xN, z3, . . . , zN|x0, x1, z1, z2, x2) r(z2; x1, x2)px1,x2
Keep going, and one gets:
PR(XN, ZN) = PR(x0)N
k=1
pxk1,xk r(zk; xk1, xk)
Assume, that all quantities > 0. If= 0, can modify algorithm.
Since the above is a strictly positive property (by above assumptions), andlog function is monotonically increasing as a function of its argument, wecan maximize
log (PR(XN, ZN)) = minXN
log(PR(x0)) +
N
k=1
log pxk1,xk r(zk; xk1, xk)
Forward DP At time k, we can calculate cost to arrive to any state. Wedont have to wait until the end to solve the problem.
2.8 Shortest Path Algorithms
Look at alternatives to DP for problems that are finite and deterministic.
Path Length Cost
18
8/3/2019 Dynamic Programming and Optimal Control Script
21/58
Wednesday 20th October, 2010 Shortest Path Algorithms
2.8.1 Label Correcting Methods
Assume aij 0. Arclength = cost to go from node i to node j 0.
OPEN BIN
i j
REMOVE
Is di + aij < dj?
Is di + aij < dT?
YES
YES
set dj = di + aij
Figure 2.4: Diagram of the label correcting algorithm.
Let di be shortest path to i so far.
Step 0 Place Node S in OPEN BIN, set dS = 0, dj = j.
Step 1 Remove a node i from OPEN, and execute Step 2 for all childrenj of i.
Step 2 Ifdi + aij < min(dj, dT), set dj = di + aij , set i to be the parent of j. Ifj = T, place j in OPEN if it is not already there.
Step 3 If OPEN is empty, done. If not, go back to Step 1.
Example 6: Deterministic Scheduling Problem (revisited)
19
8/3/2019 Dynamic Programming and Optimal Control Script
22/58
2. Dynamic Programming Algorithm
I.C.
A
C
AB
AC
CA
CD
ABC
ACB
ACD
CAB
CAD
CDA
T
5
3
2
3
4
6
3
4
6
2
4
3
6
1
3
1
3
2A before B
C before D
Figure 2.5: Deterministic Scheduling Problem.
Iteration # Remove OPEN dt OPTIMAL
0 S(0)
1 S A(5), C(3)
2 C A(5), CA(7), CD(9)
3 CD A(5), CA(7), CD A(12)
4 CDA A(5), CA(7) 14 CDAB
5 CA A(5), CAB(9), CAD(11) 14 CDAB
6 CAD A(5), CAB(9) 14 CDAB
7 CAB A(5) 10 CABD
8 A AB(7), AC(8) 10 CABD
9 AC AB(7) 10 CABD
10 AB 10 CABD
Done, optimal cost = 10, optimal path = CABD.
Different ways to remove items from OPEN give different, well known,algorithms.
Depth-First Search Last in, first out. What we did in example. Findsfeasible path quickly. Also good if you have limited memory.
20
8/3/2019 Dynamic Programming and Optimal Control Script
23/58
Wednesday 20th October, 2010 Multi-Objective Problems
Best-First Search Remove best label. Dijkstras method. Remove step
is more expensive, but can give good performance.
Brendth-First Search First in, first out. Bellman-Ford.
2.8.2 A AlgorithmWorkhouse for many AI applications, path planning.
Basic idea: Replace test di + aij < dT by di + aij + hj < dT, where hj is a lowerbound to the shortest distance from j to T. Indeed, ifdj + aij + hj dT, clearthat path going through j will not be optimal.
2.9 Multi-Objective Problems
Example 7: Motivation: care about time and fuel.
time
fuel
inferior
non-inferior
Figure 2.6: Possibilities in the time-fuel graph.
A vector x = (x1, x2, . . . , xM) S is non-inferior if there are no other y Sso that yl xl, l = 1 , . . . , M, with strict inequality for one of these ls.
Given a problem with M cost functions f1(x), . . . , fM(x) x X is a non-inferior solution if the vector (f1(x), . . . , fM(x)) is a non-inferior vector ofset {(f1(y), . . . , fM(y))
y X}.Reasonable goal: find all non-inferior solutions, then use another criterionto pick which one you actually want to use.
21
8/3/2019 Dynamic Programming and Optimal Control Script
24/58
2. Dynamic Programming Algorithm
How this applies to deterministic, finite state DP (which are equivalent to
shortest path problems):
xk+1 = fk(xk, uk) Dynamics
glN +N1k=0
glk(xk, uk) l = 1 , . . . , M
2.9.1 Extended Principle of Optimality
If{uk, . . . , uN1} is a non-inferior control sequence for the tail subproblemthat starts at xk, then {uk+1, . . . , uN1} is also non-inferior for the tail sub-problem that starts at fk(xk, uk). Simple proof: by contradiction.
Algorithm First define what we will do recursion over:
Fk(xk) : the set ofM-tuples (vectors of size M) of cost to go at xk which arenon-inferior.
FN(xN) = {(g1N(xN), . . . , gMN(xN))}. Only one element in set for each xN.
Given Fk+1(xk+1) xk+1, generate for each state xk the set of vectors (glk(xk, uk) +c1, . . . , gMk (xk, uk) + c
M), such that (c1, . . . , cM) Fk+1(fk(xk, uk)).
These are all possible costs that are consistent with Fk+1(xk+1). Then toobtain Fk(xk) simply extract all non-inferior elements.
Xk
Fk(xk)
Fk+1()
Figure 2.7: Possible sets for Fk+1.
When we calculate F0(x0), will have all non-inferior solutions.
22
8/3/2019 Dynamic Programming and Optimal Control Script
25/58
Wednesday 27th October, 2010 Infinite Horizon Problems
2.10 Infinite Horizon Problems
Consider the time (or iteration) invariant case:
xk+1 = f(xk, uk, wk) xk Suk Uwk P(|xk, uk)
J(x0) = EN1k=0
g(xk, k(xk), wk)
, no terminal cost
Write down DP algorithm:
JN(xN) = 0
Jk(xk) = minukU
Ewk
(g(xk, uk, wk) + Jk+1 (f(xk, uk, wk))) k
Question: What happens as N ? Does the problem become easier?
Yes. Reason: lose notion of time. For very large class of problems, haveBellman Equation:
J(x) = minu
Ew
(g(x, u, w) + J (f(x, u, w))) x S
Bellman Equation involves solving for optimal cost to go function J(x)x S.
u = (x) gives optimal policy (() is obtained from solution to BellmanEquation: for every x there is a u).
Efficient methods for solving Bellman Equation
Technical conditions on when this can be done.
2.11 Stochastic, Shortest Path Problems
xk+1 = wk xk S, finite setPR(wk = j|xk = i, uk = u) = pij (u) uk U(xk), finite set
23
8/3/2019 Dynamic Programming and Optimal Control Script
26/58
2. Dynamic Programming Algorithm
We have a finite number of states. The transition from one state to the next
is dictated by pij (u): probability that next state is j given current state is i. uis the constrol input, we can control what these transition probabilites are,finite set of options u U(i). Problem data is time (or iteration) indepen-dent.
Cost:Given initial state i and a policy = {0, 1, . . .}.
J(i) = limN
E
N1k=0
g(xk, k(xk))|xo = i
.
Optimal cost from state i: J
(i)
Stationary policy = {, , . . .}. Denote J(i) as the resulting cost.{, , . . .} is simply referred to as . is optimal if
J(i) = J(i) = min
J(i)
Assumptions:
Existence of a cost-free termination state t
ptt(u) = 1
u
g(t, u) = 0 u.Sufficient condition to make cost meaningfull. Think of this as a desti-nation state.
integer m such that for all admissible policies
= maxi=1,...,n
PR (xm = t|x0 = i, ) < 1
This is a strong assumption, which is only required for proofs.
2.11.1 Main Result
A) Given any initial conditions J0(1), . . . , J0(n), the sequence
Jk+1(i) = minuU(i)
g(i, u) +
n
j=1
pij (u)Jk(j)
i
converges to optimal cost J(i) for each i.
24
8/3/2019 Dynamic Programming and Optimal Control Script
27/58
Wednesday 27th October, 2010 Stochastic, Shortest Path Problems
B) Optimal cost satisfies Bellmans Equation:
J(i) = minuU(i)
g(i, u) +
n
j=1
pij (u)J(j)
i
which has a unique solution.
2.11.2 Sketch of Proof
A0) First prove that cost is bounded.
Recall m such that policies , := max
iPR(xm = t|x0 = i, ) < 1
Since all problem data is finte, := max
< 1
PR(x2m = t|x0 = i, ) = PR(x2m = t|xm = t, x0 = i, ) PR(xm = t|x0 = i, ) 2
Generally, PR(xk m = t|x0 = i, ) k.Furthermore, the cost incurred between the m periods km and (k +
1)m 1 ismk max
i,u|g(i, u)| =: kM M := m max
i,u|g(i, u)|
J(i)
k=0
Mk =M
1 finite
A1)
J(x0) = limN
E
N1k=0
g (xk, k(xk))
= E
mK1k=0
g (xk, k(xk))
+ lim
NE
N1
k=mK
g (xk, k(xk))
by previous, we know that limN E
N1
k=mK
g (xk, k(xk))
MK
1
As expected, we can make tail as small as we want.
25
8/3/2019 Dynamic Programming and Optimal Control Script
28/58
2. Dynamic Programming Algorithm
A2) Recall that we can view J0 as terminal cost funtion, with J0(i) given.
Bound its expected vallue:
|E (J0(xmK))| =
n
i=1
PR(xmK = i|x0, )J0(i)
n
i=1
PR(xmK = i|x0, )
maxi|J0(i)|
K maxi|J0(i)|
A3) Sandwich
E (J0(xmK)) + E
mK1k=0
g(xk, k(xk))
=
E (J0(xmK)) + J(x0) limN
E
N1
k=mK
g(xk, k(xk))
Recall if a = b + c, then
a b + |c| b + c, c |c|a
b
|c
| b
c
Follows that
K maxi|J0(i)| M
K
1 + J(x0) E
J0(xmK) +mK1k=0
g(xk, k(xk))
K maxi|J0(i)|+ M
K
1 + J(x0)
A4) We take minimum over all policies, the middle term is exactly our DPrecursion of part A. Take limits, get
limK
JmK(x0) = J(x0)
Now we are almost done. But since
|JmK+1(x0) JmK(x0)| KM
have
limk
Jk(x0) = J(x0)
26
8/3/2019 Dynamic Programming and Optimal Control Script
29/58
Wednesday 3rd November, 2010 Summary of previous lecture
Summary
A1: bound the tail over all policies
A2: bound contribution from initial condition J0(i), over all policies
A3: sandwich type of bounds, middle term is DP recursion
A4: optimized over all policies, took limit.
2.11.3 Prove B
Prove that optimal cost satisfies Bellmans equation
Jk+1(i) = minuU(i)
g(i, u) +
n
j=1
pij (u)Jk(j)
.
In Part A, we showed that Jk() J(), just take limits on both sides. Toprove uniqueness, just use solution of Bellman equation as initial conditionof DP iteration.
2.12 Summary of previous lecture
Dynamics
xk+1 = wk
PR{wk = j|xk = j, uk = u} = pij , u U(i) finite setxk S finite S = {1 , 2 , . . . , n, t}ptt = 1 u U(t)
Cost Given = {0, 1, . . .}
J(i) = limN
E
N1k=0
g(xk, k(xk))|x0 = i
i S
g(t, u) = 0 u U(t)J(i) = min
J(i) optimal cost
Note: J(t) = 0 J(t) = 0
27
8/3/2019 Dynamic Programming and Optimal Control Script
30/58
2. Dynamic Programming Algorithm
Result
A) Given any initial conditions J0(1), . . . , J0(n), the sequence
Jk+1(i) = minuU(i)
g(i, u) +
n
j=1
pij (u)Jk(j)
i S \ t = {1 , . . . , n}
converges to J(i).Note: there is a lot of a short-cut here, can include terminal state t,provided we pick J0(t) = 0. Does not change equations.
B)
J(i) = minuU(i)
g(i, u) +
n
j=1
pij (u) J(j)
i S \ t
Bellmans Equation: Also gives optimal policy, which is in fact stationary.
2.13 How do we solve Bellmans Equation?
2.13.1 Method 1: Value iteration (VI)
Use the DP recursion of result A:
Jk+1(i) = minuU(i)
g(i, u) +
n
j=1
pij (u) Jk(j)
i S \ t
until it converges. J0(i) can be set to a guess, if the guess is good, it willspeed up convergence. How do we know that we are close to converging?Exploit problem structure to get bounds, look at [1].
2.13.2 Method 2: Policy Iteration (PI)
Iterate over policy instead of values. Need following result:
C) For any stationary policy , the costs J(i) are the unique solutions of
J(i) = g(i, (i)) +n
j=1
pij ((i)) J(j) i S \ t
28
8/3/2019 Dynamic Programming and Optimal Control Script
31/58
Wednesday 3rd November, 2010 How do we solve Bellmans Equation?
Furthermore, given any initial conditions J0(i), the sequence
Jk+1(i) = g(i, (i)) +n
j=1
pij ((i)) Jk(j)
converges to J(i) for each i.Proof: trivial. Consider problem where only allowable control at state iis (i), and apply parts A and B. Special case of general theorem.
Algorithm for PI From now on, i S \ t = {1 , 2 , . . . , n}.
Stage 1 Given k (stationary policy at iteration k, not policy at time k), solve
for the Jk (i) by solving
J(i) = g(i, k(i)) +n
j=1
pij (k(i)) J(j) i
n equations, n unknowns (Result C).
Stage 2 Improve policy.
k+1(i) = arg minu
U(i)
g(i, u) +
n
j=1
pij (u) Jk (j)
i
Iterate, quit when Jk+1 (i) = Jk (i) i
Theorem Above terminates after a finite # of steps, and converges to opti-mal policy.
Proof two steps
1) We will first show that
Jk (i) Jk+1 (i) i, k2) We will show that what we converge to satisfies Bellmans Equation.
1) For fixed k, consider the following recursion in N:
JN+1(i) = g(i, k+1(i)) +
n
j=0
pij (k+1(i)) JN(j) i
J0(i) = Jk (i)
29
8/3/2019 Dynamic Programming and Optimal Control Script
32/58
2. Dynamic Programming Algorithm
By result C), JN Jk+1 as N .
J0(i) = g(i, k(i)) +
j
pij (k(i)) J0(j)
g(i, k+1(i)) +j
pij (k+1(i)) J0(j) = J1(i)
J1(i) g(i, k+1(i)) +j
pij (k+1(i)) J1(j)
since J1(i) J0(i). But g(i, k+1(i)) = J2(i), keep going, get
J0(i)
J1(i)
. . .
JN(i)
. . .
take limit
Jk (i) Jk+1 (i) i
Since the # of stationary policies is finite, we will eventually have Jk (i) =Jk+1 i for some finite k.
2) It follows from Stage 2 that
Jk+1 (i) = Jk (i) = minu
U(i)
g(i, u) +
j
pij (u) Jk (j)
when converged, but this is Bellmans Equation! have converged tooptimal policy.
Discussion
Complexity
Stage 1 linear sytem of equations, size n, comlexity
O(n3)
Stage 2 n minimizations, p choices (p different values of u that I can use),complexity O(p n2)
Put together: O(n2(n + p)).
Worst case # of iterations: search over all policies, pn. But in practice, con-verges very quickly.
Why does Policy Iteration converge so quickly relative to Value Iteration?
30
8/3/2019 Dynamic Programming and Optimal Control Script
33/58
Wednesday 10th November, 2010 How do we solve Bellmans Equation?
Rewrite Value Iteration
Stage 2 k = arg minuU(i)
g(i, u) +
jpij (u) Jk(j)
iterate
Stage 1 Jk+1(i) = g(i, k(i)) +
jpij (
k(i)) Jk(j)
2.13.3 Third Method: Linear Programming
Recall: Bellmans Equation
J(i) = minuU(i)
g(i, u) +
n
j=1
pij (u) J(j)
i = 1 , . . . , n
and Value Iteration:
Jk+1(i) = minuU(i)
g(i, u) +
n
j=1
pij Jk(j)
i = 1 , . . . , n
We showed that Value Iteration (V.I.) converges to optimal cost to go J forall initial guesses J0.
Assume we start V.I. with any J0 that satisfies
J0(i) minuU(i)
g(i, u) +
n
j=1
pij J0(j)
i = 1 , . . . , n
If follows that J1(i) J0(i) for all i.
J1(i) minuU(i)
g(i, u) + nj=1
pij J1(j) i = 1 , . . . , n J2(i) J1(i) i
In general:
Jk+1(i) Jk(i) i, kJk J
J0(i) J(i) i
31
8/3/2019 Dynamic Programming and Optimal Control Script
34/58
2. Dynamic Programming Algorithm
Now let J solve the following problem:
maxi
J(i) subject to
J(i) g(i, u) +j
pij (u)J(j) i, u U(i)
Clear that J(i) J(i) i, by previous analysis. Since J satisfies the con-straints, it follows that J achieves the maximum.This is a Linear Program!
2.13.4 Analogies and Connections
Say I want to solve
J = G + PJ, J Rn, G Rn, P Rnn.Direct way to solve:
(I P)J = GJ = (I P)1G.
This is exactly what we do in Stage 1 of policy iteration: solve for the costassociated wit ha specific policy.
Why is (I P)1 guaranteed to exist?For a given policy, let P R(n+1)(n+1) be the probability matrix that cap-tures our Markov Chain:
P =
P P()t
0 1
pij : probability next state = j given current state = i.
Facts
P is a right stochastic matrix: all rows sum up to 1, all elements 0. Perron-Frobenius Therorem: eigenvalues ofP have absolute value 1,
at least one = 1
Assumption on terminal state: PN 0 as N . We well eventuallyreach termination state!
Therefore
32
8/3/2019 Dynamic Programming and Optimal Control Script
35/58
Wednesday 10th November, 2010 How do we solve Bellmans Equation?
- the eigenvalues ofP have absolute value < 1.
- (I P)1 exists.
Furthermore:
(I P)1 = I+ P + P2 + . . .Proof:
(I P)(I+ P + P2 + . . .) = I+ (P + P2 + . . .) (P + P2 + . . .)= I
Therefore one way to solve for J is as follows:
J1 = G + PJ0
J2 = G + PJ1 = G + PG + P2J0
...
JN = (I+ P + . . . + PN1)G + PNJ0
JN (1 P)1G as N !
Analogy
Value Iteration: step of the updatePolicy Iteration: infinite numbers of update, or solving system of equationsexactly.
What is truly amazing is that various combinations of policy iteration,value iteration, all converge to Bellmans Equation.
Recall value iteration
Jk+1(i) = minuU(i)
g(i, u) +
j
pij (u)Jk(j)
i = 1 , . . . , n
In practice, you would implement as follows:
J(i) minuU(i)
g(i, u) +
j
pij (u)J(j)
i = 1 , . . . , n
J(i) J(i) i = 1 , . . . , n
33
8/3/2019 Dynamic Programming and Optimal Control Script
36/58
2. Dynamic Programming Algorithm
Dont have to do this! Can also do:
J(i) minuU(i)
g(i, u) +
j
pij (u)J(j)
i = 1 , . . . , n
Gauss-Seidel update: generic technique for solving iterative equations.
Gets even better: Asynchronous Policy Iterating.
Any number of value update in between policy updates
Any number of states updated at each value update
Any number of states updated at each policy update.Under some mild assumptions, all converge to J
2.14 Discounted Problems
J(i) = limN
E
N1k=0
kg (xk, k(xk))x0 = i
< 1, i {1 , . . . , n}
No explicit termination state required. No assumption on transition proba-bilities required.
Bellmans Equation for this problem:
J(i) = minuU(i)
g(i, u) +
n
j=1
pij (u) J(j)
i
How do we show this?
Define associated problem with states {1 , 2 , . . . , n, t}. From state i = t,when u is applied in new cost g(i, u) next state = j with probabilitypij (u), and t with probability 1 (since
jpij (u) = 1)
Clear that since a < 1, we have a non-zero probability of making it to
state t, therefore our assumption on reaching the termination state issatisfied.Suppose we use the same policy in discounted problem as auxiliaryproblem.
Note that
PR
xk+1 = jxk = i, xk+1 = t, u = pij (u)pi,1 + pi,2 + . . . + pi,n
=pij (u)
= pij (u)
34
8/3/2019 Dynamic Programming and Optimal Control Script
37/58
Wednesday 17th November, 2010 Discounted Problems
So as long as we reach the termination state, the state evolution is
governed by the same probabilities. The expected cost of the kth stageof associated problem g(xk, k(xk)) times the probability, that t has not
been reached k, therefore have kg(xk, k(xk)).
Connections: for a given policy, have
P =
P
1...
1
(1 )
0 1
clear that (P)N = NPN 0
35
8/3/2019 Dynamic Programming and Optimal Control Script
38/58
Chapter 3
Continuous Time OptimalControl
Consider the following system
x(t) = f(x(t), u(t)) 0 t Tx(0) = x0 no noise!
State x(t) R
Time t R, T is the terminal time
Control u(t) U Rm, U control constraint set
Assume
f is continuously differentiable with respect to x. (Less stringent re-quirement: Lipschitz)
f is continuous with respect to u
u(t) is piecewise continuous
See Appendix A in[1] for Details.
Assume: Existence and uniqueness of solutions.
Example 8:
x(t) = x(t)1/3
x(0) = 0
Solutions: x(t) = 0 t x(t) = ( 23 t)3/2Not uinque!
36
8/3/2019 Dynamic Programming and Optimal Control Script
39/58
Wednesday 17th November, 2010 The Hamilton Jacobi Bellman (HJB) Equation
Example 9:
x(t) = x(t)2
x(0) = 1
Solutions: x(t) = 11t , finite escape time, x(1) = The solution does not exist on an interval that includes 1, e.g. [0, 2].
Objective Minimize h(x(T)) +T
0g(x(t), u(t)) dt where g and h are con-
tinuously differentiable with respect to x, g is continuous with respect to u.
Very similar to discrete time problem (
, xk+1 x) except for technical
assumptions.
3.1 The Hamilton Jacobi Bellman (HJB) Equation
Continuous time analog of DP algorithm. Derive it informally by discretiz-ing problem and taking limits. Not a rigorous derivation, but it does capturethe main ideas.
Divide time horizon into N pieces, define = T/N
xk := x(k), uk := u(k) k = 0 , 1 , . . . , N Approximate differential equation by
x(k) = f(x(k, u(k)))
xk+1 xk
= f(xk, uk), xk+1 = xk + f(xk, uk)
Approximate cost function:
h(xN) +N1k=0
g(xk, uk)
Define J(t, x) = optimal cost to go at time t and state x for the con-tinuous problem.
Define J(t, x) = discrete approximation of optimal cost to go.Apply DP algorithm:
terminal condition: J(N, x) = h(x) recursion:
J(k, x) = minuU
g(x, u) + J((k + 1) , x + f(x, u) )
k = 0 , . . . , N1
37
8/3/2019 Dynamic Programming and Optimal Control Script
40/58
3. Continuous Time Optimal Control
Do Taylor Expansion of J, because 0
J((k + 1) , x + f(x, u) ) = J(k, x) + J(k, x)
t+
J(k, x)
x
Tf(x, u) + o()
lim0
o()
= 0
little oh notation: o() are quadratic terms or higher in .
Substitute back into DP recursion by :
0 = minuU g(x, u) +
J(k, x)
t
+ J(k, x)
x T
f(x, u) +o()
Now let t = k, and let 0. Assuming J J, have
0 = minuU
g(x, u) +
J(t, x)t
+
J(t, x)
x
Tf(x, u)
x, t
(3.1)
J(T, x) = h(x)
HJB equation (3.1):
Partial Differential Equation, very difficult to solve
u = (t, x) that minimizes R.H.S. of HJB is optimal policy
Example 10: Consider the system
x(t) = u(t) |u(t)| 1Cost is 12 x
2(T), only terminal cost.
Intuitive solution: u(t) = (t, x) = sgn(x) =
1 if x > 0
0 if x = 0
1 if x < 0
What is cost to go associated with this policy?
V(t, x) =1
2(max{0, |x| (T t)})2
Verify that this is indeed the cost to go associated with the policy out-lined above.
For fixed t:
38
8/3/2019 Dynamic Programming and Optimal Control Script
41/58
Wednesday 17th November, 2010 The Hamilton Jacobi Bellman (HJB) Equation
V(t, x)
x
T t(T t)
Figure 3.1: Example 10 for fixed t.
V(t, x)x
x
T t(T t)
Figure 3.2: First derivative.
V(t, x)
t1|x| T
2
T |x| T
V(t, x)
t
t
1
2
T |x| T1: |x| T2: |x| > T
Figure 3.3: Example 10 for fixed x.
V
x(t, x) = sgn(x) max{0, |x| (T t)}
39
8/3/2019 Dynamic Programming and Optimal Control Script
42/58
3. Continuous Time Optimal Control
For fixed x:
Does V(t, x) satisfy HJB?
First check: does it satisfy boundary condition?V(T, x) = 12 x
2 = h(x)
Second check:
min|u|1
V(t, x)
t+
V(t, x)
xf(x, u)
= min
|u|1(1 + sgn(x), u) max{0, |x| (T t)}
= 0 by choosing u = sgn(x)
V(t, x) satisfies HJB equation, V(t, x) = J(t, x).Furthermore u = sgn(x) is an optimal solution. Not unique!Note: Verifying that V(t, x) satisfies HJB is not trivial, even forthis simple example. Imagine solving for it!
Another issue: 12 x2(T) will give same optimal policy as cost |x(T)|, so
different costs give same optimal policy, but some cost are nicer towork with than other.
3.2 Aside on Notation
Let F(t, x) be a continuosly differentiable function. Then
1.F(t, x)
t: partial derivative of F with respect to the first argument.
2.F(t, x(t))
t=
F(t, x(t))
t
x=x(t)
: shorthand notation
3.dF(t, x(t))
dt =F(t, x(t))
t +F(t, x(t))
x x(t): total derivative
Example 11: For F(t, x) = tx
F(t, x)
t= x
F(t, x(t))
t= x(t)
dF(t, x(t))
dt= x(t) + t x(t)
Lemma 3.2.1: Let F(t, x, u) be a continous differentiable function andlet U be a convex set. Assume that (t, x) := arg min
uUF(t, x, u) is conti-
nous differentiable.
40
8/3/2019 Dynamic Programming and Optimal Control Script
43/58
Wednesday 24th November, 2010 Aside on Notation
Then:
1) minuU F(t, x, u)
t=
F(t, x, (t, x))t
t, x
2) minuU F(t, x, u)
x=
F(t, x, (t, x))x
t, x
Example 12: Let F(t, x, u) = (1 + t)u2 + ux + 1, t 0, U is the realline (no constraint on u), then
minu
F(t, x, u) : 2(1 + t)u + x = 0, u = x2(1 + t)
(t, x) = x2(1 + t)
,
minu
F(t, x, u) =(1 + t)x2
4(1 + t)2 x
2
2(1 + t)+ 1
= x2
4(1 + t)+ 1
1) minuU F(t, x, u)
t=
x2
4(1 + t)2
F(t, x, (t, x))t
= u2|u=(t,x) = x24(1 + t)2
2) minuU F(t, x, u)
x=
x
2(1 + t)
F(t, x, (t, x))x
= u|u=(t,x) =x
2(1 + t)
Proof 2: Proof of Lemma 3.2.1 when u is unconstrained (U Rm).Let
G(t, x) = minuU
F(t, x, u) = F(t, x,
(t, x)).
Then
G(t, x)
t=
F(t, x, (t, x))t
+F(t, x, (t, x))
u =0 because (t,x) minimizes
(t, x)t
.
This can be done similar forG(t, x)
x.
41
8/3/2019 Dynamic Programming and Optimal Control Script
44/58
3. Continuous Time Optimal Control
3.2.1 The Minimum Principle
HJB gives us a lot of information: optimal cost to go for all time and for allpossible states, also gives optimal feedback law u = (t, x). What if weonly cared about optimal control trajectory for a specific initial conditionx(0) = x0?
Can we exploit the fact that we are asking for much less to simplify themathematical conditions?
Starting Point: HJB
0 = minuU
g(x, u) + J(t, x)t +J(t, x)x T
f(x, u) t, xJ(T, x) = h(x) x
Let (t, x) be te corresponding optimal strategy (feedback law)
Let
F(t, x, u) = g(x, u) +J(t, x)
t+
J(t, x)
x
Tf(x, u)
So HJB equation gives us
G(t, x) = minuU
(F(t, x, u)) = 0
Apply Lemma
1)G(t, x)
t= 0 =
2J(t, x)t2
+
2J(t, x)
xt
Tf(x, (t, x)) t, x
2)G(t, x)
x= 0 =
g(x, (t, x))x
+2J(t, x)
xt+
2J(t, x)x2
f(x, (t, x))
+f(x, (t, x))
x
J(t, x)
x t, x
Consider a specific optimal trajectory:
u(t) = (t, x(t)) , x(t) = f (x(t), u(t)) , x(0) = x0
1) 0 =d
dt
J(t, x(t))
t
2) 0 =g (x(t), u(t))
x+
d
dt
J(t, x(t))
x
+
f(x(t), u(t))x
J(t, x(t))x
42
8/3/2019 Dynamic Programming and Optimal Control Script
45/58
Wednesday 24th November, 2010 Aside on Notation
p(t) = J(t, x(t))
xp0(t) =
J(t, x(t))t
1) p0(t) = 0 p0(t) = constant for 0 t T2) p(t) = f(x
(t), u(t))x
p(t) g(x(t), u(t))x
, 0 t TJ(T, x)
x=
h(x)
x p(T) = h(x
(T))x
Put all of this together:
Define Hamiltonian H(x, u, p) = g(x, u) + pT
f(x, u). Let u(t) be an optimalcontrol trajectory, x(t) resulting state trajectory. Then
x(t) =H
p(x(t), u(t), p(t)) x(0) = x0
p(t) = Hx
(x(t), u(t), p(t)) p(T) =h(x(T))
xu(t) = arg min
uUH(x(t), u, p(t))
H(x(t), u(t), p(t)) = constant t [0, T](H() = constant comes from p0(t) = constant).Some remarks:
Set of 2n ODE, with split boundary conditions. Not trivial to solve.
Necessary condition, but not sufficient. Can have multiple solutions,not all of them may be optimal.
If f(x, u) is linear, U is convex, h and g are convex, then the conditionis necessary and sufficient.
Example 13 (Resource Allocation): Some robots are sent to Mars tobuild habitats for a later exploration by humans.
x(t) Number of reconfigurable robots, which can build habitats or them-selves.
x(0) is given: Number of robots that arrive to Mars.
x(t) = u(t)x(t) x(0) = x0
y(t) = (1 u(t))x(t) y(0) = 00 u(t) 1
43
8/3/2019 Dynamic Programming and Optimal Control Script
46/58
3. Continuous Time Optimal Control
Objective: given terminal time T, find control input u(t) that maxi-
mizes y(T), the number of habitats build.Note: y(t) =
T0
(1 u(t))x(t) dt
Solution
g(x, u) = (1 u)xf(x, u) = ux
H(x, u, p) = (1 u)x + puxp(t) =
H(x(t), u(t), p(t))
x
=
1 + u(t)
p(t)u(t)
p(T) = 0 (h(x) 0)u(t) = arg max
0u1(x(t) + (p(t)x(t) x(t)) u)
get
u = 0 if p(t) < 1
u = 1 if p(t) > 1
Since p(T) = 0, for t close to T, will have u(t) = 0 and thereforep(t) = 1.
Therefore at time t = T 1, p(t) = 1 and that is where switch occurs:
p(t) = p(t) 0 t T 1 p(T 1) = 1 p(t) = exp(t) exp(T 1) 0 t T 1
Conclusion
u(t) =
1 0 t T 1
0 T 1 t T
How to use in practice:
1. If you can solve HJB, you get a feedback law u = (x). Very con-venient, just a controller: meausure the state and apply the controlinput.
2. Solve for optimal trajectory and use a feedback law (probably linear)to keep you on that trajectory.
3. Solve for optimal trajectory online after measuring state. Do this often.
44
8/3/2019 Dynamic Programming and Optimal Control Script
47/58
Wednesday 1st December, 2010 Extensions
HJBMinimumPrinciple
OptimalSolutions
Non OptimalSolutions
not too difficult
Easy to show(in book)
Viscositysolutions
Hard to showrigorously(calculus of vari-ations)
Local Minima
Figure 3.4: Different approaches to find a solution.
3.3 Extensions
(Drop x(t), J notation for simplycity)
3.3.1 Fixed Terminal State
Case where x(T) is given. Clear that there is no need for a terminal cost.
Recall co-state p(t) =J(t, x(t))
x.
p(T) = limtT
J(t, x(t))
x, but we cant use h(t), the terminal cost, to constrain
p(T). Dont need constraints on p!
x(t) = f(x(t), u(t)) x(0) = x0, x(T) = xT 2n ODEs
p(t) = H(x(t), u(t), p(t))
x 2n boundary conditions
Example 14:
x(t) = u(t) x(0) = 0, x(1) = 1
g(x, u) =1
2(x2 + u2)
cost =1
2
10
(x2(t) + u2(t)) dt
45
8/3/2019 Dynamic Programming and Optimal Control Script
48/58
3. Continuous Time Optimal Control
Hamiltonian H(x, u, p) = 12 (x2 + u2) + p u
We get
x(t) = u(t)
p(t) = x(t)u(t) = arg min
u
1
2(x2(t) + u2(t)) + p(t) u(t)
therefore u(t) = p(t) x(t) = p(t), p(t) = x(t), x(t) = x(t)
x(t) = A cosh(t) + B sinh(t)
x(0) = 0 A = 0, x(1) = 1 B = 1sinh(1)
x(t) =sinh(t)
sinh(1)=
et ete1 e1
Exercise: show that Hamiltonian is constant along this trajectory.
3.3.2 Free initial state, with cost
x(0) is not fixed, but have initial cost l(x(0)). Can show that the resuling
condition is p(0) = l(x(0))x
.
Example 15:
x(t) = u(t) x(1) = 1, x(0) is free
g(x, u) =1
2(x2 + u2)
l(x) = 0 no cost (given)
Apply Minimum Principle, as before
x(t) = u(t) x = x(t)
p(t) = x(t)u(t) = p(t) x(t) = A cosh(t) + B sinh(t)
x(0) = u(0) = p(0) = 0 B = 0
x(t) =cosh(t)
cosh(1)=
et + et
e1 + e1
x(0) 0.65
46
8/3/2019 Dynamic Programming and Optimal Control Script
49/58
Wednesday 1st December, 2010 Extensions
Free Terminal Time Result: Hamiltonian = 0 on optimal trajectory.
Gain extra degree of freedom in choosing T, we lose a degree of freedombecause H 0.
Time Varying System and Cost What happens if f = f(x, u, t), g = g(x, u, t)?
Result: Everything stays the same, except that Hamiltonian is no longerconstant along trajectory. Hint: t = 1 and x = f(x, u, t)
Singular Problems Motivate via an example:
Tracking problem: z(t) = 1 t2 0 t 1Minimize 12 10 (x(t) z(t))
2 dt subject to
|x(t)
| 1.
Apply Minimum Principle:
x(t) = u(t) |u(t)| 1 x(0), x(1) are freeg(x, u, t) =
1
2(x z(t))2
H(x, u, t) =1
2(x z(t))2 + p u
Co-state equation:
p(t) = (x(t) z(t)) p(0) = 0, p(1) = 0Optimal u:
u(t) = arg min|u|1
H(x(t), u, p(t), t)
u(t) =
= 1 if p(t) > 0= 1 if p(t) < 0
= ? if p(t) = 0
Problem is singular if Hamiltonian is not a function of u for a non-trivial
time interval.Try the following:
p(0) = 0 p(t) = 0 for 0 t T T to be determinedThen
p(t) = 0 0 t T x(t) = z(t) 0 t T x(t) = z(t) = 2t = u(t)
47
8/3/2019 Dynamic Programming and Optimal Control Script
50/58
3. Continuous Time Optimal Control
One guess: pick T = 12 .
This cant be the solution: for t > 12 :
x(t) z(t) > 0 p < 0 p(1) < 0cant satisfy constraint.
Explore this instead:
Switch before T = 12 .
x(t) = z(t) 0 t T< 12
x(t) = 1 T< t 1 x(t) = z(T) (t T) = 1 T2 t + T
p(t) = (x(t) z(t)) (1 T2 t + T 1 + t2) T< t 1= T2 T t2 + t
p(1) =1
T(T2 T t2 + t) dt
= T2 T 13
+1
2 T3 + T2 + T
3
3 T
2
2= 0
Simplifies to (multiply by 6):
0 =
4T3 + 9T2
6T+ 1
= (T 1)(T 1)(1 4T) T = 1, T =
1
4
T = 14 satisfies all the constraints and we are done!
Can easily verify that p(t) > 0 for 14 < t < 1, giving u(t) = 1 as required.
3.4 Linear Systems and Quadratic Costs
Look at infite horizon, LTI (linear time invariant) system:
xk+1 = Axk + Buk k = 0 , 1 , . . .
cost =
k=0
xTk Qxk + uTk Ruk, R > 0, Q 0, R = RT, Q = QT
Informally, the cost to go is time invariant: only depends on the state andnot when we get there.
J(x) = minu
xTQx + uTRu + J(Ax + Bu)
48
8/3/2019 Dynamic Programming and Optimal Control Script
51/58
Wednesday 8th December, 2010 Linear Systems and Quadratic Costs
Conjecture that optimal cost to go is quadratic in x: J(x) = xTKx , where
K = KT, K 0. ThenxTKx = xTQx + xTATKAx + min
u
uTRu + uTBTKBu + xTATKBu + uTBTKAx
Since R > 0, BTKB 0, R + BTKB > 0
2
R + BTKB
u + 2BTKAx = 0
u =
R + BTKB1
BTKA
Substitute back in:All terms are of the form xT x.Therefore we must have:K = Q + ATKA + ATKB
R + BTKB
1 R + BTKB
R + BTKB
1BTKA
2ATKB
R + BTKB1
BTKA
K = AT
K KB
R + BTKB1
BTK
A + Q, K 0
Summary
Optimal Cost to go J(x) = xTKx
Optimal feedback strategy u = Fx, F = R + BTKB1 BTKAQuestions
1. Can we always solve for K?
2. Is closed loop system xk+1 = (A + BF)xk stable?
Example 16:
xk+1 = 2xk + 0 ukcost =
k=0
x2k + u2kA = 2, B = 0, Q = 1, R = 1
Solve for K:
K = 4 (K 0) + 1 3K = 1 K = 1/3
49
8/3/2019 Dynamic Programming and Optimal Control Script
52/58
3. Continuous Time Optimal Control
K does not satisfy K 0 constraint. No solution to this problem: costis infite.Problem with this example is that (A, B) is not stabilizable.
Stabilizable : One can find a matrix F such that A + BF is stable.
(A + BF) < 1, eigenvalues of A + BF have magnitude < 1.
Example 17:
xk+1 = 0.5xk + 0 ukcost =
k=0
x2k + u2k
A = 0.5, B = 0, Q = 1, R = 1
Solve for K:
K = 0.25K + 1
K = 4/3Cost to go = 43 x
2k . F = 0.
Example 18:
xk+1 = 2xk + uk
cost =
k=0
x2k + u
2k
A = 2, B = 1, Q = 1, R = 1
(A, B) is stabilizable. Solve for K:
K = 4
K K2(1 + K)
+ 1
0 = K2
4K
1
K = 4
20
2= 2
5
Pick K = 2 +
5 4.236 and solve for F:
F = (1 + K)12K = 2K1 + K
1.618
A + BF = 2 1.618 0.3820 is stable.Our optimizing strategy stabilizes system as expected.
50
8/3/2019 Dynamic Programming and Optimal Control Script
53/58
Wednesday 8th December, 2010 Linear Systems and Quadratic Costs
Example 19:
xk+1 = 2xk + uk
cost =
k=0
u2k
A = 2, B = 1, Q = 0, R = 1
Solve for K:
K = 4
K K2(1 + K)1
K =
{0, 3}
K = 0 is clearly optimal thing to do, but it leads to an unstable system.K = 3, however, while not being optimal, leads to F = (1 + 3)1 3 2 =1.5, A + BF = 0.5, stable.Modify cost to
k=0
u2k + x
2k
, > 0, 1.
A = 2, B = 1, Q = , R = 1
Solve for K:
K() = {3 + 43
, 3} (to first order in )
The K = 3 solution is the limiting case as we put arbitrarily small costto the state which would otherwise .Example 20:
xk+1 = 0.5xk + uk
cost =
k=0
u2k
A = 0.5, B = 1, Q = 0, R = 1
Solve for K:
K = {0,0.75}Here K = 0 makes perfect sense: gives optimal strategy u = 0 and stableclosed loop system.
IfQ =
K() {43
,0.75 3}
K = 0 is a well behaved solution for = 0.
51
8/3/2019 Dynamic Programming and Optimal Control Script
54/58
3. Continuous Time Optimal Control
Need Concept of Detectability Let Q be decomposed as Q = CTC (can
always do this).Need detectability assumption. (A, C) is detectable if L so that A + LC isstable. C xk 0 xk 0 and C xk 0 xTk Qxk 0.
3.4.1 Summary
Given
xk+1 = Axk + Buk k = 0 , 1 , . . .
cost = k=0
xTQx + uTk Ruk
Q 0, R 0
(A, B) is stabilizable, (A, C) is detectable where C is any matrix that satisfiesCTC = Q. Then:
1. unique solution to D.A.R.E2. Optimal cost to go is J(x) = xTKx
3. Optimal feedback strategy u = Fx
4. Closed loop system is stable.
3.5 General Problem Formulation
Finite horizon, time varying, disturbances.
xk+1 = Ak xk + Bk uk + wk k = 0 , . . . , N
1
E(wk) = 0 E(wk wTk ) finite
Cost:
E
xTN QN xN +
N1k=0
xTk Qk xk + uTk Rk uk
Qk = QTk 0 (EV 0)
Rk = RTk > 0
52
8/3/2019 Dynamic Programming and Optimal Control Script
55/58
Wednesday 15th December, 2010 General Problem Formulation
Apply DP to solve problem:
JN(xN) = xTN QN xN
Jk(xk) = minuk
E
xTk Qk xk + uTk Rk uk + Jk+1(Ak xk + Bk uk + wk)
Lets do the first step of this recursion. Or equivalently, N = 1:
J0(x0) = minu0
E
xT0 Q0 x0 + uT0 R0 u0 + (A0 x0 + B0 u0 + w0)
T Q1 (. . .)
consider last term:
E(A0 x0 + B0 u0)T Q1 (A0 x0 + B0 u0) + 2(A0 x0 + B0 u0)T Q1 w0 + wT0 Q1 w0= (A0 x0 + B0 u0)
T Q1(. . .) + E(wT0 Q1 w0)
J0(x0) = minu0
xT0 Q0 x0 + u
T0 R0 u0 + (A0 x0 + B0 u0)
T Q1 (. . .) + E(wT0 Q1 w0)
Strategy is the same as if there was no noise (although it does give a differentcost). Certainty equivalence (works only for some problems, not all).
Solve for minimizing u0: differentiate , set to 0:
2R0 u0 + 2 BT0 Q1 B0 u0 + 2 B
T0 Q1 A0 x0 = 0
u0 = (R0 + BT0 Q1 B0)
1
BT0 Q1 A0 x0
=: F0 x0
Optimal feedback strategy is a linear function of the state.
Substitute and solve for J0(x0):
xT0 Q0 x0 + uT0 (R0 + B
T0 Q1 B0) u0 + x
T0 A
T0 Q1 A0 x0 + 2 x
T0 A
T0 Q1 B0 u0 + E(w
T0 Q1 w0)
= xT0 K0 x0 + E(wT0 Q1 w0)
K0 = Q0 + AT0 (K1 K1 B0(R0 + BT0 K1 B0)1 BT0 K1) A0
K1 = Q1
Cost at k = 1: quadratic in x1, at k = 0: quadratic in x0 + constant.
Can extend this to any horizon, Discrete Riccati Equation (DRE):
Kk = Qk + ATk (Kk+1 Kk+1 Bk (Rk + BTk Kk+1 Bk)1 Bk Kk+1) Ak
KN = QN
Feedback law:
uk = Fk xk Fk = (Rk + BTk Kk+1 Bk)1 BTk Kk+1 Ak
53
8/3/2019 Dynamic Programming and Optimal Control Script
56/58
3. Continuous Time Optimal Control
Cost:
Jk(xk) = xTk Kk xk +N
1
j=k
E(wTj Kj+1 wj)
No noise, time invariant, infinite horizon (N ): recover previous resultsand DARE. In fact, above iterative method is one way to solve DARE. Iterate
backwards until it converges. ([1] has proof of convergence not trivial)
Time invariant, infinite horizon with noise: Cost goes to infinity. Approach:divide cost by N and let N , cost = E(wTK w).
Example 21: System given:
z(t) = u(t)
Objective Apply a force to move a mass from any starting point to z =0, z = 0. Implement on a computer that can only update informationonce per second.
1. Discretize problem:
z(t) = z(0) + u(0)t 0 t < 1z(t) = z(0) + z(0)t + 1/2u(0)t2 0 t < 1
Let x1(k) = z(k), x2(k) = z(k).
x(k + 1) = Ax(k) + Bu(k) k = 0 , 1 , . . .
A =
1 1
0 1
B =
0.5
1
2. Cost = k=0
x21(k) + u2(k)
. Therefore
Q =
1 0
0 0
R = 1
3. Is system stabilizable? Can we make A + BF stable for some F?
Yes: F =1 1.5
makes eigenvalues of A + BF = 0
4. Q can be decomposed as follows
Q =
1 0
0 0
=
1
0
1 0
54
8/3/2019 Dynamic Programming and Optimal Control Script
57/58
Wednesday 15th December, 2010 General Problem Formulation
Therefore
C =
1 0
, Q = CTC
Is (A, C) detectable?Yes.
L =2 1
makes eigenvalues of A + LC = 0.
5. Solve DARE. Use MATLAB command dare.
K = 2 1
1 1.5Optimal Feedback matrix:
F =0.5 1.0
6. Physical interpretation: spring and a damper. spring has coeffi-
cient 0.5, damper has coefficient 1.0.
55
8/3/2019 Dynamic Programming and Optimal Control Script
58/58
Bibliography
[1] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, volume I.Athena Scientific, 3rd edition, 2005.