Pl i B d M kPlanning Based on MarkovDecision Processes
Dealing with Non-DeterminismDealing with Non Determinism
1
Motivation c
Until now, we’ve assumeda b
Intendedthat each action has only onepossible outcome a
cb
outcome
In many cases this is unrealisticIn many situations, actions may have
th ibl t
Graspblock c
a b cmore than one possible outcome
Action failuresi d i l d
Unintendedoutcome
» e.g., gripper drops its loadExogenous events» e.g., road closed
Would like to be able to plan in such situations
2
One approach: Markov Decision Processes
Stochastic SystemsyStochastic system: a triple Σ = (S, A, P)
S = finite set of statesS finite set of statesA = finite set of actionsP (s′ | s) = probability of going to s′ if we execute a in sPa (s | s) = probability of going to s if we execute a in s∑s′ ∈ S Pa (s′ | s) = 1
Several different possible action representationse g Bayes networks probabilistic operatorse.g., Bayes networks, probabilistic operators
We do not commit to any particular representationWe do not commit to any particular representationWe will deal with the underlying semanticsE li it ti f h P ( ′ | )
3
Explicit enumeration of each Pa (s′ | s)
Example
Robot r1 startsat location l1
St t s1 iState s1 inthe diagram
Objective is toObjective is toget r1 to location l4
State s4 inthe diagram
Also present, but not shown: an action called waitI ’ li bl iIt’s applicable in every stateIt leaves the state unchanged
4
Example
Robot r1 startsat location l1
State s1 inthe diagram
Objective is toget r1 to location l4
State s4 inthe diagram
No classical plan (sequence of actions) can be a solutionπ = ⟨move(r1,l1,l2), move(r1,l2,l3), move(r1,l3,l4)⟩
5
⟨ ( , , ), ( , , ), ( , , )⟩
Example
Robot r1 startsat location l1
State s1 inthe diagram
Objective is toget r1 to location l4
State s4 inthe diagram
No classical plan (sequence of actions) can be a solutionπ' = ⟨move(r1,l1,l2), move(r1,l2,l3), move(r1,l5,l4)⟩
6
⟨ ( , , ), ( , , ), ( , , )⟩
Example
Robot r1 startsat location l1
State s1 inthe diagram
Objective is toget r1 to location l4
State s4 inthe diagram
No classical plan (sequence of actions) can be a solutionπ'' = ⟨move(r1,l1,l4)⟩
7
⟨ ( , , )⟩
π1 = {(s1, move(r1,l1,l2)),(s2 move(r1 l2 l3))
Policies(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait),(s5 wait)}(s5, wait)}
π2 = {(s1, move(r1,l1,l2)),(s2 move(r1 l2 l3))(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait),(s5, move(r1,l5,l4))}(s5, move(r1,l5,l4))}
π3 = {(s1, move(r1,l1,l4)),(s2, move(r1,l2,l1)),( , ( , , )),(s3, move(r1,l3,l4)),(s4, wait),(s5, move(r1,l5,l4)}( ( )}
Policy: a function that maps states into actions
8
Write it as a set of state-action pairs
Initial StatesFor every state s,there will be aprobability P(s)that the system beginsin the state s
We assumeth t t tthe system startsin a uniqueinitial state s0initial state s0
» P(s0) = 1» P(s ) = 0 for i ≠ 1» P(si) 0 for i ≠ 1
In the example, P(s1) = 1, and P(s) = 0 for all other states
9
HistoriesHiHistory: sequenceof system states
h = ⟨s s s s s ⟩h = ⟨s0, s1, s2, s3, s4, … ⟩
h0 = ⟨s1, s3, s1, s3, s1, … ⟩h1 = ⟨s1, s2, s3, s4, s4, … ⟩h2 = ⟨s1, s2, s5, s5, s5, … ⟩h = ⟨s1 s2 s5 s4 s4 ⟩h3 = ⟨s1, s2, s5, s4, s4, … ⟩h4 = ⟨s1, s4, s4, s4, s4, … ⟩h5 = ⟨s1, s1, s4, s4, s4, … ⟩5 ⟨ ⟩
h6 = ⟨s1, s1, s1, s4, s4, … ⟩h7 = ⟨s1, s1, s1, s1, s1, … ⟩
Each policy induces a probability distribution over historiesIf h = ⟨s0, s1, … ⟩ then P(h | π) = P(s0) ∏i ≥ 0 Pπ(Si) (si+1 | si)
10
The book omits this since it assumes s0 is known
Example
π1 = {(s1, move(r1,l1,l2)),π1 {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4 wait)(s4, wait),(s5, wait)}
goal
h ⟨s1 s2 s3 s4 s4 ⟩ P(h | ) 1 × 0 8 × 1 × 0 8h1 = ⟨s1, s2, s3, s4, s4, … ⟩ P(h1 | π1) = 1 × 0.8 × 1 × … = 0.8h2 = ⟨s1, s2, s5, s5 … ⟩ P(h2 | π1) = 1 × 0.2 × 1 × … = 0.2
P(h | π1) = 0 for all other h
11
( | 1)
Example (continued)
π2 = {(s1, move(r1,l1,l2)),π2 {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4 wait)(s4, wait),(s5, move(r1,l5,l4))}
goal
h ⟨s1 s2 s3 s4 s4 ⟩ P(h | ) 1 × 0 8 × 1 × 0 8h1 = ⟨s1, s2, s3, s4, s4, … ⟩ P(h1 | π2) = 1 × 0.8 × 1 × … = 0.8h3 = ⟨s1, s2, s5, s4, s4, … ⟩ P(h3 | π2) = 1 × 0.2 × 1 × … = 0.2
P(h | π1) = 0 for all other h
12
( | 1)
Example (continued)
π3 = {(s1, move(r1,l1,l4)),π3 {(s1, move(r1,l1,l4)),(s2, move(r1,l2,l1)),(s3, move(r1,l3,l4)),(s4 wait)(s4, wait),(s5, move(r1,l5,l4)}
goalh4 = ⟨s1, s4, s4, … ⟩ P(h4 | π3) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5h5 = ⟨s1, s1, s4, s4, … ⟩ P(h5 | π3) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25h6 = ⟨s1, s1, s1, s4, s4, … ⟩ P(h6 | π3) = 0.5 × 0.5 × 0.5 × 1 × 1 × … = 0.125
• • •
13
• • •
h7 = ⟨s1, s1, s1, s1, s1, s1, … ⟩ P(h6 | π3) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0
r = –100
Utility FunctionsNumeric cost C(s,a) foreach state s and action aNumeric reward R(s)for each state sExample:Example:
C(s,a) = 1 for each“horizontal” action Start
C(s,a) = 100 foreach “vertical” actionC(s,wait) =0, for s ≠ s5C(s5,wait) = 100R as shown
Utility function: generalization of a goal
14
Utility function: generalization of a goalIf h = ⟨s0, s1, … ⟩, then V(h | π) = ∑i ≥ 0 (R(si) – C(si,π(si)))
r = –100
Examplec = 0
π1 = {(s1, move(r1,l1,l2)),( 2 ( 1 l2 l3))(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait), Start(s5, wait)}
h1 = ⟨s1, s2, s3, s4, s4, … ⟩V(h1 | π1) = (0 – 100) + (0 – 1) + (0 – 100) + 100 + 100 + … = ∞( 1 | 1) ( ) ( ) ( )
h2 = ⟨s1, s2, s5, s5 … ⟩V(h | π ) = (0 100) + (0 1) + ( 100 0) + ( 100 0) + = ∞
15
V(h2 | π1) = (0 – 100) + (0 – 1) + (–100 – 0) + (–100 – 0) + … = –∞
r = –100Discounted Utility
Start
We often need to usea discount factor, γ
0 10 ≤ γ < 1Discounted utility of a history:
(h | ) ∑ i ( ( ) ( ( )))V(h | π) = ∑i ≥ 0 γ i (R(si) – C(si,π(si))) Makes rewards and costs accumulated at later stages count l th th l t d t l tless than those accumulated at early stages
Ensure a bounded measure of utilities for infinite historiesG t if 0 ≤ 1
16
Guarantees convergence if 0 ≤ γ < 1Expected utility of a policy: E(π) = ∑ h P(h|π) V(h|π)
r = –100Example
π1 = {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),
c = 0
(s3, move(r1,l3,l4)),(s4, wait),(s5, wait)}
Startγ = 0.9
c = 0
h1 = ⟨s1, s2, s3, s4, s4, … ⟩V(h1 | π1) = .90(0 –100) + .91(0 –1) + .92(0 –100) + .93 100 + .94 100 + … = 547.9
h2 = ⟨s1, s2, s5, s5 … ⟩h2 ⟨s1, s2, s5, s5 … ⟩V(h2 | π1) = .90(0 –100) + .91(0 – 1) + .92(–100) + .93(–100) + … = –910.1
E( ) 0 8 * 547 9 0 2 ( 910 1) 256 3
17
E(π1) = 0.8 * 547.9 + 0.2 (–910.1) = 256.3
Planning as Optimizationg pFrom now onwards, we will study a special case:
all rewards are 0all rewards are 0E(π) is expected cost of policy π
» the negative of what we had before» the negative of what we had beforeThis makes the equations slightly simplerCan easily generalize everything to the case of nonzero rewardsCan easily generalize everything to the case of nonzero rewards
E(π) = ∑ P(h | π) C(h | π)E(π) = ∑h P(h | π) C(h | π)where C(h | π) = ∑i ≥ 0 γ i C(si, π(si))
A solution is an optimal policy π*, i.e., E( *) ≤ E( ) f
18
E(π*) ≤ E(π) for every π
Bellman’s TheoremLet Qπ(s,a) be the expected cost in a state s if we start by using action a in state s, and then use the policy π from then on, p y
Q π(s,a) = C(s,a) + γ ∑s′ ∈ S Pa(s′ | s) Eπ(s′)Eπ(s′) is the expected cost of policy π in state s′π( ) p p y
Let π* be an optimal policyp p yAt each state, π* chooses the action that produces the smallest expected cost from there onward (Bellman’s equation)p ( q )
» Eπ*(s) = mina ∈ A Qπ*(s,a)
ThusEπ*(s) = mina {C(s,a) + γ ∑s’ ∈ S Pa(s′ | s) Eπ*(s′)}
19
π ( ) a { ( , ) γ ∑s ∈ S a( | ) π ( )}
Policy IterationyPolicy iteration is a method to find an optimal policyStart with an arbitrary initial policy πStart with an arbitrary initial policy π1
For i = 1, 2, …Compute E (s) for every s by solving the system of equationsCompute Eπi(s) for every s by solving the system of equations
» Eπi(s) = C(s, πi(s)) + γ ∑s′ ∈ S Pπi(s) (s′ | s) Eπi(s′)FFor every s,
» πi+1(s) := argmina ∈ A {C(s, a) + γ ∑s′ ∈ S Pa (s′ | s) Eπi(s′)}If th itIf πi+1 = πi then exit
Converges in a finite number of iterations
20
Examplepπ1 = {(s1, move(r1,l1,l2)),
(s2, move(r1,l2,l3)),(s , o e( , , 3)),(s3, move(r1,l3,l4)),(s4, wait),(s5 wait)}(s5, wait)}
21
π1 = {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s , o e( , , 3)),(s3, move(r1,l3,l4)),(s4, wait),(s5 wait)}(s5, wait)}
22
Example (Continued)p ( )We had
π1 = {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait),(s5, wait)}
r = –100
An optimal policy:
{( 1 ( 1 l1 l4))π2 = {(s1, move(r1,l1,l4)),(s2, move(r1,l2,l1)),(s3, move(r1,l3,l4)),(s4, wait),
Start
c = 0(s5, move(r1,l5,l4)}
23
γ = 0.9
Value IterationValue iteration is another way to find an optimal policySt t ith bit t E ( ) f h d bit > 0Start with an arbitrary cost E0(s) for each s and an arbitrary ε > 0For k = 1, 2, …
f h i S dfor each s in S do» for each a in A do
Q( ) C( ) ∑ P ( ′ | ) E ( ′)• Q(s,a) := C(s,a) + γ ∑s′ ∈ S Pa (s′ | s) Ek–1(s′)» Ek(s) = mina ∈ A Q(s,a)» πk (s) = argmina ∈ A Q(s,a)
If maxs ∈ S |Ek(s) – Ek-1(s)| < ε for every s then exit
Converges in a finite number of iterations, even if ε = 0
24
ExampleSuppose we start with E0(s) = 0 for all s, and ε = 1
Start
25
DiscussionPolicy iteration computes an entire policy in each iteration, and computes values based on that policyp p y
More work per iteration, because it needs to solve a set of simultaneous equationsUsually converges in a smaller number of iterations
Value iteration computes new values in each iteration, and chooses pa policy based on those values
In general, the values are not the values that one would get from the chosen policy or any other policyLess work per iteration, because it doesn’t need to solve a set of
iequationsUsually takes more iterations to converge
26