+ All Categories
Home > Documents > Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 -...

Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 -...

Date post: 21-Mar-2018
Category:
Upload: buiminh
View: 218 times
Download: 4 times
Share this document with a friend
26
Pl i B d M k Planning Based on Markov Decision Processes Dealing with Non-Determinism Dealing with Non Determinism 1
Transcript
Page 1: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Pl i B d M kPlanning Based on MarkovDecision Processes

Dealing with Non-DeterminismDealing with Non Determinism

1

Page 2: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Motivation c

Until now, we’ve assumeda b

Intendedthat each action has only onepossible outcome a

cb

outcome

In many cases this is unrealisticIn many situations, actions may have

th ibl t

Graspblock c

a b cmore than one possible outcome

Action failuresi d i l d

Unintendedoutcome

» e.g., gripper drops its loadExogenous events» e.g., road closed

Would like to be able to plan in such situations

2

One approach: Markov Decision Processes

Page 3: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Stochastic SystemsyStochastic system: a triple Σ = (S, A, P)

S = finite set of statesS finite set of statesA = finite set of actionsP (s′ | s) = probability of going to s′ if we execute a in sPa (s | s) = probability of going to s if we execute a in s∑s′ ∈ S Pa (s′ | s) = 1

Several different possible action representationse g Bayes networks probabilistic operatorse.g., Bayes networks, probabilistic operators

We do not commit to any particular representationWe do not commit to any particular representationWe will deal with the underlying semanticsE li it ti f h P ( ′ | )

3

Explicit enumeration of each Pa (s′ | s)

Page 4: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Example

Robot r1 startsat location l1

St t s1 iState s1 inthe diagram

Objective is toObjective is toget r1 to location l4

State s4 inthe diagram

Also present, but not shown: an action called waitI ’ li bl iIt’s applicable in every stateIt leaves the state unchanged

4

Page 5: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Example

Robot r1 startsat location l1

State s1 inthe diagram

Objective is toget r1 to location l4

State s4 inthe diagram

No classical plan (sequence of actions) can be a solutionπ = ⟨move(r1,l1,l2), move(r1,l2,l3), move(r1,l3,l4)⟩

5

⟨ ( , , ), ( , , ), ( , , )⟩

Page 6: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Example

Robot r1 startsat location l1

State s1 inthe diagram

Objective is toget r1 to location l4

State s4 inthe diagram

No classical plan (sequence of actions) can be a solutionπ' = ⟨move(r1,l1,l2), move(r1,l2,l3), move(r1,l5,l4)⟩

6

⟨ ( , , ), ( , , ), ( , , )⟩

Page 7: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Example

Robot r1 startsat location l1

State s1 inthe diagram

Objective is toget r1 to location l4

State s4 inthe diagram

No classical plan (sequence of actions) can be a solutionπ'' = ⟨move(r1,l1,l4)⟩

7

⟨ ( , , )⟩

Page 8: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

π1 = {(s1, move(r1,l1,l2)),(s2 move(r1 l2 l3))

Policies(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait),(s5 wait)}(s5, wait)}

π2 = {(s1, move(r1,l1,l2)),(s2 move(r1 l2 l3))(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait),(s5, move(r1,l5,l4))}(s5, move(r1,l5,l4))}

π3 = {(s1, move(r1,l1,l4)),(s2, move(r1,l2,l1)),( , ( , , )),(s3, move(r1,l3,l4)),(s4, wait),(s5, move(r1,l5,l4)}( ( )}

Policy: a function that maps states into actions

8

Write it as a set of state-action pairs

Page 9: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Initial StatesFor every state s,there will be aprobability P(s)that the system beginsin the state s

We assumeth t t tthe system startsin a uniqueinitial state s0initial state s0

» P(s0) = 1» P(s ) = 0 for i ≠ 1» P(si) 0 for i ≠ 1

In the example, P(s1) = 1, and P(s) = 0 for all other states

9

Page 10: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

HistoriesHiHistory: sequenceof system states

h = ⟨s s s s s ⟩h = ⟨s0, s1, s2, s3, s4, … ⟩

h0 = ⟨s1, s3, s1, s3, s1, … ⟩h1 = ⟨s1, s2, s3, s4, s4, … ⟩h2 = ⟨s1, s2, s5, s5, s5, … ⟩h = ⟨s1 s2 s5 s4 s4 ⟩h3 = ⟨s1, s2, s5, s4, s4, … ⟩h4 = ⟨s1, s4, s4, s4, s4, … ⟩h5 = ⟨s1, s1, s4, s4, s4, … ⟩5 ⟨ ⟩

h6 = ⟨s1, s1, s1, s4, s4, … ⟩h7 = ⟨s1, s1, s1, s1, s1, … ⟩

Each policy induces a probability distribution over historiesIf h = ⟨s0, s1, … ⟩ then P(h | π) = P(s0) ∏i ≥ 0 Pπ(Si) (si+1 | si)

10

The book omits this since it assumes s0 is known

Page 11: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Example

π1 = {(s1, move(r1,l1,l2)),π1 {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4 wait)(s4, wait),(s5, wait)}

goal

h ⟨s1 s2 s3 s4 s4 ⟩ P(h | ) 1 × 0 8 × 1 × 0 8h1 = ⟨s1, s2, s3, s4, s4, … ⟩ P(h1 | π1) = 1 × 0.8 × 1 × … = 0.8h2 = ⟨s1, s2, s5, s5 … ⟩ P(h2 | π1) = 1 × 0.2 × 1 × … = 0.2

P(h | π1) = 0 for all other h

11

( | 1)

Page 12: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Example (continued)

π2 = {(s1, move(r1,l1,l2)),π2 {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4 wait)(s4, wait),(s5, move(r1,l5,l4))}

goal

h ⟨s1 s2 s3 s4 s4 ⟩ P(h | ) 1 × 0 8 × 1 × 0 8h1 = ⟨s1, s2, s3, s4, s4, … ⟩ P(h1 | π2) = 1 × 0.8 × 1 × … = 0.8h3 = ⟨s1, s2, s5, s4, s4, … ⟩ P(h3 | π2) = 1 × 0.2 × 1 × … = 0.2

P(h | π1) = 0 for all other h

12

( | 1)

Page 13: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Example (continued)

π3 = {(s1, move(r1,l1,l4)),π3 {(s1, move(r1,l1,l4)),(s2, move(r1,l2,l1)),(s3, move(r1,l3,l4)),(s4 wait)(s4, wait),(s5, move(r1,l5,l4)}

goalh4 = ⟨s1, s4, s4, … ⟩ P(h4 | π3) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5h5 = ⟨s1, s1, s4, s4, … ⟩ P(h5 | π3) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25h6 = ⟨s1, s1, s1, s4, s4, … ⟩ P(h6 | π3) = 0.5 × 0.5 × 0.5 × 1 × 1 × … = 0.125

• • •

13

• • •

h7 = ⟨s1, s1, s1, s1, s1, s1, … ⟩ P(h6 | π3) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0

Page 14: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

r = –100

Utility FunctionsNumeric cost C(s,a) foreach state s and action aNumeric reward R(s)for each state sExample:Example:

C(s,a) = 1 for each“horizontal” action Start

C(s,a) = 100 foreach “vertical” actionC(s,wait) =0, for s ≠ s5C(s5,wait) = 100R as shown

Utility function: generalization of a goal

14

Utility function: generalization of a goalIf h = ⟨s0, s1, … ⟩, then V(h | π) = ∑i ≥ 0 (R(si) – C(si,π(si)))

Page 15: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

r = –100

Examplec = 0

π1 = {(s1, move(r1,l1,l2)),( 2 ( 1 l2 l3))(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait), Start(s5, wait)}

h1 = ⟨s1, s2, s3, s4, s4, … ⟩V(h1 | π1) = (0 – 100) + (0 – 1) + (0 – 100) + 100 + 100 + … = ∞( 1 | 1) ( ) ( ) ( )

h2 = ⟨s1, s2, s5, s5 … ⟩V(h | π ) = (0 100) + (0 1) + ( 100 0) + ( 100 0) + = ∞

15

V(h2 | π1) = (0 – 100) + (0 – 1) + (–100 – 0) + (–100 – 0) + … = –∞

Page 16: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

r = –100Discounted Utility

Start

We often need to usea discount factor, γ

0 10 ≤ γ < 1Discounted utility of a history:

(h | ) ∑ i ( ( ) ( ( )))V(h | π) = ∑i ≥ 0 γ i (R(si) – C(si,π(si))) Makes rewards and costs accumulated at later stages count l th th l t d t l tless than those accumulated at early stages

Ensure a bounded measure of utilities for infinite historiesG t if 0 ≤ 1

16

Guarantees convergence if 0 ≤ γ < 1Expected utility of a policy: E(π) = ∑ h P(h|π) V(h|π)

Page 17: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

r = –100Example

π1 = {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),

c = 0

(s3, move(r1,l3,l4)),(s4, wait),(s5, wait)}

Startγ = 0.9

c = 0

h1 = ⟨s1, s2, s3, s4, s4, … ⟩V(h1 | π1) = .90(0 –100) + .91(0 –1) + .92(0 –100) + .93 100 + .94 100 + … = 547.9

h2 = ⟨s1, s2, s5, s5 … ⟩h2 ⟨s1, s2, s5, s5 … ⟩V(h2 | π1) = .90(0 –100) + .91(0 – 1) + .92(–100) + .93(–100) + … = –910.1

E( ) 0 8 * 547 9 0 2 ( 910 1) 256 3

17

E(π1) = 0.8 * 547.9 + 0.2 (–910.1) = 256.3

Page 18: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Planning as Optimizationg pFrom now onwards, we will study a special case:

all rewards are 0all rewards are 0E(π) is expected cost of policy π

» the negative of what we had before» the negative of what we had beforeThis makes the equations slightly simplerCan easily generalize everything to the case of nonzero rewardsCan easily generalize everything to the case of nonzero rewards

E(π) = ∑ P(h | π) C(h | π)E(π) = ∑h P(h | π) C(h | π)where C(h | π) = ∑i ≥ 0 γ i C(si, π(si))

A solution is an optimal policy π*, i.e., E( *) ≤ E( ) f

18

E(π*) ≤ E(π) for every π

Page 19: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Bellman’s TheoremLet Qπ(s,a) be the expected cost in a state s if we start by using action a in state s, and then use the policy π from then on, p y

Q π(s,a) = C(s,a) + γ ∑s′ ∈ S Pa(s′ | s) Eπ(s′)Eπ(s′) is the expected cost of policy π in state s′π( ) p p y

Let π* be an optimal policyp p yAt each state, π* chooses the action that produces the smallest expected cost from there onward (Bellman’s equation)p ( q )

» Eπ*(s) = mina ∈ A Qπ*(s,a)

ThusEπ*(s) = mina {C(s,a) + γ ∑s’ ∈ S Pa(s′ | s) Eπ*(s′)}

19

π ( ) a { ( , ) γ ∑s ∈ S a( | ) π ( )}

Page 20: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Policy IterationyPolicy iteration is a method to find an optimal policyStart with an arbitrary initial policy πStart with an arbitrary initial policy π1

For i = 1, 2, …Compute E (s) for every s by solving the system of equationsCompute Eπi(s) for every s by solving the system of equations

» Eπi(s) = C(s, πi(s)) + γ ∑s′ ∈ S Pπi(s) (s′ | s) Eπi(s′)FFor every s,

» πi+1(s) := argmina ∈ A {C(s, a) + γ ∑s′ ∈ S Pa (s′ | s) Eπi(s′)}If th itIf πi+1 = πi then exit

Converges in a finite number of iterations

20

Page 21: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Examplepπ1 = {(s1, move(r1,l1,l2)),

(s2, move(r1,l2,l3)),(s , o e( , , 3)),(s3, move(r1,l3,l4)),(s4, wait),(s5 wait)}(s5, wait)}

21

Page 22: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

π1 = {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s , o e( , , 3)),(s3, move(r1,l3,l4)),(s4, wait),(s5 wait)}(s5, wait)}

22

Page 23: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Example (Continued)p ( )We had

π1 = {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait),(s5, wait)}

r = –100

An optimal policy:

{( 1 ( 1 l1 l4))π2 = {(s1, move(r1,l1,l4)),(s2, move(r1,l2,l1)),(s3, move(r1,l3,l4)),(s4, wait),

Start

c = 0(s5, move(r1,l5,l4)}

23

γ = 0.9

Page 24: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

Value IterationValue iteration is another way to find an optimal policySt t ith bit t E ( ) f h d bit > 0Start with an arbitrary cost E0(s) for each s and an arbitrary ε > 0For k = 1, 2, …

f h i S dfor each s in S do» for each a in A do

Q( ) C( ) ∑ P ( ′ | ) E ( ′)• Q(s,a) := C(s,a) + γ ∑s′ ∈ S Pa (s′ | s) Ek–1(s′)» Ek(s) = mina ∈ A Q(s,a)» πk (s) = argmina ∈ A Q(s,a)

If maxs ∈ S |Ek(s) – Ek-1(s)| < ε for every s then exit

Converges in a finite number of iterations, even if ε = 0

24

Page 25: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

ExampleSuppose we start with E0(s) = 0 for all s, and ε = 1

Start

25

Page 26: Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 - MIR - Planning based on... · Decision Processes Dealing with NonDealing with Non-Determinism

DiscussionPolicy iteration computes an entire policy in each iteration, and computes values based on that policyp p y

More work per iteration, because it needs to solve a set of simultaneous equationsUsually converges in a smaller number of iterations

Value iteration computes new values in each iteration, and chooses pa policy based on those values

In general, the values are not the values that one would get from the chosen policy or any other policyLess work per iteration, because it doesn’t need to solve a set of

iequationsUsually takes more iterations to converge

26


Recommended