Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 -...

Pl i B d M kPlanning Based on MarkovDecision Processes

Dealing with Non-DeterminismDealing with Non Determinism

1

Motivation c

Until now, we’ve assumeda b

Intendedthat each action has only onepossible outcome a

cb

outcome

In many cases this is unrealisticIn many situations, actions may have

th ibl t

Graspblock c

a b cmore than one possible outcome

Action failuresi d i l d

Unintendedoutcome

» e.g., gripper drops its loadExogenous events» e.g., road closed

Would like to be able to plan in such situations

2

One approach: Markov Decision Processes

Stochastic SystemsyStochastic system: a triple Σ = (S, A, P)

S = finite set of statesS finite set of statesA = finite set of actionsP (s′ | s) = probability of going to s′ if we execute a in sPa (s | s) = probability of going to s if we execute a in s∑s′ ∈ S Pa (s′ | s) = 1

Several different possible action representationse g Bayes networks probabilistic operatorse.g., Bayes networks, probabilistic operators

We do not commit to any particular representationWe do not commit to any particular representationWe will deal with the underlying semanticsE li it ti f h P ( ′ | )

3

Explicit enumeration of each Pa (s′ | s)

Example

Robot r1 startsat location l1

St t s1 iState s1 inthe diagram

Objective is toObjective is toget r1 to location l4

State s4 inthe diagram

Also present, but not shown: an action called waitI ’ li bl iIt’s applicable in every stateIt leaves the state unchanged

4

Example



Objective is toget r1 to location l4


No classical plan (sequence of actions) can be a solutionπ = ⟨move(r1,l1,l2), move(r1,l2,l3), move(r1,l3,l4)⟩

5

⟨ ( , , ), ( , , ), ( , , )⟩

Example





No classical plan (sequence of actions) can be a solutionπ' = ⟨move(r1,l1,l2), move(r1,l2,l3), move(r1,l5,l4)⟩

6

⟨ ( , , ), ( , , ), ( , , )⟩

Example





No classical plan (sequence of actions) can be a solutionπ'' = ⟨move(r1,l1,l4)⟩

7

⟨ ( , , )⟩

π1 = {(s1, move(r1,l1,l2)),(s2 move(r1 l2 l3))

Policies(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait),(s5 wait)}(s5, wait)}

π2 = {(s1, move(r1,l1,l2)),(s2 move(r1 l2 l3))(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait),(s5, move(r1,l5,l4))}(s5, move(r1,l5,l4))}

π3 = {(s1, move(r1,l1,l4)),(s2, move(r1,l2,l1)),( , ( , , )),(s3, move(r1,l3,l4)),(s4, wait),(s5, move(r1,l5,l4)}( ( )}

Policy: a function that maps states into actions

8

Write it as a set of state-action pairs

Initial StatesFor every state s,there will be aprobability P(s)that the system beginsin the state s

We assumeth t t tthe system startsin a uniqueinitial state s0initial state s0

» P(s0) = 1» P(s ) = 0 for i ≠ 1» P(si) 0 for i ≠ 1

In the example, P(s1) = 1, and P(s) = 0 for all other states

9

HistoriesHiHistory: sequenceof system states

h = ⟨s s s s s ⟩h = ⟨s0, s1, s2, s3, s4, … ⟩

h0 = ⟨s1, s3, s1, s3, s1, … ⟩h1 = ⟨s1, s2, s3, s4, s4, … ⟩h2 = ⟨s1, s2, s5, s5, s5, … ⟩h = ⟨s1 s2 s5 s4 s4 ⟩h3 = ⟨s1, s2, s5, s4, s4, … ⟩h4 = ⟨s1, s4, s4, s4, s4, … ⟩h5 = ⟨s1, s1, s4, s4, s4, … ⟩5 ⟨ ⟩

h6 = ⟨s1, s1, s1, s4, s4, … ⟩h7 = ⟨s1, s1, s1, s1, s1, … ⟩

Each policy induces a probability distribution over historiesIf h = ⟨s0, s1, … ⟩ then P(h | π) = P(s0) ∏i ≥ 0 Pπ(Si) (si+1 | si)

10

The book omits this since it assumes s0 is known

Example

π1 = {(s1, move(r1,l1,l2)),π1 {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4 wait)(s4, wait),(s5, wait)}

goal

h ⟨s1 s2 s3 s4 s4 ⟩ P(h | ) 1 × 0 8 × 1 × 0 8h1 = ⟨s1, s2, s3, s4, s4, … ⟩ P(h1 | π1) = 1 × 0.8 × 1 × … = 0.8h2 = ⟨s1, s2, s5, s5 … ⟩ P(h2 | π1) = 1 × 0.2 × 1 × … = 0.2

P(h | π1) = 0 for all other h

11

( | 1)

Example (continued)

π2 = {(s1, move(r1,l1,l2)),π2 {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4 wait)(s4, wait),(s5, move(r1,l5,l4))}

goal

h ⟨s1 s2 s3 s4 s4 ⟩ P(h | ) 1 × 0 8 × 1 × 0 8h1 = ⟨s1, s2, s3, s4, s4, … ⟩ P(h1 | π2) = 1 × 0.8 × 1 × … = 0.8h3 = ⟨s1, s2, s5, s4, s4, … ⟩ P(h3 | π2) = 1 × 0.2 × 1 × … = 0.2

P(h | π1) = 0 for all other h

12

( | 1)

Example (continued)

π3 = {(s1, move(r1,l1,l4)),π3 {(s1, move(r1,l1,l4)),(s2, move(r1,l2,l1)),(s3, move(r1,l3,l4)),(s4 wait)(s4, wait),(s5, move(r1,l5,l4)}

goalh4 = ⟨s1, s4, s4, … ⟩ P(h4 | π3) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5h5 = ⟨s1, s1, s4, s4, … ⟩ P(h5 | π3) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25h6 = ⟨s1, s1, s1, s4, s4, … ⟩ P(h6 | π3) = 0.5 × 0.5 × 0.5 × 1 × 1 × … = 0.125

• • •

13

• • •

h7 = ⟨s1, s1, s1, s1, s1, s1, … ⟩ P(h6 | π3) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0

r = –100

Utility FunctionsNumeric cost C(s,a) foreach state s and action aNumeric reward R(s)for each state sExample:Example:

C(s,a) = 1 for each“horizontal” action Start

C(s,a) = 100 foreach “vertical” actionC(s,wait) =0, for s ≠ s5C(s5,wait) = 100R as shown

Utility function: generalization of a goal

14

Utility function: generalization of a goalIf h = ⟨s0, s1, … ⟩, then V(h | π) = ∑i ≥ 0 (R(si) – C(si,π(si)))

r = –100

Examplec = 0

π1 = {(s1, move(r1,l1,l2)),( 2 ( 1 l2 l3))(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait), Start(s5, wait)}

h1 = ⟨s1, s2, s3, s4, s4, … ⟩V(h1 | π1) = (0 – 100) + (0 – 1) + (0 – 100) + 100 + 100 + … = ∞( 1 | 1) ( ) ( ) ( )

h2 = ⟨s1, s2, s5, s5 … ⟩V(h | π ) = (0 100) + (0 1) + ( 100 0) + ( 100 0) + = ∞

15

V(h2 | π1) = (0 – 100) + (0 – 1) + (–100 – 0) + (–100 – 0) + … = –∞

r = –100Discounted Utility

Start

We often need to usea discount factor, γ

0 10 ≤ γ < 1Discounted utility of a history:

(h | ) ∑ i ( ( ) ( ( )))V(h | π) = ∑i ≥ 0 γ i (R(si) – C(si,π(si))) Makes rewards and costs accumulated at later stages count l th th l t d t l tless than those accumulated at early stages

Ensure a bounded measure of utilities for infinite historiesG t if 0 ≤ 1

16

Guarantees convergence if 0 ≤ γ < 1Expected utility of a policy: E(π) = ∑ h P(h|π) V(h|π)

r = –100Example

π1 = {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),

c = 0

(s3, move(r1,l3,l4)),(s4, wait),(s5, wait)}

Startγ = 0.9

c = 0

h1 = ⟨s1, s2, s3, s4, s4, … ⟩V(h1 | π1) = .90(0 –100) + .91(0 –1) + .92(0 –100) + .93 100 + .94 100 + … = 547.9

h2 = ⟨s1, s2, s5, s5 … ⟩h2 ⟨s1, s2, s5, s5 … ⟩V(h2 | π1) = .90(0 –100) + .91(0 – 1) + .92(–100) + .93(–100) + … = –910.1

E( ) 0 8 * 547 9 0 2 ( 910 1) 256 3

17

E(π1) = 0.8 * 547.9 + 0.2 (–910.1) = 256.3

Planning as Optimizationg pFrom now onwards, we will study a special case:

all rewards are 0all rewards are 0E(π) is expected cost of policy π

» the negative of what we had before» the negative of what we had beforeThis makes the equations slightly simplerCan easily generalize everything to the case of nonzero rewardsCan easily generalize everything to the case of nonzero rewards

E(π) = ∑ P(h | π) C(h | π)E(π) = ∑h P(h | π) C(h | π)where C(h | π) = ∑i ≥ 0 γ i C(si, π(si))

A solution is an optimal policy π*, i.e., E( *) ≤ E( ) f

18

E(π*) ≤ E(π) for every π

Bellman’s TheoremLet Qπ(s,a) be the expected cost in a state s if we start by using action a in state s, and then use the policy π from then on, p y

Q π(s,a) = C(s,a) + γ ∑s′ ∈ S Pa(s′ | s) Eπ(s′)Eπ(s′) is the expected cost of policy π in state s′π( ) p p y

Let π* be an optimal policyp p yAt each state, π* chooses the action that produces the smallest expected cost from there onward (Bellman’s equation)p ( q )

» Eπ*(s) = mina ∈ A Qπ*(s,a)

ThusEπ*(s) = mina {C(s,a) + γ ∑s’ ∈ S Pa(s′ | s) Eπ*(s′)}

19

π ( ) a { ( , ) γ ∑s ∈ S a( | ) π ( )}

Policy IterationyPolicy iteration is a method to find an optimal policyStart with an arbitrary initial policy πStart with an arbitrary initial policy π1

For i = 1, 2, …Compute E (s) for every s by solving the system of equationsCompute Eπi(s) for every s by solving the system of equations

» Eπi(s) = C(s, πi(s)) + γ ∑s′ ∈ S Pπi(s) (s′ | s) Eπi(s′)FFor every s,

» πi+1(s) := argmina ∈ A {C(s, a) + γ ∑s′ ∈ S Pa (s′ | s) Eπi(s′)}If th itIf πi+1 = πi then exit

Converges in a finite number of iterations

20

Examplepπ1 = {(s1, move(r1,l1,l2)),

(s2, move(r1,l2,l3)),(s , o e( , , 3)),(s3, move(r1,l3,l4)),(s4, wait),(s5 wait)}(s5, wait)}

21

π1 = {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s , o e( , , 3)),(s3, move(r1,l3,l4)),(s4, wait),(s5 wait)}(s5, wait)}

22

Example (Continued)p ( )We had

π1 = {(s1, move(r1,l1,l2)),(s2, move(r1,l2,l3)),(s3, move(r1,l3,l4)),(s4, wait),(s5, wait)}

r = –100

An optimal policy:

{( 1 ( 1 l1 l4))π2 = {(s1, move(r1,l1,l4)),(s2, move(r1,l2,l1)),(s3, move(r1,l3,l4)),(s4, wait),

Start

c = 0(s5, move(r1,l5,l4)}

23

γ = 0.9

Value IterationValue iteration is another way to find an optimal policySt t ith bit t E ( ) f h d bit > 0Start with an arbitrary cost E0(s) for each s and an arbitrary ε > 0For k = 1, 2, …

f h i S dfor each s in S do» for each a in A do

Q( ) C( ) ∑ P ( ′ | ) E ( ′)• Q(s,a) := C(s,a) + γ ∑s′ ∈ S Pa (s′ | s) Ek–1(s′)» Ek(s) = mina ∈ A Q(s,a)» πk (s) = argmina ∈ A Q(s,a)

If maxs ∈ S |Ek(s) – Ek-1(s)| < ε for every s then exit

Converges in a finite number of iterations, even if ε = 0

24

ExampleSuppose we start with E0(s) = 0 for all s, and ε = 1

Start

25

DiscussionPolicy iteration computes an entire policy in each iteration, and computes values based on that policyp p y

More work per iteration, because it needs to solve a set of simultaneous equationsUsually converges in a smaller number of iterations

Value iteration computes new values in each iteration, and chooses pa policy based on those values

In general, the values are not the values that one would get from the chosen policy or any other policyLess work per iteration, because it doesn’t need to solve a set of

iequationsUsually takes more iterations to converge

26

Date post:	21-Mar-2018
Category:	Documents
Upload:	buiminh
View:	218 times
Download:	4 times

Pl i B d M kPlanning Based on Markov Decision Processes ...dia.fi.upm.es/~mgremesal/MIR/slides/07 -...

Documents