+ All Categories
Home > Documents > Temporal Markov Decision Problems --- Formalization and...

Temporal Markov Decision Problems --- Formalization and...

Date post: 02-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
139
Temporal Markov Decision Problems Formalization and Resolution Emmanuel Rachelson Ecole doctorale : Systèmes Etablissement d’inscription : ISAE-SUPAERO Laboratoire d’accueil : ONERA-DCSD March 23rd, 2009
Transcript
Page 1: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Temporal Markov Decision Problems—

Formalization and Resolution

Emmanuel Rachelson

Ecole doctorale : SystèmesEtablissement d’inscription : ISAE-SUPAERO

Laboratoire d’accueil : ONERA-DCSD

March 23rd, 2009

Page 2: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Motivation

1 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 3: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Motivation

Performing“as well as possible”

1 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 4: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Motivation

Uncertain outcomes

1 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 5: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Motivation

Uncertain durations

1 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 6: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Motivation

Time-dependentenvironment

1 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 7: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Motivation

tt1 t2

Time-dependentgoals and rewards

1 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 8: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Problem statement

We want to build a control policywhich allows the agent to coordinate its durative actions

with the continuous evolution of its uncertain environmentin order to optimize its behaviour w.r.t. a given criterion.

2 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 9: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Problem statement

We want to build a control policywhich allows the agent to coordinate its durative actions

with the continuous evolution of its uncertain environmentin order to optimize its behaviour w.r.t. a given criterion.

2 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 10: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Problem statement

We want to build a control policywhich allows the agent to coordinate its durative actions

with the continuous evolution of its uncertain environmentin order to optimize its behaviour w.r.t. a given criterion.

2 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 11: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Problem statement

We want to build a control policywhich allows the agent to coordinate its durative actions

with the continuous evolution of its uncertain environmentin order to optimize its behaviour w.r.t. a given criterion.

2 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 12: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Outline

1 Background

2 Time-dependent policies

3 Time and MDPs

4 Resolution of TMDPs

5 Illustration and results

6 Is that sufficient?

7 Simulation-based asynchronous Policy Iteration for temporalproblems

8 Conclusion

3 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 13: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Modeling background

Sequential decision under probabilistic uncertainty:

Markov Decision Process

Tuple 〈S,A,p, r ,T 〉Markovian transition model p(s′|s,a)Reward model r(s,a)T is a set of timed decision epochs {0,1, . . . ,H}

Infinite (unbounded) horizon: H→ ∞

t0 1 n n + 1

s0

}p(s1|s0, a0)r(s0, a0)}

p(s1|s0, a2)r(s0, a2)

sn p(sn+1|sn, an)r(sn, an)

4 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 14: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Optimal policies for MDPs

Value of a sequence of actions

∀(an) ∈ AN,V (an)(s) = E

(∞

∑δ=0

γδ r(sδ ,aδ )

)

Stationary, deterministic, Markovian policy

D =

{π :

{S → As 7→ π(s) = a

}

Optimality equation

V ∗(s) = maxπ∈D

V π(s) = maxa∈A

{r(s,a) + γ ∑

s′∈Sp(s′|s,a)V ∗(s′)

}

5 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 15: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

What are we looking for?

Time-dependent policies

t

in s1: a3 a7 a1

in s2: a2 a6 a1

in s3: a3 a2

a3

6 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 16: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

What are we looking for?

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50 60 70

Ene

rgy

Time

WaitRecharge

Take Picture

move_to_2move_to_4move_to_5

6 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 17: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Continuous durations in stochastic processes

MDPs: the set T contains integer-valued dates.→ more flexible durations?

Semi-Markov Decision Process

Tuple 〈S,A,p, f , r〉Duration model f (τ|s,a)Transition model p(s′|s,a) or p(s′|s,a,τ)

MDP:t0 t1 t2 t3 . . . tδ

∆t = 1

SMDP:t0 t1 t2 t3 . . . tδ

f(τ |s, a)

7 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 18: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Time-dependent MDPs

Definition (TMDP, [Boyan and Littman, 2001])

Tuple 〈S,A,M,L,R,K 〉M Set of outcomes µ =

(s′µ ,Tµ ,Pµ

)L(µ|s, t,a) Probability of triggering outcome µ

R(µ, t, t ′) = rµ,t(t) + rµ,τ (t ′− t) + rµ,t ′(t ′)

s1 a1

µ1, 0.2

µ2, 0.8s2

Pµ2 Tµ2 = ABS

Pµ1 Tµ1 = REL

Boyan, J. A. and Littman, M. L. (2001).Exact Solutions to Time Dependent MDPs.Advances in Neural Information Processing Systems, 13:1026–1032.

8 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 19: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDP optimality equation

V (s, t) = supt ′≥t

(∫ t ′

tK (s,θ)dθ + V (s, t ′)

)V (s, t) = max

a∈AQ(s, t,a)

Q(s, t,a) = ∑µ∈M

L(µ|s, t,a) ·U(µ, t)

U(µ, t) =

{ ∫∞

−∞Pµ (t ′)[R(µ, t, t ′) + V (s′µ , t ′)]dt ′ if Tµ = ABS∫

−∞Pµ (t ′− t)[R(µ, t, t ′) + V (s′µ , t ′)]dt ′ if Tµ = REL

Qn(s, t, a1)

Qn(s, t, a2)

Qn(s, t, a3)

Qn

t

9 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 20: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDP optimality equation

V (s, t) = supt ′≥t

(∫ t ′

tK (s,θ)dθ + V (s, t ′)

)

V (s, t) = maxa∈A

Q(s, t,a)

Q(s, t,a) = ∑µ∈M

L(µ|s, t,a) ·U(µ, t)

U(µ, t) =

{ ∫∞

−∞Pµ (t ′)[R(µ, t, t ′) + V (s′µ , t ′)]dt ′ if Tµ = ABS∫

−∞Pµ (t ′− t)[R(µ, t, t ′) + V (s′µ , t ′)]dt ′ if Tµ = REL

Qn(s, t, a1)

Qn(s, t, a2)

Qn(s, t, a3)

Qn

t

9 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 21: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDP optimality equation

V (s, t) = supt ′≥t

(∫ t ′

tK (s,θ)dθ + V (s, t ′)

)V (s, t) = max

a∈AQ(s, t,a)

Q(s, t,a) = ∑µ∈M

L(µ|s, t,a) ·U(µ, t)

U(µ, t) =

{ ∫∞

−∞Pµ (t ′)[R(µ, t, t ′) + V (s′µ , t ′)]dt ′ if Tµ = ABS∫

−∞Pµ (t ′− t)[R(µ, t, t ′) + V (s′µ , t ′)]dt ′ if Tµ = REL

0 1 2 3 4 50

1

2

t′

V (s, t)

t

V (s, t)

0 1 2 3 4 50

1

2

9 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 22: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

An MDP with continuous observable time?

SMDPs no explicit time-dependency

TMDPs time-dependent but

no explicit criterionno theoretical guaranteesrestrictions on the model

⇒ Can we provide a sound and more general framework forrepresenting time in MDPs?

10 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 23: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Including observable time in MDPs

Can an MDP represent its own process’ time as a state variable?

XMDP

Tuple 〈Σ,A(X),p, r〉Σ σ = (s, t) ∈B(S×R)

A(X) compact set of parametric actions ai(x)

p(σ ′|σ ,a(x)) upper semi-continuous w.r.t. x

r(σ ,a(x)) positive, upper semi-continuous w.r.t. x

Steady time advance

∀(σ ,a(x)) ∈ Σ×A(X), ∃α > 0/ t ′ < t + α ⇒ p(σ ′|σ ,a(x)) = 0

“tδ+1 ≥ tδ + α”

11 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 24: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Theorem (XMDP optimality equation, [Rachelson et al., 2008a])

The optimal value function V ∗ is the unique solution of:

∀(s, t) ∈ S×R, V (s, t) =

supa(x)∈A(X)

{r(s, t,a(x)) +

∫t ′∈Rs′∈S

γt ′−tp(s′, t ′|s, t,a(x))V (s′, t ′)ds′dt ′

}

Rachelson, E., Garcia, F., and Fabiani, P. (2008a).Extending the Bellman Equation for MDP to Continuous Actions and Continuous Time inthe Discounted Case.In International Symposium on Artificial Intelligence and Mathematics.

Theorem (XMDP optimal policy)

Under the previous assumptions, there exists a deterministic,Markovian policy such that V π = V ∗.

12 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 25: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPs and XMDPs

Optimality equation and conditionsTMDP optimality equation ≡ XMDP equation with specific assumptions.

total reward criterion

t-deterministic and s-static, implicit wait action

interleaving of wait/action

no lump sum reward for wait action

assumptions on r ,L,Pµ so that the optimal policy exists

assumptions on r ,L,Pµ so that the systems retains physical meaning

13 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 26: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPs and XMDPs

Optimality equation and conditionsTMDP optimality equation ≡ XMDP equation with specific assumptions.

XMDPs provide proven optimality conditions and equation.

But solving the general case of XMDPs is too complex.

→ In practice, we turn back to solving TMDPs

13 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 27: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Solving TMDPs

s1 a1

µ1, 0.2

µ2, 0.8s2

Pµ2 Tµ2 = ABS

Pµ1 Tµ1 = REL

Value iteration Bellman backups for TMDPs can be performed exactly if:

L(µ|s, t,a) piecewise constant

R(µ, t, t ′) = rµ,t (t) + rµ,τ (t ′− t) + rµ,t ′(t ′)

rµ,t (t), rµ,τ (τ), rµ,t ′(t ′) piecewise linear

Pµ (t ′), Pµ (t ′− t) discrete distributions

Then V ∗(s, t) is piecewise linear.

14 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 28: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Solving TMDPs

s1 a1

µ1, 0.2

µ2, 0.8s2

Pµ2 Tµ2 = ABS

Pµ1 Tµ1 = REL

What about other, more expressive functions?

How does this theoretical result scale to practical resolution?

14 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 29: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Extending exact resolutionPiecewise polynomial models: L, Pµ , ri ∈Pn.

Degree evolution

Pµ ∈DPA

ri ,V0 ∈PB

L ∈PC

⇒ d◦(Vn) = B + n(A + C + 1)

Stability⇔ A + C =−1.

Exact resolution conditions

Degree stability + exact analytical computations:

Pµ ∈DP−1

ri ∈P4

L ∈P0

15 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 30: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Extending exact resolutionPiecewise polynomial models: L, Pµ , ri ∈Pn.

Degree evolution

Pµ ∈DPA

ri ,V0 ∈PB

L ∈PC

⇒ d◦(Vn) = B + n(A + C + 1)

Stability⇔ A + C =−1.

Exact resolution conditions

Degree stability + exact analytical computations:

Pµ ∈DP−1

ri ∈P4

L ∈P0

15 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 31: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Extending exact resolutionPiecewise polynomial models: L, Pµ , ri ∈Pn.

Degree evolution

Pµ ∈DPA

ri ,V0 ∈PB

L ∈PC

⇒ d◦(Vn) = B + n(A + C + 1)

Stability⇔ A + C =−1.

Exact resolution conditions

Degree stability + exact analytical computations:

Pµ ∈DP−1

ri ∈P4

L ∈P0

15 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 32: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Extending exact resolutionPiecewise polynomial models: L, Pµ , ri ∈Pn.

Degree evolution

Pµ ∈DPA

ri ,V0 ∈PB

L ∈PC

⇒ d◦(Vn) = B + n(A + C + 1)

Stability⇔ A + C =−1.

Exact resolution conditions

Degree stability + exact analytical computations:

Pµ ∈DP−1

ri ∈P4

L ∈P0

If B > 4: approximate root finding.

15 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 33: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Extending exact resolutionPiecewise polynomial models: L, Pµ , ri ∈Pn.

Degree evolution

Pµ ∈DPA

ri ,V0 ∈PB

L ∈PC

⇒ d◦(Vn) = B + n(A + C + 1)

Stability⇔ A + C =−1.

Exact resolution conditions

Degree stability + exact analytical computations:

Pµ ∈DP−1

ri ∈P4

L ∈P0

If A + C > 0: projection scheme of Vn on PB.

15 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 34: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

And in practice?

Fact (Admitted)

The number of definition intervals in Vn grows with n and does notnecessarily converge.

⇒ numerical problems occur before ‖Vn−Vn−1‖< ε .

e.g. V calculation:

Qn(s, t, a1)

Qn(s, t, a2)

Qn(s, t, a3)

Qn

t

16 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 35: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

And in practice?

Fact (Admitted)

The number of definition intervals in Vn grows with n and does notnecessarily converge.

⇒ numerical problems occur before ‖Vn−Vn−1‖< ε .

→ general case: approximate resolution by piecewise polynomialinterval simplification for the value function.

Approximation↗ degree reduction

↘ interval simplification

16 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 36: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPpoly : Approximate Value Iteration on TMDPs

TMDPpoly polynomial approximation

pout = poly_approx(pin, [l,u],ε,B)Two phases: incremental refinement and simplification.

I

I1 I2

max

erro

r>

ǫ

pin

first attemptsecond attempt

17 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 37: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPpoly : Approximate Value Iteration on TMDPs

TMDPpoly polynomial approximation

pout = poly_approx(pin, [l,u],ε,B)Two phases: incremental refinement and simplification.

I

I1 I2 I3

pin

pout

17 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 38: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPpoly : Approximate Value Iteration on TMDPs

TMDPpoly polynomial approximation

pout = poly_approx(pin, [l,u],ε,B)Two phases: incremental refinement and simplification.

Properties

pout ∈PB

‖pin−pout‖∞ ≤ ε

suboptimal number of intervals

good complexity compromise

17 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 39: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPpoly : Approximate Value Iteration on TMDPs

Prioritized Sweeping.

Leveraging the computational effort byordering Bellman backups

Perform Bellman backups in states with thelargest value function change.

Moore, A. W. and Atkeson, C. G. (1993).Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time.Machine Learning Journal, 13(1):103–105.

18 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 40: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPpoly : Approximate Value Iteration on TMDPs

Adapting Prioritized Sweeping to TMDPs.

Pick highest priority state→ s0

s0

18 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 41: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPpoly : Approximate Value Iteration on TMDPs

Adapting Prioritized Sweeping to TMDPs.

Pick highest priority state→ s0

Bellman backup→ V (s0, t)s0

update V (s0, t)update V (s0, t)poly approx (V (s0, t))

18 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 42: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPpoly : Approximate Value Iteration on TMDPs

Adapting Prioritized Sweeping to TMDPs.

Pick highest priority state→ s0

Bellman backup→ V (s0, t)Update Q values→ Q(s, t, a)

s0

s1

s2

s3

a10, µ10

a20, µ20

a30, µ30

18 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 43: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPpoly : Approximate Value Iteration on TMDPs

Adapting Prioritized Sweeping to TMDPs.

Pick highest priority state→ s0

Bellman backup→ V (s0, t)Update Q values→ Q(s, t, a)Update priorities→ prio(s) = ‖Q−Qold‖∞

s0

s1

s2

s3

prio(s1)

prio(s2)

prio(s3)

18 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 44: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

TMDPpoly

TMDPpoly in a nutshell

TMDPpoly :

Analytical polynomial calculationsL∞-bounded error projectionPrioritized Sweeping for TMDPs

Analytical operations: option for representing continuous quantities.

Approximation makes resolution possible.

Asynchronous VI makes it faster.

19 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 45: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Illustration — UAV patrol problem

*

*

*

*

state (3, 8)

0

2

25 60 70

state (5, 2)

0

5

45 50

state (9, 3)

0

2

20 50

state (9, 10)

0

3

60 70

20 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 46: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

—Compute V (s, t), V (s, t) and poly_approx(V (s, t))

21 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 47: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

—Compute U(µ, t), Q(s,a, t) and prio(s)

21 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 48: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Mars Rover

1

2

3

4

5

6photo[ts; te]

sample

sample

4

5

12

3

55

5

5

22 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 49: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Mars rover policy

V and π in p = 3 when no goals have been completed yet.

23 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 50: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Mars rover policy

π in p = 3 when no goals have been completed yet — 2D view.

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50 60 70

Ene

rgy

Time

WaitRecharge

Take Picture

move_to_2move_to_4move_to_5

23 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 51: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Contributions

XMDP optimality conditions and equations.

Specific case of TMDPs.

Extending exact resolution of TMDPs.

TMDPpoly allows better resolution of generalized piecewise polynomialTMDPs (including the exact case).

Optimal value function and policy

Existence of optimality conditions and an optimality equation on V andπ for continuous observable time, discrete event stochastic processes.

V ∗ = LV ∗

π∗ = argmaxa(x)∈A(X)

{r(s, t,a(x)) +

∫t ′∈Rs′∈S

γt ′−tp(s′, t ′|s, t,a(x))V ∗(s′, t ′)ds′dt ′

}

24 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 52: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Contributions

XMDP optimality conditions and equations.

Specific case of TMDPs.

Extending exact resolution of TMDPs.

TMDPpoly allows better resolution of generalized piecewise polynomialTMDPs (including the exact case).

TMDP hypothesis

TMDPs are XMDPs with specific hypothesis and a total rewardcriterion.

24 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 53: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Contributions

XMDP optimality conditions and equations.

Specific case of TMDPs.

Extending exact resolution of TMDPs.

TMDPpoly allows better resolution of generalized piecewise polynomialTMDPs (including the exact case).

Exact resolution conditionsConditions for exact resolution of TMDPs can be slightly extended.

Pµ ∈DPA

ri ∈PB

L ∈PC

Pµ ∈DP−1

ri ∈P4

L ∈P0

But practical resolution call for approximation.

24 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 54: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Contributions

XMDP optimality conditions and equations.

Specific case of TMDPs.

Extending exact resolution of TMDPs.

TMDPpoly allows better resolution of generalized piecewise polynomialTMDPs (including the exact case).

TMDPpoly in a nutshell

TMDPpoly :

Analytical polynomial calculationsL∞-bounded error projectionPrioritized Sweeping for TMDPs

Analytical operations: option for representing continuous quantities.

Approximation makes resolution possible.

Asynchronous VI makes it faster.

24 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 55: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Is that sufficient?

“A well-cast problem is a half-solved problem.”

Initial example: obtaining the model is not trivial.

→ the “first half” (modeling) is not solved.

A natural model for continuous-time decision processes?

25 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 56: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Is that sufficient?

“A well-cast problem is a half-solved problem.”

Initial example: obtaining the model is not trivial.

→ the “first half” (modeling) is not solved.

A natural model for continuous-time decision processes?

25 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 57: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Is that sufficient?

“A well-cast problem is a half-solved problem.”

Initial example: obtaining the model is not trivial.

→ the “first half” (modeling) is not solved.

A natural model for continuous-time decision processes?

25 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 58: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Concurrent exogeneous events

Explicit-event modeling:a natural description of the systems complexity.

26 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 59: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Concurrent exogeneous events

Explicit-event modeling:a natural description of the systems complexity.

Aggregating the contribution of concurrent temporal processes. . .

. . .internalsunlightweatherother agentmy action

26 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 60: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Concurrent exogeneous events

Explicit-event modeling:a natural description of the systems complexity.

Aggregating the contribution of concurrent temporal processes. . .

. . .internalsunlightweatherother agentmy action

S

. . . all affecting the same state space

26 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 61: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

GSMDPs

Generalized Semi-Markov Decision Process

Tuple 〈S,E ,A,p, f , r〉E Set of events.

A⊂ E Subset of controlable events (actions).

f (ce|s,e) Duration model of event e.

p(s′|s,e,ce) Transition model of event e.

Glynn, P. (1989).A GSMP Formalism for Discrete Event Systems.Proc. of the IEEE, 77.

Younes, H. L. S. and Simmons, R. G. (2004).Solving Generalized Semi-Markov Decision Processes using Continuous Phase-TypeDistributions.In AAAI Conference on Artificial Intelligence.

27 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 62: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

GSMDPs

Generalized Semi-Markov Decision Process

Tuple 〈S,E ,A,p, f , r〉E Set of events.

A⊂ E Subset of controlable events (actions).

f (ce|s,e) Duration model of event e.

p(s′|s,e,ce) Transition model of event e.

s1

27 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 63: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

GSMDPs

Generalized Semi-Markov Decision Process

Tuple 〈S,E ,A,p, f , r〉E Set of events.

A⊂ E Subset of controlable events (actions).

f (ce|s,e) Duration model of event e.

p(s′|s,e,ce) Transition model of event e.

s1

Es1 : e2

e4

e5

a

27 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 64: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

GSMDPs

Generalized Semi-Markov Decision Process

Tuple 〈S,E ,A,p, f , r〉E Set of events.

A⊂ E Subset of controlable events (actions).

f (ce|s,e) Duration model of event e.

p(s′|s,e,ce) Transition model of event e.

s1

Es1 : e2

e4

e5

a

27 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 65: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

GSMDPs

Generalized Semi-Markov Decision Process

Tuple 〈S,E ,A,p, f , r〉E Set of events.

A⊂ E Subset of controlable events (actions).

f (ce|s,e) Duration model of event e.

p(s′|s,e,ce) Transition model of event e.

s1

Es1 : e2

e4

e5

a

s2

P (s′|s1, e4)

27 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 66: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

GSMDPs

Generalized Semi-Markov Decision Process

Tuple 〈S,E ,A,p, f , r〉E Set of events.

A⊂ E Subset of controlable events (actions).

f (ce|s,e) Duration model of event e.

p(s′|s,e,ce) Transition model of event e.

s1

Es1 : e2

e4

e5

a

s2

P (s′|s1, e4)

Es2 : e2

e3

a

27 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 67: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

GSMDPs

Generalized Semi-Markov Decision Process

Tuple 〈S,E ,A,p, f , r〉E Set of events.

A⊂ E Subset of controlable events (actions).

f (ce|s,e) Duration model of event e.

p(s′|s,e,ce) Transition model of event e.

s1

Es1 : e2

e4

e5

a

s2

P (s′|s1, e4)

Es2 : e2

e3

a

P (s′|s2, a)

27 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 68: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Modeling claim

A natural model for temporal processes

Observable time GSMDPs are a natural way of modeling stochastic,temporal decision processes.

28 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 69: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Properties

Markov property

The process defined by the natural state s of a GSMDP does notretain Markov’s property.

No guarantee of an optimal π(s) policy.

Markovian state: (s,c)→ often non-observable.

29 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 70: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Properties

Working hypothesis

In time-dependent GSMDPs, the state (s, t) is a good approximationof the Markovian state variables (s,c).

29 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 71: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Properties

RemarkEven though GSMDPs are non-Markov processes, they provide astraightforward way of building a simulator.

How can we search for a good policy?→ Learning from the interaction with a GSMDP simulator.

29 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 72: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Learning from interaction with a simulator

Agent

Simulator

a s′, t′, r

Planning: using model

{P(s′, t ′|s, t,a)r(s, t,a) ↘

to get good

{V (s, t)π(s, t)

Learning: using samples (s, t,a, r ,s′, t ′)↗

30 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 73: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Simulation-based Reinforcement Learning

3 main issues:

Exploration of the state space

Update of the value function

Improvement of the policy

How should we use our temporal process’ simulator to learn policies?

31 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 74: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Illustration

This approach is motivated byproblems such as the “subway problem” with

large, hybrid state spaces, many concurrent events,for which a global model is not available.

32 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 75: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Illustration

This approach is motivated byproblems such as the “subway problem” with

large, hybrid state spaces, many concurrent events,for which a global model is not available.

Exploiting info fromepisodes?

episode = observedsimulated trajectory through

the state space.

t

ss0

b

b

bb

bb

b

bb

b

b

b

bb

bb

bb

bc

bcbc

bc

bcbc

bc

bc

bc

bcbc

bc

bc

bcbcbc

bc

bcbc

ut

ut

ut

ut

utut

ut

utut

ut

ut

ut

utut

ut

ut

ut

ututut

32 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 76: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Illustration

Our approach

Improve the policy in the situations which are likely to be encountered.Evaluate the policy in the situations needed for improvement.

Exploiting info fromepisodes?

episode = observedsimulated trajectory through

the state space.

t

ss0

b

b

bb

bb

b

bb

b

b

b

bb

bb

bb

bc

bcbc

bc

bcbc

bc

bc

bc

bcbc

bc

bc

bcbcbc

bc

bcbc

ut

ut

ut

ut

utut

ut

utut

ut

ut

ut

utut

ut

ut

ut

ututut

32 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 77: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Model-free, simulation-based local search

Input initial state s0, t0,initial policy π0,process simulator.

Goal improve on π0

“simulator” →

“local” →“incremental π improvement” →

simulation-based

asynchronouspolicy iteration

for temporal problems:

iATPI

33 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 78: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Model-free, simulation-based local search

Input initial state s0, t0,initial policy π0,process simulator.

Goal improve on π0

“simulator” →“local” →

“incremental π improvement” →

simulation-basedasynchronous

policy iteration

for temporal problems:

iATPI

33 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 79: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Model-free, simulation-based local search

Input initial state s0, t0,initial policy π0,process simulator.

Goal improve on π0

“simulator” →“local” →

“incremental π improvement” →

simulation-basedasynchronouspolicy iteration

for temporal problems:

iATPI

33 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 80: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Model-free, simulation-based local search

Input initial state s0, t0,initial policy π0,process simulator.

Goal improve on π0

“simulator” →“local” →

“incremental π improvement” →

simulation-basedasynchronouspolicy iteration

for temporal problems:

iATPI

33 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 81: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Model-free, simulation-based local search

Input initial state s0, t0,initial policy π0,process simulator.

Goal improve on π0

“simulator” →“local” →

“incremental π improvement” →

simulation-basedasynchronouspolicy iteration

for temporal problems:

iATPI

33 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 82: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Asynchronous Dynamic Programming

Asynchronous Bellman backups

As long as every state is visited infinitely often for Bellman backups onV or π , the sequences of Vn and πn converge to V ∗ and π∗.→ Asynchronous Policy Iteration.

Bertsekas, D. P. and Tsitsiklis, J. N. (1996).Neuro-Dynamic Programming.Athena Scientific.

iATPI performs greedy exploration

Once an improving action a is found in (s, t), the next state (s′, t ′)picked for Bellman backup is chosen by applying a.Observable time⇒ this (s′, t ′) is picked according to P(s′, t ′|s, t,πn).

34 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 83: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Monte Carlo evaluations for temporal problems

Simulating π in (s, t)⇓((s0, t0),a0, r0, . . . ,(sl−1, tt−1),al−1, rl−1,(sl , tl)

)∣∣∣∣∣∣(s0, t0) = (s, t)ai = π(si , ti)tl ≥ T

t

ss0

b

b

bb

bb

b

bb

b

b

b

bb

bb

bb

bc

bcbc

bc

bcbc

bc

bc

bc

bcbc

bc

bc

bcbcbc

bc

bcbc

ut

ut

ut

ut

utut

ut

utut

ut

ut

ut

utut

ut

ut

ut

ututut

35 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 84: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Monte Carlo evaluations for temporal problems

Simulating π in (s, t)⇓((s0, t0),a0, r0, . . . ,(sl−1, tt−1),al−1, rl−1,(sl , tl)

)∣∣∣∣∣∣(s0, t0) = (s, t)ai = π(si , ti)tl ≥ T

ValueSet =

{R(si , ti) =

l−1∑

k=iri

}

35 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 85: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Monte Carlo evaluations for temporal problems

Simulating π in (s, t)⇓((s0, t0),a0, r0, . . . ,(sl−1, tt−1),al−1, rl−1,(sl , tl)

)∣∣∣∣∣∣(s0, t0) = (s, t)ai = π(si , ti)tl ≥ T

ValueSet =

{R(si , ti) =

l−1∑

k=iri

}

Value function estimation

V π(s, t) = E(R(s, t))V π ← regression(ValueSet)

35 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 86: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

In practice

Algorithm sketch

Given the current policy πn,the current process state (s, t),the current estimate V πn

Compute the best action a∗ with respect to V πn

Pick (s′, t ′) according to a∗

Until t ′ > T

Compute V πn+1 for the last(s) episode(s)

But . . .

36 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 87: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Avoiding the pitfall of partial exploration

The R(s, t) are not drawn i.i.d. (only independently).→ V π is a biased estimator.

V π is only valid locally→ local confidence in V π

t

ss0

b

b

bb

bb

b

bb

b

b

b

bb

bb

bb

bc

bcbc

bc

bcbc

bc

bc

bc

bcbc

bc

bc

bcbcbc

bc

bcbc

ut

ut

ut

ut

utut

ut

utut

ut

ut

ut

utut

ut

ut

ut

ututut

P (s′, t′|s0, t0, a1)

Q(s0, a1) =?

37 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 88: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Avoiding the pitfall of partial exploration

The R(s, t) are not drawn i.i.d. (only independently).→ V π is a biased estimator.

V π is only valid locally→ local confidence in V π

Confidence function CV

Can we trust V π(s, t) as an approximation of V π in (s, t)?

CV :

{S×R → {>,⊥}

s, t 7→ CV (s, t)

V π(s, t)→ CV (s, t)π(s, t)→ Cπ(s, t)

37 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 89: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

iATPI

iATPI

iATPI:

Asynchronous policy iteration for greedy searchTime-dependency & Monte-Carlo samplingLocal policies and values via confidence functions

Asynchronous PI: local improvements / partial evaluation.

t-dependent Monte-Carlo sampling: loopless — finite — total criterion.

Confidence functions: alternative to heuristic-based approaches.

38 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 90: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

iATPI

Given the current policy πn,the current process state (s, t),the current estimate V πn

Compute the best action a∗ with respect to V πn

Use CV πn to check if V πn can be usedSample more evaluation trajectories for πn if notRefine V πn and CV πn

Pick (s′, t ′) according to a∗

Until t ′ > T

Compute V πn+1 ,CV πn+1 ,πn+1,Cπn+1 for the last(s) episode(s)

39 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 91: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Output

A pile Πn = {(π0,Cπ0),(π1,Cπ1), . . . ,(πn,Cπn )|Cπ0(s, t) =>} ofpartial policies.

39 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 92: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Preliminary results with iATPI

Preliminary results on ATPI and the subway problem:

Subway problem

4 trains, 6 stations→ 22 hybrid state variables, 9 actions

episodes of 12 hours with around 2000 steps.

40 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 93: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Preliminary results with iATPI

Preliminary results on ATPI and the subway problem:

With proper initialization, naive ATPI finds good policies.

40 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 94: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Preliminary results with iATPI

Preliminary results on ATPI and the subway problem:

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

500

1000

1500

0 2 4 6 8 10 12 14

initi

al s

tate

val

ue

iteration number

M-CSVR

40 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 95: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Value functions, policies and confidencefunctions

How do we write V , CV , π and Cπ?

→ Statistical learning problem

We implemented and tried several options:

V incremental, local regression problem.

SVR, LWPR, Nearest-neighbours.

π local classification problem.

SVC, Nearest-neighbours.

C incremental, local statistical sufficiency test.

OC-SVM, central-limit theorem.

41 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 96: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Perspectives for iATPI

iATPI is ongoing work→ no hasty conclusions

Current work: extensive testing of the algorithm full version.

Still lots of open questions:

How to avoid local maxima in value function space?

Test on a fully discrete and observable problem?

. . . and many ideas for improvement:

Use Vn−k functions as lower bounds on Vn

Utility functions for stopping sampling in episode.bestAction()

42 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 97: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Contributions

Modeling framework for stochastic decision processes: GSMDPs +continuous time.

iATPI

Modeling claim

Describing concurrent, exogenous contributions to the system’sdynamics separately.

Concurrent observable-time SMDPs affecting the same state space→ observable-time GSMDPs.

Natural framework for describing temporal problems.

43 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 98: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Contributions

Modeling framework for stochastic decision processes: GSMDPs +continuous time.

iATPI

iATPI

iATPI:

Asynchronous policy iterationTime-dependency & Monte-Carlo samplingConfidence functions

Asynchronous PI: local improvements / partial evaluation.

t-dependent Monte-Carlo sampling: loopless — finite — total criterion.

Confidence functions: alternative to heuristic-based approaches.

43 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 99: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Summarizing the work done

Three ways of reading the thesis:

Modeling of temporal stochastic decision processes:

implicit-event (extended TMDP)and

explicit-event (observable time GSMDP)Theory General framework of XMDPs, optimality conditions and

equations.

Algorithms for time-dependent policy search:model-based asynchronous value iteration (TMDPpoly )

andmodel-free local search for policy iteration (iATPI).

44 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 100: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Background Policies Time and MDPs TMDPpoly Illustration Is that sufficient? iATPI Conclusion

Thank you for your attention!

45 / 45Temporal Markov Decision Problems — Formalization and Resolution

Page 101: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

International Conferences

Rachelson, E., Teichteil, F., and Garcia, F. (2007a).Temporal coordination under uncertainty: initial results for the two agents case.In ICAPS Doctoral Consortium.

Rachelson, E., Garcia, F., and Fabiani, P. (2008a).Extending the Bellman Equation for MDP to Continuous Actions and Continuous Time inthe Discounted Case.In International Symposium on Artificial Intelligence and Mathematics.

Rachelson, E., Quesnel, G., Garcia, F., and Fabiani, P. (2008b).A Simulation-based Approach for Solving Generalized Semi-Markov Decision Processes.In European Conference on Artificial Intelligence.

Rachelson, E., Quesnel, G., Garcia, F., and Fabiani, P. (2008c).Approximate Policy Iteration for Generalized Semi-Markov Decision Processes: anImproved Algorithm.In European Workshop on Reinforcement Learning.

1 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 102: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

French-speaking Conferences

Rachelson, E., Fabiani, P., Farges, J.-L., Teichteil, F., and Garcia, F. (2006).Une approche du traitement du temps dans le cadre MDP : trois méthodes de découpagede la droite temporelle.In Journées Françaises Planification Décision Apprentissage.

Rachelson, E., Teichteil, F., and Garcia, F. (2007b).XMDP : un modèle de planification temporelle dans l’incertain à actions paramétriques.In Journées Françaises Planification Décision Apprentissage.

Rachelson, E., Fabiani, P., and Garcia, F. (2008a).Un Algorithme Amélioré d’Itération de la Politique Approchée pour les ProcessusDécisionnels Semi-Markoviens Généralisés.In Journées Françaises Planification Décision Apprentissage.

Rachelson, E., Fabiani, P., Garcia, F., and Quesnel, G. (2008b).Une Approche basée sur la Simulation pour l’Optimisation des Processus DécisionnelsSemi-Markoviens Généralisés (english version).In Conférence Francophone sur l’Apprentissage Automatique.Best student paper, awarded by AFIA.

2 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 103: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Talks and presentations

ONERA DCSD, UR-CD, Toulouse (April 2006).Planification dans l’incertain — Introduire une variable temporalle continue.

INRA-BIA, Toulouse (May 25th, 2007).Planifier en fonction du temps dans le cadre MDP.

ONERA DCSD, UR-CD, Toulouse (February 3rd, 2008)Formalisation et résolution de problèmes de Markov temporels par couplage avec VLE.Coupled with “Multi-modélisation et simulation : la plate-forme VLE” by G. Quesnel.

Intelligent Systems Laboratory, Technical University of Crete (July 29th,2008)Simulation-based Approximate Policy Iteration for Generalized Semi-Markov DecisionProcesses.

3 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 104: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Teaching activities

Non-linear optimization.lecturing (2007, 2008), tutoring (2006) — ENAC

Probabilities and Harmonic analysis, introduction module.lecturing (2006) — SUPAERO

Reinforcement Learning and Dynamic Programmingtutoring (2008) — ISAE-SUPAERO

Stochastic Processestutoring (2007, 2008) — SUPAERO then ISAE-SUPAERO

Optimization and numeric computationtutoring (2006, 2007, 2008) — SUPAERO then ISAE-SUPAERO

MatLab introductiontutoring (2006, 2007) — SUPAERO

Harmonic analysistutoring (2006) — SUPAERO

4 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 105: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Algorithmic perspectivesModel based approaches:

Biasing PS in TMDPpoly to obtain better convergence speed.

Better algorithms (and implementation) for POLYTOOLS .

XMDPpoly ?

Policy Iteration for XMDPs? TMDPs?

. . .

The iATPI perspective:

Discounted criteria?

Statistical learning for iATPI, sound algorithms and efficientimplementations.

Avoiding local minima with iATPI.

. . .

5 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 106: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Perspectives: models and foundations

Time and stochastic processes:

Foundations of time-explicit decision processes: lifting the mathematicalassumptions in the XMDP model

Relation between GSMDP and POMDP: defining a belief state from the(s,c) state

6 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 107: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Exploration vs exploitation?

How does iATPI compare to other methods concerningthe exploration vs. exploitation trade-off?

Automated balancing through “optimism”:

“Optimism in the face of uncertainty”RmaxAdmissible heuristics

Encourages early exploration.Automatically balances the trade-off.

⇒ Very good for online learning.

iATPI suggests an “offline/online” alternative:

abandon global exploration for incremental, episode-based exploration.explore what we need locally for evaluation, use it for local improvement,then look outside.

No exploration “enc/discouragement”.Local search idea

⇒ Good for “cautious” search?

7 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 108: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Other illustrations of GSMDPs

Should we open more lines ?

8 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 109: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Other illustrations of GSMDPs

Airplanes taxiing management

8 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 110: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Other illustrations of GSMDPs

Adding or removing trains ?

8 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 111: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Other illustrations of GSMDPs

Onboard planning for coordination

8 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 112: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Other illustrations of GSMDPs

The rover’s declaredmost probable trajectory

is .

The fire shouldchange according to

t

⇒ My action policy is:

t

in s1: a3 a7 a1

in s2: a2 a6 a1

in s3: a3 a2

a3

Consequence eventsof the UAV’s declaredmost probable actions:

ev5 ev1 ev6t

Probability ofsuccessfully taking road 3

t

Current state

x1 = 3 x3 = 1 x5 = 0x2 = 3 x4 = 2 x6 = 8

⇒ My action policy is:

tin s1: a3 a7 a1

in s2: a2 a6 a1

in s3: a3 a2

a3

communication channel

8 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 113: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Other illustrations of GSMDPs

The rover’s declaredmost probable trajectory

is .

The fire shouldchange according to

t

⇒ My action policy is:

t

in s1: a3 a7 a1

in s2: a2 a6 a1

in s3: a3 a2

a3

8 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 114: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Other illustrations of GSMDPs

Consequence eventsof the UAV’s declaredmost probable actions:

ev5 ev1 ev6t

Probability ofsuccessfully taking road 3

t

Current state

x1 = 3 x3 = 1 x5 = 0x2 = 3 x4 = 2 x6 = 8

⇒ My action policy is:

tin s1: a3 a7 a1

in s2: a2 a6 a1

in s3: a3 a2

a3

8 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 115: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Other illustrations of GSMDPs

The rover’s declaredmost probable trajectory

is .

The fire shouldchange according to

t

⇒ My action policy is:

t

in s1: a3 a7 a1

in s2: a2 a6 a1

in s3: a3 a2

a3

Consequence eventsof the UAV’s declaredmost probable actions:

ev5 ev1 ev6t

Probability ofsuccessfully taking road 3

t

Current state

x1 = 3 x3 = 1 x5 = 0x2 = 3 x4 = 2 x6 = 8

⇒ My action policy is:

tin s1: a3 a7 a1

in s2: a2 a6 a1

in s3: a3 a2

a3

communication channel

8 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 116: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Waiting or being idle?

tt1 T

explicit wait a

tt1 t2

T1

T2

implicit wait

implicit wait

a

a ′

being idle→ let the system change continuouslydiscrete event process→ stepwise changes in the system

From the execution point of view:

being idle→ let the system change by itself⇒ interest of W function or explicit-event representations (a∞).

But this is different from the TMDP’s wait .

9 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 117: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

DECTS

GSMPs = concurrent temporal stochastic processesDEVS = generic description of discrete events systems

model M

pinn , vin

n

...pin0 , vin

0

XM

poutm , vout

m

...pout0 , vout

0

YM

10 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 118: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

DECTS

Temporal decision process ≡ input port aDECTSa tionsa

observations(s′, r)

step(a) ≡δext(a, sinternal)

10 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 119: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

DECTS

An optimization process ≡ sequence of operations involvingexperiments with a DECTS model.

DECTSlearnerexe utivemodel DECTS

re ursivesimulationmodeldynami ally reate or loneDECTS models on the �y andlink them with the learner

re eive information fromlinked models

10 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 120: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

DECTS

A DECTS learner is an executive (high-level) discrete events system,creating and controling a set of DECTS experiments.

It has internal decision objects (policies, values, etc.)

Nota: Actor-Critic vs. DECTS? Actor-Critic is the architecture of theDECTS learner’s decision objects.

10 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 121: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI as a DECTS

0 0

0

0

0

beginendtrial

idle

decide

info action

choose

create and init"trial" DECTS

destroy "trial"

send action to "trial"destroy "eval" DECTS

create "eval" DECTSby cloning "trial" send action

to "eval"

11 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 122: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Database iATPIH0 hypothesis

The asymptotical convergence of Qn(s,a) towards a distributionN (Q(s,a),σ) is quick.

Theorem (PAC-bound guarantee)

Qn(s,a) is an ε-estimate of Q(s,a) with probability p = erf(

ε√

Qn√

2

)In practice

Na Stop the rollouts in (s,a) whenever σQn ≤ ε

√n

erf−1(p)√

2.

Nepisodes Stop running episodes for the current policy when theQ(s0,a∗) has σQ

n lower than the bound.

rollouts Early stopping if a state with σMn ≤ ε

erf−1(p)√

2is

encountered.

12 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 123: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Mars rover

V and π in p = 3 when no goals have been completed yet.

13 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 124: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Mars rover

π in p = 3 when no goals have been completed yet — 2D view.

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50 60 70

Ene

rgy

Time

WaitRecharge

Take Picture

move_to_2move_to_4move_to_5

13 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 125: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Analytical resolution of GSMDPs

[Younes and Simmons, 2004]→ approximate all duration modelsf (τ|s,e) by chains of exponential distributions.

Phase-type distributions.Introduce abstract states for the nodes in phase-type distr.

Memoryless exponential distributions turn the GSMDP into a CTMDP.

Resolution by uniformization.

Younes, H. L. S. and Simmons, R. G. (2004).Solving Generalized Semi-Markov Decision Processes using Continuous Phase-TypeDistributions.In AAAI Conference on Artificial Intelligence.

14 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 126: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

GSMDPs and POMDPs

Observations and hidden process

The natural state s of a GSMDP corresponds to observations on ahidden Markov process (s,c).

{(s,c) ↔ hidden state

s ↔ observations

Working hypothesis

In time-dependent GSMDPs, the state (s, t) is a good approximationof the associated POMDP’s belief state.

iATPI → simulation-based, asynchronous policy iterationfor stochastic shortest path POMDPs.

15 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 127: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Computing V from V

0 1 2 3 4 50

1

2

t′

V (s, t′)

0 1 2 3 4 50

1

2

t′

f(t′) = V (s, t′)− kt′

t

g(t) = supt′≥t

f(t′)

0 1 2 3 4 50

1

2

t

V (s, t) = kt + g(t)

0 1 2 3 4 50

1

2

16 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 128: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Asynchronous Policy Iteration

Asynchronous Bellman backups

As long as every state is visited infinitely often for Bellman backups onV or π , the sequences of Vn and πn converge to V ∗ and π∗.

Examples

Unordered V -backups (alternate π-backups) VIAsynchonous V -backups (alternate π-backups) Async VI,

Prio. Sweeping,RTDP, . . .

Unordered, alternate 1 π-backup / m V -backups (Modified) PI

17 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 129: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI

Main loop(π0 or V0, s0, t0, T , Nepisodes)

loopValueSet.reset(), ActionSet.reset()for i = 1 to Nepisodes do

σ .reset()episode.reset(s0, t0)while t < T do

a = episode.bestAction()episode.activateEvent(a)((s′, t ′), r)← episode.step()σ .add((s, t),a, r)t ← t ′

(ValueSet,ActionSet).merge(convert(σ))Vn,CVn ,πn,Cπn ← train(ValueSet,ActionSet)

18 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 130: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI

Main loop(π0 or V0, s0, t0, T , Nepisodes)

loopValueSet.reset(), ActionSet.reset()for i = 1 to Nepisodes do

σ .reset()episode.reset(s0, t0)while t < T do

a = episode.bestAction()episode.activateEvent(a)((s′, t ′), r)← episode.step()σ .add((s, t),a, r)t ← t ′

(ValueSet,ActionSet).merge(convert(σ))

Vn,CVn ,πn,Cπn ← train(ValueSet,ActionSet)

18 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 131: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI

Main loop(π0 or V0, s0, t0, T , Nepisodes)

loopValueSet.reset(), ActionSet.reset()for i = 1 to Nepisodes do

σ .reset()episode.reset(s0, t0)while t < T do

a = episode.bestAction()episode.activateEvent(a)((s′, t ′), r)← episode.step()σ .add((s, t),a, r)t ← t ′

(ValueSet,ActionSet).merge(convert(σ))

Vn,CVn ,πn,Cπn ← train(ValueSet,ActionSet)

18 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 132: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI

Main loop(π0 or V0, s0, t0, T , Nepisodes)

loopValueSet.reset(), ActionSet.reset()for i = 1 to Nepisodes do

σ .reset()episode.reset(s0, t0)while t < T do

a = episode.bestAction()episode.activateEvent(a)((s′, t ′), r)← episode.step()σ .add((s, t),a, r)t ← t ′

(ValueSet,ActionSet).merge(convert(σ))Vn,CVn ,πn,Cπn ← train(ValueSet,ActionSet)

18 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 133: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI

episode.bestAction()

for a ∈ As doQ(a) = 0, n = 0while not enough samples for Q(a) do

Q(a)← Q(a) + 1n (episode.rollout(a)− Q(a))

return argmaxa∈A

Q(a)

18 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 134: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI

episode.rollout(a)

rolloutEpisode(episode)rolloutEpisode.activateEvent(a)((s′, t ′), r)← rolloutEpisode.step()

if CVn−1 (s′, t ′) => thenreturn r + Vn−1(s′, t ′)

elseQ← r , s← s′, σr ← /0while rollout unfinished do

a = πn−1(s)rolloutEpisode.activateEvent(a)((s′, t ′), r)← rolloutEpisode.step()Q← Q + rσr .add((s, t), r)

Vn−1,CVn−1 ← incTrain(convert(σr ))return Q

18 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 135: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI

episode.rollout(a)

rolloutEpisode(episode)rolloutEpisode.activateEvent(a)((s′, t ′), r)← rolloutEpisode.step()

if CVn−1 (s′, t ′) => thenreturn r + Vn−1(s′, t ′)

elseQ← r , s← s′, σr ← /0while rollout unfinished do

a = πn−1(s)rolloutEpisode.activateEvent(a)((s′, t ′), r)← rolloutEpisode.step()Q← Q + rσr .add((s, t), r)

Vn−1,CVn−1 ← incTrain(convert(σr ))return Q

18 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 136: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI

episode.rollout(a)

rolloutEpisode(episode)rolloutEpisode.activateEvent(a)((s′, t ′), r)← rolloutEpisode.step()

if CVn−1 (s′, t ′) => thenreturn r + Vn−1(s′, t ′)

elseQ← r , s← s′, σr ← /0while rollout unfinished do

a = πn−1(s)rolloutEpisode.activateEvent(a)((s′, t ′), r)← rolloutEpisode.step()Q← Q + rσr .add((s, t), r)

Vn−1,CVn−1 ← incTrain(convert(σr ))return Q

18 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 137: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI

episode.rollout(a)

rolloutEpisode(episode)rolloutEpisode.activateEvent(a)((s′, t ′), r)← rolloutEpisode.step()

if CVn−1 (s′, t ′) => thenreturn r + Vn−1(s′, t ′)

elseQ← r , s← s′, σr ← /0while rollout unfinished do

a = πn−1(s)rolloutEpisode.activateEvent(a)((s′, t ′), r)← rolloutEpisode.step()Q← Q + rσr .add((s, t), r)

Vn−1,CVn−1 ← incTrain(convert(σr ))return Q

18 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 138: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

iATPI

Output

A pile Πn = {(π0,Cπ0),(π1,Cπ1), . . . ,(πn,Cπn )|Cπ0(s, t) =>} ofpartial policies.

18 / 19Temporal Markov Decision Problems — Formalization and Resolution

Page 139: Temporal Markov Decision Problems --- Formalization and ...emmanuel.rachelson.free.fr/extras/publis/Emmanuel... · Background Policies Time and MDPs TMDPpoly Illustration Is that

Models map

MP SMP GSMP

MDP SMDP GSMDP

SMDP+,TMDP,XMDP(part II)

GSMDP withobservable time

(part III)

(a)

(a) (b)

(b)

(b)

(c) (c) (c)

(d) (d) (d)

(a) add continuous sojourn time(b) add concurrency(c) add action choice(d) add observable time

19 / 19Temporal Markov Decision Problems — Formalization and Resolution


Recommended