+ All Categories
Home > Documents > Partially Observable Markov Decision Processes...

Partially Observable Markov Decision Processes...

Date post: 11-Aug-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
72
Partially Observable Markov Decision Processes (POMDPs) Rowan McAllister and Alexandre Navarro MLG Reading Group 02 June 2016 1 / 52
Transcript
Page 1: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Partially Observable Markov Decision Processes(POMDPs)

Rowan McAllister and Alexandre Navarro

MLG Reading Group

02 June 2016

1 / 52

Page 2: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Overview

What are POMDPs?

Background on Solving POMDPs

Algorithms to Solve Small POMDPs ExactlySondik/Monahan’s Enumeration (1971/1982)Zhang and Liu’s Incremental Pruning (1996)Cheng’s Linear Support (1988)

Algorithms to Solve Large POMDPs ApproximatelyVery-Approximate Online POMDPsPineau’s Point Based Value Iteration (2003)Spaan’s Perseus/Random-PBVI (2005)Smith’s Heuristic Search Value Iteration (HSVI) (2004-5)

2 / 52

Page 3: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Disclaimer

Many figures (copied and edited) are from here:

1. http:

//cs.brown.edu/research/ai/pomdp/tutorial/index.html

2. https://www.cs.cmu.edu/~ggordon/780-fall07/lectures/

POMDP_lecture.pdf

3. other places on the internet.

3 / 52

Page 4: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Talk Outline

What are POMDPs?

Background on Solving POMDPs

Algorithms to Solve Small POMDPs ExactlySondik/Monahan’s Enumeration (1971/1982)Zhang and Liu’s Incremental Pruning (1996)Cheng’s Linear Support (1988)

Algorithms to Solve Large POMDPs ApproximatelyVery-Approximate Online POMDPsPineau’s Point Based Value Iteration (2003)Spaan’s Perseus/Random-PBVI (2005)Smith’s Heuristic Search Value Iteration (HSVI) (2004-5)

4 / 52

Page 5: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Motivation of POMDPs

(a) Autonomous cars in fog (b) Spoken dialogue systems

(c) Finance (d) Reinforcement learning

5 / 52

Page 6: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Markov Model Taxonomy (as a Zoubin-square)

UncontrolledUncontrolled ContrControlledolledU

nob

serv

edst

ates

Un

obse

rved

stat

esO

bse

rved

stat

esO

bse

rved

stat

es

Markov Chain[POMDP]

MDP (Markov Decision Process)[POMDP]

HMM (Hidden Markov Model)[POMDP]

Partially Observable MDP[POMDP]

6 / 52

Page 7: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Partially Observable Markov Decision Process (POMDP)[Astrom 1965, Sondik 1971]

S, set of latent states s

A, set of action a

T(s′|s, a), the transition probability function

R(s, a) ∈ [0, 1], the reward function

γ ∈ [0, 1], a discount factor

Z, set of observations z

O(z|s′, a), the observation probability function

7 / 52

Page 8: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Partially Observable Markov Decision Process (POMDP)[Astrom 1965, Sondik 1971]

S, set of latent states s

A, set of action a

T(s′|s, a), the transition probability function

R(s, a) ∈ [0, 1], the reward function

γ ∈ [0, 1], a discount factor

Z, set of observations z

O(z|s′, a), the observation probability function

7 / 52

Page 9: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP as a Belief-MDPIdea: Plan in (fully-observable) belief-space.

B, set of non-latent beliefs b

A, set of action a

τ (b, a, z) ∝ O(z |s ′, a)∑

s∈S T (s ′|s, a)b(s), belief transition fn

R(b, a) =∑

s∈S R(s, a)b(s)

O(z|b, a) =∑

s′∈S O(z |s ′, a)∑

s∈S T (s ′|s, a)b(s)

Belief-MDP Bellman Equation:

V (b) = maxa∈A

[R(b, a) + γ

∑z∈Z

O(z |b, a)V (τ(b, a, z))

]

Qu: how can we do value iteration in POMDPs?

8 / 52

Page 10: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP as a Belief-MDPIdea: Plan in (fully-observable) belief-space.

B, set of non-latent beliefs b

A, set of action a

τ (b, a, z) ∝ O(z |s ′, a)∑

s∈S T (s ′|s, a)b(s), belief transition fn

R(b, a) =∑

s∈S R(s, a)b(s)

O(z|b, a) =∑

s′∈S O(z |s ′, a)∑

s∈S T (s ′|s, a)b(s)

Belief-MDP Bellman Equation:

V (b) = maxa∈A

[R(b, a) + γ

∑z∈Z

O(z |b, a)V (τ(b, a, z))

]

Qu: how can we do value iteration in POMDPs?

8 / 52

Page 11: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP as a Belief-MDPIdea: Plan in (fully-observable) belief-space.

B, set of non-latent beliefs b

A, set of action a

τ (b, a, z) ∝ O(z |s ′, a)∑

s∈S T (s ′|s, a)b(s), belief transition fn

R(b, a) =∑

s∈S R(s, a)b(s)

O(z|b, a) =∑

s′∈S O(z |s ′, a)∑

s∈S T (s ′|s, a)b(s)

Belief-MDP Bellman Equation:

V (b) = maxa∈A

[R(b, a) + γ

∑z∈Z

O(z |b, a)V (τ(b, a, z))

]

Qu: how can we do value iteration in POMDPs? 8 / 52

Page 12: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Talk Outline

What are POMDPs?

Background on Solving POMDPs

Algorithms to Solve Small POMDPs ExactlySondik/Monahan’s Enumeration (1971/1982)Zhang and Liu’s Incremental Pruning (1996)Cheng’s Linear Support (1988)

Algorithms to Solve Large POMDPs ApproximatelyVery-Approximate Online POMDPsPineau’s Point Based Value Iteration (2003)Spaan’s Perseus/Random-PBVI (2005)Smith’s Heuristic Search Value Iteration (HSVI) (2004-5)

9 / 52

Page 13: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Background on Solving POMDPs

S = {s0, s1}

A = {a1, a2}

Z = {z1, z2, z3}

Belief Simplex for 2-State POMDP:

b(s1)

|S | = 2 (value iteration if MDP = easy)|B| =∞ (value iteration if POMDP = hard)

10 / 52

Page 14: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Background on Solving POMDPs

S = {s0, s1}

A = {a1, a2}

Z = {z1, z2, z3}

Belief Simplex for 2-State POMDP:

b(s1)

|S | = 2 (value iteration if MDP = easy)

|B| =∞ (value iteration if POMDP = hard)

10 / 52

Page 15: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Background on Solving POMDPs

S = {s0, s1}

A = {a1, a2}

Z = {z1, z2, z3}

Belief Simplex for 2-State POMDP:

b(s1)

|S | = 2 (value iteration if MDP = easy)|B| =∞ (value iteration if POMDP = hard)

10 / 52

Page 16: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Initial Solutions?

Solving the value from a single (e.g. current) belief is intractable.

Worse, is how to find the value for each belief b ∈ B?...there is an infinite number of beliefs...

I Discretise belief simplex, e.g. |Bdiscrete| = 10

11 / 52

Page 17: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Initial Solutions?

Even solving one belief with a lookahead tree, a constant branchingfactor of |A| · |Z | means we scale exponentially with horizon

12 / 52

Page 18: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Background on Solving POMDPs

OK lets abandon trees. What might the value function look like?

13 / 52

Page 19: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Background on Solving POMDPsSondik (1971): the value function has structure, specificallypiecewise linear and convex (PWLC)

Lines are “alpha vectors”, V (b)=maxi αhi · b

Proof: Value iteration backup operator preserves PWLC, so if value isPWLC at h = 1, then PWLC for all horizons by induction.

14 / 52

Page 20: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Background on Solving POMDPsSondik (1971): the value function has structure, specificallypiecewise linear and convex (PWLC)

Lines are “alpha vectors”, V (b)=maxi αhi · b

Proof: Value iteration backup operator preserves PWLC, so if value isPWLC at h = 1, then PWLC for all horizons by induction.

14 / 52

Page 21: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Background on Solving POMDPsSondik (1971): the value function has structure, specificallypiecewise linear and convex (PWLC)

Lines are “alpha vectors”, V (b)=maxi αhi · b

Proof: Value iteration backup operator preserves PWLC, so if value isPWLC at h = 1, then PWLC for all horizons by induction.

14 / 52

Page 22: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Background on Solving POMDPs

The intersections partition the belief simplex(into regions with a common optimal-action)

15 / 52

Page 23: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 1To prove PWLC by induction, start at V h=1:

Qh=1(b, ai ) = R(b, ai ) = R(s0, ai )b(s0) + R(s1, ai )b(s1)Let R(s0, a1) = 2, and R(s1, a2) = 3, and 0 otherwise.Q(b, a1) = 2 · b(s0) + 0 · b(s1) = [2, 0] · bQ(b, a2) = 0 · b(s0) + 3 · b(s1) = [0, 3] · b → intersect at b(s1) = 0.4.

V h=1(b) = maxa∈A Q(b, a) = maxi αh=1i · b

Qu: How many α-vectors and partitions will we have in general at h=1?

16 / 52

Page 24: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 1To prove PWLC by induction, start at V h=1:

Qh=1(b, ai ) = R(b, ai ) = R(s0, ai )b(s0) + R(s1, ai )b(s1)

Let R(s0, a1) = 2, and R(s1, a2) = 3, and 0 otherwise.Q(b, a1) = 2 · b(s0) + 0 · b(s1) = [2, 0] · bQ(b, a2) = 0 · b(s0) + 3 · b(s1) = [0, 3] · b → intersect at b(s1) = 0.4.

V h=1(b) = maxa∈A Q(b, a) = maxi αh=1i · b

Qu: How many α-vectors and partitions will we have in general at h=1?

16 / 52

Page 25: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 1To prove PWLC by induction, start at V h=1:

Qh=1(b, ai ) = R(b, ai ) = R(s0, ai )b(s0) + R(s1, ai )b(s1)Let R(s0, a1) = 2, and R(s1, a2) = 3, and 0 otherwise.

Q(b, a1) = 2 · b(s0) + 0 · b(s1) = [2, 0] · bQ(b, a2) = 0 · b(s0) + 3 · b(s1) = [0, 3] · b → intersect at b(s1) = 0.4.

V h=1(b) = maxa∈A Q(b, a) = maxi αh=1i · b

Qu: How many α-vectors and partitions will we have in general at h=1?

16 / 52

Page 26: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 1To prove PWLC by induction, start at V h=1:

Qh=1(b, ai ) = R(b, ai ) = R(s0, ai )b(s0) + R(s1, ai )b(s1)Let R(s0, a1) = 2, and R(s1, a2) = 3, and 0 otherwise.Q(b, a1) = 2 · b(s0) + 0 · b(s1) = [2, 0] · bQ(b, a2) = 0 · b(s0) + 3 · b(s1) = [0, 3] · b → intersect at b(s1) = 0.4.

V h=1(b) = maxa∈A Q(b, a) = maxi αh=1i · b

Qu: How many α-vectors and partitions will we have in general at h=1?

16 / 52

Page 27: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 1To prove PWLC by induction, start at V h=1:

Qh=1(b, ai ) = R(b, ai ) = R(s0, ai )b(s0) + R(s1, ai )b(s1)Let R(s0, a1) = 2, and R(s1, a2) = 3, and 0 otherwise.Q(b, a1) = 2 · b(s0) + 0 · b(s1) = [2, 0] · bQ(b, a2) = 0 · b(s0) + 3 · b(s1) = [0, 3] · b → intersect at b(s1) = 0.4.

V h=1(b) = maxa∈A Q(b, a) = maxi αh=1i · b

Qu: How many α-vectors and partitions will we have in general at h=1?

16 / 52

Page 28: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 1To prove PWLC by induction, start at V h=1:

Qh=1(b, ai ) = R(b, ai ) = R(s0, ai )b(s0) + R(s1, ai )b(s1)Let R(s0, a1) = 2, and R(s1, a2) = 3, and 0 otherwise.Q(b, a1) = 2 · b(s0) + 0 · b(s1) = [2, 0] · bQ(b, a2) = 0 · b(s0) + 3 · b(s1) = [0, 3] · b → intersect at b(s1) = 0.4.

V h=1(b) = maxa∈A Q(b, a) = maxi αh=1i · b

Qu: How many α-vectors and partitions will we have in general at h=1?16 / 52

Page 29: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2OK, our α-vectors define V h=1, how to compute V h=2?(& if V k+1 is PWLC given PWLC V k → inductive proof complete)

Break down into 3 steps:

i) value of b given a and z

ii) value of b given a

iii) value of b

17 / 52

Page 30: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2OK, our α-vectors define V h=1, how to compute V h=2?(& if V k+1 is PWLC given PWLC V k → inductive proof complete)

Break down into 3 steps:

i) value of b given a and z

ii) value of b given a

iii) value of b

17 / 52

Page 31: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step i: V (b) given a and z

V ∗(b) = maxa∈A

R(b, a) + γ∑

z∈Z O(z |b, a)V ∗(τ(b, a, z))︸ ︷︷ ︸S(a,z)

18 / 52

Page 32: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step ii: V (b) given a

V (b) = maxa∈A

R(b, a) + γ∑z∈Z

O(z |b, a)V (τ(b, a, z))︸ ︷︷ ︸∑z S(a,z)

19 / 52

Page 33: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step ii: V (b) given a

Each zi will have it’s own transformed alpha vectors.

The transformed α-vectors are added (weighted by theirprobabilities).

Qu: Ignoring, R(b, a) how many α-vectors and partitions do weexpect at h=2?

20 / 52

Page 34: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step ii: V (b) given a

Each zi will have it’s own transformed alpha vectors.

The transformed α-vectors are added (weighted by theirprobabilities).Qu: Ignoring, R(b, a) how many α-vectors and partitions do weexpect at h=2?

20 / 52

Page 35: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step ii: V (b) given a

21 / 52

Page 36: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step ii: V (b) given a

22 / 52

Page 37: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step ii: V (b) given a

23 / 52

Page 38: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step ii: V (b) given a

Note: only 4 partitions < |Z ||A| = 6. Why?

24 / 52

Page 39: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step ii: V (b) given a

Note: only 4 partitions < |Z ||A| = 6. Why?

24 / 52

Page 40: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step ii: V (b) given a

Do the same thing for a2:

25 / 52

Page 41: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step iii: V (b)

Combine to partition into optimal actions at h=2:

26 / 52

Page 42: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step iii: V (b)

Prune any ‘everywhere-suboptimal’ α-vectors.

Post-pruning: 1-1 mapping between regions and α-vectors.

Remaining α-vectors / regions indicate different (long term) ‘a-z-a’strategies, not necessarily different (short term) next-actions.

27 / 52

Page 43: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

POMDP Value Iteration: Horizon 2, Step iii: V (b)

Prune any ‘everywhere-suboptimal’ α-vectors.

Post-pruning: 1-1 mapping between regions and α-vectors.Remaining α-vectors / regions indicate different (long term) ‘a-z-a’strategies, not necessarily different (short term) next-actions.

27 / 52

Page 44: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Talk Outline

What are POMDPs?

Background on Solving POMDPs

Algorithms to Solve Small POMDPs ExactlySondik/Monahan’s Enumeration (1971/1982)Zhang and Liu’s Incremental Pruning (1996)Cheng’s Linear Support (1988)

Algorithms to Solve Large POMDPs ApproximatelyVery-Approximate Online POMDPsPineau’s Point Based Value Iteration (2003)Spaan’s Perseus/Random-PBVI (2005)Smith’s Heuristic Search Value Iteration (HSVI) (2004-5)

28 / 52

Page 45: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Sondik/Monahan’s Enumeration (1971/1982)

Brute force, compute all α-vectors, even suboptimal:

Complexity: At h we have |A| · |Z | · |αh−1| projected vectors, and|α(h)| = |A| · |α(h − 1)|·|Z | cross-sum combinations / new vectors,and time |S |2|A| · |α(h − 1)|·|Z |.

29 / 52

Page 46: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Sondik/Monahan’s Enumeration (1971/1982)

Brute force, compute all α-vectors, even suboptimal:

Complexity: At h we have |A| · |Z | · |αh−1| projected vectors, and|α(h)| = |A| · |α(h − 1)|·|Z | cross-sum combinations / new vectors,and time |S |2|A| · |α(h − 1)|·|Z |.

29 / 52

Page 47: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Zhang and Liu’s Incremental Pruning (1996)

Like algorithm A, but incremental pruning.Combine all S(a, z1) and S(a, z2) vectors:

[more sophisticated versions: Cassandra 1997

30 / 52

Page 48: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Zhang and Liu’s Incremental Pruning (1996)

Then prune suboptimal-vectors:

31 / 52

Page 49: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Zhang and Liu’s Incremental Pruning (1996)

Using only the remaining vectors, keep going:

32 / 52

Page 50: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Cheng’s Linear Support (1988)Pick a belief-point from a stack, generate its α-vector, check eachregion-vertex to see if same vector. If not, add vertex to stack.

Start: pick (say) midpoint of belief simplex and compute α-vector:

Biggest V-error is at one of the untested (black) vertexes.

33 / 52

Page 51: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Cheng’s Linear Support (1988)Pick a belief-point from a stack, generate its α-vector, check eachregion-vertex to see if same vector. If not, add vertex to stack.

Start: pick (say) midpoint of belief simplex and compute α-vector:

Biggest V-error is at one of the untested (black) vertexes.

33 / 52

Page 52: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Cheng’s Linear Support (1988)Pick a belief-point from a stack, generate its α-vector, check eachregion-vertex to see if same vector. If not, add vertex to stack.

Start: pick (say) midpoint of belief simplex and compute α-vector:

Biggest V-error is at one of the untested (black) vertexes.

33 / 52

Page 53: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Cheng’s Linear Support (1988)Pick a belief-point from a stack, generate its α-vector, check eachregion-vertex to see if same vector. If not, add vertex to stack.

Start: pick (say) midpoint of belief simplex and compute α-vector:

Biggest V-error is at one of the untested (black) vertexes.33 / 52

Page 54: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Cheng’s Linear Support (1988)

So we evaluate vertex b(s1 = 0), giving us a new vector:

Note, by evaluating vertexes, we target areas of high V-error.

34 / 52

Page 55: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Cheng’s Linear Support (1988)

35 / 52

Page 56: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Cheng’s Linear Support (1988)

36 / 52

Page 57: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Cheng’s Linear Support (1988)

Done. No more points left on stack.

37 / 52

Page 58: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Talk Outline

What are POMDPs?

Background on Solving POMDPs

Algorithms to Solve Small POMDPs ExactlySondik/Monahan’s Enumeration (1971/1982)Zhang and Liu’s Incremental Pruning (1996)Cheng’s Linear Support (1988)

Algorithms to Solve Large POMDPs ApproximatelyVery-Approximate Online POMDPsPineau’s Point Based Value Iteration (2003)Spaan’s Perseus/Random-PBVI (2005)Smith’s Heuristic Search Value Iteration (HSVI) (2004-5)

38 / 52

Page 59: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Online POMDPs, Lower Bounds

Plan online, from the current belief state.

Hauskrecht/Smith’s Blind Policy (2000/2005): same action alwayschosen: only |V | = |A|.

αa = R(s, a) + γ∑s′∈S

T (s ′|s, a)α′a

39 / 52

Page 60: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Online POMDPs, Upper bounds:Littmans’s VMDP (1995):

V (s) = maxa∈A

[R(s, a) + γ∑s′∈S

T (s ′|s, a)V (s ′)]

VMDP(b) =∑s∈S

V (s)b(s) = V · b (|V | = 1)

Littmans’s QMDP (1995):

Q(s, a) = R(s, a) + γ∑s′∈S

T (s ′|s, a)V (s ′)

QMDP(b, a) =∑s∈S

Q(s, a)b(s) = Qa · b (|V | = |A|)

Hauskrecht’s Fast Informed Bound (FIB) (2000): time: |A||S |2|Z ||V ′|

αa(s) = R(s, a) + γ∑z∈Z

maxα′∈Γ′︸ ︷︷ ︸

opposed to ⊕

∑s′∈S

T (s ′|s, a)O(s ′|z , a)α′(s ′)

V (b) < VFIB(b) ≤ VQMDP(b) ≤ VVMDP(b) ∀b ∈ B

40 / 52

Page 61: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Online POMDPs

V b b s s a P s o s a b s si a A s S s Ss Soi

i i+ ∈ ∈ ∈ ∈∈∈

= +

∑ ∑∑∑1 ( ) m ax ( ) ( , ) m ax ( ' , | , ) ( ) ( ' )'

ρ γ αα ΓΘ

UMDP update:

fast informed bound update:V b b s s a P s o s a si a A s S s So

ii i

+ ∈ ∈ ∈ ∈∈= +

∑ ∑∑1 ( ) m ax ( ) ( , ) m ax ( ' , | , ) ( ' )'

ρ γ αα ΓΘ

V b b s s a P s s a b s si a A s S s Ss Si

i i+ ∈ ∈ ∈ ∈∈

= +

∑ ∑∑1 ( ) m ax ( ) ( , ) m ax ( ' | , ) ( ) ( ' )'

ρ γ αα Γ

exact update:

V b b s s a P s s a si a A s S s Si

i i+ ∈ ∈ ∈ ∈

= +

∑ ∑1 ( ) m ax ( ) ( , ) ( ' | , ) m ax ( ' )'

ρ γ αα Γ

≤QMDP approx. update:

MDP approx. update:V b b s s a P s s a si

s S a A s Si

i i+

∈ ∈ ∈ ∈= +

∑ ∑1 ( ) ( ) m ax ( , ) ( ' | , ) m ax ( ' )

'ρ γ α

α Γ

[Hauskrecht 2000] 41 / 52

Page 62: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Online POMDPs: Ways to Expand a Tree [Ross 2008]

b0[14.4, 18.7]

a1

b2b1

z1 z2

a2

b4b3

z1 z2

1 3

0.7 0.3 0.5 0.5

a1

b6b5

z1 z2

a2

b8b7

z1 z2

-14

0.6 0.4 0.2 0.8

[14.4, 17.9] [12, 18.7]

[10, 18][9, 15][15, 20]

[6, 14] [9, 12] [11, 20][10, 12]

[13.7, 16.9]

[5.8, 11.5]

[13.7, 16.9]

[Ross 2008] time: (|A||Z |)D |S |2 42 / 52

Page 63: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Online POMDPs: Ways to Expand a Tree [Ross 2008]Branch-and-Bound Pruning: Backup upper and lower boundsfrom leaves to root. Prune nodes known to be suboptimal.(Paquet’s Real-Time Belief Space Search (2005-6)).MC: use a generative model to sample observations, for sparse butdeep tree. (McAllester (1999), Bertsekas (1999))

Heuristic Search: expand leaf node of greatest heuristic value:

e(bleaf , b0) ≈ γd(bleaf ,b0)Pr(bleaf |b0, π)(U(bleaf )− L(bleaf ))

(Satia 1973, Washington’s BI-POMDP 1997, Ross’s AEMS 2007, Smith’s HSVI 2004)

b0[14.4, 18.7]

a1

b2b1

z1 z2

a2

b4b3

z1 z2

1 3

0.7 0.3 0.5 0.5

a1

b6b5

z1 z2

a2

b8b7

z1 z2

-14

0.6 0.4 0.2 0.8

[14.4, 17.9] [12, 18.7]

[10, 18][9, 15][15, 20]

[6, 14] [9, 12] [11, 20][10, 12]

[13.7, 16.9]

[5.8, 11.5]

[13.7, 16.9]

time: (|A||Z |)D |S |2

43 / 52

Page 64: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Point-Based Methods

1. Use a finite set of representative belief-points bi ∈ B ⊂ B.

2. Backup belief-point values V (bi ) and gradients dV (bi )dbi

= αi .

3. To evaluate any other beliefs b ∈ B\B use interpolation orlower bound that α provide.

44 / 52

Page 65: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Grid-Based Belief-Value Interpolation

I Lovejoy, 1991

I Brafman, 1997

I Hauskrecht, 2000

I Zhou, 2001

I Bonet, 2002

45 / 52

Page 66: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Pineau’s Point Based Value Iteration (PBVI) (2003)

I Maintain one value and one gradient per belief-point.

I An ‘anytime algorithm’ - trade off time with accuracy.I When all bi fully backed-up, add new beliefs-points b′i to B

using stochastic forward simulation. Pros:I more-probable belief points addedI no ‘unreachable’ belief points added

I Only include b′i furthest (L1 or L2) from B

α0

b2 b1 b0 b3b2 b1 b0 b3

V={ ,α1,α2}

46 / 52

Page 67: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Pineau’s Point Based Value Iteration (PBVI) (2003)

α0

b2 b1 b0 b3b2 b1 b0 b3

V={ ,α1,α2}

For exact V = HV ′ backup, we first create |A||Z ||V ′| projections:

Γa,r ← αa,r (s) = R(s, a)

Γa,z ← αa,zi (s) = γ

∑s′∈S

T (s ′|s, a)O(z |s ′, a)α′i (s′), ∀α′i ∈ V ′

Sondik’s Full Enumeration:

Γa = Γa,r ⊕ Γa,z1 ⊕ Γa,z2 ⊕ ...V = ∪a∈AΓa

Space: |V | = |A||V ′||Z |Time: |S |2|A||V ′||Z |.

PBVI:

Γa,b = Γa,r +∑z∈Z

arg maxα∈Γa,z

(α · b)

V ← arg maxΓa,b,a∈A

(Γa,b · b) ∀b ∈ B

Space: |V | = |B|Time: |S ||A||V ′||Z ||B|

47 / 52

Page 68: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Spaan’s Perseus/Random-PBVI (2005)Backup a (random) subset B̃ ⊂ B to improve V (bi ) ∀bi ∈ B.Note: backup operator H is linear in |V |, and |V | = |B̃| << |B|.

Spaan & Vlassis

Vn

(1, 0) (0, 1)b1 b2 b3b4 b5 b6 b7

(a)

Vn+1

(1, 0) (0, 1)

b6

(b)

replacemen

Vn+1

(1, 0) (0, 1)

b3

(c)

Vn+1

(1, 0) (0, 1)

(d)

Figure 1: Example of a Perseus backup stage in a two state POMDP. The belief space isdepicted on the x-axis and the y-axis represents V (b). Solid lines are αi

n vectorsfrom the current stage n and dashed lines are αi

n−1 vectors from the previousstage. We operate on a B of 7 beliefs, indicated by the tick marks. The backupstage computing Vn+1 from Vn proceeds as follows: (a) value function at stage n;(b) start computing Vn+1 by sampling b6, add α = backup(b6) to Vn+1 whichimproves the value of b6 and b7; (c) sample b3 from {b1, . . . , b5}, add backup(b3)to Vn+1 which improves b1 through b5; and (d) the value of all b ∈ B has improved,the backup stage is finished.

3.2 Discussion

The key observation underlying the Perseus algorithm is that when a belief b is backedup, the resulting vector improves not only V (b) but often also the value of many otherbelief points in B. This results in value functions with a relatively small number of vectors(as compared to, e.g., Poon, 2001; Pineau et al., 2003). Experiments show indeed thatthe number of vectors grows modestly with the number of backup stages (|Vn| � |B|).In practice this means that we can afford to use a much larger B than other point-basedmethods, which has a positive effect on the approximation accuracy as dictated by thebounds of Pineau et al. (2003). Furthermore, compared with other methods that buildthe set B based on various heuristics (Pineau et al., 2003; Smith & Simmons, 2004), ourbuild-up of B is cheap as it only requires sampling random trajectories starting from b0.Moreover, duplicate entries in B will only affect the probability that a particular b will besampled in the value update stages, but not the size of Vn.

204

48 / 52

Page 69: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Spaan’s Perseus/Random-PBVI (2005)Spaan & Vlassis

Vn

(1, 0) (0, 1)b1 b2 b3b4 b5 b6 b7

(a)

Vn+1

(1, 0) (0, 1)

b6

(b)

replacemen

Vn+1

(1, 0) (0, 1)

b3

(c)

Vn+1

(1, 0) (0, 1)

(d)

Figure 1: Example of a Perseus backup stage in a two state POMDP. The belief space isdepicted on the x-axis and the y-axis represents V (b). Solid lines are αi

n vectorsfrom the current stage n and dashed lines are αi

n−1 vectors from the previousstage. We operate on a B of 7 beliefs, indicated by the tick marks. The backupstage computing Vn+1 from Vn proceeds as follows: (a) value function at stage n;(b) start computing Vn+1 by sampling b6, add α = backup(b6) to Vn+1 whichimproves the value of b6 and b7; (c) sample b3 from {b1, . . . , b5}, add backup(b3)to Vn+1 which improves b1 through b5; and (d) the value of all b ∈ B has improved,the backup stage is finished.

3.2 Discussion

The key observation underlying the Perseus algorithm is that when a belief b is backedup, the resulting vector improves not only V (b) but often also the value of many otherbelief points in B. This results in value functions with a relatively small number of vectors(as compared to, e.g., Poon, 2001; Pineau et al., 2003). Experiments show indeed thatthe number of vectors grows modestly with the number of backup stages (|Vn| � |B|).In practice this means that we can afford to use a much larger B than other point-basedmethods, which has a positive effect on the approximation accuracy as dictated by thebounds of Pineau et al. (2003). Furthermore, compared with other methods that buildthe set B based on various heuristics (Pineau et al., 2003; Smith & Simmons, 2004), ourbuild-up of B is cheap as it only requires sampling random trajectories starting from b0.Moreover, duplicate entries in B will only affect the probability that a particular b will besampled in the value update stages, but not the size of Vn.

204

Perseus: Randomized Point-based Value Iteration for POMDPs

backup stage, given a value function Vn, we compute a value function Vn+1 that improvesthe value of all b ∈ B, i.e., we build a value function Vn+1 = H̃PerseusVn that upper boundsVn over B (but not necessarily over ∆ which would require linear programming):

Vn(b) ≤ Vn+1(b), for all b ∈ B. (16)

We first let the agent randomly explore the environment and collect a set B of reachablebelief points, which remains fixed throughout the complete algorithm. We initialize the valuefunction V0 as a single vector with all its components equal to 1

1−γmins,a r(s, a) (Zhang &

Zhang, 2001). Starting with V0, Perseus performs a number of backup stages until someconvergence criterion is met. Each backup stage is defined as follows (where B̃ is an auxiliaryset containing the non-improved points):

Perseus backup stage: Vn+1 = H̃PerseusVn

1. Set Vn+1 = ∅. Initialize B̃ to B.

2. Sample a belief point b uniformly at random from B̃ and compute α = backup(b).

3. If b · α ≥ Vn(b) then add α to Vn+1, otherwise add α′ = arg max{αin}i

b · αin to Vn+1.

4. Compute B̃ = {b ∈ B : Vn+1(b) < Vn(b)}.

5. If B̃ = ∅ then stop, else go to 2.

Often, a small number of vectors will be sufficient to improve Vn(b) ∀b ∈ B, especiallyin the first steps of value iteration. The idea is to compute these vectors in a randomizedgreedy manner by sampling from B̃, an increasingly smaller subset of B. We keep trackof the set of non-improved points B̃ consisting of those b ∈ B whose new value Vn+1(b) isstill lower than Vn(b). At the start of each backup stage, Vn+1 is set to ∅ which means B̃is initialized to B, indicating that all b ∈ B still need to be improved in this backup stage.As long as B̃ is not empty, we sample a point b from B̃ and compute α = backup(b). Ifα improves the value of b (i.e., if b · α ≥ Vn(b) in step 3), we add α to Vn+1 and updateVn+1(b) for all b ∈ B by computing their inner product with the new α. The hope is thatα improves the value of many other points in B, and all these points are removed from B̃.As long as B̃ is not empty we sample belief points from it and add their α vectors.

To ensure termination of each backup stage we have to enforce that B̃ shrinks whenadding vectors, i.e., that each α actually improves at least the value of the b that generatedit. If not (i.e., b · α < Vn(b) in step 3), we ignore α and insert a copy of the maximizingvector of b from Vn in Vn+1. Point b is now considered improved and is removed from B̃in step 4, together with any other belief points which had the same vector as maximizingone in Vn. This procedure ensures that B̃ shrinks and the backup stage will terminate. Apictorial example of a backup stage is presented in Fig. 1.

Perseus performs backup stages until some convergence criterion is met. For point-based methods several convergence criteria can be considered, one could for instance boundthe difference between successive value function estimates maxb∈B(Vn+1(b) − Vn(b)). An-other option would be to track the number of policy changes: the number of b ∈ B whichhad a different optimal action in Vn compared to Vn+1 (Lovejoy, 1991).

203

49 / 52

Page 70: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Spaan’s Perseus/Random-PBVI (2005)Perseus: Randomized Point-based Value Iteration for POMDPs

Tiger-grid R |π| T

HSVI 2.35 4860 10341

Perseus 2.34 134 104

PBUA 2.30 660 12116

PBVI 2.25 470 3448

BPI w/b 2.22 120 1000

Grid 0.94 174 n.a.

QMDP 0.23 n.a. 2.76

(a) Results for Tiger-grid.

Hallway R |π| T

PBVI 0.53 86 288

PBUA 0.53 300 450

HSVI 0.52 1341 10836

Perseus 0.51 55 35

BPI w/b 0.51 43 185

QMDP 0.27 n.a. 1.34

(b) Results for Hallway.

Hallway2 R |π| T

Perseus 0.35 56 10

HSVI 0.35 1571 10010

PBUA 0.35 1840 27898

PBVI 0.34 95 360

BPI w/b 0.32 60 790

QMDP 0.09 n.a. 2.23

(c) Results for Hallway2.

Tag R |π| T

Perseus −6.17 280 1670

HSVI −6.37 1657 10113

BPI w/b −6.65 17 250

BBSLS ≈ −8.3 30 105

BPI n/b −9.18 940 59772

PBVI −9.18 1334 180880

QMDP −16.9 n.a. 16.1

(d) Results for Tag.

Table 2: Experimental comparisons of Perseus with other algorithms. Perseus resultsare averaged over 10 runs. Each table lists the method, the average expecteddiscounted reward R, the size of the solution |π| (value function or controllersize), and the time T (in seconds) used to compute the solution. Sources: PBVI

(Pineau et al., 2003), BPI no bias (Poupart & Boutilier, 2004), BPI with bias(Poupart, 2005), HSVI (Smith & Simmons, 2004), Grid (Brafman, 1997), PBUA

(Poon, 2001), and BBSLS (Braziunas & Boutilier, 2004) (approximate, read fromfigure).

on the reachable belief space by incorporating the initial belief which dramatically increasesits performance in solution size and computation time, but it does not reach the controlquality of Perseus.

5.2 Continuous Action Spaces

We applied Perseus in two domains with continuous action spaces: an agent equippedwith proximity sensors moving at a continuous heading and distance, and a navigationtask involving a mobile robot with omnidirectional vision in a perceptually aliased officeenvironment.

211

50 / 52

Page 71: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Smith’s Heuristic Search Value Iteration (HSVI) (2004-5)

I Some (Shani 2008, 2012) argue against Perseus’s use ofrandom belief-point selection is bad, being unguided.

I HSVI is similar to PBVI but instead guides belief-pointselection using a tree and a heuristic (mentioned above). Itdescends the tree via action nodes of greatest upper bound,and observations nodes of greatest difference in upper andlower bounds (weighted by the observation probability).

I This guiding is akin to prioritised value iteration in MDPs.

51 / 52

Page 72: Partially Observable Markov Decision Processes (POMDPs)cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/pomdp.pdf · 2[0;1], a discount factor Z, set of observations z O(zjs0;a), the

Advanced POMDP Topics

I Reinforcement Learning as a POMDP (Duff 2002, Poupart2006)

I Learning and planning in POMDPs (Ross 2011)

I Continuous POMDPs (Porta 2006)

52 / 52


Recommended