Partially Observable Markov Decision Processes...

Partially Observable Markov Decision Processes(POMDPs)

Rowan McAllister and Alexandre Navarro

MLG Reading Group

02 June 2016

1 / 52

Overview

What are POMDPs?

Background on Solving POMDPs

Algorithms to Solve Small POMDPs ExactlySondik/Monahan’s Enumeration (1971/1982)Zhang and Liu’s Incremental Pruning (1996)Cheng’s Linear Support (1988)

Algorithms to Solve Large POMDPs ApproximatelyVery-Approximate Online POMDPsPineau’s Point Based Value Iteration (2003)Spaan’s Perseus/Random-PBVI (2005)Smith’s Heuristic Search Value Iteration (HSVI) (2004-5)

2 / 52

Disclaimer

Many figures (copied and edited) are from here:

1. http:

//cs.brown.edu/research/ai/pomdp/tutorial/index.html

2. https://www.cs.cmu.edu/~ggordon/780-fall07/lectures/

POMDP_lecture.pdf

3. other places on the internet.

3 / 52

http://cs.brown.edu/research/ai/pomdp/tutorial/index.html

http://cs.brown.edu/research/ai/pomdp/tutorial/index.html

https://www.cs.cmu.edu/~ggordon/780-fall07/lectures/POMDP_lecture.pdf

https://www.cs.cmu.edu/~ggordon/780-fall07/lectures/POMDP_lecture.pdf

Talk Outline

What are POMDPs?




4 / 52

Motivation of POMDPs

(a) Autonomous cars in fog (b) Spoken dialogue systems

(c) Finance (d) Reinforcement learning

5 / 52

Markov Model Taxonomy (as a Zoubin-square)

UncontrolledUncontrolled ContrControlledolledU

nob

serv

edst

ates

Un

obse

rved

stat

esO

bse

rved

stat

esO

bse

rved

stat

es

Markov Chain[POMDP]

MDP (Markov Decision Process)[POMDP]

HMM (Hidden Markov Model)[POMDP]

Partially Observable MDP[POMDP]

6 / 52

Partially Observable Markov Decision Process (POMDP)[Astrom 1965, Sondik 1971]

S, set of latent states s

A, set of action a

T(s′|s, a), the transition probability function

R(s, a) ∈ [0, 1], the reward function

γ ∈ [0, 1], a discount factor

Z, set of observations z

O(z|s′, a), the observation probability function

7 / 52

Partially Observable Markov Decision Process (POMDP)[Astrom 1965, Sondik 1971]

S, set of latent states s

A, set of action a

T(s′|s, a), the transition probability function

R(s, a) ∈ [0, 1], the reward function

γ ∈ [0, 1], a discount factor

Z, set of observations z

O(z|s′, a), the observation probability function

7 / 52

POMDP as a Belief-MDPIdea: Plan in (fully-observable) belief-space.

B, set of non-latent beliefs b

A, set of action a

τ (b, a, z) ∝ O(z |s ′, a)∑

s∈S T (s ′|s, a)b(s), belief transition fn

R(b, a) =∑

s∈S R(s, a)b(s)

O(z|b, a) =∑

s′∈S O(z |s ′, a)∑

s∈S T (s ′|s, a)b(s)

Belief-MDP Bellman Equation:

V (b) = maxa∈A

[R(b, a) + γ

∑z∈Z

O(z |b, a)V (τ(b, a, z))

]

Qu: how can we do value iteration in POMDPs?

8 / 52



A, set of action a

τ (b, a, z) ∝ O(z |s ′, a)∑


R(b, a) =∑

s∈S R(s, a)b(s)

O(z|b, a) =∑

s′∈S O(z |s ′, a)∑



V (b) = maxa∈A

[R(b, a) + γ

∑z∈Z

O(z |b, a)V (τ(b, a, z))

]

Qu: how can we do value iteration in POMDPs?

8 / 52



A, set of action a

τ (b, a, z) ∝ O(z |s ′, a)∑


R(b, a) =∑

s∈S R(s, a)b(s)

O(z|b, a) =∑

s′∈S O(z |s ′, a)∑



V (b) = maxa∈A

[R(b, a) + γ

∑z∈Z

O(z |b, a)V (τ(b, a, z))

]

Qu: how can we do value iteration in POMDPs? 8 / 52

Talk Outline

What are POMDPs?




9 / 52


S = {s0, s1}

A = {a1, a2}

Z = {z1, z2, z3}

Belief Simplex for 2-State POMDP:

b(s1)

|S | = 2 (value iteration if MDP = easy)|B| =∞ (value iteration if POMDP = hard)

10 / 52


S = {s0, s1}

A = {a1, a2}

Z = {z1, z2, z3}


b(s1)

|S | = 2 (value iteration if MDP = easy)

|B| =∞ (value iteration if POMDP = hard)

10 / 52


S = {s0, s1}

A = {a1, a2}

Z = {z1, z2, z3}


b(s1)

|S | = 2 (value iteration if MDP = easy)|B| =∞ (value iteration if POMDP = hard)

10 / 52

Initial Solutions?

Solving the value from a single (e.g. current) belief is intractable.

Worse, is how to find the value for each belief b ∈ B?...there is an infinite number of beliefs...

I Discretise belief simplex, e.g. |Bdiscrete| = 10

11 / 52

Initial Solutions?

Even solving one belief with a lookahead tree, a constant branchingfactor of |A| · |Z | means we scale exponentially with horizon

12 / 52


OK lets abandon trees. What might the value function look like?

13 / 52

Background on Solving POMDPsSondik (1971): the value function has structure, specificallypiecewise linear and convex (PWLC)

Lines are “alpha vectors”, V (b)=maxi αhi · b

Proof: Value iteration backup operator preserves PWLC, so if value isPWLC at h = 1, then PWLC for all horizons by induction.

14 / 52




14 / 52




14 / 52


The intersections partition the belief simplex(into regions with a common optimal-action)

15 / 52

POMDP Value Iteration: Horizon 1To prove PWLC by induction, start at V h=1:

Qh=1(b, ai ) = R(b, ai ) = R(s0, ai )b(s0) + R(s1, ai )b(s1)Let R(s0, a1) = 2, and R(s1, a2) = 3, and 0 otherwise.Q(b, a1) = 2 · b(s0) + 0 · b(s1) = [2, 0] · bQ(b, a2) = 0 · b(s0) + 3 · b(s1) = [0, 3] · b → intersect at b(s1) = 0.4.

V h=1(b) = maxa∈A Q(b, a) = maxi αh=1i · b

Qu: How many α-vectors and partitions will we have in general at h=1?

16 / 52


Qh=1(b, ai ) = R(b, ai ) = R(s0, ai )b(s0) + R(s1, ai )b(s1)

Let R(s0, a1) = 2, and R(s1, a2) = 3, and 0 otherwise.Q(b, a1) = 2 · b(s0) + 0 · b(s1) = [2, 0] · bQ(b, a2) = 0 · b(s0) + 3 · b(s1) = [0, 3] · b → intersect at b(s1) = 0.4.



16 / 52


Qh=1(b, ai ) = R(b, ai ) = R(s0, ai )b(s0) + R(s1, ai )b(s1)Let R(s0, a1) = 2, and R(s1, a2) = 3, and 0 otherwise.

Q(b, a1) = 2 · b(s0) + 0 · b(s1) = [2, 0] · bQ(b, a2) = 0 · b(s0) + 3 · b(s1) = [0, 3] · b → intersect at b(s1) = 0.4.



16 / 52





16 / 52





16 / 52




Qu: How many α-vectors and partitions will we have in general at h=1?16 / 52

POMDP Value Iteration: Horizon 2OK, our α-vectors define V h=1, how to compute V h=2?(& if V k+1 is PWLC given PWLC V k → inductive proof complete)

Break down into 3 steps:

i) value of b given a and z

ii) value of b given a

iii) value of b

17 / 52

POMDP Value Iteration: Horizon 2OK, our α-vectors define V h=1, how to compute V h=2?(& if V k+1 is PWLC given PWLC V k → inductive proof complete)

Break down into 3 steps:

i) value of b given a and z

ii) value of b given a

iii) value of b

17 / 52

POMDP Value Iteration: Horizon 2, Step i: V (b) given a and z

V ∗(b) = maxa∈A

R(b, a) + γ∑

z∈Z O(z |b, a)V ∗(τ(b, a, z))︸︷︷︸S(a,z)

18 / 52

POMDP Value Iteration: Horizon 2, Step ii: V (b) given a

V (b) = maxa∈A

R(b, a) + γ∑z∈Z

O(z |b, a)V (τ(b, a, z))︸︷︷︸∑z S(a,z)

19 / 52


Each zi will have it’s own transformed alpha vectors.

The transformed α-vectors are added (weighted by theirprobabilities).

Qu: Ignoring, R(b, a) how many α-vectors and partitions do weexpect at h=2?

20 / 52


Each zi will have it’s own transformed alpha vectors.

The transformed α-vectors are added (weighted by theirprobabilities).Qu: Ignoring, R(b, a) how many α-vectors and partitions do weexpect at h=2?

20 / 52


21 / 52


22 / 52


23 / 52


Note: only 4 partitions < |Z ||A| = 6. Why?

24 / 52


Note: only 4 partitions < |Z ||A| = 6. Why?

24 / 52


Do the same thing for a2:

25 / 52

POMDP Value Iteration: Horizon 2, Step iii: V (b)

Combine to partition into optimal actions at h=2:

26 / 52


Prune any ‘everywhere-suboptimal’ α-vectors.

Post-pruning: 1-1 mapping between regions and α-vectors.

Remaining α-vectors / regions indicate different (long term) ‘a-z-a’strategies, not necessarily different (short term) next-actions.

27 / 52


Prune any ‘everywhere-suboptimal’ α-vectors.

Post-pruning: 1-1 mapping between regions and α-vectors.Remaining α-vectors / regions indicate different (long term) ‘a-z-a’strategies, not necessarily different (short term) next-actions.

27 / 52

Talk Outline

What are POMDPs?




28 / 52

Sondik/Monahan’s Enumeration (1971/1982)

Brute force, compute all α-vectors, even suboptimal:

Complexity: At h we have |A| · |Z | · |αh−1| projected vectors, and|α(h)| = |A| · |α(h − 1)|·|Z | cross-sum combinations / new vectors,and time |S |2|A| · |α(h − 1)|·|Z |.

29 / 52

Sondik/Monahan’s Enumeration (1971/1982)

Brute force, compute all α-vectors, even suboptimal:

Complexity: At h we have |A| · |Z | · |αh−1| projected vectors, and|α(h)| = |A| · |α(h − 1)|·|Z | cross-sum combinations / new vectors,and time |S |2|A| · |α(h − 1)|·|Z |.

29 / 52

Zhang and Liu’s Incremental Pruning (1996)

Like algorithm A, but incremental pruning.Combine all S(a, z1) and S(a, z2) vectors:

[more sophisticated versions: Cassandra 1997

30 / 52


Then prune suboptimal-vectors:

31 / 52


Using only the remaining vectors, keep going:

32 / 52

Cheng’s Linear Support (1988)Pick a belief-point from a stack, generate its α-vector, check eachregion-vertex to see if same vector. If not, add vertex to stack.

Start: pick (say) midpoint of belief simplex and compute α-vector:

Biggest V-error is at one of the untested (black) vertexes.

33 / 52




33 / 52




33 / 52



Biggest V-error is at one of the untested (black) vertexes.33 / 52

Cheng’s Linear Support (1988)

So we evaluate vertex b(s1 = 0), giving us a new vector:

Note, by evaluating vertexes, we target areas of high V-error.

34 / 52


35 / 52


36 / 52


Done. No more points left on stack.

37 / 52

Talk Outline

What are POMDPs?




38 / 52

Online POMDPs, Lower Bounds

Plan online, from the current belief state.

Hauskrecht/Smith’s Blind Policy (2000/2005): same action alwayschosen: only |V | = |A|.

αa = R(s, a) + γ∑s′∈S

T (s ′|s, a)α′a

39 / 52

Online POMDPs, Upper bounds:Littmans’s VMDP (1995):

V (s) = maxa∈A

[R(s, a) + γ∑s′∈S

T (s ′|s, a)V (s ′)]

VMDP(b) =∑s∈S

V (s)b(s) = V · b (|V | = 1)

Littmans’s QMDP (1995):

Q(s, a) = R(s, a) + γ∑s′∈S

T (s ′|s, a)V (s ′)

QMDP(b, a) =∑s∈S

Q(s, a)b(s) = Qa · b (|V | = |A|)

Hauskrecht’s Fast Informed Bound (FIB) (2000): time: |A||S |2|Z ||V ′|

αa(s) = R(s, a) + γ∑z∈Z

maxα′∈Γ′︸︷︷︸

opposed to ⊕

∑s′∈S

T (s ′|s, a)O(s ′|z , a)α′(s ′)

V (b) < VFIB(b) ≤ VQMDP(b) ≤ VVMDP(b) ∀b ∈ B

40 / 52

Online POMDPs

V b b s s a P s o s a b s si a A s S s Ss Soi

i i+ ∈ ∈ ∈ ∈∈∈

= +

∑ ∑∑∑1 ( ) m ax ( ) ( , ) m ax ( ' , | , ) ( ) ( ' )'

ρ γ αα ΓΘ

UMDP update:

fast informed bound update:V b b s s a P s o s a si a A s S s So

ii i

+ ∈ ∈ ∈ ∈∈= +

∑ ∑∑1 ( ) m ax ( ) ( , ) m ax ( ' , | , ) ( ' )'

ρ γ αα ΓΘ

V b b s s a P s s a b s si a A s S s Ss Si

i i+ ∈ ∈ ∈ ∈∈

= +

∑ ∑∑1 ( ) m ax ( ) ( , ) m ax ( ' | , ) ( ) ( ' )'

ρ γ αα Γ

exact update:

≤

≤

V b b s s a P s s a si a A s S s Si

i i+ ∈ ∈ ∈ ∈

= +

∑ ∑1 ( ) m ax ( ) ( , ) ( ' | , ) m ax ( ' )'

ρ γ αα Γ

≤QMDP approx. update:

MDP approx. update:V b b s s a P s s a si

s S a A s Si

i i+

∈ ∈ ∈ ∈= +

∑ ∑1 ( ) ( ) m ax ( , ) ( ' | , ) m ax ( ' )

'ρ γ α

α Γ

≤

[Hauskrecht 2000] 41 / 52

Online POMDPs: Ways to Expand a Tree [Ross 2008]

b0[14.4, 18.7]

a1

b2b1

z1 z2

a2

b4b3

z1 z2

1 3

0.7 0.3 0.5 0.5

a1

b6b5

z1 z2

a2

b8b7

z1 z2

-14

0.6 0.4 0.2 0.8

[14.4, 17.9] [12, 18.7]

[10, 18][9, 15][15, 20]

[6, 14] [9, 12] [11, 20][10, 12]

[13.7, 16.9]

[5.8, 11.5]

[13.7, 16.9]

[Ross 2008] time: (|A||Z |)D |S |2 42 / 52

Online POMDPs: Ways to Expand a Tree [Ross 2008]Branch-and-Bound Pruning: Backup upper and lower boundsfrom leaves to root. Prune nodes known to be suboptimal.(Paquet’s Real-Time Belief Space Search (2005-6)).MC: use a generative model to sample observations, for sparse butdeep tree. (McAllester (1999), Bertsekas (1999))

Heuristic Search: expand leaf node of greatest heuristic value:

e(bleaf , b0) ≈ γd(bleaf ,b0)Pr(bleaf |b0, π)(U(bleaf )− L(bleaf ))

(Satia 1973, Washington’s BI-POMDP 1997, Ross’s AEMS 2007, Smith’s HSVI 2004)

b0[14.4, 18.7]

a1

b2b1

z1 z2

a2

b4b3

z1 z2

1 3

0.7 0.3 0.5 0.5

a1

b6b5

z1 z2

a2

b8b7

z1 z2

-14

0.6 0.4 0.2 0.8

[14.4, 17.9] [12, 18.7]

[10, 18][9, 15][15, 20]

[6, 14] [9, 12] [11, 20][10, 12]

[13.7, 16.9]

[5.8, 11.5]

[13.7, 16.9]

time: (|A||Z |)D |S |2

43 / 52

Point-Based Methods

1. Use a finite set of representative belief-points bi ∈ B ⊂ B.

2. Backup belief-point values V (bi ) and gradients dV (bi )dbi

= αi .

3. To evaluate any other beliefs b ∈ B\B use interpolation orlower bound that α provide.

44 / 52

Grid-Based Belief-Value Interpolation

I Lovejoy, 1991

I Brafman, 1997

I Hauskrecht, 2000

I Zhou, 2001

I Bonet, 2002

45 / 52

Pineau’s Point Based Value Iteration (PBVI) (2003)

I Maintain one value and one gradient per belief-point.

I An ‘anytime algorithm’ - trade off time with accuracy.I When all bi fully backed-up, add new beliefs-points b′i to B

using stochastic forward simulation. Pros:I more-probable belief points addedI no ‘unreachable’ belief points added

I Only include b′i furthest (L1 or L2) from B

α0

b2 b1 b0 b3b2 b1 b0 b3

V={ ,α1,α2}

46 / 52

Pineau’s Point Based Value Iteration (PBVI) (2003)

α0

b2 b1 b0 b3b2 b1 b0 b3

V={ ,α1,α2}

For exact V = HV ′ backup, we first create |A||Z ||V ′| projections:

Γa,r ← αa,r (s) = R(s, a)

Γa,z ← αa,zi (s) = γ

∑s′∈S

T (s ′|s, a)O(z |s ′, a)α′i (s′), ∀α′i ∈ V ′

Sondik’s Full Enumeration:

Γa = Γa,r ⊕ Γa,z1 ⊕ Γa,z2 ⊕ ...V = ∪a∈AΓa

Space: |V | = |A||V ′||Z |Time: |S |2|A||V ′||Z |.

PBVI:

Γa,b = Γa,r +∑z∈Z

arg maxα∈Γa,z

(α · b)

V ← arg maxΓa,b,a∈A

(Γa,b · b) ∀b ∈ B

Space: |V | = |B|Time: |S ||A||V ′||Z ||B|

47 / 52

Spaan’s Perseus/Random-PBVI (2005)Backup a (random) subset B̃ ⊂ B to improve V (bi ) ∀bi ∈ B.Note: backup operator H is linear in |V |, and |V | = |B̃| << |B|.

Spaan & Vlassis

Vn

(1, 0) (0, 1)b1 b2 b3b4 b5 b6 b7

(a)

Vn+1

(1, 0) (0, 1)

b6

(b)

replacemen

Vn+1

(1, 0) (0, 1)

b3

(c)

Vn+1

(1, 0) (0, 1)

(d)

Figure 1: Example of a Perseus backup stage in a two state POMDP. The belief space isdepicted on the x-axis and the y-axis represents V (b). Solid lines are αi

n vectorsfrom the current stage n and dashed lines are αi

n−1 vectors from the previousstage. We operate on a B of 7 beliefs, indicated by the tick marks. The backupstage computing Vn+1 from Vn proceeds as follows: (a) value function at stage n;(b) start computing Vn+1 by sampling b6, add α = backup(b6) to Vn+1 whichimproves the value of b6 and b7; (c) sample b3 from {b1, . . . , b5}, add backup(b3)to Vn+1 which improves b1 through b5; and (d) the value of all b ∈ B has improved,the backup stage is finished.

3.2 Discussion

The key observation underlying the Perseus algorithm is that when a belief b is backedup, the resulting vector improves not only V (b) but often also the value of many otherbelief points in B. This results in value functions with a relatively small number of vectors(as compared to, e.g., Poon, 2001; Pineau et al., 2003). Experiments show indeed thatthe number of vectors grows modestly with the number of backup stages (|Vn| � |B|).In practice this means that we can afford to use a much larger B than other point-basedmethods, which has a positive effect on the approximation accuracy as dictated by thebounds of Pineau et al. (2003). Furthermore, compared with other methods that buildthe set B based on various heuristics (Pineau et al., 2003; Smith & Simmons, 2004), ourbuild-up of B is cheap as it only requires sampling random trajectories starting from b0.Moreover, duplicate entries in B will only affect the probability that a particular b will besampled in the value update stages, but not the size of Vn.

204

48 / 52

Spaan’s Perseus/Random-PBVI (2005)Spaan & Vlassis

Vn

(1, 0) (0, 1)b1 b2 b3b4 b5 b6 b7

(a)

Vn+1

(1, 0) (0, 1)

b6

(b)

replacemen

Vn+1

(1, 0) (0, 1)

b3

(c)

Vn+1

(1, 0) (0, 1)

(d)

Figure 1: Example of a Perseus backup stage in a two state POMDP. The belief space isdepicted on the x-axis and the y-axis represents V (b). Solid lines are αi

n vectorsfrom the current stage n and dashed lines are αi

n−1 vectors from the previousstage. We operate on a B of 7 beliefs, indicated by the tick marks. The backupstage computing Vn+1 from Vn proceeds as follows: (a) value function at stage n;(b) start computing Vn+1 by sampling b6, add α = backup(b6) to Vn+1 whichimproves the value of b6 and b7; (c) sample b3 from {b1, . . . , b5}, add backup(b3)to Vn+1 which improves b1 through b5; and (d) the value of all b ∈ B has improved,the backup stage is finished.

3.2 Discussion

The key observation underlying the Perseus algorithm is that when a belief b is backedup, the resulting vector improves not only V (b) but often also the value of many otherbelief points in B. This results in value functions with a relatively small number of vectors(as compared to, e.g., Poon, 2001; Pineau et al., 2003). Experiments show indeed thatthe number of vectors grows modestly with the number of backup stages (|Vn| � |B|).In practice this means that we can afford to use a much larger B than other point-basedmethods, which has a positive effect on the approximation accuracy as dictated by thebounds of Pineau et al. (2003). Furthermore, compared with other methods that buildthe set B based on various heuristics (Pineau et al., 2003; Smith & Simmons, 2004), ourbuild-up of B is cheap as it only requires sampling random trajectories starting from b0.Moreover, duplicate entries in B will only affect the probability that a particular b will besampled in the value update stages, but not the size of Vn.

204

Perseus: Randomized Point-based Value Iteration for POMDPs

backup stage, given a value function Vn, we compute a value function Vn+1 that improvesthe value of all b ∈ B, i.e., we build a value function Vn+1 = H̃PerseusVn that upper boundsVn over B (but not necessarily over ∆ which would require linear programming):

Vn(b) ≤ Vn+1(b), for all b ∈ B. (16)

We first let the agent randomly explore the environment and collect a set B of reachablebelief points, which remains fixed throughout the complete algorithm. We initialize the valuefunction V0 as a single vector with all its components equal to 1

1−γmins,a r(s, a) (Zhang &

Zhang, 2001). Starting with V0, Perseus performs a number of backup stages until someconvergence criterion is met. Each backup stage is defined as follows (where B̃ is an auxiliaryset containing the non-improved points):

Perseus backup stage: Vn+1 = H̃PerseusVn

1. Set Vn+1 = ∅. Initialize B̃ to B.

2. Sample a belief point b uniformly at random from B̃ and compute α = backup(b).

3. If b · α ≥ Vn(b) then add α to Vn+1, otherwise add α′ = arg max{αin}i

b · αin to Vn+1.

4. Compute B̃ = {b ∈ B : Vn+1(b) < Vn(b)}.

5. If B̃ = ∅ then stop, else go to 2.

Often, a small number of vectors will be sufficient to improve Vn(b) ∀b ∈ B, especiallyin the first steps of value iteration. The idea is to compute these vectors in a randomizedgreedy manner by sampling from B̃, an increasingly smaller subset of B. We keep trackof the set of non-improved points B̃ consisting of those b ∈ B whose new value Vn+1(b) isstill lower than Vn(b). At the start of each backup stage, Vn+1 is set to ∅ which means B̃is initialized to B, indicating that all b ∈ B still need to be improved in this backup stage.As long as B̃ is not empty, we sample a point b from B̃ and compute α = backup(b). Ifα improves the value of b (i.e., if b · α ≥ Vn(b) in step 3), we add α to Vn+1 and updateVn+1(b) for all b ∈ B by computing their inner product with the new α. The hope is thatα improves the value of many other points in B, and all these points are removed from B̃.As long as B̃ is not empty we sample belief points from it and add their α vectors.

To ensure termination of each backup stage we have to enforce that B̃ shrinks whenadding vectors, i.e., that each α actually improves at least the value of the b that generatedit. If not (i.e., b · α < Vn(b) in step 3), we ignore α and insert a copy of the maximizingvector of b from Vn in Vn+1. Point b is now considered improved and is removed from B̃in step 4, together with any other belief points which had the same vector as maximizingone in Vn. This procedure ensures that B̃ shrinks and the backup stage will terminate. Apictorial example of a backup stage is presented in Fig. 1.

Perseus performs backup stages until some convergence criterion is met. For point-based methods several convergence criteria can be considered, one could for instance boundthe difference between successive value function estimates maxb∈B(Vn+1(b) − Vn(b)). An-other option would be to track the number of policy changes: the number of b ∈ B whichhad a different optimal action in Vn compared to Vn+1 (Lovejoy, 1991).

203

49 / 52

Spaan’s Perseus/Random-PBVI (2005)Perseus: Randomized Point-based Value Iteration for POMDPs

Tiger-grid R |π| T

HSVI 2.35 4860 10341

Perseus 2.34 134 104

PBUA 2.30 660 12116

PBVI 2.25 470 3448

BPI w/b 2.22 120 1000

Grid 0.94 174 n.a.

QMDP 0.23 n.a. 2.76

(a) Results for Tiger-grid.

Hallway R |π| T

PBVI 0.53 86 288

PBUA 0.53 300 450

HSVI 0.52 1341 10836

Perseus 0.51 55 35

BPI w/b 0.51 43 185

QMDP 0.27 n.a. 1.34

(b) Results for Hallway.

Hallway2 R |π| T

Perseus 0.35 56 10

HSVI 0.35 1571 10010

PBUA 0.35 1840 27898

PBVI 0.34 95 360

BPI w/b 0.32 60 790

QMDP 0.09 n.a. 2.23

(c) Results for Hallway2.

Tag R |π| T

Perseus −6.17 280 1670

HSVI −6.37 1657 10113

BPI w/b −6.65 17 250

BBSLS ≈ −8.3 30 105

BPI n/b −9.18 940 59772

PBVI −9.18 1334 180880

QMDP −16.9 n.a. 16.1

(d) Results for Tag.

Table 2: Experimental comparisons of Perseus with other algorithms. Perseus resultsare averaged over 10 runs. Each table lists the method, the average expecteddiscounted reward R, the size of the solution |π| (value function or controllersize), and the time T (in seconds) used to compute the solution. Sources: PBVI

(Pineau et al., 2003), BPI no bias (Poupart & Boutilier, 2004), BPI with bias(Poupart, 2005), HSVI (Smith & Simmons, 2004), Grid (Brafman, 1997), PBUA

(Poon, 2001), and BBSLS (Braziunas & Boutilier, 2004) (approximate, read fromfigure).

on the reachable belief space by incorporating the initial belief which dramatically increasesits performance in solution size and computation time, but it does not reach the controlquality of Perseus.

5.2 Continuous Action Spaces

We applied Perseus in two domains with continuous action spaces: an agent equippedwith proximity sensors moving at a continuous heading and distance, and a navigationtask involving a mobile robot with omnidirectional vision in a perceptually aliased officeenvironment.

211

50 / 52

Smith’s Heuristic Search Value Iteration (HSVI) (2004-5)

I Some (Shani 2008, 2012) argue against Perseus’s use ofrandom belief-point selection is bad, being unguided.

I HSVI is similar to PBVI but instead guides belief-pointselection using a tree and a heuristic (mentioned above). Itdescends the tree via action nodes of greatest upper bound,and observations nodes of greatest difference in upper andlower bounds (weighted by the observation probability).

I This guiding is akin to prioritised value iteration in MDPs.

51 / 52

Advanced POMDP Topics

I Reinforcement Learning as a POMDP (Duff 2002, Poupart2006)

I Learning and planning in POMDPs (Ross 2011)

I Continuous POMDPs (Porta 2006)

52 / 52

Date post:	11-Aug-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Partially Observable Markov Decision Processes...

Documents