Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 1 / 29
(Co)Algebraic Techniques for
Markov Decision ProcessesSYSMICS 2019
Frank Feys Helle Hvid Hansen Larry Moss
Delft University of Technology
Introduction
Joint work with Frank Feys (Delft) and Larry Moss (Indiana):
Long-Term Values in Markov Decision Processes, (Co)AlgebraicallyProc. of Coalgebraic Methods in Computer Science (CMCS 2018)
Motivation:
• Classic theory of MDPs uses “low-level”, analytic methods.
• Some classic proofs have coinductive flavour. Make this precise.
• Develop compositional, (co)algebraic methods for MDPs.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 2 / 29
Outline
1 Coalgebra basics
2 MDP Preliminaries
3 Part I: Long-Term Values from b-Corecursive Algebras
4 Part II: Policy Improvement (Co)Inductively
5 Conclusion
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 3 / 29
Outline
1 Coalgebra basics
2 MDP Preliminaries
3 Part I: Long-Term Values from b-Corecursive Algebras
4 Part II: Policy Improvement (Co)Inductively
5 Conclusion
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 4 / 29
Systems as Coalgebras
Universal Coalgebra: a universal theory of systems [Rutten, m.m.]
• Framework for uniform study of systems and their properties
• Uniform defs and proofs, coinduction principle, ...
T -Coalgebras
Determ. automaton with output in B: X → B × XA
Labelled transition system: X → P(X )A
Kripke model: X → P(X )A × P(P0)Mealy machine : X → (B × X )A
Probabilistic automaton: X → P(D(X ))A
Linear weighted automata: X → R× (RX )A
. . . . . .
T -coalgebra : X → T (X )
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 5 / 29
Systems as Coalgebras
Universal Coalgebra: a universal theory of systems [Rutten, m.m.]
• Framework for uniform study of systems and their properties
• Uniform defs and proofs, coinduction principle, ...
T -Coalgebras
Determ. automaton with output in B: X → B × XA
Labelled transition system: X → P(X )A
Kripke model: X → P(X )A × P(P0)Mealy machine : X → (B × X )A
Probabilistic automaton: X → P(D(X ))A
Linear weighted automata: X → R× (RX )A
. . . . . .
T -coalgebra : X → T (X )
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 5 / 29
T -Coalgebras and T -AlgebrasGiven functor T : C→ C on category C,
T -coalgebra T -coalgebra morphism
X
c
��
T (X )
X
c
��
f // Y
d��
T (X )T (f )
// T (Y )
T -algebra T -algebra morphism
T (X )
a��
X
T (X )
a��
T (f )f// T (Y )
b��
Xf // Y
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 6 / 29
Final Coalgebra
Def. A T -coalgebra (Z , ζ) is final if X
∀γ��
∃!beh // Z
ζ
��
T (X )T (beh)
// T (Z )
For a state x ∈ X , beh(x) is the “behaviour of x”.
Examples:T (X ) Carrier of final T -coalgebra
B × X Bω, streams over B2× XA 2A
∗, languages over alphabet A
(B × X )A causal stream functions f : Aω → Bω
Pω strongly extensional trees.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 7 / 29
Outline
1 Coalgebra basics
2 MDP Preliminaries
3 Part I: Long-Term Values from b-Corecursive Algebras
4 Part II: Policy Improvement (Co)Inductively
5 Conclusion
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 8 / 29
MDPs: Planning under Uncertainty
Markov decision processes (MDPs) are state-based models ofsequential decision-making under uncertainty.
• The system/agent chooses actions and collects rewards, but doesnot have full control over transitions.
• The decision maker wants to find a policy/plan that maximizesfuture expected long-term rewards.
• Applications: maintenance schedules, production planning,finance, reinforcement learning, ...
• MDPs are one-player stochastic games.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 9 / 29
MDPs: Planning under Uncertainty
Markov decision processes (MDPs) are state-based models ofsequential decision-making under uncertainty.
• The system/agent chooses actions and collects rewards, but doesnot have full control over transitions.
• The decision maker wants to find a policy/plan that maximizesfuture expected long-term rewards.
• Applications: maintenance schedules, production planning,finance, reinforcement learning, ...
• MDPs are one-player stochastic games.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 9 / 29
MDP Example
A start-up company needs to decide to Advertise or Save money:
+0+0
A
PoorPoor
Unknown
Unknown
Rich Rich
Famous
Famous
ASS
S
SA
A
+10 +10
& &
& &
11
1
1/2
1/2
1/2
1/2
1/2
1/2
1/21/2
1/2
1/2
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 10 / 29
Markov Decision Processes
Def. A (discrete, time-independent) Markov Decision Process (MDP)is a Set-coalgebra
m = 〈u, t〉 : S → R× (∆S)A
where
• S finite set of states,
• A is a finite set of actions,
• ∆ is the finite-support distributions functor,
• t : S → (∆S)A is a probabilistic transition function,
• u : S → R is an (immediate) reward function.
(Alternatively, m : S → (R×∆S)A, i.e. rewards are given ontransitions)
Def. A (deterministic, stationary) policy σ is a map σ : S → A.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 11 / 29
Markov Decision Processes
Def. A (discrete, time-independent) Markov Decision Process (MDP)is a Set-coalgebra
m = 〈u, t〉 : S → R× (∆S)A
where
• S finite set of states,
• A is a finite set of actions,
• ∆ is the finite-support distributions functor,
• t : S → (∆S)A is a probabilistic transition function,
• u : S → R is an (immediate) reward function.
(Alternatively, m : S → (R×∆S)A, i.e. rewards are given ontransitions)
Def. A (deterministic, stationary) policy σ is a map σ : S → A.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 11 / 29
Markov Reward ProcessGiven m = 〈u, t〉 : S → R× (∆S)A and policy σ : S → A, we getMarkov reward process
mσ = 〈u, tσ〉 : S → R×∆S where tσ(s) := t(s)(σ(s))
; m]σ : ∆S → R×∆S by determinisation (cf. Bartels, Jacobs,
Silva, Sokolova) — using (∆, δ, µ) is monad and ev : ∆(R)→ R isalgebra for monad (∆, δ, µ).
S
�
mσ=〈u,tσ〉// R×∆S
idR×!
��
∆S
m]σ66
!��
Rω // R× Rω
Concretely, m]σ = 〈u], t]σ〉 where
u](ϕ) =∑
s∈S u(s) · ϕ(s)
t]σ(ϕ)(s) =∑
s′∈S tσ(s)(s ′) · ϕ(s ′)
m]σ is (R×−)-coalgebra.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 12 / 29
Markov Reward ProcessGiven m = 〈u, t〉 : S → R× (∆S)A and policy σ : S → A, we getMarkov reward process
mσ = 〈u, tσ〉 : S → R×∆S where tσ(s) := t(s)(σ(s))
; m]σ : ∆S → R×∆S by determinisation (cf. Bartels, Jacobs,
Silva, Sokolova) — using (∆, δ, µ) is monad and ev : ∆(R)→ R isalgebra for monad (∆, δ, µ).
S
�
mσ=〈u,tσ〉// R×∆S
idR×!
��
∆S
m]σ66
!��
Rω // R× Rω
Concretely, m]σ = 〈u], t]σ〉 where
u](ϕ) =∑
s∈S u(s) · ϕ(s)
t]σ(ϕ)(s) =∑
s′∈S tσ(s)(s ′) · ϕ(s ′)
m]σ is (R×−)-coalgebra.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 12 / 29
Expected Rewards via Trace Semantics
S
�
trc
��
mσ=〈u,tσ〉// R×∆S
idR×!
��
∆S
m]σ66
!��
Rω // R× Rω
Trace semantics via finality:
trc(s) = (rσ0 (s), rσ1 (s), rσ2 (s), . . .)
where rσn (s) is the output of m]σ
after n steps starting from δ(s),i.e., expected reward at time stepn starting from s.
Long-term expected value for σ in s depends on how you evaluate
(rσ0 (s), rσ1 (s), rσ2 (s), . . .)
Different evaluation criteria exist...
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 13 / 29
Expected Rewards via Trace Semantics
S
�
trc
��
mσ=〈u,tσ〉// R×∆S
idR×!
��
∆S
m]σ66
!��
Rω // R× Rω
Trace semantics via finality:
trc(s) = (rσ0 (s), rσ1 (s), rσ2 (s), . . .)
where rσn (s) is the output of m]σ
after n steps starting from δ(s),i.e., expected reward at time stepn starting from s.
Long-term expected value for σ in s depends on how you evaluate
(rσ0 (s), rσ1 (s), rσ2 (s), . . .)
Different evaluation criteria exist...
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 13 / 29
Long-Term Value via Discounted Sums
Let 0 ≤ γ < 1 be a discount factor.
Def. The long-term value of policy σ according to the discountedsum criterion is V σ : S → R:
V σ(s) =∞∑n=0
γn · rσn (s)
Converges because reward map u : S → R is bounded.
We define:
• σ ≥ τ if for all s, V σ(s) ≥ V τ (s).
• σ is an optimal policy if σ ≥ τ for all τ .
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 14 / 29
Long-Term Value via Discounted Sums
Let 0 ≤ γ < 1 be a discount factor.
Def. The long-term value of policy σ according to the discountedsum criterion is V σ : S → R:
V σ(s) =∞∑n=0
γn · rσn (s)
Converges because reward map u : S → R is bounded.
We define:
• σ ≥ τ if for all s, V σ(s) ≥ V τ (s).
• σ is an optimal policy if σ ≥ τ for all τ .
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 14 / 29
Optimal Value
Def. The optimal value function V ∗ : S → R of m is defined as
V ∗(s) = maxσ
V σ(s)
Classical facts (wrt discounted sum criterion), cf. (Puterman, 2014):
• If σ is optimal, then V σ = V ∗.
• Optimal policy always exists.
• Optimal policies need not be unique.
• Stationary (memoryless), deterministic policies suffice.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 15 / 29
Optimal Value
Def. The optimal value function V ∗ : S → R of m is defined as
V ∗(s) = maxσ
V σ(s)
Classical facts (wrt discounted sum criterion), cf. (Puterman, 2014):
• If σ is optimal, then V σ = V ∗.
• Optimal policy always exists.
• Optimal policies need not be unique.
• Stationary (memoryless), deterministic policies suffice.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 15 / 29
Outline
1 Coalgebra basics
2 MDP Preliminaries
3 Part I: Long-Term Values from b-Corecursive Algebras
4 Part II: Policy Improvement (Co)Inductively
5 Conclusion
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 16 / 29
V σ as Coalgebra-to-Algebra Morphism• V σ : S → R satisfies for all s ∈ S :
V σ(s) = u(s) + γ∑
s′∈S tσ(s)(s ′) · V σ(s ′)
i.e. V σ = u + γtσVσ (as linear system)
(1)
• V σ arises as a fixpoint of the linear operator Ψσ : RS → RS givenby
Ψσ(v) = u + γtσv
• Observation: we can re-express (1) as V σ being acoalgebra-to-algebra morphism:
Smσ=〈u,tσ〉
//
V σ
��
R×∆S
R×∆(V σ)��
R R× Rαγ
oo R×∆RR×evoo
whereαγ : R× R→ R isαγ(x , y) = x + γ · y .
Let H = R× Id.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 17 / 29
V σ as Coalgebra-to-Algebra Morphism• V σ : S → R satisfies for all s ∈ S :
V σ(s) = u(s) + γ∑
s′∈S tσ(s)(s ′) · V σ(s ′)
i.e. V σ = u + γtσVσ (as linear system)
(1)
• V σ arises as a fixpoint of the linear operator Ψσ : RS → RS givenby
Ψσ(v) = u + γtσv
• Observation: we can re-express (1) as V σ being acoalgebra-to-algebra morphism:
Smσ=〈u,tσ〉
//
V σ
��
R×∆S
R×∆(V σ)��
R R× Rαγ
oo R×∆RR×evoo
whereαγ : R× R→ R isαγ(x , y) = x + γ · y .
Let H = R× Id.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 17 / 29
V σ as Coalgebra-to-Algebra Morphism• V σ : S → R satisfies for all s ∈ S :
V σ(s) = u(s) + γ∑
s′∈S tσ(s)(s ′) · V σ(s ′)
i.e. V σ = u + γtσVσ (as linear system)
(1)
• V σ arises as a fixpoint of the linear operator Ψσ : RS → RS givenby
Ψσ(v) = u + γtσv
• Observation: we can re-express (1) as V σ being acoalgebra-to-algebra morphism:
Smσ=〈u,tσ〉
//
V σ
��
R×∆S
R×∆(V σ)��
R R× Rαγ
oo R×∆RR×evoo
whereαγ : R× R→ R isαγ(x , y) = x + γ · y .
Let H = R× Id.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 17 / 29
V σ via Universal Property?
• Recall that a corecursive algebra (for functor F ) is an F -algebraα s.t.
X∀f //
∃! f †
��
FX
F (f †)
��
A FAαoo
• Question: Is αγ ◦ H(ev) a corecursive algebra for H∆?
• By (Capretta et al., 2004):
αγ ◦ H(ev) a corecursive algebra for H∆iffαγ corecursive algebra for H
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 18 / 29
V σ via Universal Property?
• Recall that a corecursive algebra (for functor F ) is an F -algebraα s.t.
X∀f //
∃! f †
��
FX
F (f †)
��
A FAαoo
• Question: Is αγ ◦ H(ev) a corecursive algebra for H∆?
• By (Capretta et al., 2004):
αγ ◦ H(ev) a corecursive algebra for H∆iffαγ corecursive algebra for H
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 18 / 29
Is αγ Corecursive Algebra?
• αγ : R× R→ R is corecursive for H = R× Id if:for all f : X → R× X there is unique f † : X → Rsuch that f † = αγ ◦ (R× f †) ◦ f .
• Consider H-coalgebra f : X → R× X given by
f = x0a0−→ x1
a1−→ x2a2−→ . . .
• f † is solution iff f †(xn) = an + γ · f †(xn+1), n = 0, 1, 2, . . ..
• This system has infinitely many solutions when γ > 0, even if(an)n is bounded. So, αγ is not corecursive for γ > 0.
• However, if (an)n is bounded then this system has a uniquebounded solution.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 19 / 29
Is αγ Corecursive Algebra?
• αγ : R× R→ R is corecursive for H = R× Id if:for all f : X → R× X there is unique f † : X → Rsuch that f † = αγ ◦ (R× f †) ◦ f .
• Consider H-coalgebra f : X → R× X given by
f = x0a0−→ x1
a1−→ x2a2−→ . . .
• f † is solution iff f †(xn) = an + γ · f †(xn+1), n = 0, 1, 2, . . ..
• This system has infinitely many solutions when γ > 0, even if(an)n is bounded. So, αγ is not corecursive for γ > 0.
• However, if (an)n is bounded then this system has a uniquebounded solution.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 19 / 29
Is αγ Corecursive Algebra?
• αγ : R× R→ R is corecursive for H = R× Id if:for all f : X → R× X there is unique f † : X → Rsuch that f † = αγ ◦ (R× f †) ◦ f .
• Consider H-coalgebra f : X → R× X given by
f = x0a0−→ x1
a1−→ x2a2−→ . . .
• f † is solution iff f †(xn) = an + γ · f †(xn+1), n = 0, 1, 2, . . ..
• This system has infinitely many solutions when γ > 0, even if(an)n is bounded. So, αγ is not corecursive for γ > 0.
• However, if (an)n is bounded then this system has a uniquebounded solution.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 19 / 29
Bounded Corecursive Algebra (bca)To get uniqueness ⇒ incorporate boundedness information.
• Def. A b-category (C,B) is a category C with a subclassB ⊆ Mor(C) of “bounded” morphisms s.t. for all f :f ∈ B ⇒ f ◦ g ∈ B. (Also known as a sieve.)Main example: (Met,B) where Met is metric spaces with allmaps, and B are all bounded maps.
• Def. Let (C,B) be a b-category and F : C→ C endofunctor. AnF -algebra α : FA→ A is a b-corecursive algebra (bca) if
X∀f ∈B
//
∃!f †∈B��
FX
Ff †
��
A FAαoo
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 20 / 29
Bounded Corecursive Algebra (bca)To get uniqueness ⇒ incorporate boundedness information.
• Def. A b-category (C,B) is a category C with a subclassB ⊆ Mor(C) of “bounded” morphisms s.t. for all f :f ∈ B ⇒ f ◦ g ∈ B. (Also known as a sieve.)Main example: (Met,B) where Met is metric spaces with allmaps, and B are all bounded maps.
• Def. Let (C,B) be a b-category and F : C→ C endofunctor. AnF -algebra α : FA→ A is a b-corecursive algebra (bca) if
X∀f ∈B
//
∃!f †∈B��
FX
Ff †
��
A FAαoo
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 20 / 29
Bounded Corecursive Algebra (bca)To get uniqueness ⇒ incorporate boundedness information.
• Def. A b-category (C,B) is a category C with a subclassB ⊆ Mor(C) of “bounded” morphisms s.t. for all f :f ∈ B ⇒ f ◦ g ∈ B. (Also known as a sieve.)Main example: (Met,B) where Met is metric spaces with allmaps, and B are all bounded maps.
• Def. Let (C,B) be a b-category and F : C→ C endofunctor. AnF -algebra α : FA→ A is a b-corecursive algebra (bca) if
X∀f ∈B
//
∃!f †∈B��
FX
Ff †
��
A FAαoo
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 20 / 29
V σ from Universal Property of bca
We show αγ ◦ H(ev) is bca for H∆:
1 Develop some theory of b-categories (b-functors, b-naturaltransformation, B-preservation properties, . . . )
2 Prove b-version of (Capretta et al.)-result.Theorem: If ... (certain conditions), then we obtain a bca forH∆ from from bca for H.
3 Show that αγ is bca for H (using Banach FPT).
4 Show that conditions for Theorem apply.
(Details are in the paper)
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 21 / 29
Optimal Value V ∗ from bca
Sm=〈u,t〉
//
V ∗
��
R× (∆S)A
R×(∆V ∗)A
��
R R× (∆R)Aαγ◦(R×maxA ◦ evA)oo
We can show directly that we have bca.
But bca is not obtained via Capretta-style Theorem.Problem: maxA is not affine/linear.
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 22 / 29
Outline
1 Coalgebra basics
2 MDP Preliminaries
3 Part I: Long-Term Values from b-Corecursive Algebras
4 Part II: Policy Improvement (Co)Inductively
5 Conclusion
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 23 / 29
Policy IterationNotation: For σ : S → A , let `σ : ∆S → R be
`σ(ϕ) =∑s∈S
ϕ(s) · V σ(s)
(ϕ-expectation of long-term value for σ).
Policy Iteration Algorithm:
1 Initialise σ0 to any policy.
2 Compute V σk (e.g. by solving system of linear equations).
3 Define σk+1 by
σk+1(s) := argmaxa∈A {`σk (ta(s))}
4 If σk+1 = σk then stop, else go to step 2.
Termination: X(since AS is finite). Correctness: follows if σk+1 ≥ σk .
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 24 / 29
Policy IterationNotation: For σ : S → A , let `σ : ∆S → R be
`σ(ϕ) =∑s∈S
ϕ(s) · V σ(s)
(ϕ-expectation of long-term value for σ).
Policy Iteration Algorithm:
1 Initialise σ0 to any policy.
2 Compute V σk (e.g. by solving system of linear equations).
3 Define σk+1 by
σk+1(s) := argmaxa∈A {`σk (ta(s))}
4 If σk+1 = σk then stop, else go to step 2.
Termination: X(since AS is finite). Correctness: follows if σk+1 ≥ σk .
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 24 / 29
Policy Improvement
By definition,
for all s ∈ S , σk+1(s) = argmaxa∈A {`σk (ta(s))}.
which implies
for all s ∈ S , `σk (tσk+1(s)) ≥ `σk (tσk (s)),
i.e.`σk ◦ tσk+1
≥ `σk ◦ tσk (in pointwise order on RS).
Policy Improvement Lemma:
For all σ, τ : `σ ◦ tτ ≥ `σ ◦ tσ ⇒ V τ ≥ V σ
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 25 / 29
Policy Improvement
By definition,
for all s ∈ S , σk+1(s) = argmaxa∈A {`σk (ta(s))}.
which implies
for all s ∈ S , `σk (tσk+1(s)) ≥ `σk (tσk (s)),
i.e.`σk ◦ tσk+1
≥ `σk ◦ tσk (in pointwise order on RS).
Policy Improvement Lemma:
For all σ, τ : `σ ◦ tτ ≥ `σ ◦ tσ ⇒ V τ ≥ V σ
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 25 / 29
Contraction (Co)Induction
Def. Ordered metric space
An ordered (complete) metric space (M, d ,≤) is a (complete) metricspace (M, d) together with a partial order (M,≤) such that for ally ∈ M, {z | z ≤ y} and {z | y ≤ z} are closed in the metric topology.Example: B(X ,R) with the pointwise order and supremum metric.
Theorem: Contraction (Co)Induction
Let M be a non-empty, ordered complete metric space. If f : M → Mis both contractive and order-preserving, then the fixpoint x∗ of f is:(i) a least pre-fixpoint (if f (x) ≤ x , then x∗ ≤ x), and(ii) a greatest post-fixpoint (if x ≤ f (x), then x ≤ x∗).
Cf. Metric Coinduction (Kozen & Ruozzi, 2009) and (Denardo, 1967).
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 26 / 29
Contraction (Co)Induction
Def. Ordered metric space
An ordered (complete) metric space (M, d ,≤) is a (complete) metricspace (M, d) together with a partial order (M,≤) such that for ally ∈ M, {z | z ≤ y} and {z | y ≤ z} are closed in the metric topology.Example: B(X ,R) with the pointwise order and supremum metric.
Theorem: Contraction (Co)Induction
Let M be a non-empty, ordered complete metric space. If f : M → Mis both contractive and order-preserving, then the fixpoint x∗ of f is:(i) a least pre-fixpoint (if f (x) ≤ x , then x∗ ≤ x), and(ii) a greatest post-fixpoint (if x ≤ f (x), then x ≤ x∗).
Cf. Metric Coinduction (Kozen & Ruozzi, 2009) and (Denardo, 1967).
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 26 / 29
Proof of Policy ImprovementPolicy Improvement Lemma:
For all σ, τ : `σ ◦ tτ ≥ `σ ◦ tσ ⇒ V τ ≥ V σ
Proof:
Apply Contraction (Co)induction to Ψσ : RS → RS
(contractive and order-preserving X)
Ψσ(v) = u + γtσv , and V σ is its fixpoint.
We have:
`σ ◦ tτ ≥ `σ ◦ tσ⇒ u + γ · `σ ◦ tτ ≥ u + γ · `σ ◦ tσ⇔ Ψτ (V σ) ≥ Ψσ(V σ) = V σ
(by contr. coind.) ⇒ V τ ≥ V σ
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 27 / 29
Proof of Policy ImprovementPolicy Improvement Lemma:
For all σ, τ : `σ ◦ tτ ≥ `σ ◦ tσ ⇒ V τ ≥ V σ
Proof: Apply Contraction (Co)induction to Ψσ : RS → RS
(contractive and order-preserving X)
Ψσ(v) = u + γtσv , and V σ is its fixpoint.
We have:
`σ ◦ tτ ≥ `σ ◦ tσ⇒ u + γ · `σ ◦ tτ ≥ u + γ · `σ ◦ tσ⇔ Ψτ (V σ) ≥ Ψσ(V σ) = V σ
(by contr. coind.) ⇒ V τ ≥ V σ
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 27 / 29
Outline
1 Coalgebra basics
2 MDP Preliminaries
3 Part I: Long-Term Values from b-Corecursive Algebras
4 Part II: Policy Improvement (Co)Inductively
5 Conclusion
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 28 / 29
ConclusionSummary
• Value functions V σ and V ∗ from b-corecursive algebras.• Coinductive proof of policy improvement theorem.• Still need to resort to Banach FPT to get fixpoints.
Future work
• Stochastic games : Existence of Nash Eq = Kakutani +Contraction (Co)Induction.• Other types of equilibria (Subgame perfect, Markov, ...)• Make connections to
• Coalgebraic infinite games (Abramsky & Winschel, 2017)• Open games (Hedges et al., 2018)• Semantics of equilibria (Pavlovic, 2009)
• Dualities for ordered metric spaces, fixpoints.• Coalgebraic learning: reinforcement learning
Thanks!
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 29 / 29
ConclusionSummary
• Value functions V σ and V ∗ from b-corecursive algebras.• Coinductive proof of policy improvement theorem.• Still need to resort to Banach FPT to get fixpoints.
Future work
• Stochastic games : Existence of Nash Eq = Kakutani +Contraction (Co)Induction.• Other types of equilibria (Subgame perfect, Markov, ...)• Make connections to
• Coalgebraic infinite games (Abramsky & Winschel, 2017)• Open games (Hedges et al., 2018)• Semantics of equilibria (Pavlovic, 2009)
• Dualities for ordered metric spaces, fixpoints.• Coalgebraic learning: reinforcement learning
Thanks!
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 29 / 29
ConclusionSummary
• Value functions V σ and V ∗ from b-corecursive algebras.• Coinductive proof of policy improvement theorem.• Still need to resort to Banach FPT to get fixpoints.
Future work
• Stochastic games : Existence of Nash Eq = Kakutani +Contraction (Co)Induction.• Other types of equilibria (Subgame perfect, Markov, ...)• Make connections to
• Coalgebraic infinite games (Abramsky & Winschel, 2017)• Open games (Hedges et al., 2018)• Semantics of equilibria (Pavlovic, 2009)
• Dualities for ordered metric spaces, fixpoints.• Coalgebraic learning: reinforcement learning
Thanks!
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 29 / 29