Introduction to Markov Decision Processes and DynamicProgramming
Judith Butepage and Marcus Klasson
KTH, Royal Institute of Technology, Stockholm
[email protected], [email protected]
February 14, 2017
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 1 / 46
Overview
1 Introduction to Markov Decision ProcessesFormal Modelling of RL TasksValue FunctionsBellman and his equationsOptimal Value Function
2 Dynamic ProgrammingPolicy EvaluationPolicy ImprovementPolicy IterationValue Iteration
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 2 / 46
The Agent-Environment Interaction
St œ S, S is the set of possible statesAt œ A(St), A(St) is the set of actions in state St
Rt+1 œ R µ R, is a numerical rewardfit(a|s), a policy denoting the probability of
choosing action At = a in state St = s
The agent’s goal is to maximize the total amount of reward it receives over the long run.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 3 / 46
Help us to maximize our rewards!
The states are the slides of this lecture.The actions are your reactions.We get more reward when you understand and when you ask questions.
So raise your hand and do not get lost in this mathematical jungle!
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 4 / 46
A Short Discourse Into Multi-Armed Bandits
The agent can choose between k actions and receives a reward for each action.The expected reward for taking action a at time t is
qú(a) = E[Rt |At = a].
If the agent has chosen actions up to time t, the average received reward is
Qt(a) =qt≠1
i=1 Ri · 1(Ai = a)qt≠1
i=1 1(Ai = a).
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 5 / 46
Multi-Armed Bandits Example - Dragon FinderWe can choose the actions
A = {d1, d2, d3}
We have chosen actions and received rewards
A1:t≠1 = [d1, d2, d1, d3, d2, d3, d3]R1:t≠1 = [2.6, 1.1, 3.4, 6.1, 0.8, 4.6, 5.2]
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 6 / 46
Multi-Armed Bandits Example - Dragon Finder
We have chosen actions and received rewards
A1:t≠1 = [d1, d2, d1, d3, d2, d3, d3]R1:t≠1 = [2.6, 1.1, 3.4, 6.1, 0.8, 4.6, 5.2]
Then we have
Qt(d1) = (2.6 + 3.4)2
= 3
Qt(d2) = (1.1 + 0.8)2
= 0.95
Qt(d3) = (6.1 + 4.6 + 5.2)3
= 5.3
We can be greedy and exploit this function by choosing the action that gives us the highestexpected reward.Or we can explore our action space and choose a random action with probability ‘.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 7 / 46
Multi-Armed Bandits Example - Graph ‘-greedy
Steps
0 100 200 300 400 500 600 700 800 900 1000
Ave
rag
e R
ew
ard
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
ϵ = 0 (greedy)
ϵ = 0.01
ϵ = 0.1
Comparing greedy method with two ‘-greedy (‘ = 0.01 and ‘ = 0.1). Rewards are Normallydistributed as
Rd ≥ N (µd , ‡d ), µ = [3, 1, 5], ‡ = [0.5, 0.25, 1].
Takes t = 1000 steps and is averaged over 1000 runs.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 8 / 46
Markov Decision Processes
A Markov Decision Process (MDP) is defined by a 5-tuple (S, A, p(), R, “)
S is a finite set of possible statesA(St) is a finite set of actions in state St
p(sÕ|s, a) is the state-transition probability to state sÕ from state s taking action aR is a numerical reward“ is a discount factor, 0 Æ “ Æ 1
A finite MDP has a finite number of states and actions.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 9 / 46
The Valentine’s Dilemma
The final goal of the princess is to rescue her prince. However, there are obstacles on the way.Valentine’s day is only ONCE a year, so she needs to be fast!For every step she gets a reward of -1, unless she meets a dragon and needs to fight it. Thenthe reward is -5.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 10 / 46
Goals and Rewards
Goal: The maximization of the expected value of the cumulative sum of a received scalar signal(called reward).Reward signal: What we want to achieve, not how to achieve it.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 11 / 46
Discounted Rewards
Episodic task: T œ N, each episode ends in a Continuing task: T = Œterminal state
Expected reward: Gt is some specific function of the reward sequence
Episodic task: Gt = Rt+1 + Rt+2 + Rt+3 + .... + RT
Continuing task: Gt = Rt+1 + “Rt+2 + “2Rt+3 + “3Rt+4 + ....
=Œÿ
k=0
“kRt+k+1
0 Æ “ Æ 1 is called the discount rate.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 12 / 46
Unified Notation
Episodic task: T œ N, each episode ends in a Continuing task: T = Œterminal state
Gt =T≠t≠1ÿ
k=0
“kRt+k+1
T can be Œ, 0 Æ “ Æ 1, but not T = Œ and “ = 1
Myopic agent: “ = 0 Far-sighted agent: “ æ 1
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 13 / 46
State Representations
Representation 1 Representation 2 Representation 3
A state can include sensory signals, abstract environmental information or even mental states.However, it should only contain information relevant for decision making.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 14 / 46
The Valentine’s Dilemma - The Markov Property
Generally, the current response could depend on the entire past:p(St+1 = sÕ, Rt+1 = r |S0, A0, R1, . . . , St≠1, At≠1, Rt , St , At)
The Markov property assumes independence of the past given the present:p(sÕ, r |s, a) .= p(St+1 = sÕ, Rt+1 = r |St = s, At = a)
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 15 / 46
Markov Decision ProcessesA Markov Decision Process is defined by a 5-tuple (S, A, p(), R, “)
S is a finite set of possible statesA(St) is a finite set of actions in state St
p(sÕ|s, a) is the state-transition probability to state sÕ from state s taking action aR is a numerical reward“ is a discount factor, 0 Æ “ Æ 1
Expected rewards for state–action pair
r(s, a) .= E[Rt+1|St = s, At = a] =ÿ
rœR
rÿ
sÕœS
p(sÕ, r |s, a)
State-transition probabilities
p(sÕ|s, a) .= p(St+1 = sÕ, |St = s, At = a) =ÿ
rœR
p(sÕ, r |s, a)
Expected rewards for state–action–next-state triple
r(s, a, sÕ) .= E[Rt+1|St = s, At = a, St+1 = sÕ] =
qrœR r p(sÕ, r |s, a)
p(sÕ|s, a)
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 16 / 46
MDP Transition Graph - Encountering a Dragon
Figure: Transition graph and table.States: Sm: = Smashed against the wall, Fi: = Fighting, Wo: = Won.Actions: A: = Attacking, H: = Hitting, S: = Sneaking past the dragon.Functions: [p(s’|s,a), r(s,a,s’)]
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 17 / 46
Value Functions
Efi[Gt ] denotes the expectation of Gt when following policy fi(a|s).
State–value function for policy fi
vfi(s) .= Efi[Gt |St = s] = Efi
CŒÿ
k=0
“kRt+k+1|St = s
D
Action–value function for policy fi
qfi(s, a) .= Efi[Gt |St = s, At = a] = Efi
CŒÿ
k=0
“kRt+k+1|St = s, At = a
D
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 18 / 46
Bellman Equation for State–Value Functions
Figure: Richard Ernest Bellman (August 26, 1920 - March 19, 1984)
vfi(s) .= Efi[Gt |St = s]
= Efi
CŒÿ
k=0
“kRt+k+1|St = s
D
= Efi
CRt+1 + “
Œÿ
k=0
“kRt+k+2|St = s
D
=ÿ
aœA
fi(a|s)ÿ
rœR
ÿ
sÕœS
p(sÕ, r |s, a)
Cr + “Efi
CŒÿ
k=0
“kRt+k+2|St+1 = sÕ
DD
=ÿ
aœA
fi(a|s)ÿ
rœR
ÿ
sÕœS
p(sÕ, r |s, a)#r + “vfi(sÕ)
$, ’s œ S
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 19 / 46
Bellman Equation for Action–Value functions
qfi(s, a) = Efi
CŒÿ
k=0
“kRt+k+1|St = s, At = a
D
= ...
=ÿ
rœR
ÿ
sÕœS
p(sÕ, r |s, a)
Cr + “
ÿ
aÕœA
fi(aÕ|sÕ)qfi(sÕ, aÕ)
D, ’s œ S, ’a œ A
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 20 / 46
Backup Diagrams
(a) vfi(s) =ÿ
aœA
fi(a|s)ÿ
rœR
ÿ
sÕœS
p(sÕ, r |s, a)#r + “vfi(sÕ)
$, ’s œ S
(b) qfi(s, a) =ÿ
rœR
ÿ
sÕœS
p(sÕ, r |s, a)
Cr + “
ÿ
aÕœA
fi(aÕ|sÕ)qfi(sÕ, aÕ)
D, ’s œ S, ’a œ A
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 21 / 46
Optimal Value Function
We say that policy fi is better than fiÕ iffi Ø fiÕ i� vfi(s) Ø vfiÕ (s) ’s œ S
It is always the case that÷fi : fi Ø fiÕ ’fiÕ, where fi is the optimal policy fiú and
vú(s) .= maxfi
vfi(s), ’s œ S is the optimal state-value function
qú(s, a) .= maxfi
qfi(s, a), ’s œ S, ’a œ A(s) is the optimal action-value function.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 22 / 46
Bellman Optimality Equation
vú(s) = maxaœA(s)
qú(s, a)
= maxa
Efiú [Gt |St = s, At = a]
= maxa
Efiú
CŒÿ
k=0
“kRt+k+1|St = s, At = a
D
= maxa
Efiú
CRt+1 + “
Œÿ
k=0
“kRt+k+2|St = s, At = a
D
= maxa
Efiú [Rt+1 + “vú(St+1)|St = s, At = a]
= maxaœA(s)
ÿ
sÕ,r
p(sÕ, r |s, a)[r + “vú(sÕ)]
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 23 / 46
Bellman Optimality Equation - Backup Diagrams
(a) vú(s) = maxaœA(s)
ÿ
sÕ,r
p(sÕ, r |s, a)[r + “vú(sÕ)]
(b) qú(s, a) =ÿ
sÕ,r
p(sÕ, r |s, a)[r + “ maxaÕœA(s)
qú(sÕ, aÕ)]
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 24 / 46
Introduction to Dynamic Programming
In general, Dynamic Programming techniques optimize subproblems of the main problem toreach a globally optimal solution.In the context of RL, Dynamic Programming is a collection of algorithms that can compute theoptimal value function of a finite MDP given a perfect model of the environment.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 25 / 46
Evaluating a Policy fi
We have a policy fi(a|s) and want to compute the value function vfi(s), ’s œ S.The Bellman equation can be solved directly:
vfi(s) .=ÿ
a
fi(a|s)ÿ
sÕ,r
p(sÕ, r |s, a)[r + “vfi(sÕ)]
vfi(s1) =c1,0 + c1,1vfi(s1) + c1,2vfi(s2) + c1,3vfi(s3) + ...
vfi(s2) =c2,0 + c2,1vfi(s1) + c2,2vfi(s2) + c2,3vfi(s3) + ...
vfi(s3) =c3,0 + c3,1vfi(s1) + c3,2vfi(s2) + c3,3vfi(s3) + ...
vfi(s4) =...
If the MDP is not finite we are in trouble!A large number of states and action makes this approach also infeasible.Computational complexity is O(n3), where n is the number of states.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 26 / 46
Policy Evaluation
Assume that the environment is a finite MDP. We can use an iterative approach:
vk+1(s) .= Efi[Rt+1 + “vk(St+1)|St = s]
=ÿ
a
fi(a|s)ÿ
sÕ,r
p(sÕ, r |s, a)[r + “vk(sÕ)]
This update uses an operation called full backup and is called Iterative Policy Evaluation.This will converge to the fixed point vk = vfi .
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 27 / 46
Iterative Policy Evaluation
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 28 / 46
Running Example
Shaded squares are terminal states.Actions that will take the agent o� the grid stays in the same state.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 29 / 46
Running Example - ”Random Policy”
vk+1(s) =ÿ
a
fi(a|s)ÿ
sÕ,r
p(sÕ, r |s, a)[r + “vk(sÕ)]
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 30 / 46
Policy Improvement
We have a policy fi(a|s) but it is not optimal. How can we improve it?
Policy improvement theorem:If qfi(s, fiÕ(s)) Ø vfi(s) then the policy fiÕ must be as good, or better than fi.
It must obtain greater or equal returns in all states vfiÕ (s) Ø vfi(s).
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 31 / 46
Proof of Policy Improvement Theorem
vfi(s) Æ qfi(s, fiÕ(s))= EfiÕ [Rt+1 + “vfi(St+1)|St = s]Æ EfiÕ [Rt+1 + “qfi(St+1, fiÕ(St+1))|St = s]= EfiÕ [Rt+1 + “EfiÕ [Rt+2 + “vfi(St+2)]|St = s]= EfiÕ [Rt+1 + “Rt+2 + “2vfi(St+2)|St = s]...Æ EfiÕ [Rt+1 + “Rt+2 + “2Rt+3 + ...|St = s]= vfiÕ (s)
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 32 / 46
Greedy Policy
We have state-value function vfi(s), ’s œ S and greedily choose actions that maximize it.
fiÕ(s) .= arg maxa
qfi(s, a)
= arg maxa
E[Rt+1 + “vfi(St+1)|St = s, At = a]
= arg maxa
ÿ
sÕ,r
p(sÕ, r |s, a)#r + “vfi(sÕ)
$
If the greedy policy fiÕ is as good as, but not better than fi, then vfiÕ = vfi , ’s œ S.
vfiÕ (s) = maxa
E[Rt+1 + “vfiÕ (St+1)|St = s, At = a]
= maxa
ÿ
sÕ,r
p(sÕ, r |s, a)#r + “vfiÕ (sÕ)
$
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 33 / 46
Running Example
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 34 / 46
Policy Iteration
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 35 / 46
Generalized Policy Iteration
Any method that interleaves the two processes of policy evaluation and policy improvement fallsunder the umbrella of generalized policy iterations.The two processes of policy evaluation and policy improvement can be seen as opposing forcesthat will agree on a single joint solution in the long run.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 36 / 46
Value Iteration
Value iteration combines policy improvement and truncated policy evaluation steps.
vk+1(s) .= maxa
E[Rt+1 + “vk(St+1)|St = s, At = a]
= maxa
ÿ
sÕ,r
p(sÕ, r |s, a)#r + “vk(sÕ)
$, ’s œ S.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 37 / 46
Value Iteration
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 38 / 46
Convergence and Termination
All methods presented up to here are only guaranteed to converge for k æ Œ.
However, often we get reasonable results by setting a convergence criterion such as|Vk+1(s) ≠ Vk(s)| < ◊.
Dynamic programming methods scale polynomially in the number of states and actions.Therefore they are exponentially faster than any direct search in the policy space.On today’s computers MDPs with millions of states can be solved with DP methods.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 39 / 46
Limitations of MDPs
Stop using your pink glasses: The real world is not a video game!
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 40 / 46
Limitations of MDPs
Circumvent the problem of high-dimensional state and action spaces by dividing your probleminto subproblems.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 41 / 46
Summary
In reinforcement learning we have an agent that interacts with its environment and receivesrewards based on its decisions. The goal is to learn to choose actions that maximizes theexpected future reward.
states – states should contain all relevant information for making decisions
actions – an action brings you from state s into state sÕ according to p(sÕ|s, a)
rewards – an agent receives rewards for being in a state
policy – a policy is a stochastic rule for choosing actions as a function of states
Markov Decision Process – (S, A, p(), R, “) + markov property
value functions – vfi(s) & qfi(s, a) summarize the expected reward for following a policy fi
policy evaluation – given policy fi(s) we iteratively compute vfi(s) ’s œ S
policy improvement – given vfi(s) improve your policy fi(s), e.g. by being greedy
policy iteration – alternate between policy evaluation and policy improvement
value iteration – combine policy evaluation and policy improvement
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 42 / 46
Questions?
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 43 / 46
References
Sutton, Richard S and Barto, Andrew G (2016++)Reinforcement learning: An introductionPublisher: MIT press Cambridge
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 44 / 46
Thanks!
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 45 / 46
Policy Evaluation - Proof Sketch
Assume we converged at K.
vK (s) .= Efi[Rt+1 + “vK≠1(St+1)|St = s]
vK (s) =ÿ
a
fi(a|s)ÿ
sÕ,r
p(sÕ, r |s, a)[r + “Efi[vK≠1(sÕ)]¸ ˚˙ ˝qa
fi(a|sÕ)q
sÕÕ,rp(sÕÕ,r|sÕ,a)[r+“Efi [vK≠2(sÕÕ)]]
]
Since we follow fi in every step, we e�ectively will approximate vfi .
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 46 / 46
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 46 / 46