+ All Categories
Home > Documents > Week 2: Markov Decision Processes

Week 2: Markov Decision Processes

Date post: 16-Oct-2021
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
64
Week 2: Markov Decision Processes Bolei Zhou The Chinese University of Hong Kong September 12, 2021 Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 1 / 64
Transcript
Page 1: Week 2: Markov Decision Processes

Week 2: Markov Decision Processes

Bolei Zhou

The Chinese University of Hong Kong

September 12, 2021

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 1 / 64

Page 2: Week 2: Markov Decision Processes

Announcement

1 TA Tutorial and Office Hour1 Thursday 17:45 - 18:45 at MMW (Mong Man Wai Building) 7152 There will be a tutorial in the week after each assignment due,

otherwise will be a Q&A session3 This week: a short tutorial on how to use Jupyter Notebook and how

to use Blackboard to submit assignment.

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 2 / 64

Page 3: Week 2: Markov Decision Processes

Plan

1 Last Week1 Course overview2 Key elements of an RL agent: value, policy, model

2 This Time: Decision Making in MDP1 Markov Chain→ Markov Reward Process (MRP)→ Markov Decision

Processes (MDP)2 Policy evaluation in MDP3 Control in MDP: policy iteration and value iteration4 Improving dynamic programming

3 Textbook of Sutton and Barto: Chapter 3 and Chapter 4

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 3 / 64

Page 4: Week 2: Markov Decision Processes

Markov Decision Process (MDP)

1 Markov Decision Process can model a lot of real-world problems. Itformally describes the framework of reinforcement learning

2 Under MDP, the environment is fully observable.1 Optimal control primarily deals with continuous MDPs2 Partially observable problems can be converted into MDPs

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 4 / 64

Page 5: Week 2: Markov Decision Processes

Define the Markov Models

Markov Processes

Markov Reward Processes (MRPs)

Markov Decision Processes (MDPs)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 5 / 64

Page 6: Week 2: Markov Decision Processes

Markov Property

1 The history of states: ht = {s1, s2, s3, ..., st}2 State st is Markovian if and only if:

p(st+1|st) =p(st+1|ht) (1)

p(st+1|st , at) =p(st+1|ht , at) (2)

3 “The future is independent of the past given the present”

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 6 / 64

Page 7: Week 2: Markov Decision Processes

Markov Process/Markov Chain

1 State transition matrix P specifies p(st+1 = s ′|st = s)

P =

P(s1|s1) P(s2|s1) . . . P(sN |s1)P(s1|s2) P(s2|s2) . . . P(sN |s2)

......

. . ....

P(s1|sN) P(s2|sN) . . . P(sN |sN)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 7 / 64

Page 8: Week 2: Markov Decision Processes

Example of MP

1 Sample episodes starting from s31 s3, s4, s5, s6, s62 s3, s2, s3, s2, s13 s3, s4, s4, s5, s5

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 8 / 64

Page 9: Week 2: Markov Decision Processes

Markov Reward Process (MRP)

1 Markov Reward Process is a Markov Chain + reward2 Definition of Markov Reward Process (MRP)

1 S is a (finite) set of states (s ∈ S)2 P is dynamics/transition model that specifies P(St+1 = s ′|st = s)3 R is a reward function R(st = s) = E[rt |st = s]4 Discount factor γ ∈ [0, 1]

3 If finite number of states, R can be a vector

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 9 / 64

Page 10: Week 2: Markov Decision Processes

Example of MRP

Reward: +5 in s1, +10 in s7, 0 in all other states. So that we canrepresent R = [5, 0, 0, 0, 0, 0, 10]

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 10 / 64

Page 11: Week 2: Markov Decision Processes

Return and Value function

1 Definition of Horizon1 Number of maximum time steps in each episode/trajectory2 Can be infinite, otherwise called finite Markov (reward) Process3 Per game: 100 moves for Go, 80 moves for chess

2 Definition of Return1 Discounted sum of rewards from time step t to horizon

Gt = Rt+1 + γRt+2 + γ2Rt+3 + γ3Rt+4 + ...+ γT−t−1RT

3 Definition of state value function Vt(s) for a MRP1 Expected return from t in state s

Vt(s) =E[Gt |st = s]

=E[Rt+1 + γRt+2 + γ2Rt+3 + ...+ γT−t−1RT |st = s]

2 Present value of accumulated future rewards

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 11 / 64

Page 12: Week 2: Markov Decision Processes

Why Discount Factor γ

1 Avoid infinite returns in cyclic Markov processes

2 Uncertainty about the future

3 If the reward is financial, immediate rewards may earn more interestthan delayed rewards

4 Animal/human behaviour shows preference for immediate reward5 It is sometimes possible to use undiscounted Markov reward processes

(i.e. γ = 1), e.g. if all sequences terminate.1 γ = 0: Only care about the immediate reward2 γ = 1: Future reward is equal to the immediate reward.

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 12 / 64

Page 13: Week 2: Markov Decision Processes

Example of MRP

1 Reward: +5 in s1, +10 in s7, 0 in all other states. So that we canrepresent R = [5, 0, 0, 0, 0, 0, 10]

2 Sample returns G for a 4-step episodes with γ = 1/21 return for s4, s5, s6, s7 : 0 + 1

2 × 0 + 14 × 0 + 1

8 × 10 = 1.252 return for s4, s3, s2, s1 : 0 + 1

2 × 0 + 14 × 0 + 1

8 × 5 = 0.6253 return for s4, s5, s6, s6: 0

3 How to compute the value function? For example, the value of states4 as V (s4) = E[Gt |st = s4]

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 13 / 64

Page 14: Week 2: Markov Decision Processes

Computing the Value of a Markov Reward Process

1 Value function: expected return from starting in state s

V (s) = E[Gt |st = s] = E[Rt+1 + γRt+2 + γ2Rt+3 + ...|st = s]

2 MRP value function satisfies the following Bellman equation:

V (s) = R(s)︸︷︷︸Immediate reward

+ γ∑s′∈S

P(s ′|s)V (s ′)︸ ︷︷ ︸Discounted sum of future reward

3 Practice: To derive the Bellman equation for V(s)1 Hint: V (s) = E[Rt+1 + γE[Rt+2 + γRt+3 + γ2Rt+4 + ...]|st = s]

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 14 / 64

Page 15: Week 2: Markov Decision Processes

Understanding Bellman Equation

1 Bellman equation describes the iterative relations of states

V (s) = R(s) + γ∑s′∈S

P(s ′|s)V (s ′)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 15 / 64

Page 16: Week 2: Markov Decision Processes

Matrix Form of Bellman Equation for MRP

Therefore, we can express V(s) using the matrix form:V (s1)V (s2)

...V (sN)

=

R(s1)R(s2)

...R(sN)

P(s1|s1) P(s2|s1) . . . P(sN |s1)P(s1|s2) P(s2|s2) . . . P(sN |s2)

......

. . ....

P(s1|sN) P(s2|sN) . . . P(sN |sN)

V (s1)V (s2)

...V (sN)

V = R + γPV

1 Analytic solution for value of MRP: V = (I − γP)−1R1 Matrix inverse takes the complexity O(N3) for N states2 Only possible for a small MRPs

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 16 / 64

Page 17: Week 2: Markov Decision Processes

Iterative Algorithm for Computing Value of a MRP

1 Dynamic Programming

2 Monte-Carlo evaluation

3 Temporal-Difference learning

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 17 / 64

Page 18: Week 2: Markov Decision Processes

Monte Carlo Algorithm for Computing Value of a MRP

Algorithm 1 Monte Carlo simulation to calculate MRP value function

1: i ← 0,Gt ← 02: while i 6= N do3: generate an episode, starting from state s and time t4: Using the generated episode, calculate return g =

∑H−1i=t γ i−tri

5: Gt ← Gt + g , i ← i + 16: end while7: Vt(s)← Gt/N

1 For example: to calculate V (s4) we can generate a lot of trajectoriesthen take the average of the returns:

1 return for s4, s5, s6, s7 : 0 + 12 × 0 + 1

4 × 0 + 18 × 10 = 1.25

2 return for s4, s3, s2, s1 : 0 + 12 × 0 + 1

4 × 0 + 18 × 5 = 0.625

3 return for s4, s5, s6, s6: 04 more trajectories

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 18 / 64

Page 19: Week 2: Markov Decision Processes

Iterative Algorithm for Computing Value of a MRP

Algorithm 2 Iterative algorithm to calculate MRP value function

1: for all states s ∈ S ,V ′(s)← 0,V (s)←∞2: while ||V − V ′|| > ε do3: V ← V ′

4: For all states s ∈ S ,V ′(s) = R(s) + γ∑

s′∈S P(s ′|s)V (s ′)5: end while6: return V ′(s) for all s ∈ S

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 19 / 64

Page 20: Week 2: Markov Decision Processes

Markov Decision Process (MDP)

1 Markov Decision Process is Markov Reward Process with decisions.2 Definition of MDP

1 S is a finite set of states2 A is a finite set of actions3 Pa is dynamics/transition model for each action

P(st+1 = s ′|st = s, at = a)4 R is a reward function R(st = s, at = a) = E[rt |st = s, at = a]5 Discount factor γ ∈ [0, 1]

3 MDP is a tuple: (S ,A,P,R, γ)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 20 / 64

Page 21: Week 2: Markov Decision Processes

Policy in MDP

1 Policy specifies what action to take in each state

2 Give a state, specify a distribution over actions

3 Policy: π(a|s) = P(at = a|st = s)

4 Policies are stationary (time-independent), At ∼ π(a|s) for any t > 0

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 21 / 64

Page 22: Week 2: Markov Decision Processes

Policy in MDP

1 Given a MDP (S ,A,P,R, γ) and a policy π

2 The state and reward sequence S1,R2, S2,R2, ... is a Markov rewardprocess (S ,Pπ,Rπ, γ) where,

Pπ(s ′|s) =∑a∈A

π(a|s)P(s ′|s, a)

Rπ(s) =∑a∈A

π(a|s)R(s, a)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 22 / 64

Page 23: Week 2: Markov Decision Processes

Comparison of MP/MRP and MDP

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 23 / 64

Page 24: Week 2: Markov Decision Processes

Value function for MDP

1 The state-value function vπ(s) of an MDP is the expected returnstarting from state s, and following policy π

vπ(s) = Eπ[Gt |st = s] (3)

2 The action-value function qπ(s, a) is the expected return startingfrom state s, taking action a, and then following policy π

qπ(s, a) = Eπ[Gt |st = s,At = a] (4)

3 We have the relation between vπ(s) and qπ(s, a)

vπ(s) =∑a∈A

π(a|s)qπ(s, a) (5)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 24 / 64

Page 25: Week 2: Markov Decision Processes

Bellman Expectation Equation

1 The state-value function can be decomposed into immediate rewardplus discounted value of the successor state,

vπ(s) = Eπ[Rt+1 + γvπ(st+1)|st = s] (6)

2 The action-value function can similarly be decomposed

qπ(s, a) = Eπ[Rt+1 + γqπ(st+1,At+1)|st = s,At = a] (7)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 25 / 64

Page 26: Week 2: Markov Decision Processes

Bellman Expectation Equation for V π and Qπ

vπ(s) =∑a∈A

π(a|s)qπ(s, a) (8)

qπ(s, a) =Ras + γ

∑s′∈S

P(s ′|s, a)vπ(s ′) (9)

Thus

vπ(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vπ(s ′)) (10)

qπ(s, a) =R(s, a) + γ∑s′∈S

P(s ′|s, a)∑a′∈A

π(a′|s ′)qπ(s ′, a′) (11)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 26 / 64

Page 27: Week 2: Markov Decision Processes

Backup Diagram for V π

vπ(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vπ(s ′)) (12)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 27 / 64

Page 28: Week 2: Markov Decision Processes

Backup Diagram for Qπ

qπ(s, a) =R(s, a) + γ∑s′∈S

P(s ′|s, a)∑a′∈A

π(a′|s ′)qπ(s ′, a′) (13)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 28 / 64

Page 29: Week 2: Markov Decision Processes

Policy Evaluation

1 Evaluate the value of state given a policy π: compute vπ(s)

2 Also called as (value) prediction

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 29 / 64

Page 30: Week 2: Markov Decision Processes

Example: Navigate the boat

Figure: Markov Chain/MRP: Go with river stream

Figure: MDP: Navigate the boat

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 30 / 64

Page 31: Week 2: Markov Decision Processes

Example: Policy Evaluation

1 Two actions: Left and Right

2 For all actions, reward: +5 in s1, +10 in s7, 0 in all other states. Sothat we can represent R = [5, 0, 0, 0, 0, 0, 10]

3 Let’s have a deterministic policy π(s) = Left and γ = 0 for any states, then what is the value of the policy?

1 V π = [5, 0, 0, 0, 0, 0, 10] since γ = 0

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 31 / 64

Page 32: Week 2: Markov Decision Processes

Example: Policy Evaluation

1 R = [5, 0, 0, 0, 0, 0, 10]

2 Practice 1: Deterministic policy π(s) = Left with γ = 0.5 for anystate s, then what are the state values under the policy?

3 Practice 2: Stochastic policy P(π(s) = Left) = 0.5 andP(π(s) = Right) = 0.5 and γ = 0.5 for any state s, then what are thestate values under the policy?

4 Iteration t:vπt (s) =

∑a P(π(s) = a)(r(s, a) + γ

∑s′∈S P(s ′|s, a)vπt−1(s ′))

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 32 / 64

Page 33: Week 2: Markov Decision Processes

Decision Making in Markov Decision Process (MDP)

1 Prediction (evaluate a given policy):1 Input: MDP < S,A,P,R, γ > and policy π or MRP < S,Pπ,Rπ, γ >2 Output: value function vπ

2 Control (search the optimal policy):1 Input: MDP < S,A,P,R, γ >2 Output: optimal value function v∗ and optimal policy π∗

3 Prediction and control in MDP can be solved by dynamicprogramming.

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 33 / 64

Page 34: Week 2: Markov Decision Processes

Dynamic Programming

Dynamic Programming is a very general solution method for problemswhich have two properties:

1 Optimal substructure1 Principle of optimality applies2 Optimal solution can be decomposed into subproblems

2 Overlapping subproblems1 Subproblems recur many times2 Solutions can be cached and reused

Markov decision processes satisfy both properties

1 Bellman equation gives recursive decomposition

2 Value function stores and reuses solutions

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 34 / 64

Page 35: Week 2: Markov Decision Processes

Prediction: Policy evaluation on MDP

1 Objective: Evaluate a given policy π for a MDP

2 Output: the value function under policy vπ

3 Solution: iteration on Bellman expectation backup4 Algorithm: Synchronous backup

1 At each iteration t+1update vt+1(s) from vt(s

′) for all states s ∈ S where s ′ is a successorstate of s

vt+1(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vt(s′)) (14)

5 Convergence: v1 → v2 → ...→ vπ

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 35 / 64

Page 36: Week 2: Markov Decision Processes

Policy evaluation: Iteration on Bellman expectation backup

Bellman expectation backup for a particular policy

vt+1(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vt(s′)) (15)

Or if in the form of MRP < S,Pπ,R, γ >

vt+1(s) = Rπ(s) + γ∑s′∈S

Pπ(s ′|s)vt(s′) (16)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 36 / 64

Page 37: Week 2: Markov Decision Processes

Evaluating a Random Policy in the Small Gridworld

Example 4.1 in the Sutton RL textbook.

1 Undiscounted episodic MDP (γ = 1)

2 Nonterminal states 1, ..., 14

3 Two terminal states (two shaded squares)

4 Action leading out of grid leaves state unchanged, P(7|7, right) = 1

5 Reward is −1 until the terminal state is reach

6 Transition is deterministic given the action, e.g., P(6|5, right) = 1

7 Uniform random policy π(l |.) = π(r |.) = π(u|.) = π(d |.) = 0.25

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 37 / 64

Page 38: Week 2: Markov Decision Processes

Evaluating a Random Policy in the Small Gridworld

1 Iteratively evaluate the random policy

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 38 / 64

Page 39: Week 2: Markov Decision Processes

A live demo on policy evaluation

vπ(s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vπ(s ′)) (17)

1 https://cs.stanford.edu/people/karpathy/reinforcejs/

gridworld_dp.html

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 39 / 64

Page 40: Week 2: Markov Decision Processes

Practice: Gridworld

Textbook Example 3.5:GridWorld

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 40 / 64

Page 41: Week 2: Markov Decision Processes

MDP Control

1 Compute the optimal policy

π∗(s) = arg maxπ

vπ(s) (18)

2 Optimal policy for a MDP in an infinite horizon problem (agent actsforever) is

1 Deterministic2 Stationary (does not depend on time step)3 Unique? Not necessarily, may have state-actions with identical optimal

values

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 41 / 64

Page 42: Week 2: Markov Decision Processes

Optimal Value Function

1 The optimal state-value function v∗(s) is the maximum valuefunction over all policies

v∗(s) = maxπ

vπ(s)

2 The optimal policy

π∗(s) = arg maxπ

vπ(s)

3 An MDP is “solved” when we know the optimal value

4 There exists a unique optimal value function, but could be multipleoptimal policies (two actions that have the same optimal valuefunction)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 42 / 64

Page 43: Week 2: Markov Decision Processes

Finding Optimal Policy

1 An optimal policy can be found by maximizing over q∗(s, a),

π∗(a|s) =

{1, if a = arg maxa∈A q∗(s, a)

0, otherwise

2 There is always a deterministic optimal policy for any MDP

3 If we know q∗(s, a), we immediately have the optimal policy

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 43 / 64

Page 44: Week 2: Markov Decision Processes

Policy Search

1 One option is to enumerate search the best policy

2 Number of deterministic policies is |A||S|

3 Other approaches such as policy iteration and value iteration are moreefficient

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 44 / 64

Page 45: Week 2: Markov Decision Processes

Improving a Policy through Policy Iteration

1 Iterate through the two steps:1 Evaluate the policy π (computing v given current π)2 Improve the policy by acting greedily with respect to vπ

π′ = greedy(vπ) (19)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 45 / 64

Page 46: Week 2: Markov Decision Processes

Policy Improvement

1 Compute the state-action value of a policy π:

qπi (s, a) = R(s, a) + γ∑s′∈S

P(s ′|s, a)vπi (s ′) (20)

2 Compute new policy πi+1 for all s ∈ S following

πi+1(s) = arg maxa

qπi (s, a) (21)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 46 / 64

Page 47: Week 2: Markov Decision Processes

Monotonic Improvement in Policy1 Consider a determinisitc policy a = π(s)2 We improve the policy through

π′(s) = arg maxa

qπ(s, a)

3 This improves the value from any state s over one step,

qπ(s, π′(s)) = maxa∈A

qπ(s, a) ≥ qπ(s, π(s)) = vπ(s)

4 It therefore improves the value function, vπ′(s) ≥ vπ(s)

vπ(s) ≤qπ(s, π′(s)) = Eπ′ [Rt+1 + γvπ(St+1|St = s)]

≤Eπ′ [Rt+1 + γqπ(St+1, π′(St+1))|St = s]

≤Eπ′ [Rt+1 + γRt+2 + γ2qπ(St+2, π′(St+2))|St = s]

≤Eπ′ [Rt+1 + γRt+2 + ...|St = s] = vπ′(s)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 47 / 64

Page 48: Week 2: Markov Decision Processes

Monotonic Improvement in Policy

1 If improvements stop,

qπ(s, π′(s)) = maxa∈A

qπ(s, a) = qπ(s, π(s)) = vπ(s)

2 Thus the Bellman optimality equation has been satisfied

vπ(s) = maxa∈A

qπ(s, a)

3 Therefore vπ(s) = v∗(s) for all s ∈ S, so π is an optimal policy

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 48 / 64

Page 49: Week 2: Markov Decision Processes

Bellman Optimality Equation

1 The optimal value functions are reached by the Bellman optimalityequations:

v∗(s) = maxa

q∗(s, a)

q∗(s, a) =R(s, a) + γ∑s′∈S

P(s ′|s, a)v∗(s ′)

thus

v∗(s) = maxa

R(s, a) + γ∑s′∈S

P(s ′|s, a)v∗(s ′)

q∗(s, a) =R(s, a) + γ∑s′∈S

P(s ′|s, a) maxa′

q∗(s ′, a′)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 49 / 64

Page 50: Week 2: Markov Decision Processes

Value Iteration by turning the Bellman OptimalityEquation as update rule

1 If we know the solution to subproblem v∗(s ′), which is optimal.

2 Then the solution for the optimal v∗(s) can be found by iterationover the following Bellman Optimality backup rule,

v(s)← maxa∈A

(R(s, a) + γ

∑s′∈S

P(s ′|s, a)v(s ′))

3 The idea of value iteration is to apply these updates iteratively

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 50 / 64

Page 51: Week 2: Markov Decision Processes

Algorithm of Value Iteration

1 Objective: find the optimal policy π

2 Solution: iteration on the Bellman optimality backup3 Value Iteration algorithm:

1 initialize k = 1 and v0(s) = 0 for all states s2 For k = 1 : H

1 for each state s

qk+1(s, a) =R(s, a) + γ∑s′∈S

P(s ′|s, a)vk(s ′) (22)

vk+1(s) =maxa

qk+1(s, a) (23)

2 k ← k + 1

3 To retrieve the optimal policy after the value iteration:

π(s) = arg maxa

R(s, a) + γ∑s′∈S

P(s ′|s, a)vk+1(s ′) (24)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 51 / 64

Page 52: Week 2: Markov Decision Processes

Example: Shortest Path

After the optimal values are reached, we run policy extraction to retrievethe optimal policy.

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 52 / 64

Page 53: Week 2: Markov Decision Processes

Difference between Policy Iteration and Value Iteration

1 Policy iteration includes: policy evaluation + policy improvement,and the two are repeated iteratively until policy converges.

2 Value iteration includes: finding optimal value function + onepolicy extraction. There is no repeat of the two because once thevalue function is optimal, then the policy out of it should also beoptimal (i.e. converged).

3 Finding optimal value function can also be seen as a combination ofpolicy improvement (due to max) and truncated policy evaluation(the reassignment of v(s) after just one sweep of all states regardlessof convergence).

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 53 / 64

Page 54: Week 2: Markov Decision Processes

Summary for Prediction and Control in MDP

Table: Dynamic Programming Algorithms

Problem Bellman Equation Algorithm

Prediction Bellman Expectation Equation Iterative Policy Evaluation

Control Bellman Expectation Equation Policy Iteration

Control Bellman Optimality Equation Value Iteration

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 54 / 64

Page 55: Week 2: Markov Decision Processes

Demo of policy iteration and value iteration

1 Policy iteration: Iteration of policy evaluation and policyimprovement(update)

2 Value iteration

3 https://cs.stanford.edu/people/karpathy/reinforcejs/

gridworld_dp.html

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 55 / 64

Page 56: Week 2: Markov Decision Processes

Policy iteration and value iteration on FrozenLake

1 https://github.com/cuhkrlcourse/RLexample/tree/master/MDP

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 56 / 64

Page 57: Week 2: Markov Decision Processes

Improving Dynamic Programming

1 A major drawback to the DP methods is that they involve operationsover the entire state set of the MDP, that is, they require sweeps ofthe state set.

2 If the state set is very large, for example, the game of backgammonhas over 1020 states. Thousands of years to be taken to finish onesweep.

3 Asychronous DP algorithms are in-place iterative DP that are notorganized in terms of systematic sweeps of the state set

4 The values of some states may be updated several times before thevalues of others are updated once.

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 57 / 64

Page 58: Week 2: Markov Decision Processes

Improving Dynamic Programming

Synchronoous dynamic programming is usually slow. Three simple ideas toextend DP for asynchronous dynamic programming:

1 In-place dynamic programming

2 Prioritized sweeping

3 Real-time dynamic programming

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 58 / 64

Page 59: Week 2: Markov Decision Processes

In-Places Dynamic Programming

1 Synchronous value iteration stores two copies of value function:for all s in S

vnew (s)← maxa∈A

(R(s, a) + γ

∑s′∈S P(s ′|s, a)vold(s ′)

)vold ← vnew

2 In-place value iteration only stores one copy of value function:for all s in S

v(s)← maxa∈A

(R(s, a) + γ

∑s′∈S P(s ′|s, a)v(s ′)

)

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 59 / 64

Page 60: Week 2: Markov Decision Processes

Prioritized Sweeping

1 Use magnitude of Bellman error to guide state selection, e.g.∣∣∣maxa∈A

(R(s, a) + γ

∑s′∈S

P(s ′|s, a)v(s ′))− v(s)

∣∣∣2 Backup the state with the largest remaining Bellman error

3 Update Bellman error of affected states after each backup

4 Can be implemented efficiently by maintaining a priority queue

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 60 / 64

Page 61: Week 2: Markov Decision Processes

Real-Time Dynamic Programming

1 To solve a given MDP, we can run an iterative DP algorithm at thesame time that an agent is actually experiencing the MDP

2 The agent’s experience can be used to determine the states to whichthe DP algorithm applies its updates

3 We can apply updates to states as the agent visits them. So focus onthe parts of the state set that are most relevant to the agent

4 After each time-step St ,At , backup the state St ,

v(St)← maxa∈A

(R(St , a) + γ

∑s′∈S

P(s ′|St , a)v(s ′))

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 61 / 64

Page 62: Week 2: Markov Decision Processes

Sample Backups

1 The key design for RL algorithms such as Q-learning and SARSA innext lectures

2 Using sample rewards and sample transition pairs < S ,A,R, S ′ >,rather than the reward function R and transition dynamics P

3 Benefits:1 Model-free: no advance knowledge of MDP required2 Break the curse of dimensionality through sampling3 Cost of backup is constant, independent of n = |S|

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 62 / 64

Page 63: Week 2: Markov Decision Processes

Approximate Dynamic Programming

1 Using a function approximator v(s,w)2 Fitted value iteration repeats at each iteration k ,

1 Sample state s from the state cache S

vk(s) = maxa∈A

(R(s, a) + γ

∑s′∈S

P(s ′|s, a)v(s ′,wk))

2 Train next value function v(s ′,wk+1) using targets < s, vk(s) >.

3 Key idea behind the Deep Q-Learning

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 63 / 64

Page 64: Week 2: Markov Decision Processes

End

1 Summary: MDP, policy evaluation, policy iteration, and valueiteration

2 Homework 0 is made available at https://github.com/cuhkrlcourse/ierg5350-assignment-2021

3 Homework 1 is made available at https://github.com/cuhkrlcourse/ierg5350-assignment-2021

4 TA session on using Jupyter Notebook and submitting assignmentthrough Notebook

5 Next Week: Model-free methods

6 Reading: Textbook Chapter 5 and Chapter 6

Bolei Zhou IERG5350 Reinforcement Learning September 12, 2021 64 / 64


Recommended