R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Chapter 5: Monte Carlo Methods
Monte Carlo methods learn from complete sample returns Only defined for episodic tasks
Monte Carlo methods learn directly from experience On-line: No model necessary and still attains optimality Simulated: No need for a full model
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2
Monte Carlo Policy Evaluation
Goal: learn V(s) Given: some number of episodes under which contain s Idea: Average returns observed after visits to s
Every-Visit MC: average returns for every time s is visited in an episode
First-visit MC: average returns only for first time s is visited in an episode
Both converge asymptotically
1 2 3 4 5
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3
First-visit Monte Carlo Policy Evaluation
Initialize: policy to be evaluatedV an arbitrary state-value functionReturns(s) empty list, for all sS
Repeat forever:Generate an episode using For each state s appearing in the episode
R return following the first occurrence of sAppend R to Return(s)V(s) average(Returns(s))
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4
Blackjack example
Object: Have your card sum be greater than the dealers without exceeding 21.
States (200 of them): current sum (12-21) dealer’s showing card (ace-10) do I have a useable ace?
Reward: +1 for winning, 0 for a draw, -1 for losing Actions: stick (stop receiving cards), hit (receive another
card) Policy: Stick if my sum is 20 or 21, else hit
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5
Blackjack Value Functions
After many MC state visit evaluations the state-value function is well approximatedDynamic Programming parameters difficult to formulate here!
For instance with a given hand and a decision to stay what would be the expected return value?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6
Backup Diagram for Monte Carlo
Entire episode is included while in DP only one step transitions
Only one choice at each state (unlike DP which uses all possible transitions in one step)
Estimates for all states are independent so MC does not bootstrap (build on other estimates)
Time required to estimate one state does not depend on the total number of states
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7
Example - Elastic Membrane (Dirichlet Problem)
The Power of Monte Carlo
How do we compute the shape of the membrane or bubble attached to a fixed frame?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8
RelaxationIterate on the grid and compute averages(like DP iterations)
Kakutani’s algorithm, 1945Use many random walks and average the boundary points values(like MC approach)
Two Approaches
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9
Monte Carlo Estimation of Action Values (Q)
Monte Carlo is most useful when a model is not available We want to learn Q*
Q(s,a) - average return starting from state s and action a following
Converges asymptotically if every state-action pair is visited
To assure this we must maintain exploration to visit many state-action pairs
Exploring starts: Every state-action pair has a non-zero probability of being the starting pair
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10
Monte Carlo Control (to approximate an optimal policy)
MC policy iteration: Policy evaluation by approximating Q using MC methods, followed by policy improvement
Policy improvement step: greedy with respect to Q (action-value) function – no model needed to construct greedy policy
Generalized Policy Iteration GPI
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11
Convergence of MC Control
Policy improvement theorem tells us:
Q k (s, k1(s)) Q k (s,arg maxa
Q k (s,a))
maxa
Q k (s,a)
Q k (s, k (s))
V k (s) This assumes exploring starts and infinite number of
episodes for MC policy evaluation To solve the latter:
update only to a given level of performance alternate between evaluation and improvement per
episode
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12
Monte Carlo Exploring Starts
Fixed point is optimal policy *
Proof is one of the fundamental questions of RL
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13
Blackjack Example continued Exploring starts – easy to enforce by generating state-
action pairs randomly Initial policy as described before (sticks only at 20 or 21) Initial action-value function equal to zero
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14
On-policy Monte Carlo Control
)(1
sA
greedy action
)(sA
non-max actions
On-policy: learn or improve the policy currently executing How do we get rid of exploring starts?
Need soft policies: (s,a) > 0 for all s and a e.g. -soft policy
– probability of action selection:
Similar to GPI: move policy towards greedy policy (i.e. -soft)
Converges to best -soft policy
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15
On-policy MC Control
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16
Learning about while following another
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17
Off-policy Monte Carlo control
On-policy estimates the policy value while following it so the policy to generate behavior and estimate its result is identical Off-policy assumes separate behavior and estimation policies
Behavior policy generates behavior in environment may be randomized to sample all actions
Estimation policy is policy being learned about may be deterministic (greedy)
Off-policy estimates one policy while following another Average returns from behavior policy by using nonzero
probabilities of selecting all actions for the estimation policy May be slow to find improvement after nongreedy action
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18
Off-policy MC control
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19
Incremental Implementation
MC can be implemented incrementally saves memory
Compute the weighted average of each return
Vn wk Rk
k 1
n
wkk1
n
000
11
11
11
WV
wWW
VRW
wVV
nnn
nnn
nnn
incremental equivalentnon-incremental
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20
Racetrack Exercise
States: grid squares, velocity horizontal and vertical
Rewards: -1 on track, -5 off track Only the right turns allowed
Actions: +1, -1, 0 to velocity 0 < Velocity < 5 Stochastic: 50% of the time it
moves 1 extra square up or right
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21
Summary
MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book)
MC methods provide an alternate policy evaluation process
One issue to watch for: maintaining sufficient exploration exploring starts, soft policies
No bootstrapping (as opposed to DP)