Dynamic Programming: MDPbicmr.pku.edu.cn/~wenzw/bigdata/lect-mdp.pdf · 2 Introduction to MDP 3...

Dynamic Programming: MDP

http://bicmr.pku.edu.cn/~wenzw/bigdata2020.html

Acknowledgement: this slides is based on OpenAI Spinning Up and and Prof. ShipraAgrawal’s lecture notes

1/75

http://bicmr.pku.edu.cn/~wenzw/bigdata2020.html

2/75

Outline

1 Introduction

2 Introduction to MDP

3 Examples

4 Bellman Equation

5 Iterative algorithms (discounted reward case)Value IterationQ-value iterationPolicy iteration

6 RL Algorithms

3/75

What Can RL Do?

RL methods have recently enjoyed a wide variety of successes. Forexample, it’s been used to teach computers to control robots insimulation

4/75

What Can RL Do?

It’s also famously been used to create breakthrough AIs forsophisticated strategy games, most notably ‘Go‘ and ‘Dota‘, taughtcomputers to ‘play Atari games‘ from raw pixels, and trainedsimulated robots ‘to follow human instructions‘.

Go: https://deepmind.com/research/alphago

Dota: https://blog.openai.com/openai-five

play Atari games: https://deepmind.com/research/dqn/

to follow human instructions: https://blog.openai.com/deep-reinforcement-learning-from-human-preferences

https://deepmind.com/research/alphago

https://blog.openai.com/openai-five

https://deepmind.com/research/dqn/

https://blog.openai.com/deep-reinforcement-learning-from-human-preferences

https://blog.openai.com/deep-reinforcement-learning-from-human-preferences

5/75

Main characters of RL: agent and environment

environment: the world that the agent lives in and interacts with.At every step of interaction, the agent sees a (possibly partial)observation of the state of the world, and then decides on anaction to take. The environment changes when the agent acts onit, but may also change on its own.

agent: perceives a reward signal from the environment, anumber that tells it how good or bad the current world state is.The goal of the agent is to maximize its cumulative reward,called return. Reinforcement learning methods are ways that theagent can learn behaviors to achieve its goal.

6/75

Key characteristics

Let’s look at a few things that make reinforcement learning differentfrom other paradigms of optimization, and other machine learningmethods like supervised learning.

Lack of a "supervisor": One of the main characteristic of RL isthat there is no supervisor no labels telling us the best action totake, only reward signals to enforce some actions more thanothers. For example, for a robot trying to walk, there is nosupervisor telling the robot if the actions it took were a good wayto walk, but it does get some signals if form of the effect of itsactions - moving forward or falling down which it can use to guideits behavior.

7/75

Key characteristics

Delayed feedback: Other major distinction is that the feedback isoften delayed: the effect of an action may not be entirely visibleinstantaneously, but it may severely affect the reward signalmany steps later. In the robot example, an aggressive legmovement may look good right now as it may seem to make therobot go quickly towards the target, but a sequence of limbmovements later you may realize that aggressive movementmade the robot fall. This makes it difficult to attribute credit andreinforce a good move whose effect may be seen only manysteps and many moves later. This is also referred to as the"credit assignment problem".Sequential decisions: Time really matters in RL, the “sequence”in which you make your decisions (moves) will decide the pathyou take and hence the final outcome.

8/75

Key characteristics

Actions effect observations: Finally you can see from theseexamples that the observations or feedback that an agent makesduring the course of learning are not really independent, in factthey are a function of the agents’ own actions, which the agentmay decide based on its past observations. This is very unlikeother paradigms like supervised learning, where the trainingexamples are often assumed to be independent of each other, orat least independent of learning agents’ actions.

These are some key characteristics of reinforcement learning whichdifferentiate it from the other branches of learning, and also make it apowerful model of learning for a variety of application domains.

9/75

Examples

Automated vehicle control: Imagine trying to fly an unmannedhelicopter, learning to perform stunts with it. Again, no one willtell you if a particular way of moving the helicopter was good ornot, you may get signals in form of how the helicopter is lookingin the air, that it is not crashing, and in the end you might get areward for example a high reward for a good stunt, negativereward for crashing. Using these signals and reward, areinforcement learning agent would learn a sequence ofmaneuvers just by trial and error.

10/75

Examples

Learning to play games: Some of the most famous successes ofreinforcement learning have been in playing games. You mighthave heard about Gerald Tesauro’s reinforcement learning agentdefeating world Backgammon Champion, or Deepmind’s AlphaGo defeating the world’s best Go player Lee Sedol, usingreinforcement learning. A team at Google Deepmind built an RLsystem that can learn to play suite of Atari games from scratchby just by playing the game again and again, trying out differentstrategies and learning from their own mistakes and successes.This RL agent uses just the pixels of game screen as state andscore increase as reward, and is not even aware of the rules ofthe game to begin with!

11/75

Examples

Medical treatment planning: A slightly different but importantapplication is in medical treatment planning, here the problem isto learn a sequence of treatments for a patient based on thereactions to the past treatments, and current state of the patient.Again, while you observe reward signals in the form of theimmediate effect of a treatment on patients’ condition, the finalreward is whether the patient could be cured or note, and thatcan be only observed in the end. The trials are very expensive inthis case, and need to be carefully performed to achieve mostefficient learning possible.

12/75

Examples

Chatbots: Another popular application is chatbots: you mighthave heard of Microsoft’s chatbots Tay and Zo, or intelligentpersonal assistants like Siri, Google Now, Cortana, and Alexa.All these agents try to make a conversation with a human user.What are conversations? They are essentially a sequence ofsentences exchanged between two people. Again, a bot trying tomake a conversation may receive encouraging signalsperiodically if it is making a good conversation or negativesignals some times in the form of human user leaving theconversation or getting annoyed. A reinforcement learning agentcan use these feedback signals to learn how to make goodconversations just by trial and error, and after manyconversations, you may have a chatbot which has learned theright thing to say at the right moment!

13/75

Terminology

states and observations,

action spaces,

policies,

trajectories,

different formulations of return,

the RL optimization problem,

and value functions.

14/75

Tetris

Height: 12Width: 7Rotate and move the fallingshapeGravity related to currentheightScore when eliminating anentire levelGame over when reachingthe ceiling

15/75

DP Model of Tetris

State: The current board, the current falling tile, predictions offuture tilesTermination state: when the tiles reach the ceiling, the game isover with no more future rewardAction: Rotation and shiftSystem Dynamics: The next board is deterministicallydetermined by the current board and the player’s placement ofthe current tile. The future tiles are generated randomly.Uncertainty: Randomness in future tilesTransitional cost g: If a level is cleared by the current action,score 1; otherwise score 0.Objective: Expectation of total score.

16/75

Interesting facts about Tetris

First released in 1984 by Alexey Pajitnov from the Soviet Union

Has been proved to be NP-complete.

Game will be over with probability 1.

For a 12× 7 board, the number of possible states ≈ 212×7 ≈ 1025

Highest score achieved by human ≈ 1 million

Highest score achieved by algorithm ≈ 35 million (averageperformance)

17/75

Outline

1 Introduction


3 Examples

4 Bellman Equation


6 RL Algorithms

18/75

States and Observations

state: s is a complete description of the state of the world. Thereis no information about the world which is hidden from the state.An observation o is a partial description of a state, which mayomit information.

In deep RL, we almost always represent states and observationsby a “real-valued vector, matrix, or higher-order tensor”. Forinstance, a visual observation could be represented by the RGBmatrix of its pixel values; the state of a robot might berepresented by its joint angles and velocities.

When the agent is able to observe the complete state of theenvironment, we say that the environment is fully observed.When the agent can only see a partial observation, we say thatthe environment is partially observed.

We often write that the action is conditioned on the state, whenin practice, the action is conditioned on the observation becausethe agent does not have access to the state.

19/75

Action Spaces

Different environments allow different kinds of actions. The set ofall valid actions in a given environment is often called the actionspace. Some environments, like Atari and Go, have discreteaction spaces, where only a finite number of moves areavailable to the agent. Other environments, like where the agentcontrols a robot in a physical world, have continuous actionspaces. In continuous spaces, actions are real-valued vectors.

This distinction has some quite-profound consequences formethods in deep RL. Some families of algorithms can only bedirectly applied in one case, and would have to be substantiallyreworked for the other.

20/75

Markov decision processes

The defining property of MDPs is the Markov property whichsays that the future is independent of the past given the currentstate. This essentially means that the state in this modelcaptures all the information from the past that is relevant indetermining the future states and rewards.A Markov Decision Process (MDP) is specified by a tuple(S, s1,A,P,R,H), where S is the set of states, s1 is the startingstate, A is the set of actions. The process proceeds in discreterounds t = 1, 2, · · · ,H, starting in the initial state s1. In everyround, t the agent observes the current state st ∈ S, takes anaction at ∈ A, and observes a feedback in form of a reward signalrt+1 ∈ R. The agent then observes transition to the next statest+1 ∈ S.

21/75

Formal definition

The probability of transitioning to a particular state depends onlyon current state and action, and not on any other aspect of thehistory. The matrix P ∈ [0, 1]S×A×Sspecifies these probabilities.That is,

Pr(st+1 = s′ | history till time t) = Pr(st+1 = s′ | st = s, at = a)

= P(s, a, s′)

The reward distribution depends only on the current state andaction. So, that the expected reward at time t is a function ofcurrent state and action. A matrix R specifies these rewards.

E[rt+1 | history till time t] = E[rt+1 | st = s, at = a] = R(s, a)

Let R(s, a, s′) be the expected (or deterministic) reward whenaction a is taken in state s and transition to state s′ is observed.Then, we can obtain the same model as above by defining

R(s, a) = E[rt+1 | st = s, at = a] = Es′∼P(s,a)[R(s, a, s′)]

22/75

Policy

A policy specifies what action to take at any time step. A historydependent policy at time t is a mapping from history till time t toan action. A Markovian policy is a mapping from state space toaction π: S→ A. Due to Markovian property of the MDP, itsuffices to consider Markovian policies (in the sense that for anyhistory dependent policy same performance can be achieved bya Markovian policy). Therefore, in this text, policy refers to aMarkovian policy.A deterministic policy π: S→ A is mapping from any given stateto an action. A randomized policy π : S→ ∆A is a mapping fromany given state to a distribution over actions. Following a policyπt at time t means that if the current state st = s, the agent takesaction at = πt(s) (or at ∼ π(s) for randomized policy). Following astationary policy π means that πt = π for all rounds t = 1, 2, . . .

23/75

Policy

Any stationary policy π defines a Markov chain, or rather a’Markov reward process’ (MRP), that is, a Markov chain withreward associated with every transition.The transition probability vector and reward for this MRP in states is given by Pr (s′|s) = Pπs ,E [rt|s] = rπs , where Pπ is an S× Smatrix, and rπ is an S-dimensional vector defined as:

Pπs,s′ = Ea∼π(s)[P(s, a, s′

)],∀s, s′ ∈ S

rπs = Ea∈π(s)[R(s, a)]

The stationary distribution (if exists) of this Markov chain whenstarting from state s1 is also referred to as the stationarydistribution of the policy π, denoted by dπ:

dπ(s) = limt→∞

Pr (st = s|s1, π)

24/75

Goals, finite horizon MDP

The tradeoffs between immediate reward vs. future rewards ofthe sequential decisions and the need for planning ahead iscaptured by the goal of the Markov Decision Process. At a highlevel, the goal is to maximize some form of cumulative reward.Some popular forms are total reward, average reward, ordiscounted sum of rewards.

finite horizon MDPactions are taken for t = 1, . . . ,H where H is a finite horizon. Thetotal (discounted) reward criterion is simply to maximize theexpected total (discounted) rewards in an episode of length H.(In reinforcement learning context, when this goal is used, theMDP is often referred to as an episodic MDP.) For discount0 ≤ γ ≤ 1, the goal is to maximize

E

[H∑

t=1

γt−1rt|s1

]

25/75

Infinite horizon MDP

Expected total discounted reward criteria: The most popular formof cumulative reward is expected discounted sum of rewards.This is an asymptotic weighted sum of rewards, where with timethe weights decrease by a factor of γ < 1. This essentiallymeans that the immediate returns more valuable than those farin the future.

limT→∞

E

[T∑

t=1

γt−1rt|s1

]

26/75


Expected total reward criteria: Here, the goal is to maximize

limT→∞

E

[T∑

t=1

rt|s1

]

The limit may not always exist or be bounded. We are onlyinterested in cases where above exists and is finite. This requiresrestrictions on reward and/or transition models. Interesting casesinclude the case where there is an undesirable state, the rewardafter reaching that state is 0. For example, end of a computergame. The goal would be to maximize the time to reach thisstate. (A minimization version of this model is where there is acost associated with each state and the game is to minimize thetime to reach winning state, called the shortest path problem).

27/75


Expected average reward criteria: Maximize

limT→∞

E

[1T

T∑t=1

rt|s1

]Intuitively, the performance in a few initial rounds does not matterhere, what we are looking for is a good asymptotic performance.This limit may not always exist. Assuming bounded rewards andfinite state spaces, it exists under some further conditions onpolicy used.

28/75

Advantages of Discounted sum of rewards

it is mathematically convenient as it is always finite and avoidsthe complications due to infinite returns. Practically, dependingon the application, immediate rewards may indeed be morevaluable.Further, often uncertainty about far future are not wellunderstood, so you may not want to give as much weight to whatyou think you might earn far ahead in future. The discountedreward criteria can also be seen as a soft version of finitehorizon, as the contribution of reward many time steps later isvery small.As you will see later, discounted reward MDP has many desirableproperties for iterative algorithm design and learning. Due tothese reasons, often the practical approaches which actuallyexecute the MDP for finite horizon, use policies, algorithms andinsights from infinite horizon discounted reward setting.

29/75

Gain of the MDP

Gain (roughly the ‘expected value objective’ or formal goal) of anMDP when starting in state s1 is defined as (when supremum exists):

episodic MDP:

J (s1) = sup{πt}

E

[H∑

t=1

γt−1rt|s1

]

Infinite horizon expected total reward:

J (s1) = sup{πt}

limT→∞

E

[T∑

t=1

rt|s1

]

Infinite horizon discounted sum of rewards:

J (s1) = sup{πt}

limT→∞

E

[T∑

t=1

γt−1rt|s1

]

30/75

Gain of the MDP

infinite horizon average reward:

J (s1) = sup{πt}

limT→∞

E

[1T

T∑t=1

rt|s1

]

Here, expectation is taken with respect to state transition and rewarddistribution, supremum is taken over all possible sequence of policiesfor the given MDP. It is also useful to define gain ρπ of a stationarypolicy π, which is the expected (total/total discounted/average)reward when policy π is used in all time steps. For example, forinfinite average reward:

Jπ (s1) = limT→∞

E

[1T

T∑t=1

rt|s1

]

where at = π (st) , t = 1, . . . ,T

31/75

Optimal policy

Optimal policy is defined as the one that maximizes the gain ofthe MDP.Due to the structure of MDP it is not difficult to show that it issufficient to consider Markovian policies. Henceforth, weconsider only Markovian policies.For infinite horizon MDP with average/discounted reward criteria,a further observation that comes in handy is that such a MDPalways has a stationary optimal policy, whenever optimal policyexists. That is, there always exists a fixed policy so that takingactions specified by that policy at all time steps maximizesaverage/discounted/total reward.The agent does not need to change policies with time. Thisinsight reduces the question of finding the best sequentialdecision making strategy to the question of finding the beststationary policy.

32/75

Optimal policy

The results below assume finite state, action space and boundedrewards.

Theorem 1 (Puterman [1994], Theorem 6.2.7)For any infinite horizon discounted MDP, there always exists adeterministic stationary policy π that is optimal.

Theorem 2 (Puterman [1994], Theorem 7.1.9)For any infinite horizon expected total reward MDP, there alwaysexists a deterministic stationary policy π that is optimal.

Theorem 3 (Puterman [1994], Theorem 8.1.2)For infinite horizon average reward MDP, there always exist astationary (possibly randomized) policy which is an optimal policy.

33/75

Optimal policy

Therefore, for infinite horizon MDPs, optimal gain:

J∗(s) = maxπ: Markovian stationary

Jπ(s)

(limit exists for stationary policies [Puterman Proposition 8.1.1])These results imply that the optimal solution space is simpler forinfinite horizon case, and make infinite horizon an attractivemodel even when the actual problem is finite horizon but thehorizon is long.Even when such a result on optimality of stationary policy is notavailable, ’finding the best stationary policy’ is often used as analternate convenient and more tractable objective, instead offinding the optimal policy which may not exist or may not bestationary in general.

34/75

Outline

1 Introduction


3 Examples

4 Bellman Equation


6 RL Algorithms

35/75

Example 1

Let’s formulate a simple MDP for a robot moving on a line.Let’s say there are only three actions available to this robot: walkor run or stay.Walking involves a slow limb movement, which allows the robotto move one step without falling.Running involves an aggressive limb movement which may allowthe robot to move two steps forward, but there is a 20% chanceto fall.Once the robot falls, it cannot get up. The goal is to moveforward quickly and as much as possible without falling.

36/75

Example 1

We can model this as MDP. We define state of the robot as acombination of its Stance: whether it is standing upright or hasfallen down, denoted here as S or F, and its location on the line,represented here as 0, 1, 2, 3, 4, 5, . . .. So, this state (S, 1) forexample means that the robot is upright at location 1, where asthis state (F, 2) means that the robot has fallen down at location2.The robot starts in a standing state at the beginning of the line,that is at state (S, 0).Action space consists of three actions: walk, run, and stay.The state transition depends on current state and action.Walking in a standing state always transfers the robot to astanding state at a location one step ahead(this transition ontaking walk action is represented here by these black arrows).So, by walking the robot can always move up by one step.

37/75

Example 1

On the other hand, by taking the second action of running or anaggressive limb movement, a robot in a standing state may moveby 2 steps at a time (shown here by these green arrows), butthere is also a 20% chance of falling and transitioning to a Fallenstate. In fallen state, there is no effect of any action, the robot isstuck. Stay action keeps the robot in the current state.

38/75

Example 1

As is often the case in applications of MDPs or more generally,reinforcement learning, the rewards and goals are notexogenously given but are also an important part of theapplication modeling process. Different settings will lead todifferent interpretations of the problem. Let’s say the reward isthe number of steps the agent moves as a result of an action inthe current state.Now, if the goal is set to be the total reward (infinite horizon),then the agent should just walk, because the aim is to move asmany steps as possible, so moving quickly is not important, andthe robot should not take the risk of falling by running. The totalreward is infinite.But if the goal is set as discounted reward, then it is alsoimportant to move more steps initially and gather more rewardquickly before the discount term becomes very small, so it maybe useful to run (depending on how small the discount factor γis).

39/75

Example 1

One can also set reward to be 0 for all the states except the finaldestination, say (S, 5), where the reward is 1. In that case, thediscounted sum of rewards would be simply γτ if (S, 5) isreached at time τ , so the agent will want to minimize τ , the timeto reach the end without falling, and therefore may want to moveaggressively at times.Another important point to note from this example, is that in anMDPs an action has long term consequences. For instance, itmay seem locally beneficial to run on state (S, 0) here because,even with some chance of falling, the 2 steps gain at 80%chance means that expected immediate reward is 0.8× 2 = 1.6,which still more than the expected immediate reward of 1 stepthat can be earned by walking, but that greedy approach ignoresall the reward you can make in future if you don’t fall.

40/75

Example 1

Finding an optimal sequential decision making strategy underthis model therefore involves careful tradeoff of immediate andfuture rewards.Here are some examples of possible policies.Suppose you decide that whenever the robot is in a standingposition, you will make the robot walk and not run - this is astationary Markovian (deterministic) policy.A more complex policy could be that whenever the robot isstanding 2 or more steps away from the target location (5), inthose states you will make it walk, otherwise you make it run.This is another Markovian stationary deterministic policy which isconservative in states farther from target and aggressive instates closer to the target.

41/75

Example 1

You can get a randomized policy by making it walk, run or staywith some probability. In general you can change policies overtime, it doesn’t need to be stationary. May be initially youdecided that you will always walk in standing state, but later onafter realizing that you are moving very slowly, you changed yourmind and started running in all states. In this case the agent isusing different policies at different time steps, i.e., anonstationary policy.

42/75

Example 2: Inventory control problem

Each month the manager of a warehouse determines currentinventory (stock on hand) of a single product. Based on thisinformation, she decides whether or not to order additional stockfrom a supplier. In doing so, she is faced with a trade-offbetween holding costs and the lost sales or penalties associatedwith being unable to satisfy customer demand for the product.The objective is to maximize some measure of profit overdecision-making horizon.Demand is a random variable with a probability distributionknown to the manager.

43/75


Let st denote the inventory on hand at the beginning of the tthtime period, at the number of units ordered by the inventorymanager period and Dt the random demand during this timeperiod. We assume that the demand has a knowntime-homogeneous probability distributionpj = Pr (Dt = j) , j = 0, 1, . . .. The inventory at decision epocht + 1 referred to as st+1, is related to the inventory at decisionepoch t, st, through the system equation

st+1 = max {st + at − Dt, 0} ≡ [st + at − Dt]+

That backlogging is not allowed implies the non-negativity of theinventory level. Denote by O(u) the cost of ordering u units in anytime period. Assuming a fixed cost K for placing orders and avariable cost c(u) that increases with quantity ordered, we have

O(u) = [K + c(u)]1{u>0}

44/75


Timing of events in an inventory model:

45/75


The cost of maintaining an inventory of u units for a time periodis represented by a nondecreasing function h(u).Finally, if the demand is j units and sufficient inventory isavailable to meet demand, the manager receives revenue withpresent value f (j).In this model, the reward depends on the state of the system atthe subsequent decision epoch, that is

rt (st, at, st+1) = −O (at)− h (st + at) + f (st + at − st+1)

The goal of a inventory policy could be to maximize expectedtotal reward in a finite horizon, or discounted reward if the firmcares more about near future.

46/75

Outline

1 Introduction


3 Examples

4 Bellman Equation


6 RL Algorithms

47/75

Value Functions

It’s often useful to know the value of a state, or state-action pair. Byvalue, we mean the expected return if you start in that state orstate-action pair, and then act according to a particular policy foreverafter. Value functions are used, one way or another, in almost everyRL algorithm.

The On-Policy Value Function Vπ(s), which gives the expectedreturn if you start in state s and always act according to policy π:

Vπ(s) = limT→∞

E

[T∑

t=1

γt−1rt|s1 = s

]The On-Policy Action-Value Function Qπ(s, a), which gives theexpected return if you start in state s, take an arbitrary action a(which may not have come from the policy), and then foreverafter act according to policy π:

Qπ(s, a) = limT→∞

E

[T∑

t=1

γt−1rt|s1 = s, a1 = a

]

48/75

Value Functions

The Optimal Value Function V∗(s), which gives the expectedreturn if you start in state s and always act according to theoptimal policy in the environment:

V∗(s) = maxπ

Vπ(s)

The Optimal Action-Value Function, Q∗(s, a), which gives theexpected return if you start in state s, take an arbitrary action a,and then forever after act according to the optimal policy in theenvironment:

Q∗(s, a) = maxπ

Qπ(s, a)

49/75

Value Functions

When we talk about value functions, if we do not make referenceto time-dependence, we only mean expected infinite-horizondiscounted return. Value functions for finite-horizonundiscounted return would need to accept time as an argument.Can you think about why? Hint: what happens when time’s up?

There are two key connections between the value function andthe action-value function that come up pretty often:

Vπ(s) = Ea∼π

[Qπ(s, a)] ,

and

V∗(s) = maxa

Q∗(s, a).

These relations follow pretty directly from the definitions justgiven: can you prove them?

50/75

The Optimal Q-Function and the Optimal Action

There is an important connection between the optimalaction-value function Q∗(s, a) and the action selected by theoptimal policy. By definition, Q∗(s, a) gives the expected return forstarting in state s, taking (arbitrary) action a, and then actingaccording to the optimal policy forever after.

The optimal policy in s will select whichever action maximizes theexpected return from starting in s. As a result, if we have Q∗, wecan directly obtain the optimal action, a∗(s), via

a∗(s) = arg maxa

Q∗(s, a).

Note: there may be multiple actions which maximize Q∗(s, a), inwhich case, all of them are optimal, and the optimal policy mayrandomly select any of them. But there is always an optimalpolicy which deterministically selects an action.

51/75

Bellman Equations

All four of the value functions obey special self-consistencyequations called Bellman equations. The basic idea is: Thevalue of your starting point is the reward you expect to get frombeing there, plus the value of wherever you land next.

The Bellman equations for the on-policy value functions are

Vπ(s) = Ea∼πs′∼P

[R(s, a, s′) + γVπ(s′)

],

Qπ(s, a) = Es′∼P

[R(s, a, s′) + γ E

a′∼π

[Qπ(s′, a′)

]],

where s′ ∼ P is shorthand for s′ ∼ P(·|s, a), indicating that thenext state s′ is sampled from the environment’s transition rules;a ∼ π is shorthand for a ∼ π(·|s); and a′ ∼ π is shorthand fora′ ∼ π(·|s′).

52/75

Proof of Bellman equations

Proof. Vπ = Rπ + γPπVπ:

Vπ(s) = E[r1 + γr2 + γ2r3 + γ3r4 + . . . |s1 = s

]= E [r1|s1 = s] + γE

[E[r2 + γr3 + γ2r4 + . . . |s2

]|s1 = s

]The first term here is simply the expected reward in state s whenaction is given by π(s). The second term is γ times the value functionat s2 ∼ P(s, π(s), ·)

Vπ(s) = E [R (s, π(s), s1) + γVπ (s2) |s1 = s]

= R(s, π(s)) + γ∑s2∈S

P (s, π(s), s2) Vπ (s2)

= Rπ(s) + γ [PπVπ] (s)

53/75

Bellman Optimal Equations

The Bellman equations for the optimal value functions are

V∗(s) = maxa

Es′∼P

[R(s, a) + γV∗(s′)

],

Q∗(s, a) = Es′∼P

[R(s, a) + γmax

a′Q∗(s′, a′)

].

The crucial difference between the Bellman equations for theon-policy value functions and the optimal value functions, is theabsence or presence of the max over actions. Its inclusionreflects the fact that whenever the agent gets to choose itsaction, in order to act optimally, it has to pick whichever actionleads to the highest value.

The term “Bellman backup” comes up quite frequently in the RLliterature. The Bellman backup for a state, or state-action pair, isthe right-hand side of the Bellman equation: thereward-plus-next-value.

54/75

Proof of Bellman optimality equations

Proof. for all s, from the theorem ensuring stationary optimal policy:

V∗(s) = maxπ

Vπ(s) = maxπ

Ea∼π(s),s′∼P(s,a)[R(s, a, s′

)+ γVπ

(s′)]

≤ maxa

R(s, a) + γ∑

s′P(s, a, s′

)maxπ

Vπ(s′)

= maxa

R(s, a) + γ∑

s′P(s, a, s′

)V∗(s′)

Now, if the above inequality is strict then the value of state s can beimproved by using a (possibly non-stationary) policy that uses actionarg maxa R(s, a) in the first step. This is a contradiction to thedefinition V∗(s). Therefore,

V∗(s) = maxa

R(s, a) + γ∑

s′P(s, a, s′

)V∗(s′)

55/75

Bellman optimality equations

Technically, above only shows that V∗ satisfies the Bellmanequations.Theorem 6.2.2 (c) in Puterman [1994] shows that V∗ is in factunique solution of above equations.Therefore, satisfying these equations is sufficient to guaranteeoptimality, so that it is not difficult to see that the deterministic(stationary) policy

π∗(s) = arg maxa

R(s, a) + γ∑

s′P(s, a, s′

)V∗(s′)

is optimal (see Puterman [1994] Theorem 6.2.7 for formal proof).

56/75

Linear programming

Linear programmingThe fixed point for above Bellman optimality equations can be foundby formulating a linear program. It amounts to :

minv∈RS

∑s

wsvs

s.t. vs ≥ R(s, a) + γP(s, a)>v ∀a, s

Proof. V∗ clearly satisfies the constraints of the above LP. Next, weshow that v = V∗ minimizes the obj. fun. The constraint implies that

vs ≥ R (s, π∗(s)) + γP (s, π∗(s))> v, ∀s

(Above is written assuming π∗ is deterministic, which is in fact true inthe infinite horizon discounted reward case.) Or,(

I − γPπ∗)

v ≥ Rπ∗

57/75

Proof

Because γ < 1, (I − γPπ)−1 exists for all π, and for any u ≥ 0

(I − γPπ)−1 u =(

I + γPπ + γ2 (Pπ)2 + · · ·)

u ≥ u

Therefore, from above(I − γPπ

∗)−1 ((

I − γPπ∗)

v− Rπ∗)≥ 0

Or,

v ≥(

I − γPπ∗)−1

Rπ∗

= V∗

Therefore, w>v for w > 0 is minimized by v = V∗.

58/75

Outline

1 Introduction


3 Examples

4 Bellman Equation


6 RL Algorithms

59/75

Value Iteration

Indirect method that finds optimal value function (value vector vin above), not explicit policy.

PseudocodeStart with an arbitrary initialization v0. Specify ε > 0

Repeat for k = 1, 2, . . . until∥∥vk(s)− vk−1(s)

∥∥∞ ≤ ε

(1−γ)2γ :

for every s ∈ S, improve the value vector as:

vk(s) = maxa∈A

R(s, a) + γ∑

s′P (s, a, s′) vk−1 (s′) (1)

Compute optimal policy as

π(s) ∈ arg maxa

R(s, a) + γP(s, a)>vk (2)

60/75

Bellman operator

It is useful to represent the iterative step (1) using operatorL : RS → RS.

LV(s) := maxa∈A

R(s, a) + γ∑

s′P(s, a, s′

)V(s′)

LπV(s) := Ea∈π(s)

[R(s, a) + γ

∑s′

P(s, a, s′

)V(s′)]

(3)

Then, (1) is same asvk = Lvk−1 (4)

For any policy π, if Vπ denotes its value function, then, byBellman equations:

V∗ = LV∗,Vπ = LπVπ (5)

61/75

Bellman operator

Below is a useful ‘contraction’ property of the Bellman operator,which underlies the convergence properties of all DP based iterativealgorithms.

Lemma 6The operator L(·) and Lπ(·) defined by (3) are contraction mappings,i.e.,

‖Lv− Lu‖∞ ≤ γ‖v− u‖∞‖Lπv− Lπu‖∞ ≤ γ‖v− u‖∞

62/75

Proof of contraction

Proof. First assume Lv(s) ≥ Lu(s).Let a∗s = arg maxa∈A R(s, a) + γ

∑s′ P (s, a, s′) v (s′)

0 ≤ Lv(s)− Lu(s)

≤ R (s, a∗s ) + γ∑

s′P(s, a∗s , s

′) v(s′)− R (s, a∗s )− γ

∑s′

P(s, a∗s , s

′) u(s′)

= γP (s, a∗s )> (v− u)

≤ γ‖v− u‖∞

Repeating a symmetric argument for the case Lv(s) ≥ Lu(s) gives thelemma statement. Similar proof holds for Lπ.

63/75

Convergence

Theorem 7 (Theorem 6.3.3, Section 6.3.2 in Puterman [1994])The convergence rate of the above algorithm is linear at rate γ.Specifically, ∥∥vk − V∗

∥∥∞ ≤

γk

1− γ∥∥v1 − v0∥∥

∞

Further, let πk be the policy given by (2) using vk. Then,∥∥∥Vπk − V∗

∥∥∥∞≤ 2γk

1− γ∥∥v1 − v0∥∥

∞

64/75

Proof of Convergence

Proof. By Bellman equations V∗ = LV∗∥∥V∗ − vk∥∥∞ =

∥∥LV∗ − vk∥∥∞

≤∥∥LV∗ − Lvk

∥∥∞ +

∥∥Lvk − vk∥∥∞

=∥∥LV∗ − Lvk

∥∥∞ +

∥∥Lvk − Lvk−1∥∥∞

≤ γ∥∥V∗ − vk

∥∥+ γ∥∥vk − vk−1∥∥

≤ γ∥∥V∗ − vk

∥∥+ γk∥∥v1 − v0∥∥∥∥V∗ − vk

∥∥∞ ≤ γk

1− γ∥∥v1 − v0∥∥

Let π = πk be the policy at the end of k iterations. Then, Vπ = LπVπ

by Bellman equations. Further, by definition of π = πk,

Lπvk(s) = maxa

R(s, a) + γ∑

s′P(s, a, s′

)vk (s′) = Lvk(s)

65/75

Proof of Convergence

Therefore,∥∥Vπ − vk∥∥∞ =

∥∥LπVπ − vk∥∥∞

≤∥∥LπVπ − Lπvk

∥∥∞ +

∥∥Lπvk − vk∥∥∞

=∥∥LπVπ − Lπvk

∥∥∞ +

∥∥Lvk − Lvk−1∥∥∞

≤ γ∥∥Vπ − vk

∥∥+ γ∥∥vk − vk−1∥∥∥∥Vπ − vk

∥∥∞ ≤ γ

1− γ∥∥vk − vk−1∥∥

≤ γk

1− γ∥∥v1 − v0∥∥

Adding the two results above:

‖Vπ − V∗‖∞ ≤2γk

1− γ∥∥v1 − v0∥∥

∞

66/75

Convergence

In average reward case, the algorithm is similar, but the Bellmanoperator used to update the values is nowLV(s) = maxa rs,a + P(s, a)>V. Also, here vk will converge tov∗ + ce for some constant c. Therefore, the stopping conditionused is instead sp

(vk − vk−1

)≤ ε where

sp(v) := maxs vs −mins vs. That is, span is used instead of L∞norm. Further since there is no discount (γ = 1), a condition onthe transition matrix is required to prove convergence. Let

γ := maxs,s′,a,a′

1−∑j∈S

min{

P(s, a, j),P(s′, a′, j

)}Then, linear convergence with rate γ is guaranteed if γ < 1. Thiscondition ensures that the Bellman operator in this case: is still acontraction. For more details, refer to Section 8.5.2 in Puterman[1994].

67/75

Q-value iteration

Q∗(s, a): expected utility on taking action a in state s, andthereafter acting optimally. Then,V∗(s) = maxa Q∗(s, a).Therefore, Bellman equations can be written as,

Q∗(s, a) = R(s, a) + γ∑

s′P(s, a, s′

)(max

a′Q∗(s′, a′

))Based on above a Q-value-iteration algorithm can be derived:Pseudocode

Start with an arbitrary initialization Q0 ∈ RS×A.In every iteration k, improve the Q-value vector as:

Qk(s, a) = R(s, a) + γEs′

[max

a′Qk−1 (s′, a′) |s, a] ,∀s, a

Stop if∥∥Qk − Qk−1

∥∥∞ is small.

68/75

Policy iteration

Start with an arbitrary initialization of policy π1. The k-th policyiteration has two steps:

Policy evaluation: Find vk by solving vk = Lπk vk, i.e.,

vk(s) = Ea∼π(s)

[R(s, a, s′

)+ γ

∑s′

P(s, a, s′

)vk (s′)] ,∀s

Policy improvement: Find πk+1 such that Lπk+1vk = Lvk, i.e.,

πk+1(s) = arg maxa

R(s, a) + γEs′[vk (s′) |s, a] ,∀s

69/75

Outline

1 Introduction


3 Examples

4 Bellman Equation


6 RL Algorithms

70/75

Platform

Gym: a toolkit for developing and comparing reinforcementlearning algorithms. Support Atari and Mujoco.Universe: measuring and training an AI across the world’s supplyof games, websites and other applicationsDeepmind Lab: a fully 3D game-like platform tailored foragent-based AI researchViZDoom: allows developing AI bots that play Doom using onlythe visual information

71/75

Platform

Rllab: mainly supports for TRPO, VPG, CEM, NPGBaselines: supports for TRPO, PPO, DQN, A2C...GithubImplement your algorithms through these packages

72/75

Gym

A simple implementation of sampling a single path with gym

73/75

Environments

Mujoco, continuous tasksA physics engine for detailed, efficient rigid body simulations withcontactsSwimmer, Hopper, Walker, Reacher,...Gaussian distribution

74/75

Environments

Atari 2600, discrete action spaceCategorial distribution

75/75

A Taxonomy of RL Algorithms

Date post:	09-Jul-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Dynamic Programming: MDPbicmr.pku.edu.cn/~wenzw/bigdata/lect-mdp.pdf · 2 Introduction to MDP 3...

Documents