MDPs

8/8/2019 MDPs

1/40

Markov Decision Processes &Reinforcement LearningMegan SmithLehigh University, Fall 2006

8/8/2019 MDPs

2/40

Ou tlineStochastic ProcessMarkov PropertyMarkov ChainMarkov Decision ProcessReinforcement LearningRL Techniq u esExample Applications

8/8/2019 MDPs

3/40

Stochastic ProcessQu ick definition: A Random ProcessO ften viewed as a collection of indexed random variablesUsef u l to u s: Set of states withprobabilities of being in those statesindexed over time

We ll deal with discrete stochasticprocesses

http://en.wikipedia.org/wiki/Image:AAMarkov.jpg

8/8/2019 MDPs

4/40

Stochastic Process ExampleClassic: Random Walk

Start at state X 0 at time t 0 At time t i, move a step Z i where

P(Zi = -1) = p and P(Z i = 1) = 1 - p At time t i, state X i = X 0 + Z 1 + + Z i

http://en.wikipedia.org/wiki/Image:Random_Walk_example.png

8/8/2019 MDPs

5/40

Markov Property Also tho u ght of as the memorylessproperty

A stochastic process is said to havethe Markov property if the probabilityof state X n+1 having any given val u edepends only u pon state X n

Very m u ch depends on description of states

8/8/2019 MDPs

6/40

Markov Property ExampleCheckers:

Cu rrent State: The c u rrent config u rationof the board

Contains all information needed fortransition to next stateThu s, each config u ration can be said tohave the Markov property

8/8/2019 MDPs

7/40

Markov ChainDiscrete-timestochastic processwith the MarkovpropertyInd u stry Example:Google s PageRankalgorithm

Probability

distrib u tionrepresentinglikelihood of random linkingending u p on a

page http://en.wikipedia.org/wiki/PageRank

8/8/2019 MDPs

8/40

Markov Decision Process (MDP)

Discrete time stochastic controlprocessExtension of Markov chainsDifferences:

Addition of actions (choice) Addition of rewards (motivation)

If the actions are fixed, an MDPred u ces to a Markov chain

8/8/2019 MDPs

9/40

Description of MDPsTu ple (S, A, P(.,.), R(.)))

S -> state space A -> action space

Pa(s, s ) = Pr(s t+1 = s | s t = s, a t = a)R(s) = immediate reward at state s

Goal is to maximize some c u mu lative

f u nction of the rewardsFinite MDPs have finite state andaction spaces

8/8/2019 MDPs

10/40

Simple MDP ExampleRecycling MDP Robot

Can search for trashcan, wait forsomeone to bring a trashcan, or go

home and recharge batteryHas two energy levels high and lowSearching r u ns down battery, waitingdoes not, and a depleted battery has avery low reward

news.bbc.co.uk

8/8/2019 MDPs

11/40

8/8/2019 MDPs

12/40

Transition Graph

state node

action node

8/8/2019 MDPs

13/40

Solu tion to an MDP = Policy

Gives the action to take from a givenstate regardless of historyTwo arrays indexed by state

V is the value f unction, namely the discounted sum of reward s on average f rom f ollowing a policy

is an array of action s to be taken in each state (Policy )

V ( s): = R( s) + P ( s)( s, s')V ( s')

2 basic steps

8/8/2019 MDPs

14/40

Variants

Valu e IterationPolicy IterationModified Policy IterationPrioritized Sweeping

V ( s): = R( s) + P ( s)( s, s')V ( s')

2 basic steps

1

2

Value Function

8/8/2019 MDPs

15/40

Valu e Iteration

k Vk (PU ) Vk (PF) Vk (RU ) Vk (R F)

1

2

3

4

5

6

V (s ) = R (s ) + max a P a (s ,s ') V (s ')

0 0 10 100 4.5 14.5 19

2.03 8.55 18.55 24.184.76 11.79 19. 26 29.237.45 15.30 20.81 31.82

10. 23 17. 67 22.72 33.68

8/8/2019 MDPs

16/40

Why So Interesting?

If the transition probabilities areknown, this becomes astraightforward comp u tational

problem, howeverIf the transition probabilities areu nknown, then this is a problem for

reinforcement learning.

8/8/2019 MDPs

17/40

Typical Agent

In reinforcement learning (RL), theagent observes a state and takes anaction.

Afterward, the agent receives areward.

8/8/2019 MDPs

18/40

Mission: O ptimize Reward

Rewards are calc u lated in theenvironment Used to teach the agent how to reacha goal stateMu st signal what we u ltimately want achieved, not necessarily s u bgoals

May be disco u nted over timeIn general, seek to maximize theexpected ret u rn

8/8/2019 MDPs

19/40

Valu e Fu nctions V is a value f unction (How good is it to be in this s tate? )

V is the unique solution to its B ellman EquationE xpre sses

relation ship between a state and itssucce ssor state s

Bellman E quation:

State -value f unction f or policy

8/8/2019 MDPs

20/40

Another Val u e Fu nctionQ def ines the value of taking action a in state sunder policy E xpected return starting f rom s, taking action a ,

and therea

f ter

f ollowing

policy

Backup diagram s f or (a) V and (b) Q

Action -value f unction f or policy

8/8/2019 MDPs

21/40

Dynamic Programming

Classically, a collection of algorithmsu sed to comp u te optimal policiesgiven a perfect model of environment

as an MDPThe classical view is not so u sef u l inpractice since we rarely have a

perfect environment modelProvides fo u ndation for othermethods

Not practical for large problems

8/8/2019 MDPs

22/40

DP Continu edUse val u e f u nctions to organize andstr u ct u re the search for good policies.Tu rn Bellman eq u ations into u pdatepolicies.Iterative policy eval u ation u sing f u llback u ps

8/8/2019 MDPs

23/40

Policy Improvement

When sho u ld we change the policy?If we pick a new action from state sand thereafter follow the c u rrent policy and V( ) >= V( ), thenpicking from state s is a betterpolicy overall.

Resu lts from the policy improvement theorem

8/8/2019 MDPs

24/40

Policy IterationContinu e improvingthe policy andrecalc u lating V( )

A finite MDP has afinite n u mber of policies, soconvergence is

gu

aranteed in afinite n u mber of iterations

8/8/2019 MDPs

25/40

Remember Val u e Iteration?

Used to tr u ncate policy iteration by combiningone sweep of policy eval u ation and one of policy

improvement in each of its sweeps.

8/8/2019 MDPs

26/40

Monte Carlo Methods

Requ ires only episodic experience on-line or sim u latedBased on averaging sample ret u rns

Valu e estimates and policies onlychanged at the end of each episode,not on a step-by-step basis

8/8/2019 MDPs

27/40

Policy Evalu ationCompu teaverage ret u rnsas the episoderu nsTwo methods:first-visit andevery-visit

First-visit ismost widelyst u died

Fir st-visit MC method

8/8/2019 MDPs

28/40

Estimation of Action Val u es

State val u es are not eno u gh witho u t a model we need action val u es aswell

Q (s, a) expected ret u rn whenstarting in state s, taking action a,and thereafter following policy

Exploration vs. ExploitationExploring starts

8/8/2019 MDPs

29/40

Example Monte Carlo Algorithm

Fir st-visit Monte Carlo assuming exploring start s

8/8/2019 MDPs

30/40

Another MC Algorithm

O n-line, f ir st-visit, -greedy MC without exploring start s

8/8/2019 MDPs

31/40

Temporal-Difference Learning

Central and novel to reinforcement learningCombines Monte Carlo and DPmethodsCan learn from experience w/o amodel like MC

Updates estimates based on otherlearned estimates (bootstraps) likeDP

8/8/2019 MDPs

32/40

TD(0)

Simplest TD method

Uses sample backup from single successorstate or state-action pair instead of fullbackup of DP methods

8/8/2019 MDPs

33/40

SARSA O n-policy Control

Qu int u ple of events (s t , a t , r t+1 , s t+1 , a t+1 )

Continu ally estimate Q while changing

8/8/2019 MDPs

34/40

Q-Learning O ff-policy Control

Learned action-val u e f u nction, Q, directly

approximates Q*, the optimal action-valu

ef u nction, independent of policy beingfollowed

8/8/2019 MDPs

35/40

Case St u dy

Job-shop Sched u lingTemporal and reso u rce constraintsFind constraint-satisfying sched u les of short d u rationIn it s general form, NP-complete

8/8/2019 MDPs

36/40

NASA Space Shu ttle PayloadProcessing Problem (SSPPP)

Sched u le tasks req u ired for installation andtesting of sh u ttle cargo bay payloadsTypical: 2-6 sh u ttle missions, eachreq u iring 34-164 tasksZhang and Dietterich (1995, 1996; Zhang,1996)First s u ccessf u l instance of RL applied in

plan-spacestates = complete plansactions = plan modifications

8/8/2019 MDPs

37/40

SSPPP contin u ed

States were an entire sched u leTwo types of actions:

REASSIGN-POO L operators reassigns a

resou

rce to a different poolMO VE operators moves task to first earlier or later time with satisfiedreso u rce constraints

Small negative reward for each stepReso u rce dilation factor (RDF)formu la for rewarding final sched u le sdu ration

8/8/2019 MDPs

38/40

Even More SSPPP

Used TD( P ) to learn val u e f u nction Actions selected by decreasing -greedypolicy with one-step lookahead

Fu

nction approximationu

sed mu

ltilayerne u ral networksTraining generally took 10,000 episodesEach res u lting network representeddifferent scheduling algorithm not asched u le for a specific instance!

8/8/2019 MDPs

39/40

RL and CBR

Example: CBR u sed to store vario u spolicies and RL u sed to learn andmodify those policies

Ashwin Ram and J u an CarlosSantamar a, 1993 Au tonomo u s Robotic Control

Job shop schedu

ling: RLu

sed torepair sched u les, CBR u sed todetermine which repair to makeSimilar methods can be u sed for IDSS

8/8/2019 MDPs

40/40

ReferencesSu tton, R. S. and Barto A. G. Reinforcement Learning: An Introd u ction. The MIT Press,Cambridge, MA, 1998Stochastic Processes, www.hanoivn.net

http://en.wikipedia.org/wiki/PageRankhttp://en.wikipedia.org/wiki/Markov_decision_processUsing Case-Based Reasoning as a Reinforcement

Learning framework for Optimization withChanging Criteria, Zeng, D. and Sycara, K. 1995

Date post:	10-Apr-2018
Category:	Documents
Upload:	sanja-lazarova-molnar
View:	221 times
Download:	0 times

MDPs

Documents