Date post: | 10-Apr-2018 |
Category: |
Documents |
Upload: | sanja-lazarova-molnar |
View: | 221 times |
Download: | 0 times |
of 40
8/8/2019 MDPs
1/40
Markov Decision Processes &Reinforcement LearningMegan SmithLehigh University, Fall 2006
8/8/2019 MDPs
2/40
Ou tlineStochastic ProcessMarkov PropertyMarkov ChainMarkov Decision ProcessReinforcement LearningRL Techniq u esExample Applications
8/8/2019 MDPs
3/40
Stochastic ProcessQu ick definition: A Random ProcessO ften viewed as a collection of indexed random variablesUsef u l to u s: Set of states withprobabilities of being in those statesindexed over time
We ll deal with discrete stochasticprocesses
http://en.wikipedia.org/wiki/Image:AAMarkov.jpg
8/8/2019 MDPs
4/40
Stochastic Process ExampleClassic: Random Walk
Start at state X 0 at time t 0 At time t i, move a step Z i where
P(Zi = -1) = p and P(Z i = 1) = 1 - p At time t i, state X i = X 0 + Z 1 + + Z i
http://en.wikipedia.org/wiki/Image:Random_Walk_example.png
8/8/2019 MDPs
5/40
Markov Property Also tho u ght of as the memorylessproperty
A stochastic process is said to havethe Markov property if the probabilityof state X n+1 having any given val u edepends only u pon state X n
Very m u ch depends on description of states
8/8/2019 MDPs
6/40
Markov Property ExampleCheckers:
Cu rrent State: The c u rrent config u rationof the board
Contains all information needed fortransition to next stateThu s, each config u ration can be said tohave the Markov property
8/8/2019 MDPs
7/40
Markov ChainDiscrete-timestochastic processwith the MarkovpropertyInd u stry Example:Google s PageRankalgorithm
Probability
distrib u tionrepresentinglikelihood of random linkingending u p on a
page http://en.wikipedia.org/wiki/PageRank
8/8/2019 MDPs
8/40
Markov Decision Process (MDP)
Discrete time stochastic controlprocessExtension of Markov chainsDifferences:
Addition of actions (choice) Addition of rewards (motivation)
If the actions are fixed, an MDPred u ces to a Markov chain
8/8/2019 MDPs
9/40
Description of MDPsTu ple (S, A, P(.,.), R(.)))
S -> state space A -> action space
Pa(s, s ) = Pr(s t+1 = s | s t = s, a t = a)R(s) = immediate reward at state s
Goal is to maximize some c u mu lative
f u nction of the rewardsFinite MDPs have finite state andaction spaces
8/8/2019 MDPs
10/40
Simple MDP ExampleRecycling MDP Robot
Can search for trashcan, wait forsomeone to bring a trashcan, or go
home and recharge batteryHas two energy levels high and lowSearching r u ns down battery, waitingdoes not, and a depleted battery has avery low reward
news.bbc.co.uk
8/8/2019 MDPs
11/40
8/8/2019 MDPs
12/40
Transition Graph
state node
action node
8/8/2019 MDPs
13/40
Solu tion to an MDP = Policy
Gives the action to take from a givenstate regardless of historyTwo arrays indexed by state
V is the value f unction, namely the discounted sum of reward s on average f rom f ollowing a policy
is an array of action s to be taken in each state (Policy )
V ( s): = R( s) + P ( s)( s, s')V ( s')
2 basic steps
8/8/2019 MDPs
14/40
Variants
Valu e IterationPolicy IterationModified Policy IterationPrioritized Sweeping
V ( s): = R( s) + P ( s)( s, s')V ( s')
2 basic steps
1
2
Value Function
8/8/2019 MDPs
15/40
Valu e Iteration
k Vk (PU ) Vk (PF) Vk (RU ) Vk (R F)
1
2
3
4
5
6
V (s ) = R (s ) + max a P a (s ,s ') V (s ')
0 0 10 100 4.5 14.5 19
2.03 8.55 18.55 24.184.76 11.79 19. 26 29.237.45 15.30 20.81 31.82
10. 23 17. 67 22.72 33.68
8/8/2019 MDPs
16/40
Why So Interesting?
If the transition probabilities areknown, this becomes astraightforward comp u tational
problem, howeverIf the transition probabilities areu nknown, then this is a problem for
reinforcement learning.
8/8/2019 MDPs
17/40
Typical Agent
In reinforcement learning (RL), theagent observes a state and takes anaction.
Afterward, the agent receives areward.
8/8/2019 MDPs
18/40
Mission: O ptimize Reward
Rewards are calc u lated in theenvironment Used to teach the agent how to reacha goal stateMu st signal what we u ltimately want achieved, not necessarily s u bgoals
May be disco u nted over timeIn general, seek to maximize theexpected ret u rn
8/8/2019 MDPs
19/40
Valu e Fu nctions V is a value f unction (How good is it to be in this s tate? )
V is the unique solution to its B ellman EquationE xpre sses
relation ship between a state and itssucce ssor state s
Bellman E quation:
State -value f unction f or policy
8/8/2019 MDPs
20/40
Another Val u e Fu nctionQ def ines the value of taking action a in state sunder policy E xpected return starting f rom s, taking action a ,
and therea
f ter
f ollowing
policy
Backup diagram s f or (a) V and (b) Q
Action -value f unction f or policy
8/8/2019 MDPs
21/40
Dynamic Programming
Classically, a collection of algorithmsu sed to comp u te optimal policiesgiven a perfect model of environment
as an MDPThe classical view is not so u sef u l inpractice since we rarely have a
perfect environment modelProvides fo u ndation for othermethods
Not practical for large problems
8/8/2019 MDPs
22/40
DP Continu edUse val u e f u nctions to organize andstr u ct u re the search for good policies.Tu rn Bellman eq u ations into u pdatepolicies.Iterative policy eval u ation u sing f u llback u ps
8/8/2019 MDPs
23/40
Policy Improvement
When sho u ld we change the policy?If we pick a new action from state sand thereafter follow the c u rrent policy and V( ) >= V( ), thenpicking from state s is a betterpolicy overall.
Resu lts from the policy improvement theorem
8/8/2019 MDPs
24/40
Policy IterationContinu e improvingthe policy andrecalc u lating V( )
A finite MDP has afinite n u mber of policies, soconvergence is
gu
aranteed in afinite n u mber of iterations
8/8/2019 MDPs
25/40
Remember Val u e Iteration?
Used to tr u ncate policy iteration by combiningone sweep of policy eval u ation and one of policy
improvement in each of its sweeps.
8/8/2019 MDPs
26/40
Monte Carlo Methods
Requ ires only episodic experience on-line or sim u latedBased on averaging sample ret u rns
Valu e estimates and policies onlychanged at the end of each episode,not on a step-by-step basis
8/8/2019 MDPs
27/40
Policy Evalu ationCompu teaverage ret u rnsas the episoderu nsTwo methods:first-visit andevery-visit
First-visit ismost widelyst u died
Fir st-visit MC method
8/8/2019 MDPs
28/40
Estimation of Action Val u es
State val u es are not eno u gh witho u t a model we need action val u es aswell
Q (s, a) expected ret u rn whenstarting in state s, taking action a,and thereafter following policy
Exploration vs. ExploitationExploring starts
8/8/2019 MDPs
29/40
Example Monte Carlo Algorithm
Fir st-visit Monte Carlo assuming exploring start s
8/8/2019 MDPs
30/40
Another MC Algorithm
O n-line, f ir st-visit, -greedy MC without exploring start s
8/8/2019 MDPs
31/40
Temporal-Difference Learning
Central and novel to reinforcement learningCombines Monte Carlo and DPmethodsCan learn from experience w/o amodel like MC
Updates estimates based on otherlearned estimates (bootstraps) likeDP
8/8/2019 MDPs
32/40
TD(0)
Simplest TD method
Uses sample backup from single successorstate or state-action pair instead of fullbackup of DP methods
8/8/2019 MDPs
33/40
SARSA O n-policy Control
Qu int u ple of events (s t , a t , r t+1 , s t+1 , a t+1 )
Continu ally estimate Q while changing
8/8/2019 MDPs
34/40
Q-Learning O ff-policy Control
Learned action-val u e f u nction, Q, directly
approximates Q*, the optimal action-valu
ef u nction, independent of policy beingfollowed
8/8/2019 MDPs
35/40
Case St u dy
Job-shop Sched u lingTemporal and reso u rce constraintsFind constraint-satisfying sched u les of short d u rationIn it s general form, NP-complete
8/8/2019 MDPs
36/40
NASA Space Shu ttle PayloadProcessing Problem (SSPPP)
Sched u le tasks req u ired for installation andtesting of sh u ttle cargo bay payloadsTypical: 2-6 sh u ttle missions, eachreq u iring 34-164 tasksZhang and Dietterich (1995, 1996; Zhang,1996)First s u ccessf u l instance of RL applied in
plan-spacestates = complete plansactions = plan modifications
8/8/2019 MDPs
37/40
SSPPP contin u ed
States were an entire sched u leTwo types of actions:
REASSIGN-POO L operators reassigns a
resou
rce to a different poolMO VE operators moves task to first earlier or later time with satisfiedreso u rce constraints
Small negative reward for each stepReso u rce dilation factor (RDF)formu la for rewarding final sched u le sdu ration
8/8/2019 MDPs
38/40
Even More SSPPP
Used TD( P ) to learn val u e f u nction Actions selected by decreasing -greedypolicy with one-step lookahead
Fu
nction approximationu
sed mu
ltilayerne u ral networksTraining generally took 10,000 episodesEach res u lting network representeddifferent scheduling algorithm not asched u le for a specific instance!
8/8/2019 MDPs
39/40
RL and CBR
Example: CBR u sed to store vario u spolicies and RL u sed to learn andmodify those policies
Ashwin Ram and J u an CarlosSantamar a, 1993 Au tonomo u s Robotic Control
Job shop schedu
ling: RLu
sed torepair sched u les, CBR u sed todetermine which repair to makeSimilar methods can be u sed for IDSS
8/8/2019 MDPs
40/40
ReferencesSu tton, R. S. and Barto A. G. Reinforcement Learning: An Introd u ction. The MIT Press,Cambridge, MA, 1998Stochastic Processes, www.hanoivn.net
http://en.wikipedia.org/wiki/PageRankhttp://en.wikipedia.org/wiki/Markov_decision_processUsing Case-Based Reasoning as a Reinforcement
Learning framework for Optimization withChanging Criteria, Zeng, D. and Sycara, K. 1995