+ All Categories

MDPs

Date post: 10-Apr-2018
Category:
Upload: sanja-lazarova-molnar
View: 221 times
Download: 0 times
Share this document with a friend

of 40

Transcript
  • 8/8/2019 MDPs

    1/40

    Markov Decision Processes &Reinforcement LearningMegan SmithLehigh University, Fall 2006

  • 8/8/2019 MDPs

    2/40

    Ou tlineStochastic ProcessMarkov PropertyMarkov ChainMarkov Decision ProcessReinforcement LearningRL Techniq u esExample Applications

  • 8/8/2019 MDPs

    3/40

    Stochastic ProcessQu ick definition: A Random ProcessO ften viewed as a collection of indexed random variablesUsef u l to u s: Set of states withprobabilities of being in those statesindexed over time

    We ll deal with discrete stochasticprocesses

    http://en.wikipedia.org/wiki/Image:AAMarkov.jpg

  • 8/8/2019 MDPs

    4/40

    Stochastic Process ExampleClassic: Random Walk

    Start at state X 0 at time t 0 At time t i, move a step Z i where

    P(Zi = -1) = p and P(Z i = 1) = 1 - p At time t i, state X i = X 0 + Z 1 + + Z i

    http://en.wikipedia.org/wiki/Image:Random_Walk_example.png

  • 8/8/2019 MDPs

    5/40

    Markov Property Also tho u ght of as the memorylessproperty

    A stochastic process is said to havethe Markov property if the probabilityof state X n+1 having any given val u edepends only u pon state X n

    Very m u ch depends on description of states

  • 8/8/2019 MDPs

    6/40

    Markov Property ExampleCheckers:

    Cu rrent State: The c u rrent config u rationof the board

    Contains all information needed fortransition to next stateThu s, each config u ration can be said tohave the Markov property

  • 8/8/2019 MDPs

    7/40

    Markov ChainDiscrete-timestochastic processwith the MarkovpropertyInd u stry Example:Google s PageRankalgorithm

    Probability

    distrib u tionrepresentinglikelihood of random linkingending u p on a

    page http://en.wikipedia.org/wiki/PageRank

  • 8/8/2019 MDPs

    8/40

    Markov Decision Process (MDP)

    Discrete time stochastic controlprocessExtension of Markov chainsDifferences:

    Addition of actions (choice) Addition of rewards (motivation)

    If the actions are fixed, an MDPred u ces to a Markov chain

  • 8/8/2019 MDPs

    9/40

    Description of MDPsTu ple (S, A, P(.,.), R(.)))

    S -> state space A -> action space

    Pa(s, s ) = Pr(s t+1 = s | s t = s, a t = a)R(s) = immediate reward at state s

    Goal is to maximize some c u mu lative

    f u nction of the rewardsFinite MDPs have finite state andaction spaces

  • 8/8/2019 MDPs

    10/40

    Simple MDP ExampleRecycling MDP Robot

    Can search for trashcan, wait forsomeone to bring a trashcan, or go

    home and recharge batteryHas two energy levels high and lowSearching r u ns down battery, waitingdoes not, and a depleted battery has avery low reward

    news.bbc.co.uk

  • 8/8/2019 MDPs

    11/40

  • 8/8/2019 MDPs

    12/40

    Transition Graph

    state node

    action node

  • 8/8/2019 MDPs

    13/40

    Solu tion to an MDP = Policy

    Gives the action to take from a givenstate regardless of historyTwo arrays indexed by state

    V is the value f unction, namely the discounted sum of reward s on average f rom f ollowing a policy

    is an array of action s to be taken in each state (Policy )

    V ( s): = R( s) + P ( s)( s, s')V ( s')

    2 basic steps

  • 8/8/2019 MDPs

    14/40

    Variants

    Valu e IterationPolicy IterationModified Policy IterationPrioritized Sweeping

    V ( s): = R( s) + P ( s)( s, s')V ( s')

    2 basic steps

    1

    2

    Value Function

  • 8/8/2019 MDPs

    15/40

    Valu e Iteration

    k Vk (PU ) Vk (PF) Vk (RU ) Vk (R F)

    1

    2

    3

    4

    5

    6

    V (s ) = R (s ) + max a P a (s ,s ') V (s ')

    0 0 10 100 4.5 14.5 19

    2.03 8.55 18.55 24.184.76 11.79 19. 26 29.237.45 15.30 20.81 31.82

    10. 23 17. 67 22.72 33.68

  • 8/8/2019 MDPs

    16/40

    Why So Interesting?

    If the transition probabilities areknown, this becomes astraightforward comp u tational

    problem, howeverIf the transition probabilities areu nknown, then this is a problem for

    reinforcement learning.

  • 8/8/2019 MDPs

    17/40

    Typical Agent

    In reinforcement learning (RL), theagent observes a state and takes anaction.

    Afterward, the agent receives areward.

  • 8/8/2019 MDPs

    18/40

    Mission: O ptimize Reward

    Rewards are calc u lated in theenvironment Used to teach the agent how to reacha goal stateMu st signal what we u ltimately want achieved, not necessarily s u bgoals

    May be disco u nted over timeIn general, seek to maximize theexpected ret u rn

  • 8/8/2019 MDPs

    19/40

    Valu e Fu nctions V is a value f unction (How good is it to be in this s tate? )

    V is the unique solution to its B ellman EquationE xpre sses

    relation ship between a state and itssucce ssor state s

    Bellman E quation:

    State -value f unction f or policy

  • 8/8/2019 MDPs

    20/40

    Another Val u e Fu nctionQ def ines the value of taking action a in state sunder policy E xpected return starting f rom s, taking action a ,

    and therea

    f ter

    f ollowing

    policy

    Backup diagram s f or (a) V and (b) Q

    Action -value f unction f or policy

  • 8/8/2019 MDPs

    21/40

    Dynamic Programming

    Classically, a collection of algorithmsu sed to comp u te optimal policiesgiven a perfect model of environment

    as an MDPThe classical view is not so u sef u l inpractice since we rarely have a

    perfect environment modelProvides fo u ndation for othermethods

    Not practical for large problems

  • 8/8/2019 MDPs

    22/40

    DP Continu edUse val u e f u nctions to organize andstr u ct u re the search for good policies.Tu rn Bellman eq u ations into u pdatepolicies.Iterative policy eval u ation u sing f u llback u ps

  • 8/8/2019 MDPs

    23/40

    Policy Improvement

    When sho u ld we change the policy?If we pick a new action from state sand thereafter follow the c u rrent policy and V( ) >= V( ), thenpicking from state s is a betterpolicy overall.

    Resu lts from the policy improvement theorem

  • 8/8/2019 MDPs

    24/40

    Policy IterationContinu e improvingthe policy andrecalc u lating V( )

    A finite MDP has afinite n u mber of policies, soconvergence is

    gu

    aranteed in afinite n u mber of iterations

  • 8/8/2019 MDPs

    25/40

    Remember Val u e Iteration?

    Used to tr u ncate policy iteration by combiningone sweep of policy eval u ation and one of policy

    improvement in each of its sweeps.

  • 8/8/2019 MDPs

    26/40

    Monte Carlo Methods

    Requ ires only episodic experience on-line or sim u latedBased on averaging sample ret u rns

    Valu e estimates and policies onlychanged at the end of each episode,not on a step-by-step basis

  • 8/8/2019 MDPs

    27/40

    Policy Evalu ationCompu teaverage ret u rnsas the episoderu nsTwo methods:first-visit andevery-visit

    First-visit ismost widelyst u died

    Fir st-visit MC method

  • 8/8/2019 MDPs

    28/40

    Estimation of Action Val u es

    State val u es are not eno u gh witho u t a model we need action val u es aswell

    Q (s, a) expected ret u rn whenstarting in state s, taking action a,and thereafter following policy

    Exploration vs. ExploitationExploring starts

  • 8/8/2019 MDPs

    29/40

    Example Monte Carlo Algorithm

    Fir st-visit Monte Carlo assuming exploring start s

  • 8/8/2019 MDPs

    30/40

    Another MC Algorithm

    O n-line, f ir st-visit, -greedy MC without exploring start s

  • 8/8/2019 MDPs

    31/40

    Temporal-Difference Learning

    Central and novel to reinforcement learningCombines Monte Carlo and DPmethodsCan learn from experience w/o amodel like MC

    Updates estimates based on otherlearned estimates (bootstraps) likeDP

  • 8/8/2019 MDPs

    32/40

    TD(0)

    Simplest TD method

    Uses sample backup from single successorstate or state-action pair instead of fullbackup of DP methods

  • 8/8/2019 MDPs

    33/40

    SARSA O n-policy Control

    Qu int u ple of events (s t , a t , r t+1 , s t+1 , a t+1 )

    Continu ally estimate Q while changing

  • 8/8/2019 MDPs

    34/40

    Q-Learning O ff-policy Control

    Learned action-val u e f u nction, Q, directly

    approximates Q*, the optimal action-valu

    ef u nction, independent of policy beingfollowed

  • 8/8/2019 MDPs

    35/40

    Case St u dy

    Job-shop Sched u lingTemporal and reso u rce constraintsFind constraint-satisfying sched u les of short d u rationIn it s general form, NP-complete

  • 8/8/2019 MDPs

    36/40

    NASA Space Shu ttle PayloadProcessing Problem (SSPPP)

    Sched u le tasks req u ired for installation andtesting of sh u ttle cargo bay payloadsTypical: 2-6 sh u ttle missions, eachreq u iring 34-164 tasksZhang and Dietterich (1995, 1996; Zhang,1996)First s u ccessf u l instance of RL applied in

    plan-spacestates = complete plansactions = plan modifications

  • 8/8/2019 MDPs

    37/40

    SSPPP contin u ed

    States were an entire sched u leTwo types of actions:

    REASSIGN-POO L operators reassigns a

    resou

    rce to a different poolMO VE operators moves task to first earlier or later time with satisfiedreso u rce constraints

    Small negative reward for each stepReso u rce dilation factor (RDF)formu la for rewarding final sched u le sdu ration

  • 8/8/2019 MDPs

    38/40

    Even More SSPPP

    Used TD( P ) to learn val u e f u nction Actions selected by decreasing -greedypolicy with one-step lookahead

    Fu

    nction approximationu

    sed mu

    ltilayerne u ral networksTraining generally took 10,000 episodesEach res u lting network representeddifferent scheduling algorithm not asched u le for a specific instance!

  • 8/8/2019 MDPs

    39/40

    RL and CBR

    Example: CBR u sed to store vario u spolicies and RL u sed to learn andmodify those policies

    Ashwin Ram and J u an CarlosSantamar a, 1993 Au tonomo u s Robotic Control

    Job shop schedu

    ling: RLu

    sed torepair sched u les, CBR u sed todetermine which repair to makeSimilar methods can be u sed for IDSS

  • 8/8/2019 MDPs

    40/40

    ReferencesSu tton, R. S. and Barto A. G. Reinforcement Learning: An Introd u ction. The MIT Press,Cambridge, MA, 1998Stochastic Processes, www.hanoivn.net

    http://en.wikipedia.org/wiki/PageRankhttp://en.wikipedia.org/wiki/Markov_decision_processUsing Case-Based Reasoning as a Reinforcement

    Learning framework for Optimization withChanging Criteria, Zeng, D. and Sycara, K. 1995


Recommended