Home >Documents >Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The...

Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The...

Date post:01-Jan-2016
View:223 times
Download:3 times
Share this document with a friend
  • Decision Makingin Robots and Autonomous Agents

    The Markov Decision Process (MDP) modelSubramanian RamamoorthySchool of Informatics

    25 January, 2013

  • In the MAB ModelWe were in a single casino and the only decision is to pull from a set of n armsexcept perhaps in the very last slides, exactly one state!

    We asked the following,What if there is more than one state?So, in this state space, what is the effect of the distribution of payout changing based on how you pull arms? What happens if you only obtain a net reward corresponding to a long sequence of arm pulls (at the end)?


  • Decision Making Agent-Environment Interface25/01/2013*

  • Markov Decision ProcessesA model of the agent-environment systemMarkov property = history doesnt matter, only current stateIf state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give:state and action setsone-step dynamics defined by transition probabilities:

    reward probabilities:


  • Recycling Robot An Example Finite MDPAt each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge.

    Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad).

    Decisions made on basis of current energy level: high, low.

    Reward = number of cans collected25/01/2013*

  • Recycling Robot MDP25/01/2013*

  • Enumerated In Tabular Form25/01/2013*If you were given this much, what can you say aboutthe behaviour (over time) of the system?

  • A Very Brief Primer on Markov Chains and Decisions

    A model, as originally developed in Operations Research/Stochastic Control theory25/01/2013*

  • Stochastic ProcessesA stochastic process is an indexed collection of random variables .e.g., collection of weekly demands for a productOne type: At a particular time t, labelled by integers, system is found in exactly one of a finite number of mutually exclusive and exhaustive categories or states, labelled by integers tooProcess could be imbedded in that time points correspond to occurrence of specific events (or time may be equi-spaced)Random variables may depend on others, e.g.,25/01/2013*

  • Markov ChainsThe stochastic process is said to have a Markovian property if

    Markovian probability means that the conditional probability of a future event given any past events and current state, is independent of past states and depends only on presentThe conditional probabilities are transition probabilities,

    These are stationary if time invariant, called pij,25/01/2013*

  • Markov ChainsLooking forward in time, n-step transition probabilities, pij(n)

    One can write a transition matrix,

    A stochastic process is a finite-state Markov chain if it has,Finite number of statesMarkovian propertyStationary transition probabilitiesA set of initial probabilities P{X0 = i} for all i


  • Markov Chainsn-step transition probabilities can be obtained from 1-step transition probabilities recursively (Chapman-Kolmogorov)

    We can get this via the matrix too

    First Passage Time: number of transitions to go from i to j for the first timeIf i = j, this is the recurrence timeIn general, this itself is a random variable


  • Markov Chainsn-step recursive relationship for first passage time

    For fixed i and j, these fij(n) are nonnegative numbers so that

    If, , that state is a recurrent state, absorbing if n=1


  • Markov Chains: Long-Run PropertiesConsider the 8-step transition matrix of the inventory example:

    Interesting property: probability of being in state j after 8 weeks appears independent of initial level of inventory.For an irreducible ergodic Markov chain, one has limiting probabilityReciprocal gives yourecurrence time mjj25/01/2013*

  • Markov Decision ModelConsider the following application: machine maintenanceA factory has a machine that deteriorates rapidly in quality and output and is inspected periodically, e.g., dailyInspection declares the machine to be in four possible states:0: Good as new1: Operable, minor deterioration2: Operable, major deterioration3: InoperableLet Xt denote this observed stateevolves according to some law of motion, so it is a stochastic processFurthermore, assume it is a finite state Markov chain


  • Markov Decision ModelTransition matrix is based on the following:

    Once the machine goes inoperable, it stays there until repairsIf no repairs, eventually, it reaches this state which is absorbing!

    Repair is an action a very simple maintenance policy.e.g., machine from from state 3 to state 025/01/2013*

  • Markov Decision ModelThere are costs as system evolves:State 0: cost 0State 1: cost 1000State 2: cost 3000Replacement cost, taking state 3 to 0, is 4000 (and lost production of 2000), so cost = 6000The modified transition probabilities are:


  • Markov Decision ModelSimple question: What is the average cost of this maintenance policy?

    Compute the steady state probabilities:

    (Long run) expected average cost per day,


  • Markov Decision ModelConsider a slightly more elaborate policy:Repair when inoperable or needing major repairs, replaceTransition matrix now changes a little bitPermit one more thing: overhaulGo back to minor repairs state (1) for the next time stepNot possible if truly inoperable, but can go from major to minorKey point about the system behaviour. It evolves according toLaws of motionSequence of decisions made (actions from {1: none,2:overhaul,3: replace})Stochastic process is now defined in terms of {Xt} and {Dt}Policy, R, is a rule for making decisionsCould use all history, although popular choice is (current) state-based


  • Markov Decision ModelThere is a space of potential policies, e.g.,

    Each policy defines a transition matrix, e.g., for RbWhich policy is best?Need costs.25/01/2013*00

  • Markov Decision ModelCik = expected cost incurred during next transition if system is in state i and decision k is made

    The long run average expected cost for each policy may be computed usingRb is best25/01/2013*


  • Markov Decision Processes

    Solution using Dynamic Programming(*some notation changes upcoming)25/01/2013*

  • The RL ProblemMain Elements:States, sActions, aState transition dynamics - often, stochastic & unknownReward (r) process - possibly stochastic

    Objective: Policy pt(s,a)probability distribution over actions given current stateAssumption:Environment defines a finite-state MDP25/01/2013*

  • Back to Our Recycling Robot MDP25/01/2013*

  • Given an enumeration of transitions and corresponding costs/rewards, what is the best sequence of actions?

    We want to maximize the criterion:

    So, what must one do? 25/01/2013*

  • The Shortest Path Problem25/01/2013*

  • Finite-State Systems and Shortest Pathsstate space sk is a finite set for each kak can get you from sk to fk(sk, ak) at a cost gk(xk, uk)25/01/2013*Length Cost Sum of length of arcs

    Solve this firstVk(i) = minj [akij + Vk+1(j)]

  • Value FunctionsThe value of a state is the expected return starting from that state; depends on the agents policy:

    The value of taking an action in a state under policy p is the expected return starting from that state, taking that action, and thereafter following p :25/01/2013*

  • Recursive Equation for ValueThe basic idea: So: 25/01/2013*

  • Optimality in MDPs Bellman Equation25/01/2013*

  • Policy EvaluationHow to compute V(s) for an arbitrary policy p? (Prediction problem)

    For a given MDP, this yields a system of simultaneous equationsas many unknowns as states (BIG, |S| linear system!)

    Solve iteratively, with a sequence of value functions,


  • Policy ImprovementDoes it make sense to deviate from p(s) at any state (following the policy everywhere else)? Let us for now assume deterministic p(s) Policy Improvement Theorem [Howard/Blackwell]3/02/2012*

  • Computing Better PoliciesStarting with an arbitrary policy, wed like to approach truly optimal policies. So, we compute new policies using the following,

    Are we restricted to deterministic policies? No.With stochastic policies, 3/02/2012*

  • Grid-World Example25/01/2013*

  • Iterative Policy Evaluation in Grid WorldNote: The value function can be searchedgreedily to find long-term optimal actions25/01/2013*


Click here to load reader

Reader Image
Embed Size (px)