Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | vernon-gilbert |
View: | 214 times |
Download: | 1 times |
Markov Decision Processes & Reinforcement LearningMegan SmithLehigh University, Fall 2006
Outline
Stochastic Process Markov Property Markov Chain Markov Decision Process Reinforcement Learning RL Techniques Example Applications
Stochastic Process
Quick definition: A Random Process
Often viewed as a collection of indexed random variables
Useful to us: Set of states with probabilities of being in those states indexed over time
We’ll deal with discrete stochastic processes
http://en.wikipedia.org/wiki/Image:AAMarkov.jpg
Stochastic Process Example Classic: Random Walk
Start at state X0 at time t0
At time ti, move a step Zi whereP(Zi = -1) = p and P(Zi = 1) = 1 - p
At time ti, state Xi = X0 + Z1 +…+ Zi
http://en.wikipedia.org/wiki/Image:Random_Walk_example.png
Markov Property
Also thought of as the “memoryless” property
A stochastic process is said to have the Markov property if the probability of state Xn+1 having any given value depends only upon state Xn
Very much depends on description of states
Markov Property Example Checkers:
Current State: The current configuration of the board
Contains all information needed for transition to next state
Thus, each configuration can be said to have the Markov property
Markov Chain Discrete-time
stochastic process with the Markov property
Industry Example: Google’s PageRank algorithm Probability
distribution representing likelihood of random linking ending up on a page http://en.wikipedia.org/wiki/PageRank
Markov Decision Process (MDP) Discrete time stochastic control
process Extension of Markov chains Differences:
Addition of actions (choice) Addition of rewards (motivation)
If the actions are fixed, an MDP reduces to a Markov chain
Description of MDPs
Tuple (S, A, P(.,.), R(.))) S -> state space A -> action space Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a) R(s) = immediate reward at state s
Goal is to maximize some cumulative function of the rewards
Finite MDPs have finite state and action spaces
Simple MDP Example
Recycling MDP Robot Can search for trashcan, wait for
someone to bring a trashcan, or go home and recharge battery
Has two energy levels – high and low
Searching runs down battery, waiting does not, and a depleted battery has a very low reward
news.bbc.co.uk
Transition Probabilities
s = st s’ = st+1 a = at Pass’ Ra
ss’
high high search α Rsearch
high low search 1 - α Rsearch
low high search 1 - β -3
low low search β Rsearch
high high wait 1 Rwait
high low wait 0 Rwait
low high wait 0 Rwait
low low wait 1 Rwait
low high recharge 1 0
low low recharge 0 0
Transition Graph
state node
action node
Solution to an MDP = Policy π Gives the action to take from a
given state regardless of history Two arrays indexed by state
V is the value function, namely the discounted sum of rewards on average from following a policy
π is an array of actions to be taken in each state (Policy)
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')
2 basic steps
Variants
Value Iteration Policy Iteration Modified Policy Iteration Prioritized Sweeping
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')
2 basic steps
1
2
Value Function
Value Iteration
k Vk(PU) Vk(PF) Vk(RU) Vk(RF)
1
2
3
4
5
6
V(s) = R(s) + γmaxa∑Pa(s,s')V(s')
0 0 10 10
0 4.5 14.5 192.03 8.55 18.55 24.184.76 11.79 19.26 29.237.45 15.30 20.81 31.82
10.23 17.67 22.72 33.68
Why So Interesting?
If the transition probabilities are known, this becomes a straightforward computational problem, however…
If the transition probabilities are unknown, then this is a problem for reinforcement learning.
Typical Agent In reinforcement learning (RL),
the agent observes a state and takes an action.
Afterward, the agent receives a reward.
Mission: Optimize Reward Rewards are calculated in the
environment Used to teach the agent how to
reach a goal state Must signal what we ultimately
want achieved, not necessarily subgoals
May be discounted over time In general, seek to maximize the
expected return
Value Functions Vπ is a value function
(How good is it to be in this state?)
Vπ is the unique solution to its Bellman Equation
Expresses relationship between a state and its successor states
Bellman Equation:
State-value function for policy π
Another Value Function Qπ defines the value of taking action a in state s
under policy π Expected return starting from s, taking action a,
and thereafter following policy π
Backup diagrams for (a) Vπ and (b) Qπ
Action-value function for policy π
Dynamic Programming
Classically, a collection of algorithms used to compute optimal policies given a perfect model of environment as an MDP
The classical view is not so useful in practice since we rarely have a perfect environment model
Provides foundation for other methods
Not practical for large problems
DP Continued… Use value functions to organize and
structure the search for good policies. Turn Bellman equations into update
policies. Iterative policy evaluation using full
backups
Policy Improvement
When should we change the policy?
If we pick a new action α from state s and thereafter follow the current policy and V(π’) >= V(π), then picking α from state s is a better policy overall.
Results from the policy improvement theorem
Policy Iteration Continue
improving the policy π and recalculating V(π)
A finite MDP has a finite number of policies, so convergence is guaranteed in a finite number of iterations
Remember Value Iteration?
Used to truncate policy iteration by combining one sweep of policy evaluation and one of policy improvement in each of
its sweeps.
Monte Carlo Methods
Requires only episodic experience – on-line or simulated
Based on averaging sample returns
Value estimates and policies only changed at the end of each episode, not on a step-by-step basis
Policy Evaluation Compute
average returns as the episode runs
Two methods: first-visit and every-visit
First-visit is most widely studied
First-visit MC method
Estimation of Action Values State values are not enough
without a model – we need action values as well
Qπ(s, a) expected return when starting in state s, taking action a, and thereafter following policy π
Exploration vs. Exploitation Exploring starts
Example Monte Carlo Algorithm
First-visit Monte Carlo assuming exploring starts
Another MC Algorithm
On-line, first-visit, ε-greedy MC without exploring starts
Temporal-Difference Learning Central and novel to
reinforcement learning Combines Monte Carlo and DP
methods Can learn from experience w/o a
model – like MC Updates estimates based on
other learned estimates (bootstraps) – like DP
TD(0)
Simplest TD method Uses sample backup from single
successor state or state-action pair instead of full backup of DP methods
SARSA – On-policy Control
Quintuple of events (st, at, rt+1, st+1, at+1)
Continually estimate Qπ while changing π
Q-Learning – Off-policy Control
Learned action-value function, Q, directly approximates Q*, the optimal action-value function, independent of policy being followed
Case Study
Job-shop Scheduling Temporal and resource
constraints Find constraint-satisfying
schedules of short duration In it’s general form, NP-complete
NASA Space Shuttle Payload Processing Problem (SSPPP) Schedule tasks required for
installation and testing of shuttle cargo bay payloads
Typical: 2-6 shuttle missions, each requiring 34-164 tasks
Zhang and Dietterich (1995, 1996; Zhang, 1996)
First successful instance of RL applied in plan-space states = complete plans actions = plan modifications
SSPPP – continued… States were an entire schedule Two types of actions:
REASSIGN-POOL operators – reassigns a resource to a different pool
MOVE operators – moves task to first earlier or later time with satisfied resource constraints
Small negative reward for each step
Resource dilation factor (RDF) formula for rewarding final schedule’s duration
Even More SSPPP… Used TD() to learn value function Actions selected by decreasing ε-
greedy policy with one-step lookahead
Function approximation used multilayer neural networks
Training generally took 10,000 episodes
Each resulting network represented different scheduling algorithm – not a schedule for a specific instance!
RL and CBR
Example: CBR used to store various policies and RL used to learn and modify those policies Ashwin Ram and Juan Carlos
Santamarıa, 1993 Autonomous Robotic Control
Job shop scheduling: RL used to repair schedules, CBR used to determine which repair to make
Similar methods can be used for IDSS
References Sutton, R. S. and Barto A. G. Reinforcement
Learning: An Introduction. The MIT Press, Cambridge, MA, 1998
Stochastic Processes, www.hanoivn.net http://en.wikipedia.org/wiki/PageRank http://en.wikipedia.org/wiki/Markov_decision
_process Using Case-Based Reasoning as a
Reinforcement Learning framework for Optimization with Changing Criteria, Zeng, D. and Sycara, K. 1995