+ All Categories
Home > Documents > Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite...

Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite...

Date post: 07-Mar-2021
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
19
1 Reinforcement Learning (INF11010) Pavlos Andreadis, January 19 th 2018 Lecture 2: Introduction to Markov Decision Processes
Transcript
Page 1: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

1

Reinforcement Learning (INF11010)

Pavlos Andreadis, January 19th 2018

Lecture 2: Introduction to Markov Decision Processes

Page 2: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

2

Today’s Content

● (discrete-time) finite Markov Decision Process (MDPs)– State space; Action space; Transition function; Reward function.– Policy; Value function.

● Markov property/assumption

● MDPs with set policy → Markov chain

● The Reinforcement Learning problem:– Maximise the accumulation of rewards across time

● Modelling a problem as an MDP (example)

Page 3: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

3

a Repair Scenario

● Output in 1000s of $:- Good: - No conveyor belt: - No production:

● Cost of repairs (regardless of condition) in 1000s of $:

● Probability of engine fault

● Probability of conveyor belt fault

Page 4: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

4

State & Action spaces

● ___ No problems● ___ Conveyor belt fault● ___ Engine fault

● ___ wait● ___ repair

...

...

...

...

● the MDP model as a Dynamic Bayesian Network(i.e. a dynamic probabilistic directed acyclic graph):

● Markov property!

Page 5: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

5

Reward & Transition Functions

● The Transition function:

● The Reward function:

Page 6: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

6

Markov Property

● Environment response, Generally:

● … with the Markov property:

Page 7: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

7

Transition Graph

wait

wait

repair repair

repair

wait

● the Transition Graph for our MDP model for the Repair Scenario:

Page 8: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

8

Policy

● A policy is a mapping from each state and action to a probability

● For example:

Page 9: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

9

a Deterministic Policy

● “Wait till it breaks” policy:

● Stochastic/Transition matrix:

wait

wait

repair

● A Markov Chain

Page 10: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

10

another Deterministic Policy

● “Repair” policy:

● Stochastic/Transition matrix:

wait

repair

repair

● Another Markov Chain

Page 11: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

11

Returns (finite time)

● Return at time = the reward accumulated starting from the next time step:

● = a final time step

● Episodic tasks, i.e. there is a final time step

● Each episode ends in a terminal (absorbing) state

● Assuming we are at time our goal is to maximise the expected return at

Page 12: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

12

Returns (infinite time)

● Discounted Return at time

● = discount rate (prevents a sum to infinity / weights reward across time)

● Continuing tasks, i.e. there is no final time step

● A single neverending episode

● Assuming we are at time our goal is to maximise the expected discounted return at

Page 13: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

13

Returns (unified notation)

● Discounted Return at time

● Continuing tasks by setting

● In which case we can’t have both and

● Define absorbing states as transitioning to themselves with a reward of

OR

Page 14: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

14

Value Function

● We can define the value of a state under policy using the state-value function:

● … or the action-value (or Q-) function:

Page 15: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

15

Bellman Equation

Page 16: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

16

Optimal Value Function

Page 17: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

17

Markov Decision Processes

● A finite Markov Decision Process (MDP) is a tuple

where:

● is a finite set of states

● is a finite set of actions

● is a state transition probability function

● is a reward function

● is a discount factor

Page 18: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

18

Reading +

● Chapter 3 of Sutton and Barto (1st Edition) http://incompleteideas.net/book/ebook/the-book.html

● Please join Piazza for announcements and support: https://piazza.com/ed.ac.uk/spring2018/infr11010

● Excercise: pick a policy for the Repair Scenario, and write a procedure in Matlab that evaluates the Expected Return from . (feel free to use Piazza to ask for tips)

Optional:

Page 19: Lecture 2: Introduction to Markov Decision Processes...2 Today’s Content (discrete-time) finite Markov Decision Process (MDPs) – State space; Action space; Transition function;

19

a Repair Scenario


Recommended