+ All Categories
Home > Documents > ECE-517: Reinforcement Learning in Artificial...

ECE-517: Reinforcement Learning in Artificial...

Date post: 28-Jun-2020
Category:
Upload: others
View: 31 times
Download: 0 times
Share this document with a friend
23
1 ECE - 517 : Reinforcement Learning in Artificial Intelligence Lecture 10 : Temporal - Difference Learning Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2015 September 29, 2015
Transcript
Page 1: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

1

ECE-517: Reinforcement Learningin Artificial Intelligence

Lecture 10: Temporal-Difference Learning

Dr. Itamar Arel

College of EngineeringDepartment of Electrical Engineering and Computer Science

The University of TennesseeFall 2015

September 29, 2015

Page 2: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 2

Introduction to Temporal Learning (TD) & TD Prediction

If one had to identify one idea as central and novel to RL, it would undoubtedly be temporal-difference (TD) learning Combination of ideas from DP and Monte Carlo

Learns without a model (like MC), bootstraps (like DP)

Both TD and Monte Carlo methods use experience to solve the prediction problem (a.k.a. policy evaluation)

A simple every-visit MC method may be expressed as

Let’s call this constant-a MC

We will focus on the prediction problem evaluating V(s) for a given policy

)()()( tttt sVRsVsV a

target: the actual return after time t

Page 3: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 3

TD Prediction (cont.)

Recall that in MC we need to wait until the end of the episode to update the value estimates

The idea of TD is to do so every time step

Simplest TD method, TD(0):

Essentially, we are updating one guess based on another

The idea is that we have a “moving target”

ttttt sVsVrsVsV 11)()( a

target: an estimate of the return

Page 4: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 4

Simple Monte Carlo

T T T TT

T T T T T

. state following return actual theis where

)()()(

tt

tttt

sR

sVRsVsV a

st

T T

T T

TT T

T TT

Page 5: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 5

Simplest TD Method

T T T TT

T T T T T

st1

rt1

st

)()()()( 11 ttttt sVsVrsVsV a

TTTTT

T T T T T

Page 6: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 6

Dynamic Programming

)()( 1 ttt sVrEsV

T

T T T

st

rt1

st1

T

TT

T

TT

T

T

T

Page 7: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 7

Tabular TD(0) for estimating V

Page 8: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 8

TD methods Bootstrap and Sample

Bootstrapping: update involves an estimate (i.e. guess from a guess) Monte Carlo does not bootstrap

Dynamic Programming bootstraps

Temporal Different bootstraps

Sampling: update does not involve an expected value Monte Carlo samples

Dynamic Programming does not sample

Temporal Difference samples

Page 9: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 9

Example: Driving Home

State Elapsed Time

(minutes)

Predicted

Time to Go

Predicted

Total Time

leaving office 0 30 30

reach car, raining 5 35 40

exit highway 20 15 35

behind truck 30 10 40

home street 40 3 43

arrive home 43 0 43

rewardsReturns from

each state

Page 10: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 10

Example: Driving Home (cont.)

Changes recommended by

Monte Carlo methods a=1)

Changes recommended

by TD methods (a=1)

Value of each state is its expected time-to-go

Page 11: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 11

Is it really necessary to wait until the end of the episode to start learning? Monte Carlo says it is

TD learning argues that learning can occur on-line

Suppose, on another day, you again estimate when leaving your office that it will take 30 minutes to drive home, but then you get stuck in a massive traffic jam Twenty-five minutes after leaving the office you are still

bumper-to-bumper on the highway

You now estimate that it will take another 25 minutes to get home, for a total of 50 minutes

Must you wait until you get home before increasing your estimate for the initial state?

In TD you would be shifting your initial estimate from 30 minutes toward 50

Example: Driving Home (cont.)

Page 12: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 12

Advantages of TD Learning

TD methods do not require a model of the environment, only experience

TD, but not MC, methods can be fully incremental Agent learns a “guess from a guess”

Agent can learn before knowing the final outcomeLess memory

Reduced peak computation

Agent can learn without the final outcomeFrom incomplete sequences

Helps with applications that have very long episodes

Both MC and TD converge (under certain assumptions to be detailed later), but which is faster? Currently unknown – generally TD does better on stochastic

tasks

Page 13: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 13

Random Walk Example

In this example we empirically compare the prediction

abilities of TD(0) and constant-a MC applied to the small Markov process:

All episodes starts in state C

Proceed one state, right orleft with equal probability

Termination: R = +1, L = 0

True values:V(C)=1/2, V(A)=1/6, V(B)=2/6V(D)=4/6, V(E)=5/6

Page 14: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 14

Random Walk Example (cont.)

Data averaged over

100 sequences of

episodes

Page 15: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 15

Optimality of TD(0)

Suppose only a finite amount of experience is available, say 10 episodes or 100 time steps

Intuitively, we repeatedly present the experience until convergence is achieved

Updates are made after a batch of training data Also called batch updating

For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small a

MC method also converges deterministically but to a different answer

To better understand the different between MC and TD(0), we’ll consider the batch random walk

Page 16: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 16

Optimality of TD(0) (cont.)

After each new episode, all previous episodes were treated as a batch, and the algorithm was trained until convergence. All repeated 100 times.

A key question is what would explain these two curves?

Page 17: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 17

You are the Predictor

Suppose you observe the following 8 episodes

Q :What would you guess V(A) and V(B) to be?

1) A, 0, B, 0

2) B, 1

3) B, 1

4) B, 1

5) B, 1

6) B, 1

7) B, 1

8) B, 0

Page 18: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 18

You are the Predictor (cont.)

V(A) = ¾ is the answer that batch TD(0) givesThe other reasonable answer is simply to say that A(0)=0 (Why?) This is the answer that MC gives

If the process is Markovian, we expect that the TD(0) answer will produce lower error on future data, even though the Monte Carlo answer is better on the existing data

Page 19: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 19

TD(0) vs. MC

For MC, the prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set

This is what a batch Monte Carlo method gets

If we consider the sequentiality of the problem, then we would set V(A)=.75

This is correct for the maximum likelihoodestimate of a Markov model generating the data

i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts

This is called the certainty-equivalence estimate

It is what TD(0) yields

Page 20: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 20

Learning An Action-Value Function

We now consider the use of TD methods for the control problem

As with MC, we need to balance exploration and exploitation Again, two schemes: on-policy and off-policy

We’ll start with on-policy, and learn action-value function

.0),( then terminal,is If

,,,,

: thisdo ,state terminal-nona from sitionevery tranAfter

111

111

ttt

ttttttttt

t

asQs

asQasQrasQasQ

s

a

Page 21: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 21

SARSA: On-Policy TD(0) Learning

One can easily turn this into a control method by always updating the policy to be greedy with respect to the current estimate of Q(s,a)

Page 22: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 22

Q-Learning: Off-Policy TD Control

One of the most important breakthroughs in RL was the development of Q-Learning - an off-policy TD control algorithm (1989)

ttta

ttttt asQasQrasQasQ ,,max,, 11 a

Page 23: ECE-517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture10.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 - Reinforcement Learning in AI 23

Q-Learning: Off-Policy TD Control (cont.)

The learned action-value function, Q, directly approximates the optimal action-value function, Q*

Converges as long as all states are visited and state-action values updated

Why is it considered an off-policy control method?

How expensive is it to implement?


Recommended