+ All Categories
Home > Documents > CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine...

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine...

Date post: 13-Dec-2015
Category:
Upload: harvey-warren
View: 243 times
Download: 3 times
Share this document with a friend
Popular Tags:
15
CHAPTER 16: Reinforcement Learning
Transcript
Page 1: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

CHAPTER 16:

Reinforcement Learning

Page 2: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)2

Introduction

Game-playing: Sequence of moves to win a game Robot in a maze: Sequence of actions to find a

goal Agent has a state in an environment, takes an

action and sometimes receives reward and the state changes

Credit-assignment Learn a policy

Page 3: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3

Single State: K-armed Bandit

Among K levers, choose the one that pays best

Q(a): value of action aReward is ra

Set Q(a) = ra

Choose a* if Q(a*)=maxa Q(a)

Rewards stochastic (keep an expected reward):

aQaraQaQ tttt 11

Page 4: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Example: Q-Learning

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4

Remarks:• The learning rate determines how quickly Q(s,a)changes based on “new evidence”• The discount factor determines the importance of future rewards

Page 5: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Example: The STU-World

1 R=+5 2 R=+1 3 R=+9

6 R=

9 R=

10R=+1

8 R=

5 R=+4 R=+

7 R=5

e e

s

s

snw

x/0.9

n

ne

x/0.1n y/0.1

n

Problem: What actionsshould an agent chooseto maximize its rewards?

ne

Remark: no terminal states

swy/0.5 y/0.4w

Page 6: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6

Elements of RL (Markov Decision Processes) st : State of agent at time t at: Action taken at time t In st, action at is taken, clock ticks and reward rt+1

is received and state changes to st+1

Next state prob: P (st+1 | st , at ) Reward prob: p (rt+1 | st , at ) Initial state(s), goal state(s) Episode (trial) of actions from initial state to goal (Sutton and Barto, 1998; Kaelbling et al., 1996)

Page 7: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7

Policy and Cumulative Reward

Policy, Value of a policy, Finite-horizon:

Infinite horizon:

tt sa: AS tsV

T

iitTtttt rErrrEsV

121

rate discount the is 10

1

13

221

iit

itttt rErrrEsV

Page 8: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8

1111

111

11

11

11

1

1

11

1

max|

in of Valuemax

|max

max

max

max

max

tt*

as

tttttt*

tttt*

at

*

t*

stttt

at

*

t*

ta

iit

it

a

iit

i

a

ttt*

a,sQa,ssPrEa,sQ

saa,sQsV

sVa,ssPrEsV

sVrE

rrE

rE

s,sVsV

tt

t

tt

t

t

t

Page 9: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9

Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is known

There is no need for exploration Can be solved using dynamic programming; e.g.

Bellman update Solve for

Optimal policy

Model-Based Learning

111

1

|max t*

stttt

at

* sVa,ssPrEsVt

t

111

1

||max arg t*

stttttt

at sVa,ssPa,srEs*

tt

Page 10: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10

Value Iteration

Goal: Find the optimal Policy

Page 11: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11

Temporal Difference Learning

Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is not known; model-free learning

There is need for exploration to sample from P (st+1 | st , at ) and p (rt+1 | st , at )

Use the reward received in the next time step to update the value of current state (action)

The temporal difference between the value of the current action and the value discounted from the next state

Page 12: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12

Exploration Strategies

ε-greedy: With pr ε,choose one action at random uniformly; and choose the best action with pr 1-ε

Probabilistic:

Move smoothly from exploration/exploitation. Decrease ε Annealing

A

1exp

exp|

bb,sQ

a,sQsaP

A

1exp

exp|

bT/b,sQ

T/a,sQsaP

Page 13: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13

ttttt sVsVrsVsV 11

Nondeterministic Rewards and Actions When next states and rewards are nondeterministic

(there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments

Q-learning (Watkins and Dayan, 1992):

Off-policy vs on-policy (Sarsa) Learning V (TD-learning: Sutton, 1988)

tttt

attttt a,sQ̂a,sQ̂ra,sQ̂a,sQ̂

t111

1

max

backup

Page 14: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14

Q-learninga’ is chosen based on maximum q value

Page 15: CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15

Sarsaa’ is chosen based on policy


Recommended