Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | harvey-warren |
View: | 243 times |
Download: | 3 times |
CHAPTER 16:
Reinforcement Learning
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)2
Introduction
Game-playing: Sequence of moves to win a game Robot in a maze: Sequence of actions to find a
goal Agent has a state in an environment, takes an
action and sometimes receives reward and the state changes
Credit-assignment Learn a policy
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3
Single State: K-armed Bandit
Among K levers, choose the one that pays best
Q(a): value of action aReward is ra
Set Q(a) = ra
Choose a* if Q(a*)=maxa Q(a)
Rewards stochastic (keep an expected reward):
aQaraQaQ tttt 11
Example: Q-Learning
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4
Remarks:• The learning rate determines how quickly Q(s,a)changes based on “new evidence”• The discount factor determines the importance of future rewards
Example: The STU-World
1 R=+5 2 R=+1 3 R=+9
6 R=
9 R=
10R=+1
8 R=
5 R=+4 R=+
7 R=5
e e
s
s
snw
x/0.9
n
ne
x/0.1n y/0.1
n
Problem: What actionsshould an agent chooseto maximize its rewards?
ne
Remark: no terminal states
swy/0.5 y/0.4w
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6
Elements of RL (Markov Decision Processes) st : State of agent at time t at: Action taken at time t In st, action at is taken, clock ticks and reward rt+1
is received and state changes to st+1
Next state prob: P (st+1 | st , at ) Reward prob: p (rt+1 | st , at ) Initial state(s), goal state(s) Episode (trial) of actions from initial state to goal (Sutton and Barto, 1998; Kaelbling et al., 1996)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7
Policy and Cumulative Reward
Policy, Value of a policy, Finite-horizon:
Infinite horizon:
tt sa: AS tsV
T
iitTtttt rErrrEsV
121
rate discount the is 10
1
13
221
iit
itttt rErrrEsV
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8
1111
111
11
11
11
1
1
11
1
max|
in of Valuemax
|max
max
max
max
max
tt*
as
tttttt*
tttt*
at
*
t*
stttt
at
*
t*
ta
iit
it
a
iit
i
a
ttt*
a,sQa,ssPrEa,sQ
saa,sQsV
sVa,ssPrEsV
sVrE
rrE
rE
s,sVsV
tt
t
tt
t
t
t
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9
Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is known
There is no need for exploration Can be solved using dynamic programming; e.g.
Bellman update Solve for
Optimal policy
Model-Based Learning
111
1
|max t*
stttt
at
* sVa,ssPrEsVt
t
111
1
||max arg t*
stttttt
at sVa,ssPa,srEs*
tt
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10
Value Iteration
Goal: Find the optimal Policy
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11
Temporal Difference Learning
Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is not known; model-free learning
There is need for exploration to sample from P (st+1 | st , at ) and p (rt+1 | st , at )
Use the reward received in the next time step to update the value of current state (action)
The temporal difference between the value of the current action and the value discounted from the next state
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12
Exploration Strategies
ε-greedy: With pr ε,choose one action at random uniformly; and choose the best action with pr 1-ε
Probabilistic:
Move smoothly from exploration/exploitation. Decrease ε Annealing
A
1exp
exp|
bb,sQ
a,sQsaP
A
1exp
exp|
bT/b,sQ
T/a,sQsaP
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13
ttttt sVsVrsVsV 11
Nondeterministic Rewards and Actions When next states and rewards are nondeterministic
(there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments
Q-learning (Watkins and Dayan, 1992):
Off-policy vs on-policy (Sarsa) Learning V (TD-learning: Sutton, 1988)
tttt
attttt a,sQ̂a,sQ̂ra,sQ̂a,sQ̂
t111
1
max
backup
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14
Q-learninga’ is chosen based on maximum q value
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15
Sarsaa’ is chosen based on policy