CS325 Artificial IntelligenceCh. 21 – Reinforcement Learning
Cengiz Günay, Emory Univ.
Spring 2013
Günay Ch. 21 – Reinforcement Learning Spring 2013 1 / 23
Rats!
Fundooprofessor
Rat put in a cage with lever.Each lever press sends asignal to rat’s brain,to the reward center.
Rat presses lever continouslyuntil . . .it dies because it stopseating and drinking.
Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23
Rats!
Fundooprofessor
Rat put in a cage with lever.Each lever press sends asignal to rat’s brain,to the reward center.Rat presses lever continouslyuntil . . .
it dies because it stopseating and drinking.
Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23
Rats!
Fundooprofessor
Rat put in a cage with lever.Each lever press sends asignal to rat’s brain,to the reward center.Rat presses lever continouslyuntil . . .it dies because it stopseating and drinking.
Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23
Wikipedia.org
Dopamine Neurons Respond to Novelty
sciencemuseum.org.uk
It turns out:Novelty detection = Temporal Difference rulein Reinforcement Learning(Sutton and Barto, 1981)
Schultz et al. (1997)
Günay Ch. 21 – Reinforcement Learning Spring 2013 4 / 23
Dopamine Neurons Respond to Novelty
sciencemuseum.org.uk
It turns out:Novelty detection = Temporal Difference rulein Reinforcement Learning(Sutton and Barto, 1981)
Schultz et al. (1997)
Günay Ch. 21 – Reinforcement Learning Spring 2013 4 / 23
Performance standard
Agent
Environm
entSensors
Performanceelement
changes
knowledgelearning goals
Problemgenerator
feedback
Learning element
Critic
Actuators
Entry/Exit Surveys
Exit survey: Planning Under UncertaintyWhy can’t we use a regular MDP for partially-observable situations?Give an example where you think MDPs would help you solve aproblem in your daily life.
Entry survey: Reinforcement Learning (0.25 points of final grade)In a partially-observable scenario, can reinforcement be used to learnMDP rewards?How can we improve MDP by using the plan-execute cycle?
Günay Ch. 21 – Reinforcement Learning Spring 2013 7 / 23
Blindfolded MDPs: Enter Reinforcement Learning
1 2 3 4a Gb xtc S
What if the agent does not know anything about:where walls arewhere goals/penalties are
Can we use the plan-execute cycle?
Explore firstUpdate world state based on reward/reinforcement
⇒ Reinforcement Learning (see Scholarpedia article)
Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23
Blindfolded MDPs: Enter Reinforcement Learning
1 2 3 4a Gb xtc S
What if the agent does not know anything about:where walls arewhere goals/penalties are
Can we use the plan-execute cycle?
Explore firstUpdate world state based on reward/reinforcement
⇒ Reinforcement Learning (see Scholarpedia article)
Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23
Blindfolded MDPs: Enter Reinforcement Learning
1 2 3 4a Gb xtc S
What if the agent does not know anything about:where walls arewhere goals/penalties are
Can we use the plan-execute cycle?
Explore firstUpdate world state based on reward/reinforcement
⇒ Reinforcement Learning (see Scholarpedia article)
Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23
Where Does Reinforcement Learning Fit?
Machine learning so far:
Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a
(by finding optimal policy, π(s)→ a)
Which is it?S U R
X
Speech recognition: connect sounds to transcripts
X
Star data: find groupings from spectral emissions
X
Rat presses lever: gets reward based on certain conditions
X
Elevator controller: multiple elevators, minimize wait time
Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23
Where Does Reinforcement Learning Fit?
Machine learning so far:Unsupervised learning: find regularities in input data, x
Supervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a
(by finding optimal policy, π(s)→ a)
Which is it?S U R
X
Speech recognition: connect sounds to transcripts
X
Star data: find groupings from spectral emissions
X
Rat presses lever: gets reward based on certain conditions
X
Elevator controller: multiple elevators, minimize wait time
Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23
Where Does Reinforcement Learning Fit?
Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ y
Reinforcement learning: find mapping between states and actions, s → a(by finding optimal policy, π(s)→ a)
Which is it?S U R
X
Speech recognition: connect sounds to transcripts
X
Star data: find groupings from spectral emissions
X
Rat presses lever: gets reward based on certain conditions
X
Elevator controller: multiple elevators, minimize wait time
Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23
Where Does Reinforcement Learning Fit?
Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a
(by finding optimal policy, π(s)→ a)
Which is it?S U R
X
Speech recognition: connect sounds to transcripts
X
Star data: find groupings from spectral emissions
X
Rat presses lever: gets reward based on certain conditions
X
Elevator controller: multiple elevators, minimize wait time
Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23
Where Does Reinforcement Learning Fit?
Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a
(by finding optimal policy, π(s)→ a)
Which is it?S U R
X
Speech recognition: connect sounds to transcripts
X
Star data: find groupings from spectral emissions
X
Rat presses lever: gets reward based on certain conditions
X
Elevator controller: multiple elevators, minimize wait time
Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23
Where Does Reinforcement Learning Fit?
Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a
(by finding optimal policy, π(s)→ a)
Which is it?S U R
X
Speech recognition: connect sounds to transcripts
X
Star data: find groupings from spectral emissions
X
Rat presses lever: gets reward based on certain conditions
X
Elevator controller: multiple elevators, minimize wait time
Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23
Where Does Reinforcement Learning Fit?
Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a
(by finding optimal policy, π(s)→ a)
Which is it?S U RX Speech recognition: connect sounds to transcripts
X Star data: find groupings from spectral emissionsX Rat presses lever: gets reward based on certain conditionsX Elevator controller: multiple elevators, minimize wait time
Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23
But, Wasn’t That What Markov Decision Processes Were?
Find optimal policy to maximize reward:
π(s) = argmaxπ
E
[ ∞∑t=0
γtR(s, π(s), s ′)
],
with reward at state: R(s), or from action, R(s, a, s ′).
By estimating utility values:
V (s)←
[argmax
aγ∑s′
P(s ′|s, a)V (s ′)
]+ R(s) ,
with transition probabilities: P(s ′|s, a)
Assumes we know R(s) and P(s ′|s, a)
Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23
But, Wasn’t That What Markov Decision Processes Were?
Find optimal policy to maximize reward:
π(s) = argmaxπ
E
[ ∞∑t=0
γtR(s, π(s), s ′)
],
with reward at state: R(s), or from action, R(s, a, s ′).By estimating utility values:
V (s)←
[argmax
aγ∑s′
P(s ′|s, a)V (s ′)
]+ R(s) ,
with transition probabilities: P(s ′|s, a)
Assumes we know R(s) and P(s ′|s, a)
Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23
But, Wasn’t That What Markov Decision Processes Were?
Find optimal policy to maximize reward:
π(s) = argmaxπ
E
[ ∞∑t=0
γtR(s, π(s), s ′)
],
with reward at state: R(s), or from action, R(s, a, s ′).By estimating utility values:
V (s)←
[argmax
aγ∑s′
P(s ′|s, a)V (s ′)
]+ R(s) ,
with transition probabilities: P(s ′|s, a)
Assumes we know R(s) and P(s ′|s, a)
Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23
Blindfolded Agent Must Learn From Rewards
Don’t know R(s) or P(s ′|s, a). What to do?
Use Reinforcement Learning (RL) to explore and find rewards
Agent types:
knows learns usesUtility agent P R → U U
Q-learning (RL) Q(s, a) QReflex π(s)
Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23
Blindfolded Agent Must Learn From Rewards
Don’t know R(s) or P(s ′|s, a). What to do?
Use Reinforcement Learning (RL) to explore and find rewards
Agent types:
knows learns usesUtility agent P R → U U
Q-learning (RL) Q(s, a) QReflex π(s)
Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23
Blindfolded Agent Must Learn From Rewards
Don’t know R(s) or P(s ′|s, a). What to do?
Use Reinforcement Learning (RL) to explore and find rewards
Agent types:
knows learns usesUtility agent P R → U U
Q-learning (RL) Q(s, a) QReflex π(s)
Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23
Video: Backgammon and Choppers
How Much to Learn?
1 Passive RL: Simple Case
Keep policy π(s) fixed, learn othersAlways do same actions, and learn utilitiesExamples:
public transit commutelearning a difficult game
2 Active RL
Learn policy at the same timeHelp explore better by changing policyExample: drive own car
Günay Ch. 21 – Reinforcement Learning Spring 2013 13 / 23
How Much to Learn?
1 Passive RL: Simple Case
Keep policy π(s) fixed, learn othersAlways do same actions, and learn utilitiesExamples:
public transit commutelearning a difficult game
2 Active RL
Learn policy at the same timeHelp explore better by changing policyExample: drive own car
Günay Ch. 21 – Reinforcement Learning Spring 2013 13 / 23
RL in Practise: Temporal Difference (TD) Rule
Animals use derivative: Remember value iteration:
V (s)←
[argmax
aγ∑s′
P(s ′|s, a)V (s ′)
]+ R(s) .
TD rule:Use derivative when going s → s ′:
V (s)← V (s) + α(R(s) + γV (s ′)− V (s)
)where:α is the learning rate, andγ is the discount factor.
It’s even simpler than before!
Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23
RL in Practise: Temporal Difference (TD) Rule
Animals use derivative: Remember value iteration:
V (s)←
[argmax
aγ∑s′
P(s ′|s, a)V (s ′)
]+ R(s) .
TD rule:Use derivative when going s → s ′:
V (s)← V (s) + α(R(s) + γV (s ′)− V (s)
)where:α is the learning rate, andγ is the discount factor.
It’s even simpler than before!
Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23
RL in Practise: Temporal Difference (TD) Rule
Animals use derivative: Remember value iteration:
V (s)←
[argmax
aγ∑s′
P(s ′|s, a)V (s ′)
]+ R(s) .
TD rule:Use derivative when going s → s ′:
V (s)← V (s) + α(R(s) + γV (s ′)− V (s)
)where:α is the learning rate, andγ is the discount factor.
It’s even simpler than before!
Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23
Passive RL: Simple Case
1 2 3 4a +1b xt −1c S
Keep same policyThat is, follow same path and updatevalues, V (s)
To mimic increasing confidence, reduce learning rate with number of visits,N(s):
α =1
N(s) + 1
like in simulated annealing.TD rule:
V (s)← V (s) +1
N(s) + 1(R(s) + γV (s ′)− V (s)
)
Günay Ch. 21 – Reinforcement Learning Spring 2013 15 / 23
Passive RL: Simple Case
1 2 3 4a +1b xt −1c S
Keep same policyThat is, follow same path and updatevalues, V (s)
To mimic increasing confidence, reduce learning rate with number of visits,N(s):
α =1
N(s) + 1
like in simulated annealing.
TD rule:
V (s)← V (s) +1
N(s) + 1(R(s) + γV (s ′)− V (s)
)
Günay Ch. 21 – Reinforcement Learning Spring 2013 15 / 23
Passive RL: Simple Case
1 2 3 4a +1b xt −1c S
Keep same policyThat is, follow same path and updatevalues, V (s)
To mimic increasing confidence, reduce learning rate with number of visits,N(s):
α =1
N(s) + 1
like in simulated annealing.TD rule:
V (s)← V (s) +1
N(s) + 1(R(s) + γV (s ′)− V (s)
)
Günay Ch. 21 – Reinforcement Learning Spring 2013 15 / 23
Passive RL: Simple Case (2)
1 2 3 4a → → → +1b ↑ xt −1c S
V (s) ← V (s) + ∆
∆ =1
N(s) + 1(R(s) + γV (s ′)− V (s)
)For simplicity, γ = 1.
N V (s) ∆
a3→ a4 1 0 1/2a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6
Convergence time?
Günay Ch. 21 – Reinforcement Learning Spring 2013 16 / 23
Passive RL: Simple Case (2)
1 2 3 4a → → → +1b ↑ xt −1c S
V (s) ← V (s) + ∆
∆ =1
N(s) + 1(R(s) + γV (s ′)− V (s)
)For simplicity, γ = 1.
N V (s) ∆
a3→ a4 1 0 1/2
a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6
Convergence time?
Günay Ch. 21 – Reinforcement Learning Spring 2013 16 / 23
Passive RL: Simple Case (2)
1 2 3 4a → → → +1b ↑ xt −1c S
V (s) ← V (s) + ∆
∆ =1
N(s) + 1(R(s) + γV (s ′)− V (s)
)For simplicity, γ = 1.
N V (s) ∆
a3→ a4 1 0 1/2a2→ a3 2 0 1/6
a3→ a4 2 1/2 1/6
Convergence time?
Günay Ch. 21 – Reinforcement Learning Spring 2013 16 / 23
Passive RL: Simple Case (2)
1 2 3 4a → → → +1b ↑ xt −1c S
V (s) ← V (s) + ∆
∆ =1
N(s) + 1(R(s) + γV (s ′)− V (s)
)For simplicity, γ = 1.
N V (s) ∆
a3→ a4 1 0 1/2a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6
Convergence time?
Günay Ch. 21 – Reinforcement Learning Spring 2013 16 / 23
Passive RL: Simple Case (2)
1 2 3 4a → → → +1b ↑ xt −1c S
V (s) ← V (s) + ∆
∆ =1
N(s) + 1(R(s) + γV (s ′)− V (s)
)For simplicity, γ = 1.
N V (s) ∆
a3→ a4 1 0 1/2a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6
Convergence time?
Günay Ch. 21 – Reinforcement Learning Spring 2013 16 / 23
Passive RL: Problems?
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500
Uti
lity
est
imat
es
Number of trials
(1,1)(1,3)
(2,1)
(3,3)(4,3)
0
0.1
0.2
0.3
0.4
0.5
0.6
0 20 40 60 80 100
RM
S e
rror
in u
tili
ty
Number of trials
Limited by constant policy?Fewer visited states cause poor estimate?
Günay Ch. 21 – Reinforcement Learning Spring 2013 17 / 23
Passive RL: Problems?
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500
Uti
lity
est
imat
es
Number of trials
(1,1)(1,3)
(2,1)
(3,3)(4,3)
0
0.1
0.2
0.3
0.4
0.5
0.6
0 20 40 60 80 100
RM
S e
rror
in u
tili
ty
Number of trials
Limited by constant policy?Fewer visited states cause poor estimate?
Günay Ch. 21 – Reinforcement Learning Spring 2013 17 / 23
Active RL: Example
Greedy algorithmAfter updating V (s) and N(s), recalculate policy π(s)
0
0.5
1
1.5
2
0 50 100 150 200 250 300 350 400 450 500
RM
S e
rror
, pol
icy
loss
Number of trials
RMS errorPolicy loss
1 2 3
1
2
3
–1
+1
4
Greedy algorithm cannot find optimal policy⇒ needs more exploration
Günay Ch. 21 – Reinforcement Learning Spring 2013 18 / 23
Active RL: Example
Greedy algorithmAfter updating V (s) and N(s), recalculate policy π(s)
0
0.5
1
1.5
2
0 50 100 150 200 250 300 350 400 450 500
RM
S e
rror
, pol
icy
loss
Number of trials
RMS errorPolicy loss
1 2 3
1
2
3
–1
+1
4
Greedy algorithm cannot find optimal policy⇒ needs more exploration
Günay Ch. 21 – Reinforcement Learning Spring 2013 18 / 23
Active RL: Example
Greedy algorithmAfter updating V (s) and N(s), recalculate policy π(s)
0
0.5
1
1.5
2
0 50 100 150 200 250 300 350 400 450 500
RM
S e
rror
, pol
icy
loss
Number of trials
RMS errorPolicy loss
1 2 3
1
2
3
–1
+1
4
Greedy algorithm cannot find optimal policy⇒ needs more exploration
Günay Ch. 21 – Reinforcement Learning Spring 2013 18 / 23
How to Improve Active RL?
Source of errors:Reason for error: sampling policy
V too lowV too high
increase N helps?
Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it
Exploration:Minimize it, use random moves?
Günay Ch. 21 – Reinforcement Learning Spring 2013 19 / 23
How to Improve Active RL?
Source of errors:Reason for error: sampling policy
V too low TV too high T
increase N helps? T
Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it
Exploration:Minimize it, use random moves?
Günay Ch. 21 – Reinforcement Learning Spring 2013 19 / 23
How to Improve Active RL?
Source of errors:Reason for error: sampling policy
V too low T TV too high T F
increase N helps? T F
Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it
Exploration:Minimize it, use random moves?
Günay Ch. 21 – Reinforcement Learning Spring 2013 19 / 23
How to Improve Active RL?
Source of errors:Reason for error: sampling policy
V too low TV too high T
increase N helps? T
Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it
Exploration:Minimize it, use random moves?
Günay Ch. 21 – Reinforcement Learning Spring 2013 19 / 23
How to Improve Active RL?
Source of errors:Reason for error: sampling policy
V too low TV too high T
increase N helps? T
Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it
Exploration:Minimize it, use random moves?
Günay Ch. 21 – Reinforcement Learning Spring 2013 19 / 23
Exploring Agent
1 2 3 4a +1 +1 +1 +1b +1 xt +1 +1c S +1 +1 +1
Initialize all V (s) = +R (e.g., +1)Until N(s)>e; exploration thresholdThen use V (s)
Wait until built confidence
Günay Ch. 21 – Reinforcement Learning Spring 2013 20 / 23
Exploring Agent Does Much Better
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
0 20 40 60 80 100
Uti
lity
est
imat
es
Number of trials
(1,1)(1,2)(1,3)(2,3)(3,2)(3,3)(4,3)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 20 40 60 80 100R
MS
err
or, p
olic
y lo
ssNumber of trials
RMS errorPolicy loss
Günay Ch. 21 – Reinforcement Learning Spring 2013 21 / 23
Q-Learning
Instead of V (s), use Q(s, a):
Q(s, a) = argmaxa
V (s)
then the value iteration becomes
Q(s, a) = R(s) + γ∑s′
P(s ′|s, a) argmaxa′
Q(s ′, a′)
State of the art, but also has problems with dimensionality
Günay Ch. 21 – Reinforcement Learning Spring 2013 22 / 23
Q-Learning
Instead of V (s), use Q(s, a):
Q(s, a) = argmaxa
V (s)
then the value iteration becomes
Q(s, a) = R(s) + γ∑s′
P(s ′|s, a) argmaxa′
Q(s ′, a′)
State of the art, but also has problems with dimensionality
Günay Ch. 21 – Reinforcement Learning Spring 2013 22 / 23
Q-Learning in Real World Problems
Translate problem space to feature space: s = [f1, . . . , fm]
Günay Ch. 21 – Reinforcement Learning Spring 2013 23 / 23
Q-Learning in Real World Problems
Translate problem space to feature space: s = [f1, . . . , fm]
Günay Ch. 21 – Reinforcement Learning Spring 2013 23 / 23