+ All Categories
Home > Documents > CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor...

CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor...

Date post: 21-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
56
CS325 Artificial Intelligence Ch. 21 – Reinforcement Learning Cengiz Günay, Emory Univ. Spring 2013 Günay Ch. 21 – Reinforcement Learning Spring 2013 1 / 23
Transcript
Page 1: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

CS325 Artificial IntelligenceCh. 21 – Reinforcement Learning

Cengiz Günay, Emory Univ.

Spring 2013

Günay Ch. 21 – Reinforcement Learning Spring 2013 1 / 23

Page 2: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Rats!

Fundooprofessor

Rat put in a cage with lever.Each lever press sends asignal to rat’s brain,to the reward center.

Rat presses lever continouslyuntil . . .it dies because it stopseating and drinking.

Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23

Page 3: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Rats!

Fundooprofessor

Rat put in a cage with lever.Each lever press sends asignal to rat’s brain,to the reward center.Rat presses lever continouslyuntil . . .

it dies because it stopseating and drinking.

Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23

Page 4: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Rats!

Fundooprofessor

Rat put in a cage with lever.Each lever press sends asignal to rat’s brain,to the reward center.Rat presses lever continouslyuntil . . .it dies because it stopseating and drinking.

Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23

Page 5: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Wikipedia.org

Page 6: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Dopamine Neurons Respond to Novelty

sciencemuseum.org.uk

It turns out:Novelty detection = Temporal Difference rulein Reinforcement Learning(Sutton and Barto, 1981)

Schultz et al. (1997)

Günay Ch. 21 – Reinforcement Learning Spring 2013 4 / 23

Page 7: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Dopamine Neurons Respond to Novelty

sciencemuseum.org.uk

It turns out:Novelty detection = Temporal Difference rulein Reinforcement Learning(Sutton and Barto, 1981)

Schultz et al. (1997)

Günay Ch. 21 – Reinforcement Learning Spring 2013 4 / 23

Page 8: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously
Page 9: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Performance standard

Agent

Environm

entSensors

Performanceelement

changes

knowledgelearning goals

Problemgenerator

feedback

Learning element

Critic

Actuators

Page 10: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Entry/Exit Surveys

Exit survey: Planning Under UncertaintyWhy can’t we use a regular MDP for partially-observable situations?Give an example where you think MDPs would help you solve aproblem in your daily life.

Entry survey: Reinforcement Learning (0.25 points of final grade)In a partially-observable scenario, can reinforcement be used to learnMDP rewards?How can we improve MDP by using the plan-execute cycle?

Günay Ch. 21 – Reinforcement Learning Spring 2013 7 / 23

Page 11: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Blindfolded MDPs: Enter Reinforcement Learning

1 2 3 4a Gb xtc S

What if the agent does not know anything about:where walls arewhere goals/penalties are

Can we use the plan-execute cycle?

Explore firstUpdate world state based on reward/reinforcement

⇒ Reinforcement Learning (see Scholarpedia article)

Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23

Page 12: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Blindfolded MDPs: Enter Reinforcement Learning

1 2 3 4a Gb xtc S

What if the agent does not know anything about:where walls arewhere goals/penalties are

Can we use the plan-execute cycle?

Explore firstUpdate world state based on reward/reinforcement

⇒ Reinforcement Learning (see Scholarpedia article)

Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23

Page 13: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Blindfolded MDPs: Enter Reinforcement Learning

1 2 3 4a Gb xtc S

What if the agent does not know anything about:where walls arewhere goals/penalties are

Can we use the plan-execute cycle?

Explore firstUpdate world state based on reward/reinforcement

⇒ Reinforcement Learning (see Scholarpedia article)

Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23

Page 14: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Where Does Reinforcement Learning Fit?

Machine learning so far:

Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a

(by finding optimal policy, π(s)→ a)

Which is it?S U R

X

Speech recognition: connect sounds to transcripts

X

Star data: find groupings from spectral emissions

X

Rat presses lever: gets reward based on certain conditions

X

Elevator controller: multiple elevators, minimize wait time

Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Page 15: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Where Does Reinforcement Learning Fit?

Machine learning so far:Unsupervised learning: find regularities in input data, x

Supervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a

(by finding optimal policy, π(s)→ a)

Which is it?S U R

X

Speech recognition: connect sounds to transcripts

X

Star data: find groupings from spectral emissions

X

Rat presses lever: gets reward based on certain conditions

X

Elevator controller: multiple elevators, minimize wait time

Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Page 16: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Where Does Reinforcement Learning Fit?

Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ y

Reinforcement learning: find mapping between states and actions, s → a(by finding optimal policy, π(s)→ a)

Which is it?S U R

X

Speech recognition: connect sounds to transcripts

X

Star data: find groupings from spectral emissions

X

Rat presses lever: gets reward based on certain conditions

X

Elevator controller: multiple elevators, minimize wait time

Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Page 17: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Where Does Reinforcement Learning Fit?

Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a

(by finding optimal policy, π(s)→ a)

Which is it?S U R

X

Speech recognition: connect sounds to transcripts

X

Star data: find groupings from spectral emissions

X

Rat presses lever: gets reward based on certain conditions

X

Elevator controller: multiple elevators, minimize wait time

Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Page 18: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Where Does Reinforcement Learning Fit?

Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a

(by finding optimal policy, π(s)→ a)

Which is it?S U R

X

Speech recognition: connect sounds to transcripts

X

Star data: find groupings from spectral emissions

X

Rat presses lever: gets reward based on certain conditions

X

Elevator controller: multiple elevators, minimize wait time

Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Page 19: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Where Does Reinforcement Learning Fit?

Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a

(by finding optimal policy, π(s)→ a)

Which is it?S U R

X

Speech recognition: connect sounds to transcripts

X

Star data: find groupings from spectral emissions

X

Rat presses lever: gets reward based on certain conditions

X

Elevator controller: multiple elevators, minimize wait time

Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Page 20: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Where Does Reinforcement Learning Fit?

Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a

(by finding optimal policy, π(s)→ a)

Which is it?S U RX Speech recognition: connect sounds to transcripts

X Star data: find groupings from spectral emissionsX Rat presses lever: gets reward based on certain conditionsX Elevator controller: multiple elevators, minimize wait time

Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Page 21: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

But, Wasn’t That What Markov Decision Processes Were?

Find optimal policy to maximize reward:

π(s) = argmaxπ

E

[ ∞∑t=0

γtR(s, π(s), s ′)

],

with reward at state: R(s), or from action, R(s, a, s ′).

By estimating utility values:

V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) ,

with transition probabilities: P(s ′|s, a)

Assumes we know R(s) and P(s ′|s, a)

Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23

Page 22: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

But, Wasn’t That What Markov Decision Processes Were?

Find optimal policy to maximize reward:

π(s) = argmaxπ

E

[ ∞∑t=0

γtR(s, π(s), s ′)

],

with reward at state: R(s), or from action, R(s, a, s ′).By estimating utility values:

V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) ,

with transition probabilities: P(s ′|s, a)

Assumes we know R(s) and P(s ′|s, a)

Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23

Page 23: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

But, Wasn’t That What Markov Decision Processes Were?

Find optimal policy to maximize reward:

π(s) = argmaxπ

E

[ ∞∑t=0

γtR(s, π(s), s ′)

],

with reward at state: R(s), or from action, R(s, a, s ′).By estimating utility values:

V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) ,

with transition probabilities: P(s ′|s, a)

Assumes we know R(s) and P(s ′|s, a)

Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23

Page 24: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Blindfolded Agent Must Learn From Rewards

Don’t know R(s) or P(s ′|s, a). What to do?

Use Reinforcement Learning (RL) to explore and find rewards

Agent types:

knows learns usesUtility agent P R → U U

Q-learning (RL) Q(s, a) QReflex π(s)

Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23

Page 25: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Blindfolded Agent Must Learn From Rewards

Don’t know R(s) or P(s ′|s, a). What to do?

Use Reinforcement Learning (RL) to explore and find rewards

Agent types:

knows learns usesUtility agent P R → U U

Q-learning (RL) Q(s, a) QReflex π(s)

Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23

Page 26: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Blindfolded Agent Must Learn From Rewards

Don’t know R(s) or P(s ′|s, a). What to do?

Use Reinforcement Learning (RL) to explore and find rewards

Agent types:

knows learns usesUtility agent P R → U U

Q-learning (RL) Q(s, a) QReflex π(s)

Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23

Page 28: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

How Much to Learn?

1 Passive RL: Simple Case

Keep policy π(s) fixed, learn othersAlways do same actions, and learn utilitiesExamples:

public transit commutelearning a difficult game

2 Active RL

Learn policy at the same timeHelp explore better by changing policyExample: drive own car

Günay Ch. 21 – Reinforcement Learning Spring 2013 13 / 23

Page 29: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

How Much to Learn?

1 Passive RL: Simple Case

Keep policy π(s) fixed, learn othersAlways do same actions, and learn utilitiesExamples:

public transit commutelearning a difficult game

2 Active RL

Learn policy at the same timeHelp explore better by changing policyExample: drive own car

Günay Ch. 21 – Reinforcement Learning Spring 2013 13 / 23

Page 30: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

RL in Practise: Temporal Difference (TD) Rule

Animals use derivative: Remember value iteration:

V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) .

TD rule:Use derivative when going s → s ′:

V (s)← V (s) + α(R(s) + γV (s ′)− V (s)

)where:α is the learning rate, andγ is the discount factor.

It’s even simpler than before!

Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23

Page 31: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

RL in Practise: Temporal Difference (TD) Rule

Animals use derivative: Remember value iteration:

V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) .

TD rule:Use derivative when going s → s ′:

V (s)← V (s) + α(R(s) + γV (s ′)− V (s)

)where:α is the learning rate, andγ is the discount factor.

It’s even simpler than before!

Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23

Page 32: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

RL in Practise: Temporal Difference (TD) Rule

Animals use derivative: Remember value iteration:

V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) .

TD rule:Use derivative when going s → s ′:

V (s)← V (s) + α(R(s) + γV (s ′)− V (s)

)where:α is the learning rate, andγ is the discount factor.

It’s even simpler than before!

Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23

Page 33: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Passive RL: Simple Case

1 2 3 4a +1b xt −1c S

Keep same policyThat is, follow same path and updatevalues, V (s)

To mimic increasing confidence, reduce learning rate with number of visits,N(s):

α =1

N(s) + 1

like in simulated annealing.TD rule:

V (s)← V (s) +1

N(s) + 1(R(s) + γV (s ′)− V (s)

)

Günay Ch. 21 – Reinforcement Learning Spring 2013 15 / 23

Page 34: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Passive RL: Simple Case

1 2 3 4a +1b xt −1c S

Keep same policyThat is, follow same path and updatevalues, V (s)

To mimic increasing confidence, reduce learning rate with number of visits,N(s):

α =1

N(s) + 1

like in simulated annealing.

TD rule:

V (s)← V (s) +1

N(s) + 1(R(s) + γV (s ′)− V (s)

)

Günay Ch. 21 – Reinforcement Learning Spring 2013 15 / 23

Page 35: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Passive RL: Simple Case

1 2 3 4a +1b xt −1c S

Keep same policyThat is, follow same path and updatevalues, V (s)

To mimic increasing confidence, reduce learning rate with number of visits,N(s):

α =1

N(s) + 1

like in simulated annealing.TD rule:

V (s)← V (s) +1

N(s) + 1(R(s) + γV (s ′)− V (s)

)

Günay Ch. 21 – Reinforcement Learning Spring 2013 15 / 23

Page 36: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Passive RL: Simple Case (2)

1 2 3 4a → → → +1b ↑ xt −1c S

V (s) ← V (s) + ∆

∆ =1

N(s) + 1(R(s) + γV (s ′)− V (s)

)For simplicity, γ = 1.

N V (s) ∆

a3→ a4 1 0 1/2a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6

Convergence time?

Günay Ch. 21 – Reinforcement Learning Spring 2013 16 / 23

Page 37: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Passive RL: Simple Case (2)

1 2 3 4a → → → +1b ↑ xt −1c S

V (s) ← V (s) + ∆

∆ =1

N(s) + 1(R(s) + γV (s ′)− V (s)

)For simplicity, γ = 1.

N V (s) ∆

a3→ a4 1 0 1/2

a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6

Convergence time?

Günay Ch. 21 – Reinforcement Learning Spring 2013 16 / 23

Page 38: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Passive RL: Simple Case (2)

1 2 3 4a → → → +1b ↑ xt −1c S

V (s) ← V (s) + ∆

∆ =1

N(s) + 1(R(s) + γV (s ′)− V (s)

)For simplicity, γ = 1.

N V (s) ∆

a3→ a4 1 0 1/2a2→ a3 2 0 1/6

a3→ a4 2 1/2 1/6

Convergence time?

Günay Ch. 21 – Reinforcement Learning Spring 2013 16 / 23

Page 39: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Passive RL: Simple Case (2)

1 2 3 4a → → → +1b ↑ xt −1c S

V (s) ← V (s) + ∆

∆ =1

N(s) + 1(R(s) + γV (s ′)− V (s)

)For simplicity, γ = 1.

N V (s) ∆

a3→ a4 1 0 1/2a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6

Convergence time?

Günay Ch. 21 – Reinforcement Learning Spring 2013 16 / 23

Page 40: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Passive RL: Simple Case (2)

1 2 3 4a → → → +1b ↑ xt −1c S

V (s) ← V (s) + ∆

∆ =1

N(s) + 1(R(s) + γV (s ′)− V (s)

)For simplicity, γ = 1.

N V (s) ∆

a3→ a4 1 0 1/2a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6

Convergence time?

Günay Ch. 21 – Reinforcement Learning Spring 2013 16 / 23

Page 41: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Passive RL: Problems?

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500

Uti

lity

est

imat

es

Number of trials

(1,1)(1,3)

(2,1)

(3,3)(4,3)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100

RM

S e

rror

in u

tili

ty

Number of trials

Limited by constant policy?Fewer visited states cause poor estimate?

Günay Ch. 21 – Reinforcement Learning Spring 2013 17 / 23

Page 42: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Passive RL: Problems?

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500

Uti

lity

est

imat

es

Number of trials

(1,1)(1,3)

(2,1)

(3,3)(4,3)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100

RM

S e

rror

in u

tili

ty

Number of trials

Limited by constant policy?Fewer visited states cause poor estimate?

Günay Ch. 21 – Reinforcement Learning Spring 2013 17 / 23

Page 43: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Active RL: Example

Greedy algorithmAfter updating V (s) and N(s), recalculate policy π(s)

0

0.5

1

1.5

2

0 50 100 150 200 250 300 350 400 450 500

RM

S e

rror

, pol

icy

loss

Number of trials

RMS errorPolicy loss

1 2 3

1

2

3

–1

+1

4

Greedy algorithm cannot find optimal policy⇒ needs more exploration

Günay Ch. 21 – Reinforcement Learning Spring 2013 18 / 23

Page 44: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Active RL: Example

Greedy algorithmAfter updating V (s) and N(s), recalculate policy π(s)

0

0.5

1

1.5

2

0 50 100 150 200 250 300 350 400 450 500

RM

S e

rror

, pol

icy

loss

Number of trials

RMS errorPolicy loss

1 2 3

1

2

3

–1

+1

4

Greedy algorithm cannot find optimal policy⇒ needs more exploration

Günay Ch. 21 – Reinforcement Learning Spring 2013 18 / 23

Page 45: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Active RL: Example

Greedy algorithmAfter updating V (s) and N(s), recalculate policy π(s)

0

0.5

1

1.5

2

0 50 100 150 200 250 300 350 400 450 500

RM

S e

rror

, pol

icy

loss

Number of trials

RMS errorPolicy loss

1 2 3

1

2

3

–1

+1

4

Greedy algorithm cannot find optimal policy⇒ needs more exploration

Günay Ch. 21 – Reinforcement Learning Spring 2013 18 / 23

Page 46: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

How to Improve Active RL?

Source of errors:Reason for error: sampling policy

V too lowV too high

increase N helps?

Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it

Exploration:Minimize it, use random moves?

Günay Ch. 21 – Reinforcement Learning Spring 2013 19 / 23

Page 47: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

How to Improve Active RL?

Source of errors:Reason for error: sampling policy

V too low TV too high T

increase N helps? T

Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it

Exploration:Minimize it, use random moves?

Günay Ch. 21 – Reinforcement Learning Spring 2013 19 / 23

Page 48: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

How to Improve Active RL?

Source of errors:Reason for error: sampling policy

V too low T TV too high T F

increase N helps? T F

Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it

Exploration:Minimize it, use random moves?

Günay Ch. 21 – Reinforcement Learning Spring 2013 19 / 23

Page 49: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

How to Improve Active RL?

Source of errors:Reason for error: sampling policy

V too low TV too high T

increase N helps? T

Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it

Exploration:Minimize it, use random moves?

Günay Ch. 21 – Reinforcement Learning Spring 2013 19 / 23

Page 50: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

How to Improve Active RL?

Source of errors:Reason for error: sampling policy

V too low TV too high T

increase N helps? T

Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it

Exploration:Minimize it, use random moves?

Günay Ch. 21 – Reinforcement Learning Spring 2013 19 / 23

Page 51: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Exploring Agent

1 2 3 4a +1 +1 +1 +1b +1 xt +1 +1c S +1 +1 +1

Initialize all V (s) = +R (e.g., +1)Until N(s)>e; exploration thresholdThen use V (s)

Wait until built confidence

Günay Ch. 21 – Reinforcement Learning Spring 2013 20 / 23

Page 52: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Exploring Agent Does Much Better

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

0 20 40 60 80 100

Uti

lity

est

imat

es

Number of trials

(1,1)(1,2)(1,3)(2,3)(3,2)(3,3)(4,3)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 20 40 60 80 100R

MS

err

or, p

olic

y lo

ssNumber of trials

RMS errorPolicy loss

Günay Ch. 21 – Reinforcement Learning Spring 2013 21 / 23

Page 53: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Q-Learning

Instead of V (s), use Q(s, a):

Q(s, a) = argmaxa

V (s)

then the value iteration becomes

Q(s, a) = R(s) + γ∑s′

P(s ′|s, a) argmaxa′

Q(s ′, a′)

State of the art, but also has problems with dimensionality

Günay Ch. 21 – Reinforcement Learning Spring 2013 22 / 23

Page 54: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Q-Learning

Instead of V (s), use Q(s, a):

Q(s, a) = argmaxa

V (s)

then the value iteration becomes

Q(s, a) = R(s) + γ∑s′

P(s ′|s, a) argmaxa′

Q(s ′, a′)

State of the art, but also has problems with dimensionality

Günay Ch. 21 – Reinforcement Learning Spring 2013 22 / 23

Page 55: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Q-Learning in Real World Problems

Translate problem space to feature space: s = [f1, . . . , fm]

Günay Ch. 21 – Reinforcement Learning Spring 2013 23 / 23

Page 56: CS325ArtificialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor Ratputinacagewithlever. Eachleverpresssendsa signaltorat’sbrain, totherewardcenter. Ratpresseslevercontinously

Q-Learning in Real World Problems

Translate problem space to feature space: s = [f1, . . . , fm]

Günay Ch. 21 – Reinforcement Learning Spring 2013 23 / 23


Recommended