CS325ArtiﬁcialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor...

CS325 Artificial IntelligenceCh. 21 – Reinforcement Learning

Cengiz Günay, Emory Univ.

Spring 2013

Günay Ch. 21 – Reinforcement Learning Spring 2013 1 / 23

http://www.imdb.com/title/tt0081505/

Rats!

Fundooprofessor

Rat put in a cage with lever.Each lever press sends asignal to rat’s brain,to the reward center.

Rat presses lever continouslyuntil . . .it dies because it stopseating and drinking.


Rats!

Fundooprofessor

Rat put in a cage with lever.Each lever press sends asignal to rat’s brain,to the reward center.Rat presses lever continouslyuntil . . .

it dies because it stopseating and drinking.


Rats!

Fundooprofessor

Rat put in a cage with lever.Each lever press sends asignal to rat’s brain,to the reward center.Rat presses lever continouslyuntil . . .it dies because it stopseating and drinking.


Wikipedia.org

Dopamine Neurons Respond to Novelty

sciencemuseum.org.uk

It turns out:Novelty detection = Temporal Difference rulein Reinforcement Learning(Sutton and Barto, 1981)

Schultz et al. (1997)


http://jn.physiology.org/content/80/1/1.long

Dopamine Neurons Respond to Novelty

sciencemuseum.org.uk

It turns out:Novelty detection = Temporal Difference rulein Reinforcement Learning(Sutton and Barto, 1981)

Schultz et al. (1997)


http://jn.physiology.org/content/80/1/1.long

Performance standard

Agent

Environm

entSensors

Performanceelement

changes

knowledgelearning goals

Problemgenerator

feedback

Learning element

Critic

Actuators

Entry/Exit Surveys

Exit survey: Planning Under UncertaintyWhy can’t we use a regular MDP for partially-observable situations?Give an example where you think MDPs would help you solve aproblem in your daily life.

Entry survey: Reinforcement Learning (0.25 points of final grade)In a partially-observable scenario, can reinforcement be used to learnMDP rewards?How can we improve MDP by using the plan-execute cycle?


Blindfolded MDPs: Enter Reinforcement Learning

1 2 3 4a Gb xtc S

What if the agent does not know anything about:where walls arewhere goals/penalties are

Can we use the plan-execute cycle?

Explore firstUpdate world state based on reward/reinforcement

⇒ Reinforcement Learning (see Scholarpedia article)


http://www.scholarpedia.org/article/Reinforcement_learning


1 2 3 4a Gb xtc S








1 2 3 4a Gb xtc S







Where Does Reinforcement Learning Fit?

Machine learning so far:

Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a

(by finding optimal policy, π(s)→ a)

Which is it?S U R

X

Speech recognition: connect sounds to transcripts

X

Star data: find groupings from spectral emissions

X

Rat presses lever: gets reward based on certain conditions

X

Elevator controller: multiple elevators, minimize wait time



Machine learning so far:Unsupervised learning: find regularities in input data, x

Supervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a


Which is it?S U R

X


X


X


X




Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ y

Reinforcement learning: find mapping between states and actions, s → a(by finding optimal policy, π(s)→ a)

Which is it?S U R

X


X


X


X




Machine learning so far:Unsupervised learning: find regularities in input data, xSupervised learning: find mapping between input and output, f (x)→ yReinforcement learning: find mapping between states and actions, s → a


Which is it?S U R

X


X


X


X






Which is it?S U R

X


X


X


X






Which is it?S U R

X


X


X


X






Which is it?S U RX Speech recognition: connect sounds to transcripts

X Star data: find groupings from spectral emissionsX Rat presses lever: gets reward based on certain conditionsX Elevator controller: multiple elevators, minimize wait time


But, Wasn’t That What Markov Decision Processes Were?

Find optimal policy to maximize reward:

π(s) = argmaxπ

E

[ ∞∑t=0

γtR(s, π(s), s ′)

],

with reward at state: R(s), or from action, R(s, a, s ′).

By estimating utility values:

V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) ,

with transition probabilities: P(s ′|s, a)

Assumes we know R(s) and P(s ′|s, a)




π(s) = argmaxπ

E

[ ∞∑t=0


],

with reward at state: R(s), or from action, R(s, a, s ′).By estimating utility values:

V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) ,






π(s) = argmaxπ

E

[ ∞∑t=0


],

with reward at state: R(s), or from action, R(s, a, s ′).By estimating utility values:

V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) ,




Blindfolded Agent Must Learn From Rewards

Don’t know R(s) or P(s ′|s, a). What to do?

Use Reinforcement Learning (RL) to explore and find rewards

Agent types:

knows learns usesUtility agent P R → U U

Q-learning (RL) Q(s, a) QReflex π(s)





Agent types:







Agent types:




Video: Backgammon and Choppers

https://www.youtube.com/watch?v=dqH6tp49uFY&feature=player_embedded#t=31s

How Much to Learn?

1 Passive RL: Simple Case

Keep policy π(s) fixed, learn othersAlways do same actions, and learn utilitiesExamples:

public transit commutelearning a difficult game

2 Active RL

Learn policy at the same timeHelp explore better by changing policyExample: drive own car


How Much to Learn?

1 Passive RL: Simple Case

Keep policy π(s) fixed, learn othersAlways do same actions, and learn utilitiesExamples:

public transit commutelearning a difficult game

2 Active RL

Learn policy at the same timeHelp explore better by changing policyExample: drive own car


RL in Practise: Temporal Difference (TD) Rule

Animals use derivative: Remember value iteration:

V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) .

TD rule:Use derivative when going s → s ′:

V (s)← V (s) + α(R(s) + γV (s ′)− V (s)

)where:α is the learning rate, andγ is the discount factor.

It’s even simpler than before!




V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) .


V (s)← V (s) + α(R(s) + γV (s ′)− V (s)






V (s)←

[argmax

aγ∑s′

P(s ′|s, a)V (s ′)

]+ R(s) .


V (s)← V (s) + α(R(s) + γV (s ′)− V (s)




Passive RL: Simple Case

1 2 3 4a +1b xt −1c S

Keep same policyThat is, follow same path and updatevalues, V (s)

To mimic increasing confidence, reduce learning rate with number of visits,N(s):

α =1

N(s) + 1

like in simulated annealing.TD rule:

V (s)← V (s) +1

N(s) + 1(R(s) + γV (s ′)− V (s)

)



1 2 3 4a +1b xt −1c S



α =1

N(s) + 1

like in simulated annealing.

TD rule:

V (s)← V (s) +1

N(s) + 1(R(s) + γV (s ′)− V (s)

)



1 2 3 4a +1b xt −1c S



α =1

N(s) + 1

like in simulated annealing.TD rule:

V (s)← V (s) +1

N(s) + 1(R(s) + γV (s ′)− V (s)

)


Passive RL: Simple Case (2)

1 2 3 4a → → → +1b ↑ xt −1c S

V (s) ← V (s) + ∆

∆ =1

N(s) + 1(R(s) + γV (s ′)− V (s)

)For simplicity, γ = 1.

N V (s) ∆

a3→ a4 1 0 1/2a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6

Convergence time?



1 2 3 4a → → → +1b ↑ xt −1c S

V (s) ← V (s) + ∆

∆ =1

N(s) + 1(R(s) + γV (s ′)− V (s)


N V (s) ∆

a3→ a4 1 0 1/2

a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6

Convergence time?



1 2 3 4a → → → +1b ↑ xt −1c S

V (s) ← V (s) + ∆

∆ =1

N(s) + 1(R(s) + γV (s ′)− V (s)


N V (s) ∆

a3→ a4 1 0 1/2a2→ a3 2 0 1/6

a3→ a4 2 1/2 1/6

Convergence time?



1 2 3 4a → → → +1b ↑ xt −1c S

V (s) ← V (s) + ∆

∆ =1

N(s) + 1(R(s) + γV (s ′)− V (s)


N V (s) ∆

a3→ a4 1 0 1/2a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6

Convergence time?



1 2 3 4a → → → +1b ↑ xt −1c S

V (s) ← V (s) + ∆

∆ =1

N(s) + 1(R(s) + γV (s ′)− V (s)


N V (s) ∆

a3→ a4 1 0 1/2a2→ a3 2 0 1/6a3→ a4 2 1/2 1/6

Convergence time?


Passive RL: Problems?

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500

Uti

lity

est

imat

es

Number of trials

(1,1)(1,3)

(2,1)

(3,3)(4,3)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100

RM

S e

rror

in u

tili

ty

Number of trials

Limited by constant policy?Fewer visited states cause poor estimate?


Passive RL: Problems?

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500

Uti

lity

est

imat

es

Number of trials

(1,1)(1,3)

(2,1)

(3,3)(4,3)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100

RM

S e

rror

in u

tili

ty

Number of trials

Limited by constant policy?Fewer visited states cause poor estimate?


Active RL: Example

Greedy algorithmAfter updating V (s) and N(s), recalculate policy π(s)

0

0.5

1

1.5

2

0 50 100 150 200 250 300 350 400 450 500

RM

S e

rror

, pol

icy

loss

Number of trials

RMS errorPolicy loss

1 2 3

1

2

3

–1

+1

4

Greedy algorithm cannot find optimal policy⇒ needs more exploration


Active RL: Example


0

0.5

1

1.5

2

0 50 100 150 200 250 300 350 400 450 500

RM

S e

rror

, pol

icy

loss

Number of trials


1 2 3

1

2

3

–1

+1

4



Active RL: Example


0

0.5

1

1.5

2

0 50 100 150 200 250 300 350 400 450 500

RM

S e

rror

, pol

icy

loss

Number of trials


1 2 3

1

2

3

–1

+1

4



How to Improve Active RL?

Source of errors:Reason for error: sampling policy

V too lowV too high

increase N helps?

Exploration vs. Exploitation:We can’t do without itWe can’t live with too much of it

Exploration:Minimize it, use random moves?




V too low TV too high T

increase N helps? T






V too low T TV too high T F

increase N helps? T F







increase N helps? T







increase N helps? T




Exploring Agent

1 2 3 4a +1 +1 +1 +1b +1 xt +1 +1c S +1 +1 +1

Initialize all V (s) = +R (e.g., +1)Until N(s)>e; exploration thresholdThen use V (s)

Wait until built confidence


Exploring Agent Does Much Better

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

0 20 40 60 80 100

Uti

lity

est

imat

es

Number of trials

(1,1)(1,2)(1,3)(2,3)(3,2)(3,3)(4,3)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 20 40 60 80 100R

MS

err

or, p

olic

y lo

ssNumber of trials



Q-Learning

Instead of V (s), use Q(s, a):

Q(s, a) = argmaxa

V (s)

then the value iteration becomes

Q(s, a) = R(s) + γ∑s′

P(s ′|s, a) argmaxa′

Q(s ′, a′)

State of the art, but also has problems with dimensionality


Q-Learning

Instead of V (s), use Q(s, a):

Q(s, a) = argmaxa

V (s)

then the value iteration becomes

Q(s, a) = R(s) + γ∑s′

P(s ′|s, a) argmaxa′

Q(s ′, a′)

State of the art, but also has problems with dimensionality


Q-Learning in Real World Problems

Translate problem space to feature space: s = [f1, . . . , fm]


Q-Learning in Real World Problems

Translate problem space to feature space: s = [f1, . . . , fm]


Date post:	21-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

CS325ArtiﬁcialIntelligence Ch.21–ReinforcementLearning · Rats! Fundooprofessor...

Documents