+ All Categories
Home > Documents > Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement...

Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement...

Date post: 04-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
25
Reinforcement Learning Robot Image Credit: Viktoriya Sukhanova © 123RF.com Based on slides by Dan Klein
Transcript
Page 1: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

ReinforcementLearning

Robot Image Credit: Viktoriya Sukhanova © 123RF.com

Based on slides by Dan Klein

Page 2: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

ReinforcementLearning

§ Basicidea:§ Receivefeedbackintheformofrewards§ Agent’sutilityisdefinedbytherewardfunction§ Must(learnto)actsoastomaximizeexpectedrewards§ Alllearningisbasedonobservedsamplesofoutcomes!

Environment

Agent

Actions:aState:sReward:r

Page 3: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

ReinforcementLearning

§ StillassumeaMarkovdecisionprocess(MDP):§ AsetofstatessÎ S§ Asetofactions(perstate)A§ AmodelT(s,a,s’)§ ArewardfunctionR(s,a,s’)

§ Stilllookingforapolicyp(s)

§ Newtwist:don’tknowTorR§ I.e.wedon’tknowwhichstatesaregoodorwhattheactionsdo§ Mustactuallytryactionsandstatesouttolearn

Page 4: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Offline(MDPs)vs.Online(RL)

OfflineSolution OnlineLearning

Page 5: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Model-BasedLearning

Page 6: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Model-BasedLearning

§ Model-BasedIdea:§ Learnanapproximatemodelbasedonexperiences§ Solveforvaluesasifthelearnedmodelwerecorrect

§ Step1:LearnempiricalMDPmodel§ Countoutcomess’foreachs,a§ Normalizetogiveanestimateof§ Discovereach whenweexperience(s,a,s’)

§ Step2:SolvethelearnedMDP§ Forexample,usevalueiteration,asbefore

Page 7: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Example:Model-BasedLearning

InputPolicyp

Assume:g =1

ObservedEpisodes(Training) LearnedModel

A

B C D

E

B,east,C,-1C,east,D,-1D,exit,x,+10

B,east,C,-1C,east,D,-1D,exit,x,+10

E,north,C,-1C,east,A,-1A,exit,x,-10

Episode1 Episode2

Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10

T(s,a,s’).T(B,east,C)=1.00T(C,east,D)=0.75T(C,east,A)=0.25

R(s,a,s’).R(B,east,C)=-1R(C,east,D)=-1R(D,exit,x)=+10

Page 8: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Model-FreeLearning

Page 9: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

PassiveReinforcementLearning

Page 10: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

PassiveReinforcementLearning

§ Simplifiedtask:policyevaluation§ Input:afixedpolicyp(s)§ Youdon’tknowthetransitionsT(s,a,s’)§ Youdon’tknowtherewardsR(s,a,s’)§ Goal:learnthestatevalues

§ Inthiscase:§ Learneris“alongfortheride”§ Nochoiceaboutwhatactionstotake§ Justexecutethepolicyandlearnfromexperience§ ThisisNOTofflineplanning!Youactuallytakeactionsintheworld.

Page 11: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

DirectEvaluation

§ Goal:Computevaluesforeachstateunderp

§ Idea:Averagetogetherobservedsamplevalues§ Actaccordingtop§ Everytimeyouvisitastate,writedownwhatthesumofdiscountedrewardsturnedouttobe

§ Averagethosesamples

§ Thisiscalleddirectevaluation

Page 12: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Example:DirectEvaluation

InputPolicyp

Assume:g =1

ObservedEpisodes(Training) OutputValues

A

B C D

E

B,east,C,-1C,east,D,-1D,exit,x,+10

B,east,C,-1C,east,D,-1D,exit,x,+10

E,north,C,-1C,east,A,-1A,exit,x,-10

Episode1 Episode2

Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10

A

B C D

E

+8 +4 +10

-10

-2

Page 13: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

ProblemswithDirectEvaluation

§ What’sgoodaboutdirectevaluation?§ It’seasytounderstand§ Itdoesn’trequireanyknowledgeofT,R§ Iteventuallycomputesthecorrectaveragevalues,usingjustsampletransitions

§ Whatbadaboutit?§ Eachstatemustbelearnedseparately§ So,ittakesalongtimetolearn

OutputValues

A

B C D

E

+8 +4 +10

-10

-2

IfBandEbothgotoCunderthispolicy,howcantheirvaluesbedifferent?

Page 14: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

WhyNotUsePolicyEvaluation?

§ SimplifiedBellmanupdatescalculateVforafixedpolicy:§ Eachround,replaceVwithaone-step-look-aheadlayeroverV

§ Thisapproachfullyexploitedtheconnectionsbetweenthestates§ Unfortunately,weneedTandRtodoit!

§ Keyquestion:howcanwedothisupdatetoVwithoutknowingTandR?§ Inotherwords,howtowetakeaweightedaveragewithoutknowingtheweights?

p(s)

s

s,p(s)

s, p(s),s’s’

Page 15: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Sample-BasedPolicyEvaluation?§ WewanttoimproveourestimateofVbycomputingtheseaverages:

§ Idea:Takesamplesofoutcomess’(bydoingtheaction!)andaverage

p(s)

s

s,p(s)

'1s'2s '3ss, p(s),s’

s'

Almost!Butwecan’trewindtimetogetsampleaftersamplefromstates.

Page 16: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

TemporalDifferenceLearning

Page 17: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

TemporalDifferenceLearning§ Bigidea:learnfromeveryexperience!

§ UpdateV(s)eachtimeweexperienceatransition(s,a,s’,r)§ Likelyoutcomess’willcontributeupdatesmoreoften

§ Temporaldifferencelearningofvalues§ Policystillfixed,stilldoingevaluation!§ Movevaluestowardvalueofwhateversuccessoroccurs:runningaverage

p(s)s

s,p(s)

s’

SampleofV(s):

UpdatetoV(s):

Sameupdate:

Page 18: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

ExponentialMovingAverage

§ Exponentialmovingaverage§ Therunninginterpolationupdate:

§ Makesrecentsamplesmoreimportant:

§ Forgetsaboutthepast(distantpastvalueswerewronganyway)

§ Decreasinglearningrate(alpha)cangiveconvergingaverages

Page 19: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Example:TemporalDifferenceLearning

Assume:g =1,α =1/2

ObservedTransitions

B,east,C,-2

0

0 0 8

0

0

-1 0 8

0

0

-1 3 8

0

C,east,D,-2

A

B C D

E

States

Page 20: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

ProblemswithTDValueLearning

§ TDvalueleaningisamodel-freewaytodopolicyevaluation,mimickingBellmanupdateswithrunningsampleaverages

§ However,ifwewanttoturnvaluesintoa(new)policy,we’resunk:

§ Idea:learnQ-values,notvalues§ Makesactionselectionmodel-freetoo!

a

s

s,a

s,a,s’s’

Page 21: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

ActiveReinforcementLearning

Page 22: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

ActiveReinforcementLearning

§ Fullreinforcementlearning:optimalpolicies(likevalueiteration)§ Youdon’tknowthetransitionsT(s,a,s’)§ Youdon’tknowtherewardsR(s,a,s’)§ Youchoosetheactionsnow§ Goal:learntheoptimalpolicy/values

§ Inthiscase:§ Learnermakeschoices!§ Fundamentaltradeoff:explorationvs.exploitation§ ThisisNOTofflineplanning!Youactuallytakeactionsintheworldandfindoutwhathappens…

Page 23: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Detour:Q-ValueIteration

§ Valueiteration:findsuccessive(depth-limited)values§ StartwithV0(s)=0,whichweknowisright§ GivenVk,calculatethedepthk+1valuesforallstates:

§ ButQ-valuesaremoreuseful,socomputetheminstead§ StartwithQ0(s,a)=0,whichweknowisright§ GivenQk,calculatethedepthk+1q-valuesforallq-states:

Page 24: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Q-Learning§ Q-Learning:sample-basedQ-valueiteration

§ LearnQ(s,a)valuesasyougo§ Receiveasample(s,a,s’,r)§ Consideryouroldestimate:§ Consideryournewsampleestimate:

§ Incorporatethenewestimateintoarunningaverage:

Page 25: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility

Q-LearningProperties

§ Amazingresult:Q-learningconvergestooptimalpolicy-- evenifyou’reactingsuboptimally!

§ Thisiscalledoff-policylearning

§ Caveats:§ Youhavetoexploreenough§ Youhavetoeventuallymakethelearningratesmallenough

§ …butnotdecreaseittooquickly§ Basically,inthelimit,itdoesn’tmatterhowyouselectactions(!)


Recommended