+ All Categories
Home > Documents > Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and...

Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and...

Date post: 04-Jan-2016
Category:
Upload: itachi2
View: 14 times
Download: 2 times
Share this document with a friend
Description:
Book Machine Learning
573
Transcript
Page 1: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 2: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

STATISTICAL

REINFORCEMENT

LEARNING

ModernMachine

LearningApproaches

Chapman&Hall/CRC

MachineLearning&PatternRecognitionSeries

SERIESEDITORS

RalfHerbrich

ThoreGraepel

AmazonDevelopmentCenter

MicrosoftResearchLtd.

Berlin,Germany

Cambridge,UK

AIMSANDSCOPE

Thisseriesreflectsthelatestadvancesandapplicationsinmachinelearningandpatternrecognitionthroughthepublicationofabroadrangeofreferenceworks,textbooks,andhandbooks.Theinclusionofconcreteexamples,applications,andmethodsishighlyencouraged.Thescopeoftheseriesincludes,butisnotlimitedto,titlesintheareasofmachinelearning,patternrecognition,computationalintelligence,robotics,computational/statisticallearningtheory,naturallanguageprocessing,computervision,gameAI,gametheory,neuralnetworks,computationalneuroscience,andotherrelevanttopics,suchasmachinelearningappliedtobioinformaticsorcognitivescience,whichmightbeproposedbypotentialcontribu-tors.

PUBLISHEDTITLES

BAYESIANPROGRAMMING

PierreBessière,EmmanuelMazer,Juan-ManuelAhuactzin,andKamelMekhnacha

UTILITY-BASEDLEARNINGFROMDATA

CraigFriedmanandSvenSandow

HANDBOOKOFNATURALLANGUAGEPROCESSING,SECONDEDITION

NitinIndurkhyaandFredJ.Damerau

COST-SENSITIVEMACHINELEARNING

BalajiKrishnapuram,ShipengYu,andBharatRao

COMPUTATIONALTRUSTMODELSANDMACHINELEARNING

Page 3: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

XinLiu,AnwitamanDatta,andEe-PengLim

MULTILINEARSUBSPACELEARNING:DIMENSIONALITYREDUCTIONOF

MULTIDIMENSIONALDATA

HaipingLu,KonstantinosN.Plataniotis,andAnastasiosN.Venetsanopoulos

MACHINELEARNING:AnAlgorithmicPerspective,SecondEdition

StephenMarsland

SPARSEMODELING:THEORY,ALGORITHMS,ANDAPPLICATIONS

IrinaRishandGenadyYa.Grabarnik

AFIRSTCOURSEINMACHINELEARNING

SimonRogersandMarkGirolami

STATISTICALREINFORCEMENTLEARNING:MODERNMACHINELEARNINGAPPROACHES

MasashiSugiyama

MULTI-LABELDIMENSIONALITYREDUCTION

LiangSun,ShuiwangJi,andJiepingYe

REGULARIZATION,OPTIMIZATION,KERNELS,ANDSUPPORTVECTORMACHINES

JohanA.K.Suykens,MarcoSignoretto,andAndreasArgyriou

ENSEMBLEMETHODS:FOUNDATIONSANDALGORITHMS

Zhi-HuaZhou

Chapman&Hall/CRC

MachineLearning&PatternRecognitionSeries

STATISTICAL

REINFORCEMENT

LEARNING

ModernMachine

LearningApproaches

MasashiSugiyama

UniversityofTokyo

Tokyo,Japan

CRCPress

Taylor&FrancisGroup

6000BrokenSoundParkwayNW,Suite300

Page 4: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

BocaRaton,FL33487-2742

©2015byTaylor&FrancisGroup,LLC

CRCPressisanimprintofTaylor&FrancisGroup,anInformabusiness

NoclaimtooriginalU.S.Governmentworks

VersionDate:20150128

InternationalStandardBookNumber-13:978-1-4398-5690-1(eBook-PDF)

Thisbookcontainsinformationobtainedfromauthenticandhighlyregardedsources.Reasonableeffortshavebeenmadetopublishreliabledataandinformation,buttheauthorandpublishercannotassumeresponsibilityforthevalidityofallmaterialsortheconsequencesoftheiruse.Theauthorsandpublishershaveattemptedtotracethecopyrightholdersofallmaterialreproducedinthispublicationandapologizetocopyrightholdersifpermissiontopublishinthisformhasnotbeenobtained.Ifanycopyrightmaterialhasnotbeenacknowledgedpleasewriteandletusknowsowemayrectifyinanyfuturereprint.

ExceptaspermittedunderU.S.CopyrightLaw,nopartofthisbookmaybereprinted,reproduced,transmitted,orutilizedinanyformbyanyelectronic,mechanical,orothermeans,nowknownorhereafterinvented,includingphotocopying,microfilming,andrecording,orinanyinformationstor-ageorretrievalsystem,withoutwrittenpermissionfromthepublishers.

Forpermissiontophotocopyorusematerialelectronicallyfromthiswork,pleaseaccesswww.copy-

right.com(http://www.copyright.com/)orcontacttheCopyrightClearanceCenter,Inc.(CCC),222

RosewoodDrive,Danvers,MA01923,978-750-8400.CCCisanot-for-profitorganizationthatprovideslicensesandregistrationforavarietyofusers.FororganizationsthathavebeengrantedaphotocopylicensebytheCCC,aseparatesystemofpaymenthasbeenarranged.

TrademarkNotice:Productorcorporatenamesmaybetrademarksorregisteredtrademarks,andareusedonlyforidentificationandexplanationwithoutintenttoinfringe.

VisittheTaylor&FrancisWebsiteat

http://www.taylorandfrancis.com

andtheCRCPressWebsiteat

http://www.crcpress.com

Contents

Page 5: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Foreword

ix

Preface

xi

Author

xiii

I

Introduction

1

1IntroductiontoReinforcementLearning

3

1.1

ReinforcementLearning…………………

3

1.2

MathematicalFormulation

……………….

8

1.3

StructureoftheBook………………….

12

1.3.1

Model-FreePolicyIteration……………

12

1.3.2

Model-FreePolicySearch…………….

13

1.3.3

Model-BasedReinforcementLearning………

14

II

Model-FreePolicyIteration

Page 6: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

15

2PolicyIterationwithValueFunctionApproximation

17

2.1

ValueFunctions

…………………….

17

2.1.1

StateValueFunctions………………

17

2.1.2

State-ActionValueFunctions…………..

18

2.2

Least-SquaresPolicyIteration

……………..

20

2.2.1

Immediate-RewardRegression………….

20

2.2.2

Algorithm…………………….

21

2.2.3

Regularization………………….

23

2.2.4

ModelSelection………………….

25

2.3

Remarks

………………………..

Page 7: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

26

3BasisDesignforValueFunctionApproximation

27

3.1

GaussianKernelsonGraphs

………………

27

3.1.1

MDP-InducedGraph……………….

27

3.1.2

OrdinaryGaussianKernels……………

29

3.1.3

GeodesicGaussianKernels……………

29

3.1.4

ExtensiontoContinuousStateSpaces………

30

3.2

Illustration……………………….

30

3.2.1

Setup………………………

31

v

vi

Contents

3.2.2

GeodesicGaussianKernels……………

Page 8: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

31

3.2.3

OrdinaryGaussianKernels……………

33

3.2.4

Graph-LaplacianEigenbases……………

34

3.2.5

DiffusionWavelets………………..

35

3.3

NumericalExamples…………………..

36

3.3.1

Robot-ArmControl……………….

36

3.3.2

Robot-AgentNavigation……………..

39

3.4

Remarks

………………………..

45

4SampleReuseinPolicyIteration

47

4.1

Formulation

………………………

47

4.2

Off-PolicyValueFunctionApproximation………..

48

Page 9: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

4.2.1

EpisodicImportanceWeighting………….

49

4.2.2

Per-DecisionImportanceWeighting

……….

50

4.2.3

AdaptivePer-DecisionImportanceWeighting…..

50

4.2.4

Illustration……………………

51

4.3

AutomaticSelectionofFlatteningParameter………

54

4.3.1

Importance-WeightedCross-Validation………

54

4.3.2

Illustration……………………

55

4.4

Sample-ReusePolicyIteration

……………..

56

4.4.1

Algorithm…………………….

56

4.4.2

Illustration……………………

57

Page 10: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

4.5

NumericalExamples…………………..

58

4.5.1

InvertedPendulum………………..

58

4.5.2

MountainCar…………………..

60

4.6

Remarks

………………………..

63

5ActiveLearninginPolicyIteration

65

5.1

EfficientExplorationwithActiveLearning

……….

65

5.1.1

ProblemSetup………………….

65

5.1.2

DecompositionofGeneralizationError………

66

5.1.3

EstimationofGeneralizationError………..

67

5.1.4

DesigningSamplingPolicies……………

68

5.1.5

Page 11: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Illustration……………………

69

5.2

ActivePolicyIteration

…………………

71

5.2.1

Sample-ReusePolicyIterationwithActiveLearning.

72

5.2.2

Illustration……………………

73

5.3

NumericalExamples…………………..

75

5.4

Remarks

………………………..

77

6RobustPolicyIteration

79

6.1

RobustnessandReliabilityinPolicyIteration

……..

79

6.1.1

Robustness……………………

79

6.1.2

Reliability…………………….

80

6.2

Page 12: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

LeastAbsolutePolicyIteration……………..

81

Contents

vii

6.2.1

Algorithm…………………….

81

6.2.2

Illustration……………………

81

6.2.3

Properties…………………….

83

6.3

NumericalExamples…………………..

84

6.4

PossibleExtensions

…………………..

88

6.4.1

HuberLoss……………………

88

6.4.2

PinballLoss……………………

89

6.4.3

Deadzone-LinearLoss………………

90

6.4.4

Page 13: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ChebyshevApproximation…………….

90

6.4.5

ConditionalValue-At-Risk…………….

91

6.5

Remarks

………………………..

92

III

Model-FreePolicySearch

93

7DirectPolicySearchbyGradientAscent

95

7.1

Formulation

………………………

95

7.2

GradientApproach

…………………..

96

7.2.1

GradientAscent…………………

96

7.2.2

BaselineSubtractionforVarianceReduction…..

98

7.2.3

VarianceAnalysisofGradientEstimators…….

99

7.3

Page 14: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

NaturalGradientApproach……………….

101

7.3.1

NaturalGradientAscent……………..

101

7.3.2

Illustration……………………

103

7.4

ApplicationinComputerGraphics:ArtistAgent…….

104

7.4.1

SumiePainting………………….

105

7.4.2

DesignofStates,Actions,andImmediateRewards..

105

7.4.3

ExperimentalResults………………

112

7.5

Remarks

………………………..

113

8DirectPolicySearchbyExpectation-Maximization

117

8.1

Expectation-MaximizationApproach

………….

117

8.2

SampleReuse

Page 15: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

……………………..

120

8.2.1

EpisodicImportanceWeighting………….

120

8.2.2

Per-DecisionImportanceWeight…………

122

8.2.3

AdaptivePer-DecisionImportanceWeighting…..

123

8.2.4

AutomaticSelectionofFlatteningParameter…..

124

8.2.5

Reward-WeightedRegressionwithSampleReuse…

125

8.3

NumericalExamples…………………..

126

8.4

Remarks

………………………..

132

9Policy-PriorSearch

133

9.1

Formulation

………………………

133

9.2

PolicyGradientswithParameter-BasedExploration…..

Page 16: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

134

9.2.1

Policy-PriorGradientAscent…………..

135

9.2.2

BaselineSubtractionforVarianceReduction…..

136

9.2.3

VarianceAnalysisofGradientEstimators…….

136

viii

Contents

9.2.4

NumericalExamples……………….

138

9.3

SampleReuseinPolicy-PriorSearch…………..

143

9.3.1

ImportanceWeighting………………

143

9.3.2

VarianceReductionbyBaselineSubtraction……

145

9.3.3

NumericalExamples……………….

146

9.4

Remarks

………………………..

Page 17: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

153

IV

Model-BasedReinforcementLearning

155

10TransitionModelEstimation

157

10.1ConditionalDensityEstimation

…………….

157

10.1.1Regression-BasedApproach……………

157

10.1.2ǫ-NeighborKernelDensityEstimation………

158

10.1.3Least-SquaresConditionalDensityEstimation….

159

10.2Model-BasedReinforcementLearning………….

161

10.3NumericalExamples…………………..

162

10.3.1ContinuousChainWalk……………..

162

10.3.2HumanoidRobotControl…………….

167

10.4Remarks

………………………..

172

11DimensionalityReductionforTransitionModelEstimation173

11.1SufficientDimensionalityReduction…………..

173

11.2Squared-LossConditionalEntropy……………

174

11.2.1ConditionalIndependence…………….

Page 18: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

174

11.2.2DimensionalityReductionwithSCE……….

175

11.2.3RelationtoSquared-LossMutualInformation…..

176

11.3NumericalExamples…………………..

177

11.3.1ArtificialandBenchmarkDatasets………..

177

11.3.2HumanoidRobot…………………

180

11.4Remarks

………………………..

182

References

183

Index

191

Foreword

Howcanagentslearnfromexperiencewithoutanomniscientteacherexplicitly

tellingthemwhattodo?Reinforcementlearningistheareawithinmachine

learningthatinvestigateshowanagentcanlearnanoptimalbehaviorby

correlatinggenericrewardsignalswithitspastactions.Thedisciplinedraws

uponandconnectskeyideasfrombehavioralpsychology,economics,control

theory,operationsresearch,andotherdisparatefieldstomodelthelearning

process.Inreinforcementlearning,theenvironmentistypicallymodeledasa

Markovdecisionprocessthatprovidesimmediaterewardandstateinforma-

tiontotheagent.However,theagentdoesnothaveaccesstothetransition

structureoftheenvironmentandneedstolearnhowtochooseappropriate

Page 19: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

actionstomaximizeitsoverallrewardovertime.

ThisbookbyProf.MasashiSugiyamacoverstherangeofreinforcement

learningalgorithmsfromafresh,modernperspective.Withafocusonthe

statisticalpropertiesofestimatingparametersforreinforcementlearning,the

bookrelatesanumberofdifferentapproachesacrossthegamutoflearningsce-

narios.Thealgorithmsaredividedintomodel-freeapproachesthatdonotex-

plicitlymodelthedynamicsoftheenvironment,andmodel-basedapproaches

thatconstructdescriptiveprocessmodelsfortheenvironment.Withineach

ofthesecategories,therearepolicyiterationalgorithmswhichestimatevalue

functions,andpolicysearchalgorithmswhichdirectlymanipulatepolicypa-

rameters.

Foreachofthesedifferentreinforcementlearningscenarios,thebookmetic-

ulouslylaysouttheassociatedoptimizationproblems.Acarefulanalysisis

givenforeachofthesecases,withanemphasisonunderstandingthestatistical

propertiesoftheresultingestimatorsandlearnedparameters.Eachchapter

containsillustrativeexamplesofapplicationsofthesealgorithms,withquan-

titativecomparisonsbetweenthedifferenttechniques.Theseexamplesare

drawnfromavarietyofpracticalproblems,includingrobotmotioncontrol

andAsianbrushpainting.

Insummary,thebookprovidesathoughtprovokingstatisticaltreatmentof

reinforcementlearningalgorithms,reflectingtheauthor’sworkandsustained

researchinthisarea.Itisacontemporaryandwelcomeadditiontotherapidly

growingmachinelearningliterature.Bothbeginnerstudentsandexperienced

ix

x

Foreword

researcherswillfindittobeanimportantsourceforunderstandingthelatest

reinforcementlearningtechniques.

DanielD.Lee

GRASPLaboratory

Page 20: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

SchoolofEngineeringandAppliedScience

UniversityofPennsylvania,Philadelphia,PA,USA

Preface

Inthecomingbigdataera,statisticsandmachinelearningarebecoming

indispensabletoolsfordatamining.Dependingonthetypeofdataanalysis,

machinelearningmethodsarecategorizedintothreegroups:

•Supervisedlearning:Giveninput-outputpaireddata,theobjective

ofsupervisedlearningistoanalyzetheinput-outputrelationbehindthe

data.Typicaltasksofsupervisedlearningincluderegression(predict-

ingtherealvalue),classification(predictingthecategory),andranking

(predictingtheorder).Supervisedlearningisthemostcommondata

analysisandhasbeenextensivelystudiedinthestatisticscommunity

forlongtime.Arecenttrendofsupervisedlearningresearchinthema-

chinelearningcommunityistoutilizesideinformationinadditiontothe

input-outputpaireddatatofurtherimprovethepredictionaccuracy.For

example,semi-supervisedlearningutilizesadditionalinput-onlydata,

transferlearningborrowsdatafromothersimilarlearningtasks,and

multi-tasklearningsolvesmultiplerelatedlearningtaskssimultaneously.

•Unsupervisedlearning:Giveninput-onlydata,theobjectiveofun-

supervisedlearningistofindsomethingusefulinthedata.Duetothis

ambiguousdefinition,unsupervisedlearningresearchtendstobemore

adhocthansupervisedlearning.Nevertheless,unsupervisedlearningis

regardedasoneofthemostimportanttoolsindataminingbecause

ofitsautomaticandinexpensivenature.Typicaltasksofunsupervised

learningincludeclustering(groupingthedatabasedontheirsimilarity),

densityestimation(estimatingtheprobabilitydistributionbehindthe

data),anomalydetection(removingoutliersfromthedata),datavisual-

ization(reducingthedimensionalityofthedatato1–3dimensions),and

blindsourceseparation(extractingtheoriginalsourcesignalsfromtheir

Page 21: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

mixtures).Also,unsupervisedlearningmethodsaresometimesusedas

datapre-processingtoolsinsupervisedlearning.

•Reinforcementlearning:Supervisedlearningisasoundapproach,

butcollectinginput-outputpaireddataisoftentooexpensive.Unsu-

pervisedlearningisinexpensivetoperform,butittendstobeadhoc.

Reinforcementlearningisplacedbetweensupervisedlearningandunsu-

pervisedlearning—noexplicitsupervision(outputdata)isprovided,

butwestillwanttolearntheinput-outputrelationbehindthedata.

Insteadofoutputdata,reinforcementlearningutilizesrewards,which

xi

xii

Preface

evaluatethevalidityofpredictedoutputs.Givingimplicitsupervision

suchasrewardsisusuallymucheasierandlesscostlythangivingex-

plicitsupervision,andthereforereinforcementlearningcanbeavital

approachinmoderndataanalysis.Varioussupervisedandunsupervised

learningtechniquesarealsoutilizedintheframeworkofreinforcement

learning.

Thisbookisdevotedtointroducingfundamentalconceptsandpracti-

calalgorithmsofstatisticalreinforcementlearningfromthemodernmachine

learningviewpoint.Variousillustrativeexamples,mainlyinrobotics,arealso

providedtohelpunderstandtheintuitionandusefulnessofreinforcement

learningtechniques.Targetreadersaregraduate-levelstudentsincomputer

scienceandappliedstatisticsaswellasresearchersandengineersinrelated

fields.Basicknowledgeofprobabilityandstatistics,linearalgebra,andele-

mentarycalculusisassumed.

Machinelearningisarapidlydevelopingareaofscience,andtheauthor

hopesthatthisbookhelpsthereadergraspvariousexcitingtopicsinrein-

forcementlearningandstimulatereaders’interestinmachinelearning.Please

visitourwebsiteat:http://www.ms.k.u-tokyo.ac.jp.

Page 22: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

MasashiSugiyama

UniversityofTokyo,Japan

Author

MasashiSugiyamawasborninOsaka,Japan,in1974.HereceivedBachelor,

Master,andDoctorofEngineeringdegreesinComputerSciencefromAll

TokyoInstituteofTechnology,Japanin1997,1999,and2001,respectively.

In2001,hewasappointedAssistantProfessorinthesameinstitute,andhe

waspromotedtoAssociateProfessorin2003.HemovedtotheUniversityof

TokyoasProfessorin2014.

HereceivedanAlexandervonHumboldtFoundationResearchFellowship

andresearchedatFraunhoferInstitute,Berlin,Germany,from2003to2004.In

2006,hereceivedaEuropeanCommissionProgramErasmusMundusSchol-

arshipandresearchedattheUniversityofEdinburgh,Scotland.Hereceived

theFacultyAwardfromIBMin2007forhiscontributiontomachinelearning

undernon-stationarity,theNagaoSpecialResearcherAwardfromtheInfor-

mationProcessingSocietyofJapanin2011andtheYoungScientists’Prize

fromtheCommendationforScienceandTechnologybytheMinisterofEd-

ucation,Culture,Sports,ScienceandTechnologyforhiscontributiontothe

density-ratioparadigmofmachinelearning.

Hisresearchinterestsincludetheoriesandalgorithmsofmachinelearning

anddatamining,andawiderangeofapplicationssuchassignalprocessing,

imageprocessing,androbotcontrol.HepublishedDensityRatioEstimationin

MachineLearning(CambridgeUniversityPress,2012)andMachineLearning

inNon-StationaryEnvironments:IntroductiontoCovariateShiftAdaptation

(MITPress,2012).

Theauthorthankshiscollaborators,HirotakaHachiya,SethuVijayaku-

mar,JanPeters,JunMorimoto,ZhaoTingting,NingXie,VootTangkaratt,

TetsuroMorimura,andNorikazuSugimoto,forexcitingandcreativediscus-

sions.HeacknowledgessupportfromMEXTKAKENHI17700142,18300057,

Page 23: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

20680007,23120004,23300069,25700022,and26280054,theOkawaFounda-

tion,EUErasmusMundusFellowship,AOARD,SCAT,theJSTPRESTO

program,andtheFIRSTprogram.

xiii

Thispageintentionallyleftblank

PartI

Introduction

Thispageintentionallyleftblank

Chapter1

IntroductiontoReinforcement

Learning

Reinforcementlearningisaimedatcontrollingacomputeragentsothata

targettaskisachievedinanunknownenvironment.

Inthischapter,wefirstgiveaninformaloverviewofreinforcementlearning

inSection1.1.Thenweprovideamoreformalformulationofreinforcement

learninginSection1.2.Finally,thebookissummarizedinSection1.3.

1.1

ReinforcementLearning

AschematicofreinforcementlearningisgiveninFigure1.1.Inanunknown

environment(e.g.,inamaze),acomputeragent(e.g.,arobot)takesanaction

(e.g.,towalk)basedonitsowncontrolpolicy.Thenitsstateisupdated(e.g.,

bymovingforward)andevaluationofthatactionisgivenasa“reward”(e.g.,

praise,neutral,orscolding).Throughsuchinteractionwiththeenvironment,

theagentistrainedtoachieveacertaintask(e.g.,gettingoutofthemaze)

withoutexplicitguidance.Acrucialadvantageofreinforcementlearningisits

non-greedynature.Thatis,theagentistrainednottoimproveperformancein

Page 24: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ashortterm(e.g.,greedilyapproachinganexitofthemaze),buttooptimize

thelong-termachievement(e.g.,successfullygettingoutofthemaze).

Areinforcementlearningproblemcontainsvarioustechnicalcomponents

suchasstates,actions,transitions,rewards,policies,andvalues.Beforego-

ingintomathematicaldetails(whichwillbeprovidedinSection1.2),we

intuitivelyexplaintheseconceptsthroughillustrativereinforcementlearning

problemshere.

Letusconsideramazeproblem(Figure1.2),wherearobotagentislocated

inamazeandwewanttoguidehimtothegoalwithoutexplicitsupervision

aboutwhichdirectiontogo.Statesarepositionsinthemazewhichtherobot

agentcanvisit.IntheexampleillustratedinFigure1.3,thereare21states

inthemaze.Actionsarepossibledirectionsalongwhichtherobotagentcan

move.IntheexampleillustratedinFigure1.4,thereare4actionswhichcorre-

spondtomovementtowardthenorth,south,east,andwestdirections.States

3

4

StatisticalReinforcementLearning

Action

Environment

Reward

Agent

State

FIGURE1.1:Reinforcementlearning.

andactionsarefundamentalelementsthatdefineareinforcementlearning

problem.

Transitionsspecifyhowstatesareconnectedtoeachotherthroughactions

(Figure1.5).Thus,knowingthetransitionsintuitivelymeansknowingthemap

ofthemaze.Rewardsspecifytheincomes/coststhattherobotagentreceives

whenmakingatransitionfromonestatetoanotherbyacertainaction.Inthe

caseofthemazeexample,therobotagentreceivesapositiverewardwhenit

Page 25: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

reachesthegoal.Morespecifically,apositiverewardisprovidedwhenmaking

atransitionfromstate12tostate17byaction“east”orfromstate18to

state17byaction“north”(Figure1.6).Thus,knowingtherewardsintuitively

meansknowingthelocationofthegoalstate.Toemphasizethefactthata

rewardisgiventotherobotagentrightaftertakinganactionandmakinga

transitiontothenextstate,itisalsoreferredtoasanimmediatereward.

Undertheabovesetup,thegoalofreinforcementlearningtofindthepolicy

forcontrollingtherobotagentthatallowsittoreceivethemaximumamount

ofrewardsinthelongrun.Here,apolicyspecifiesanactiontherobotagent

takesateachstate(Figure1.7).Throughapolicy,aseriesofstatesandac-

tionsthattherobotagenttakesfromastartstatetoanendstateisspecified.

Suchaseriesiscalledatrajectory(seeFigure1.7again).Thesumofim-

mediaterewardsalongatrajectoryiscalledthereturn.Inpractice,rewards

thatcanbeobtainedinthedistantfutureareoftendiscountedbecausere-

ceivingrewardsearlierisregardedasmorepreferable.Inthemazetask,such

adiscountingstrategyurgestherobotagenttoreachthegoalasquicklyas

possible.

Tofindtheoptimalpolicyefficiently,itisusefultoviewthereturnasa

functionoftheinitialstate.Thisiscalledthe(state-)value.Thevaluescan

beefficientlyobtainedviadynamicprogramming,whichisageneralmethod

forsolvingacomplexoptimizationproblembybreakingitdownintosimpler

subproblemsrecursively.Withthehopethatmanysubproblemsareactually

thesame,dynamicprogrammingsolvessuchoverlappedsubproblemsonly

onceandreusesthesolutionstoreducethecomputationcosts.

Inthemazeproblem,thevalueofastatecanbecomputedfromthevalues

ofneighboringstates.Forexample,letuscomputethevalueofstate7(see

Page 26: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

IntroductiontoReinforcementLearning

5

FIGURE1.2:Amazeproblem.Wewanttoguidetherobotagenttothe

goal.

1

6

12

17

2

7

13

18

3

8

14

19

4

9

11

15

20

5

Page 27: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

10

16

21

FIGURE1.3:Statesarevisitablepositionsinthemaze.

North

West

East

South

FIGURE1.4:Actionsarepossiblemovementsoftherobotagent.

Page 28: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 29: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 30: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 31: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 32: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 33: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 34: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 35: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 36: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 37: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 38: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 39: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 40: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 41: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 42: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 43: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 44: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 45: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 46: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 47: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 48: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 49: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 50: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 51: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 52: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 53: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 54: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 55: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 56: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 57: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 58: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 59: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 60: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 61: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 62: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 63: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

6

StatisticalReinforcementLearning

1

6

Page 64: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

12

17

2

7

13

18

3

8

14

19

4

9

11

15

20

5

10

16

21

FIGURE1.5:Transitionsspecifyconnectionsbetweenstatesviaactions.

Thus,knowingthetransitionsmeansknowingthemapofthemaze.

1

6

12

17

2

7

13

18

3

8

14

Page 65: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

19

4

9

11

15

20

5

10

16

21

FIGURE1.6:Apositiverewardisgivenwhentherobotagentreachesthe

goal.Thus,therewardspecifiesthegoallocation.

FIGURE1.7:Apolicyspecifiesanactiontherobotagenttakesateach

state.Thus,apolicyalsospecifiesatrajectory,whichisaseriesofstatesand

actionsthattherobotagenttakesfromastartstatetoanendstate.

IntroductiontoReinforcementLearning

7

.35

.39

.9

1

.39

.43

.81

.9

.43

.48

.73

.81

.48

Page 66: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

.53

.59

.66

.73

.43

.48

.59

.66

FIGURE1.8:Valuesofeachstatewhenreward+1isgivenatthegoalstate

andtherewardisdiscountedattherateof0.9accordingtothenumberof

steps.

Figure1.5again).Fromstate7,therobotagentcanreachstate2,state6,

andstate8byasinglestep.Iftherobotagentknowsthevaluesofthese

neighboringstates,thebestactiontherobotagentshouldtakeistovisitthe

neighboringstatewiththelargestvalue,becausethisallowstherobotagent

toearnthelargestamountofrewardsinthelongrun.However,thevalues

ofneighboringstatesareunknowninpracticeandthustheyshouldalsobe

computed.

Now,weneedtosolve3subproblemsofcomputingthevaluesofstate2,

state6,andstate8.Then,inthesameway,thesesubproblemsarefurther

decomposedasfollows:

•Theproblemofcomputingthevalueofstate2isdecomposedinto3

subproblemsofcomputingthevaluesofstate1,state3,andstate7.

•Theproblemofcomputingthevalueofstate6isdecomposedinto2

subproblemsofcomputingthevaluesofstate1andstate7.

•Theproblemofcomputingthevalueofstate8isdecomposedinto3

subproblemsofcomputingthevaluesofstate3,state7,andstate9.

Thus,byremovingoverlaps,theoriginalproblemofcomputingthevalueof

state7hasbeendecomposedinto6uniquesubproblems:computingthevalues

ofstate1,state2,state3,state6,state8,andstate9.

Ifwefurthercontinuethisproblemdecomposition,weencountertheprob-

lemofcomputingthevaluesofstate17,wheretherobotagentcanreceive

Page 67: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

reward+1.Thenthevaluesofstate12andstate18canbeexplicitlycom-

puted.Indeed,ifadiscountingfactor(amultiplicativepenaltyfordelayed

rewards)is0.9,thevaluesofstate12andstate18are(0.9)1=0.9.Thenwe

canfurtherknowthatthevaluesofstate13andstate19are(0.9)2=0.81.

Byrepeatingthisprocedure,wecancomputethevaluesofallstates(asillus-

tratedinFigure1.8).Basedonthesevalues,wecanknowtheoptimalaction

8

StatisticalReinforcementLearning

therobotagentshouldtake,i.e.,anactionthatleadstherobotagenttothe

neighboringstatewiththelargestvalue.

Notethat,inreal-worldreinforcementlearningtasks,transitionsareoften

notdeterministicbutstochastic,becauseofsomeexternaldisturbance;inthe

caseoftheabovemazeexample,thefloormaybeslipperyandthustherobot

agentcannotmoveasperfectlyasitdesires.Also,stochasticpoliciesinwhich

mappingfromastatetoanactionisnotdeterministicareoftenemployed

inmanyreinforcementlearningformulations.Inthesecases,theformulation

becomesslightlymorecomplicated,butessentiallythesameideacanstillbe

usedforsolvingtheproblem.

Tofurtherhighlightthenotableadvantageofreinforcementlearningthat

nottheimmediaterewardsbutthelong-termaccumulationofrewardsismax-

imized,letusconsideramountain-carproblem(Figure1.9).Therearetwo

mountainsandacarislocatedinavalleybetweenthemountains.Thegoalis

toguidethecartothetopoftheright-handhill.However,theengineofthe

carisnotpowerfulenoughtodirectlyrunuptheright-handhillandreach

thegoal.Theoptimalpolicyinthisproblemistofirstclimbtheleft-handhill

andthengodowntheslopetotherightwithfullaccelerationtogettothe

goal(Figure1.10).

Supposewedefinetheimmediaterewardsuchthatmovingthecartothe

rightgivesapositivereward+1andmovingthecartotheleftgivesanega-

Page 68: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

tivereward−1.Then,agreedysolutionthatmaximizestheimmediatereward

movesthecartotheright,whichdoesnotallowthecartogettothegoal

duetolackofenginepower.Ontheotherhand,reinforcementlearningseeks

asolutionthatmaximizesthereturn,i.e.,thediscountedsumofimmediate

rewardsthattheagentcancollectovertheentiretrajectory.Thismeansthat

thereinforcementlearningsolutionwillfirstmovethecartothelefteven

thoughnegativerewardsaregivenforawhile,toreceivemorepositivere-

wardsinthefuture.Thus,thenotionof“priorinvestment”canbenaturally

incorporatedinthereinforcementlearningframework.

1.2

MathematicalFormulation

Inthissection,thereinforcementlearningproblemismathematicallyfor-

mulatedastheproblemofcontrollingacomputeragentunderaMarkovde-

cisionprocess.

Weconsidertheproblemofcontrollingacomputeragentunderadiscrete-

timeMarkovdecisionprocess(MDP).Thatis,ateachdiscretetime-stept,

theagentobservesastatest∈S,selectsanactionat∈A,makesatransitionst+1∈S,andreceivesanimmediatereward,rt=r(st,at,st+1)∈R.

IntroductiontoReinforcementLearning

9

Goal

Car

FIGURE1.9:Amountain-carproblem.Wewanttoguidethecartothe

goal.However,theengineofthecarisnotpowerfulenoughtodirectlyrunup

theright-handhill.

Goal

FIGURE1.10:Theoptimalpolicytoreachthegoalistofirstclimbthe

left-handhillandthenheadfortheright-handhillwithfullacceleration.

SandAarecalledthestatespaceandtheactionspace,respectively.r(s,a,s′)

Page 69: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

iscalledtheimmediaterewardfunction.

Theinitialpositionoftheagent,s1,isdrawnfromtheinitialprobability

distribution.IfthestatespaceSisdiscrete,theinitialprobabilitydistributionisspecifiedbytheprobabilitymassfunctionP(s)suchthat

0≤P(s)≤1,∀s∈S,XP(s)=1.

s∈SIfthestatespaceSiscontinuous,theinitialprobabilitydistributionisspeci-

fiedbytheprobabilitydensityfunctionp(s)suchthat

p(s)≥0,∀s∈S,

10

StatisticalReinforcementLearning

Z

p(s)ds=1.

s∈SBecausetheprobabilitymassfunctionP(s)canbeexpressedasaprobability

densityfunctionp(s)byusingtheDiracdeltafunction1δ(s)as

X

p(s)=

δ(s′−s)P(s′),

s′∈Swefocusonlyonthecontinuousstatespacebelow.

Thedynamicsoftheenvironment,whichrepresentthetransitionprob-

abilityfromstatestostates′whenactionaistaken,arecharacterized

bythetransitionprobabilitydistributionwithconditionalprobabilitydensity

p(s′|s,a):

p(s′|s,a)≥0,∀s,s′∈S,∀a∈A,Z

p(s′|s,a)ds′=1,∀s∈S,∀a∈A.

Page 70: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

s′∈STheagent’sdecisionisdeterminedbyapolicyπ.Whenweconsideradeter-

ministicpolicywheretheactiontotakeateachstateisuniquelydetermined,

weregardthepolicyasafunctionofstates:

π(s)∈A,∀s∈S.Actionacanbeeitherdiscreteorcontinuous.Ontheotherhand,whendevel-

opingmoresophisticatedreinforcementlearningalgorithms,itisoftenmore

convenienttoconsiderastochasticpolicy,whereanactiontotakeatastate

isprobabilisticallydetermined.Mathematically,astochasticpolicyisacon-

ditionalprobabilitydensityoftakingactionaatstates:

π(a|s)≥0,∀s∈S,∀a∈A,Z

π(a|s)da=1,∀s∈S.a∈AByintroducingstochasticityinactionselection,wecanmoreactivelyexplore

theentirestatespace.Notethatwhenactionaisdiscrete,thestochasticpolicy

isexpressedusingDirac’sdeltafunction,asinthecaseofthestatedensities.

Asequenceofstatesandactionsobtainedbytheproceduredescribedin

Figure1.11iscalledatrajectory.

1TheDiracdeltafunctionδ(·)allowsustoobtainthevalueofafunctionfatapointτ

viatheconvolutionwithf:

Z

f(s)δ(s−τ)ds=f(τ).

−∞

Dirac’sdeltafunctionδ(·)canbeexpressedastheGaussiandensitywithstandarddeviationσ→0:

1

a2

δ(a)=lim√

exp−

.

σ→0

Page 71: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

2πσ2

2σ2

IntroductiontoReinforcementLearning

11

1.Theinitialstates1ischosenfollowingtheinitialprobabilityp(s).

2.Fort=1,…,T,

(a)Theactionatischosenfollowingthepolicyπ(at|st).

(b)Thenextstatest+1isdeterminedaccordingtothetransition

probabilityp(st+1|st,at).

FIGURE1.11:Generationofatrajectorysample.

Whenthenumberofsteps,T,isfiniteorinfinite,thesituationiscalled

thefinitehorizonorinfinitehorizon,respectively.Below,wefocusonthe

finite-horizoncasebecausethetrajectorylengthisalwaysfiniteinpractice.

Wedenoteatrajectorybyh(whichstandsfora“history”):

h=[s1,a1,…,sT,aT,sT+1].

Thediscountedsumofimmediaterewardsalongthetrajectoryhiscalled

thereturn:

T

X

R(h)=

γt−1r(st,at,st+1),

t=1

whereγ∈[0,1)iscalledthediscountfactorforfuturerewards.Thegoalofreinforcementlearningistolearntheoptimalpolicyπ∗thatmaximizestheexpectedreturn:

h

i

π∗=argmaxEpπ(h)R(h),

Page 72: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

π

whereEpπ(h)denotestheexpectationovertrajectoryhdrawnfrompπ(h),and

pπ(h)denotestheprobabilitydensityofobservingtrajectoryhunderpolicy

π:

T

Y

pπ(h)=p(s1)

p(st+1|st,at)π(at|st).

t=1

“argmax”givesthemaximizerofafunction(Figure1.12).

Forpolicylearning,variousmethodshavebeendevelopedsofar.These

methodscanbeclassifiedintomodel-basedreinforcementlearningandmodel-

freereinforcementlearning.Theterm“model”indicatesamodelofthetran-

sitionprobabilityp(s′|s,a).Inthemodel-basedreinforcementlearningap-

proach,thetransitionprobabilityislearnedinadvanceandthelearnedtran-

sitionmodelisexplicitlyusedforpolicylearning.Ontheotherhand,inthe

model-freereinforcementlearningapproach,policiesarelearnedwithoutex-

plicitlyestimatingthetransitionprobability.Ifstrongpriorknowledgeofthe

12

StatisticalReinforcementLearning

max

argmax

FIGURE1.12:“argmax”givesthemaximizerofafunction,while“max”

givesthemaximumvalueofafunction.

transitionmodelisavailable,themodel-basedapproachwouldbemorefavor-

able.Ontheotherhand,learningthetransitionmodelwithoutpriorknowl-

edgeitselfisahardstatisticalestimationproblem.Thus,ifgoodpriorknowl-

edgeofthetransitionmodelisnotavailable,themodel-freeapproachwould

bemorepromising.

Page 73: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1.3

StructureoftheBook

Inthissection,weexplainthestructureofthisbook,whichcoversmajor

reinforcementlearningapproaches.

1.3.1

Model-FreePolicyIteration

Policyiterationisapopularandwell-studiedapproachtoreinforcement

learning.Thekeyideaofpolicyiterationistodeterminepoliciesbasedonthe

valuefunction.

Letusfirstintroducethestate-actionvaluefunctionQπ(s,a)∈Rforpolicyπ,whichisdefinedastheexpectedreturntheagentwillreceivewhen

takingactionaatstatesandfollowingpolicyπthereafter:

h

i

Qπ(

s,a)=Epπ(h)R(h)s1=s,a1=a,

where“|s1=s,a1=a”meansthattheinitialstates1andthefirstactiona1

arefixedats1=sanda1=a,respectively.Thatis,theright-handsideof

theaboveequationdenotestheconditionalexpectationofR(h)givens1=s

anda1=a.

LetQ∗(s,a)betheoptimalstate-actionvalueatstatesforactionadefinedas

Q∗(s,a)=maxQπ(s,a).π

Basedontheoptimalstate-actionvaluefunction,theoptimalactiontheagent

shouldtakeatstatesisdeterministicallygivenasthemaximizerofQ∗(s,a)

IntroductiontoReinforcementLearning

13

Page 74: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1.Initializepolicyπ(a|s).

2.Repeatthefollowingtwostepsuntilthepolicyπ(a|s)converges.

(a)Policyevaluation:Computethestate-actionvaluefunction

Qπ(s,a)forthecurrentpolicyπ(a|s).

(b)Policyimprovement:Updatethepolicyas

π(a|s)←−δa−argmaxQπ(s,a′).

a′

FIGURE1.13:Algorithmofpolicyiteration.

withrespecttoa.Thus,theoptimalpolicyπ∗(a|s)isgivenbyπ∗(a|s)=δa−argmaxQ∗(s,a′),a′

whereδ(·)denotesDirac’sdeltafunction.

Becausetheoptimalstate-actionvalueQ∗isunknowninpractice,thepolicyiterationalgorithmalternatelyevaluatesthevalueQπforthecurrent

policyπandupdatesthepolicyπbasedonthecurrentvalueQπ(Figure1.13).

Theperformanceoftheabovepolicyiterationalgorithmdependsonthe

qualityofpolicyevaluation;i.e.,howtolearnthestate-actionvaluefunction

fromdataisthekeyissue.Valuefunctionapproximationcorrespondstoare-

gressionprobleminstatisticsandmachinelearning.Thus,variousstatistical

machinelearningtechniquescanbeutilizedforbettervaluefunctionapprox-

imation.PartIIofthisbookaddressesthisissue,includingleast-squareses-

timationandmodelselection(Chapter2),basisfunctiondesign(Chapter3),

efficientsamplereuse(Chapter4),activelearning(Chapter5),androbust

learning(Chapter6).

1.3.2

Model-FreePolicySearch

Oneofthepotentialweaknessesofpolicyiterationisthatpoliciesare

learnedviavaluefunctions.Thus,improvingthequalityofvaluefunction

approximationdoesnotnecessarilycontributetoimprovingthequalityof

resultingpolicies.Furthermore,asmallchangeinvaluefunctionscancausea

bigdifferenceinpolicies,whichisproblematicin,e.g.,robotcontrolbecause

suchinstabilitycandamagetherobot’sphysicalsystem.Anotherweakness

Page 75: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ofpolicyiterationisthatpolicyimprovement,i.e.,findingthemaximizerof

Qπ(s,a)withrespecttoa,iscomputationallyexpensiveordifficultwhenthe

actionspaceAiscontinuous.

14

StatisticalReinforcementLearning

Policysearch,whichdirectlylearnspolicyfunctionswithoutestimating

valuefunctions,canovercometheabovelimitations.Thebasicideaofpolicy

searchistofindthepolicythatmaximizestheexpectedreturn:

h

i

π∗=argmaxEpπ(h)R(h).π

Inpolicysearch,howtofindagoodpolicyfunctioninavastfunctionspaceis

thekeyissuetobeaddressed.PartIIIofthisbookfocusesonpolicysearchand

introducesgradient-basedmethodsandtheexpectation-maximizationmethod

inChapter7andChapter8,respectively.However,apotentialweaknessof

thesedirectpolicysearchmethodsistheirinstabilityduetothestochasticity

ofpolicies.Toovercometheinstabilityproblem,analternativeapproachcalled

policy-priorsearch,whichlearnsthepolicy-priordistributionfordeterministic

policies,isintroducedinChapter9.Efficientsamplereuseinpolicy-prior

searchisalsodiscussedthere.

1.3.3

Model-BasedReinforcementLearning

Intheabovemodel-freeapproaches,policiesarelearnedwithoutexplicitly

modelingtheunknownenvironment(i.e.,thetransitionprobabilityofthe

agentintheenvironment,p(s′|s,a)).Ontheotherhand,themodel-based

approachexplicitlylearnstheenvironmentinadvanceandusesthelearned

environmentmodelforpolicylearning.

Noadditionalsamplingcostisnecessarytogenerateartificialsamplesfrom

thelearnedenvironmentmodel.Thus,themodel-basedapproachisparticu-

Page 76: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

larlyusefulwhendatacollectionisexpensive(e.g.,robotcontrol).However,

accuratelyestimatingthetransitionmodelfromalimitedamountoftrajec-

torydatainmulti-dimensionalcontinuousstateandactionspacesishighly

challenging.PartIVofthisbookfocusesonmodel-basedreinforcementlearn-

ing.InChapter10,anon-parametrictransitionmodelestimatorthatpossesses

theoptimalconvergenceratewithhighcomputationalefficiencyisintroduced.

However,evenwiththeoptimalconvergencerate,estimatingthetransition

modelinhigh-dimensionalstateandactionspacesisstillchallenging.InChap-

ter11,adimensionalityreductionmethodthatcanbeefficientlyembedded

intothetransitionmodelestimationprocedureisintroducedanditsusefulness

isdemonstratedthroughexperiments.

PartII

Model-FreePolicy

Iteration

InPartII,weintroduceareinforcementlearningapproachbasedonvalue

functionscalledpolicyiteration.

Thekeyissueinthepolicyiterationframeworkishowtoaccuratelyap-

proximatethevaluefunctionfromasmallnumberofdatasamples.InChap-

ter2,afundamentalframeworkofvaluefunctionapproximationbasedon

leastsquaresisexplained.Inthisleast-squaresformulation,howtodesign

goodbasisfunctionsiscriticalforbettervaluefunctionapproximation.A

practicalbasisdesignmethodbasedonmanifold-basedsmoothing(Chapelle

etal.,2006)isexplainedinChapter3.

Inreal-worldreinforcementlearningtasks,gatheringdataisoftencostly.

InChapter4,wedescribeamethodforefficientlyreusingpreviouslycor-

rectedsamplesintheframeworkofcovariateshiftadaptation(Sugiyama&

Kawanabe,2012).InChapter5,weapplyastatisticalactivelearningtech-

nique(Sugiyama&Kawanabe,2012)tooptimizingdatacollectionstrategies

forreducingthesamplingcost.

Finally,inChapter6,anoutlier-robustextensionoftheleast-squares

Page 77: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

methodbasedonrobustregression(Huber,1981)isintroduced.Sucharo-

bustmethodishighlyusefulinhandlingnoisyreal-worlddata.

Thispageintentionallyleftblank

Chapter2

PolicyIterationwithValueFunction

Approximation

Inthischapter,weintroducetheframeworkofleast-squarespolicyiteration.

InSection2.1,wefirstexplaintheframeworkofpolicyiteration,whichitera-

tivelyexecutesthepolicyevaluationandpolicyimprovementstepsforfinding

betterpolicies.Then,inSection2.2,weshowhowvaluefunctionapproxima-

tioninthepolicyevaluationstepcanbeformulatedasaregressionproblem

andintroducealeast-squaresalgorithmcalledleast-squarespolicyiteration

(Lagoudakis&Parr,2003).Finally,thischapterisconcludedinSection2.3.

2.1

ValueFunctions

Atraditionalwaytolearntheoptimalpolicyisbasedonvaluefunction.

Inthissection,weintroducetwotypesofvaluefunctions,thestatevalue

functionandthestate-actionvaluefunction,andexplainhowtheycanbe

usedforfindingbetterpolicies.

2.1.1

StateValueFunctions

ThestatevaluefunctionVπ(s)∈Rforpolicyπmeasuresthe“value”ofstates,whichisdefinedastheexpectedreturntheagentwillreceivewhen

followingpolicyπfromstates:

h

i

Vπ(

s)=Epπ(h)R(h)s1=s,

where“|s1=s”meansthattheinitialstates1isfixedats1=s.Thatis,the

Page 78: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

right-handsideoftheaboveequationdenotestheconditionalexpectationof

returnR(h)givens1=s.

Byrecursion,Vπ(s)canbeexpressedas

h

i

Vπ(s)=Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′),

whereEp(s′|s,a)π(a|s)denotestheconditionalexpectationoveraands′drawn

17

18

StatisticalReinforcementLearning

fromp(s′|s,a)π(a|s)givens.ThisrecursiveexpressioniscalledtheBellman

equationforstatevalues.Vπ(s)maybeobtainedbyrepeatingthefollowing

updatefromsomeinitialestimate:

h

i

Vπ(s)←−Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′).

Theoptimalstatevalueatstates,V∗(s),isdefinedasthemaximizerofstatevalueVπ(s)withrespecttopolicyπ:

V∗(s)=maxVπ(s).π

BasedontheoptimalstatevalueV∗(s),theoptimalpolicyπ∗,whichisde-terministic,canbeobtainedas

π∗(a|s)=δ(a−a∗(s)),whereδ(·)denotesDirac’sdeltafunctionand

n

h

io

a∗(s)=argmaxEp(s′|s,a)r(s,a,s′)+γV∗(s′).

a∈A

Page 79: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Ep(s′|s,a)denotestheconditionalexpectationovers′drawnfromp(s′|s,a)

givensanda.Thisalgorithm,firstcomputingtheoptimalvaluefunction

andthenobtainingtheoptimalpolicybasedontheoptimalvaluefunction,is

calledvalueiteration.

Apossiblevariationistoiterativelyperformpolicyevaluationandim-

provementas

h

i

Policyevaluation:Vπ(s)←−Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′).

Policyimprovement:π∗(a|s)←−δ(a−aπ(s)),

where

n

h

io

aπ(s)=argmaxEp(s′|s,a)r(s,a,s′)+γVπ(s′)

.

a∈AThesetwostepsmaybeiteratedeitherforallstatesatonceorinastate-by-

statemanner.Thisiterativealgorithmiscalledthepolicyiteration(basedon

statevaluefunctions).

2.1.2

State-ActionValueFunctions

Intheabovepolicyimprovementstep,theactiontotakeisoptimizedbased

onthestatevaluefunctionVπ(s).Amoredirectwaytohandlethisaction

optimizationistoconsiderthestate-actionvaluefunctionQπ(s,a)forpolicy

π:

h

i

Qπ(

s,a)=Epπ(h)R(h)s1=s,a1=a,

Page 80: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

PolicyIterationwithValueFunctionApproximation

19

where“|s1=s,a1=a”meansthattheinitialstates1andthefirstactiona1

arefixedats1=sanda1=a,respectively.Thatis,theright-handsideof

theaboveequationdenotestheconditionalexpectationofreturnR(h)given

s1=sanda1=a.

Letr(s,a)betheexpectedimmediaterewardwhenactionaistakenat

states:

r(s,a)=Ep(s′|s,a)[r(s,a,s′)].

Then,inthesamewayasVπ(s),Qπ(s,a)canbeexpressedbyrecursionas

h

i

Qπ(s,a)=r(s,a)+γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′),

(2.1)

whereEπ(a′|s′)p(s′|s,a)denotestheconditionalexpectationovers′anda′drawn

fromπ(a′|s′)p(s′|s,a)givensanda.Thisrecursiveexpressioniscalledthe

Bellmanequationforstate-actionvalues.

BasedontheBellmanequation,theoptimalpolicymaybeobtainedby

iteratingthefollowingtwosteps:

h

i

Policyevaluation:Qπ(s,a)←−r(s,a)+γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′).

Policyimprovement:π(a|s)←−δa−argmaxQπ(s,a′).

a′∈AInpractice,itissometimespreferabletouseanexplorativepolicy.For

example,Gibbspolicyimprovementisgivenby

exp(Qπ(s,a)/τ)

π(a|s)←−R

,

exp(Qπ(s,a′)/τ)da′

A

whereτ>0determinesthedegreeofexploration.WhentheactionspaceA

Page 81: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

isdiscrete,ǫ-greedypolicyimprovementisalsoused:

(1−ǫ+ǫ/|A|ifa=argmaxQπ(s,a′),

π(a|s)←−

a′∈Aǫ/|A|otherwise,

whereǫ∈(0,1]determinestherandomnessofthenewpolicy.TheabovepolicyimprovementstepbasedonQπ(s,a)isessentiallythe

sameastheonebasedonVπ(s)explainedinSection2.1.1.However,the

policyimprovementstepbasedonQπ(s,a)doesnotcontaintheexpectation

operatorandthuspolicyimprovementcanbemoredirectlycarriedout.For

thisreason,wefocusontheaboveformulation,calledpolicyiterationbased

onstate-actionvaluefunctions.

20

StatisticalReinforcementLearning

2.2

Least-SquaresPolicyIteration

Asexplainedintheprevioussection,theoptimalpolicyfunctionmaybe

learnedviastate-actionvaluefunctionQπ(s,a).However,learningthestate-

actionvaluefunctionfromdataisachallengingtaskforcontinuousstates

andactiona.

Learningthestate-actionvaluefunctionfromdatacanactuallybere-

gardedasaregressionprobleminstatisticsandmachinelearning.Inthissec-

tion,weexplainhowtheleast-squaresregressiontechniquecanbeemployed

invaluefunctionapproximation,whichiscalledleast-squarespolicyiteration

(Lagoudakis&Parr,2003).

2.2.1

Immediate-RewardRegression

Letusapproximatethestate-actionvaluefunctionQπ(s,a)bythefollow-

inglinear-in-parametermodel:

Page 82: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

B

Xθbφb(s,a),

b=1

whereφb(s,a)Barebasisfunctions,Bdenotesthenumberofbasisfunc-

b=1

tions,andθbB

areparameters.Specificdesignsofbasisfunctionswillbe

b=1

discussedinChapter3.Below,weusethefollowingvectorrepresentationfor

compactlyexpressingtheparametersandbasisfunctions:

θ⊤φ(s,a),

where⊤denotesthetransposeand

θ=(θ1,…,θB)⊤∈RB,⊤φ(s,a)=φ1(s,a),…,φB(s,a)

∈RB.FromtheBellmanequationforstate-actionvalues(2.1),wecanexpress

theexpectedimmediaterewardr(s,a)as

h

i

r(s,a)=Qπ(s,a)−γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′).

Bysubstitutingthevaluefunctionmodelθ⊤φ(s,a)intheaboveequation,

theexpectedimmediaterewardr(s,a)maybeapproximatedas

h

i

r(s,a)≈θ⊤φ(s,a)−γEπ(a′|s′)p(s′|s,a)θ⊤φ(s′,a′).

Nowletusdefineanewbasisfunctionvectorψ(s,a):

h

i

ψ(s,a)=φ(s,a)−γEπ(a′|s′)p(s′|s,a)φ(s′,a′).

Page 83: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

PolicyIterationwithValueFunctionApproximation

21

r(s1,a1)

r(s,a)

r(sT,aT)

r(s1,a1,s2)

r(s2,a2)

T

θψ(s,a)

r(sT,aT,sT+1)

r(s2,a2,s3)

(s,a)

(s1,a1)

(s2,a2)

(sT,aT)

FIGURE2.1:Linearapproximationofstate-actionvaluefunctionQπ(s,a)

aslinearregressionofexpectedimmediaterewardr(s,a).

Thentheexpectedimmediaterewardr(s,a)maybeapproximatedas

r(s,a)≈θ⊤ψ(s,a).

Asexplainedabove,thelinearapproximationproblemofthestate-action

valuefunctionQπ(s,a)canbereformulatedasthelinearregressionproblem

oftheexpectedimmediaterewardr(s,a)(seeFigure2.1).Thekeytrickwas

Page 84: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

topushtherecursivenatureofthestate-actionvaluefunctionQπ(s,a)into

thecompositebasisfunctionψ(s,a).

2.2.2

Algorithm

Now,weexplainhowtheparametersθarelearnedintheleast-squares

framework.Thatis,themodelθ⊤ψ(s,a)isfittedtotheexpectedimmediate

rewardr(s,a)underthesquaredloss:

(

#)

T

1X

2

minEpπ(h)

θ⊤ψ(st,at)−r(st,at)

,

θ

Tt=1

wherehdenotesthehistorysamplefollowingthecurrentpolicyπ:

h=[s1,a1,…,sT,aT,sT+1].

ForhistorysamplesH=h1,…,hN,where

hn=[s1,n,a1,n,…,sT,n,aT,n,sT+1,n],

anempiricalversionoftheaboveleast-squaresproblemisgivenas

(

#)

N

T

1X

1X

2

min

Page 85: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

θ⊤b

ψ(st,n,at,n;H)−r(st,n,at,n,st+1,n)

.

θ

N

T

n=1

t=1

22

StatisticalReinforcementLearning

1

2

θ−r

Ψ

ˆ

NT

θ

FIGURE2.2:Gradientdescent.

Here,b

ψ(s,a;H)isanempiricalestimatorofψ(s,a)givenby

X

h

i

b

1

ψ(s,a;H)=φ(s,a)−

E

γφ(s′,a′),

|H

Page 86: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

π(a′|s′)

(s,a)|s′∈H(s,a)whereH(s,a)denotesasubsetofHthatconsistsofalltransitionsamplesfrom

statesbyactiona,|H(s,a)|denotesthenumberofelementsinthesetH(s,a),

P

and

denotesthesummationoveralldestinationstatess′intheset

s′∈Hs,a)H(s,a).

Letb

ΨbetheNT×BmatrixandrbetheNT-dimensionalvectordefined

as

b

ΨN(t−1)+n,b=b

ψb(st,n,at,n),

rN(t−1)+n=r(st,n,at,n,st+1,n).

b

Ψissometimescalledthedesignmatrix.Thentheaboveleast-squaresprob-

lemcanbecompactlyexpressedas

1

min

kb

Ψθ−rk2,

θ

NT

wherek·kdenotestheℓ2-norm.Becausethisisaquadraticfunctionwith

respecttoθ,itsglobalminimizerb

θcanbeanalyticallyobtainedbysettingits

derivativetozeroas

b

⊤⊤

Page 87: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

θ=(b

Ψb

Ψ)−1b

Ψr.

(2.2)

⊤IfBistoolargeandcomputingtheinverseofb

Ψb

Ψisintractable,wemay

useagradientdescentmethod.Thatis,startingfromsomeinitialestimateθ,

thesolutionisupdateduntilconvergence,asfollows(seeFigure2.2):

⊤⊤θ←−θ−ε(b

Ψb

Ψθ−b

Ψr),

PolicyIterationwithValueFunctionApproximation

23

⊤⊤whereb

Ψb

Ψθ−b

Ψrcorrespondstothegradientoftheobjectivefunction

kb

Ψθ−rk2andεisasmallpositiveconstantrepresentingthestepsizeof

gradientdescent.

Anotablevariationoftheaboveleast-squaresmethodistocomputethe

Page 88: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

solutionby

eθ=(Φ⊤b

Ψ)−1Φ⊤r,

whereΦistheNT×Bmatrixdefinedas

ΦN(t−1)+n,b=φ(st,n,at,n).

This

variation

is

called

the

least-squaresfixed-pointapproximation

(Lagoudakis&Parr,2003)andisshowntohandletheestimationerrorin-

cludedinthebasisfunctionb

ψinasoundway(Bradtke&Barto,1996).

However,forsimplicity,wefocusonEq.(2.2)below.

2.2.3

Regularization

Regressiontechniquesinmachinelearningaregenerallyformulatedasmin-

imizationofagoodness-of-fittermandaregularizationterm.Intheabove

least-squaresframework,thegoodness-of-fitofourmodelismeasuredbythe

squaredloss.Inthefollowingchapters,wediscusshowotherlossfunctionscan

beutilizedinthepolicyiterationframework,e.g.,samplereuseinChapter4

andoutlier-robustlearninginChapter6.Herewefocusontheregularization

termandintroducepracticallyusefulregularizationtechniques.

Theℓ2-regularizeristhemoststandardregularizerinstatisticsandma-

chinelearning;itisalsocalledtheridgeregression(Hoerl&Kennard,1970):

1

min

kb

Ψθ−rk2+λkθk2,

θ

NT

Page 89: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

whereλ≥0istheregularizationparameter.Theroleoftheℓ2-regularizer

kθk2istopenalizethegrowthoftheparametervectorθtoavoidoverfitting

tonoisysamples.Apracticaladvantageoftheuseoftheℓ2-regularizeristhat

theminimizerb

θcanstillbeobtainedanalytically:

b

⊤⊤θ=(b

Ψb

Ψ+λIB)−1b

Ψr,

whereIBdenotestheB×Bidentitymatrix.BecauseoftheadditionofλIB,

thematrixtobeinvertedabovehasabetternumericalconditionandthus

thesolutiontendstobemorestablethanthesolutionobtainedbyplainleast

squareswithoutregularization.

Notethatthesamesolutionastheaboveℓ2-penalizedleast-squaresprob-

lemcanbeobtainedbysolvingthefollowingℓ2-constrainedleast-squaresprob-

lem:

1

min

kb

Ψθ−rk2

θ

NT

24

StatisticalReinforcementLearning

θ

θ

Page 90: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

2

2

θ

ˆ

θ

ˆ

LS

LS

θ

ˆ

θ

ˆ

ℓ2−CLS

ℓ1−CLS

θ

θ

1

1

(a)ℓ2-constraint

(b)ℓ1-constraint

FIGURE2.3:Feasibleregions(i.e.,regionswheretheconstraintissatisfied).

Theleast-squares(LS)solutionisthebottomoftheellipticalhyperboloid,

whereasthesolutionofconstrainedleast-squares(CLS)islocatedatthepoint

wherethehyperboloidtouchesthefeasibleregion.

subjecttokθk2≤C,

whereCisdeterminedfromλ.Notethatthelargerthevalueofλis(i.e.,the

strongertheeffectofregularizationis),thesmallerthevalueofCis(i.e.,the

smallerthefeasibleregionis).Thefeasibleregion(i.e.,theregionwherethe

constraintkθk2≤Cissatisfied)isillustratedinFigure2.3(a).

Anotherpopularchoiceofregularizationinstatisticsandmachinelearn-

ingistheℓ1-regularizer,whichisalsocalledtheleastabsoluteshrinkageand

selectionoperator(LASSO)(Tibshirani,1996):

Page 91: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

min

kb

Ψθ−rk2+λkθk1,

θ

NT

wherek·k1denotestheℓ1-normdefinedastheabsolutesumofelements:

B

X

kθk1=

|θb|.

b=1

Inthesamewayastheℓ2-regularizationcase,thesamesolutionastheabove

ℓ1-penalizedleast-squaresproblemcanbeobtainedbysolvingthefollowing

constrainedleast-squaresproblem:

1

min

kb

Ψθ−rk2

θ

NT

subjecttokθk1≤C,

PolicyIterationwithValueFunctionApproximation

25

1stSubset

(K–1)thsubset

Kthsubset

···

Estimation

Validation

Page 92: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

FIGURE2.4:Crossvalidation.

whereCisdeterminedfromλ.ThefeasibleregionisillustratedinFig-

ure2.3(b).

Anotablepropertyofℓ1-regularizationisthatthesolutiontendstobe

sparse,i.e.,manyoftheelementsθbBbecomeexactlyzero.Thereasonwhy

b=1

thesolutionbecomessparsecanbeintuitivelyunderstoodfromFigure2.3(b):

thesolutiontendstobeononeofthecornersofthefeasibleregion,where

thesolutionissparse.Ontheotherhand,intheℓ2-constraintcase(seeFig-

ure2.3(a)again),thesolutionissimilartotheℓ1-constraintcase,butitis

notgenerallyonanaxisandthusthesolutionisnotsparse.Suchasparse

solutionhasvariouscomputationaladvantages.Forexample,thesolutionfor

large-scaleproblemscanbecomputedefficiently,becauseallparametersdo

nothavetobeexplicitlyhandled;see,e.g.,Tomiokaetal.,2011.Furthermore,

thesolutionsforalldifferentregularizationparameterscanbecomputedef-

ficiently(Efronetal.,2004),andtheoutputofthelearnedmodelcanbe

computedefficiently.

2.2.4

ModelSelection

Inregression,tuningparametersareoftenincludedinthealgorithm,such

asbasisparametersandtheregularizationparameter.Suchtuningparameters

canbeobjectivelyandsystematicallyoptimizedbasedoncross-validation

(Wahba,1990)asfollows(seeFigure2.4).

First,thetrainingdatasetHisdividedintoKdisjointsubsetsofapprox-

imatelythesamesize,HkK.Thentheregressionsolutionbθ

k=1

kisobtained

usingH\Hk(i.e.,allsampleswithoutHk),anditssquarederrorforthehold-

outsamplesHkiscomputed.Thisprocedureisrepeatedfork=1,…,K,and

themodel(suchasthebasisparameterandtheregularizationparameter)that

minimizestheaverageerrorischosenasthemostsuitableone.

Onemaythinkthattheordinarysquarederrorisdirectlyusedformodel

Page 93: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

selection,insteadofitscross-validationestimator.However,theordinary

squarederrorisheavilybiased(orinotherwords,over-fitted)sincethesame

trainingsamplesareusedtwiceforlearningparametersandestimatingthe

generalizationerror(i.e.,theout-of-samplepredictionerror).Ontheother

hand,thecross-validationestimatorofsquarederrorisalmostunbiased,where

“almost”comesfromthefactthatthenumberoftrainingsamplesisreduced

duetodatasplittinginthecross-validationprocedure.

26

StatisticalReinforcementLearning

Ingeneral,cross-validationiscomputationallyexpensivebecausethe

squarederrorneedstobeestimatedmanytimes.Forexample,whenperform-

ing5-foldcross-validationfor10modelcandidates,thelearningprocedurehas

toberepeated5×10=50times.However,thisisoftenacceptableinpractice

becausesensiblemodelselectiongivesanaccuratesolutionevenwithasmall

numberofsamples.Thus,intotal,thecomputationtimemaynotgrowthat

much.Furthermore,cross-validationissuitableforparallelcomputingsinceer-

rorestimationfordifferentmodelsanddifferentfoldsareindependentofeach

other.Forinstance,whenperforming5-foldcross-validationfor10modelcan-

didates,theuseof50computingunitsallowsustocomputeeverythingat

once.

2.3

Remarks

Reinforcementlearningviaregressionofstate-actionvaluefunctionsisa

highlypowerfulandflexibleapproach,becausewecanutilizevariousregression

techniquesdevelopedinstatisticsandmachinelearningsuchasleast-squares,

regularization,andcross-validation.

Inthefollowingchapters,weintroducemoresophisticatedregressiontech-

niquessuchasmanifold-basedsmoothing(Chapelleetal.,2006)inChapter3,

covariateshiftadaptation(Sugiyama&Kawanabe,2012)inChapter4,active

Page 94: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

learning(Sugiyama&Kawanabe,2012)inChapter5,androbustregression

(Huber,1981)inChapter6.

Chapter3

BasisDesignforValueFunction

Approximation

Least-squarespolicyiterationexplainedinChapter2workswell,givenappro-

priatebasisfunctionsforvaluefunctionapproximation.Becauseofitssmooth-

ness,theGaussiankernelisapopularandusefulchoiceasabasisfunction.

However,itdoesnotallowfordiscontinuity,whichisconceivableinmanyre-

inforcementlearningtasks.Inthischapter,weintroduceanalternativebasis

functionbasedongeodesicGaussiankernels(GGKs),whichexploitthenon-

linearmanifoldstructureinducedbytheMarkovdecisionprocesses(MDPs).

ThedetailsofGGKareexplainedinSection3.1,anditsrelationtoother

basisfunctiondesignsisdiscussedinSection3.2.Then,experimentalperfor-

manceisnumericallyevaluatedinSection3.3,andthischapterisconcluded

inSection3.4.

3.1

GaussianKernelsonGraphs

Inleast-squarespolicyiteration,thechoiceofbasisfunctionsφb(s,a)B

b=1

isanopendesignissue(seeChapter2).Traditionally,Gaussiankernelshave

beenapopularchoice(Lagoudakis&Parr,2003;Engeletal.,2005),butthey

cannotapproximatediscontinuousfunctionswell.Tocopewiththisproblem,

moresophisticatedmethodsofconstructingsuitablebasisfunctionshavebeen

proposedwhicheffectivelymakeuseofthegraphstructureinducedbyMDPs

(Mahadevan,2005).Inthissection,weintroduceanalternativewayofcon-

structingbasisfunctionsbyincorporatingthegraphstructureofthestate

space.

3.1.1

Page 95: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

MDP-InducedGraph

LetGbeagraphinducedbyanMDP,wherestatesSarenodesofthe

graphandthetransitionswithnon-zerotransitionprobabilitiesfromonenode

toanotherareedges.Theedgesmayhaveweightsdetermined,e.g.,basedon

thetransitionprobabilitiesorthedistancebetweennodes.Thegraphstructure

correspondingtoanexamplegridworldshowninFigure3.1(a)isillustrated

27

28

StatisticalReinforcementLearning

123456789101112131415161718192021

1

2

→→→→→→→↓↓

→→→→→→→→

−10

3

→→→→→→→↓↓

→→→→→→↑↑↑

4

→→↓→↓→→→↓

→↑↑→→↑↑↑↑

−20

5

↓↓↓↓↓↓↓↓↓

→→→→↑↑↑↑↑

6

→→→→→→↓↓↓

→→↑→↑↑↑↑↑

−30

Page 96: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

7

→↓↓→↓→↓↓↓

↑→↑↑↑↑↑↑↑

8

→→↓→→→↓↓↓

↑→↑→↑↑↑↑↑

9

→→→→→→↓↓↓

→→↑↑↑↑↑↑↑

10

→→→→→→→→→→→↑↑→↑↑↑↑↑

11

→→→→→→→→→↑→↑→↑↑↑↑↑↑

5

12

→→→→→→→→↑

→↑→↑→↑↑↑↑

13

→→→↑→→↑↑↑

↑→→↑↑↑↑↑↑

14

→→↑↑→↑↑→↑

↑→↑↑↑→↑↑↑

10

15

→→→→→→→→↑

↑→↑↑↑↑↑↑↑

16

↑→↑↑↑→→↑↑

↑→→↑↑↑↑↑↑

20

17

Page 97: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

→→→→→→→→↑

↑→↑↑↑↑↑↑↑

15

15

18

↑↑↑→→→↑↑↑

→→↑↑↑→↑↑↑

10

y

19

→→→→→→→→↑

↑↑↑↑↑↑↑↑↑

5

20

20

x

(a)Blackareasarewallsoverwhich

(b)Optimalstatevaluefunction(in

theagentcannotmove,whilethegoal

log-scale).

isrepresentedingray.Arrowsonthe

gridsrepresentoneoftheoptimalpoli-

cies.

(c)GraphinducedbytheMDPanda

randompolicy.

FIGURE3.1:Anillustrativeexampleofareinforcementlearningtaskof

guidinganagenttoagoalinthegridworld.

inFigure3.1(c).Inpractice,suchgraphstructure(includingtheconnection

weights)isestimatedfromsamplesofafinitelength.Weassumethatthe

graphGisconnected.Typically,thegraphissparseinreinforcementlearning

tasks,i.e.,

ℓ≪n(n−1)/2,

Page 98: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

whereℓisthenumberofedgesandnisthenumberofnodes.

BasisDesignforValueFunctionApproximation

29

3.1.2

OrdinaryGaussianKernels

OrdinaryGaussiankernels(OGKs)ontheEuclideanspacearedefinedas

ED(s,s′)2

K(s,s′)=exp−

,

2σ2

whereED(s,s′)aretheEuclideandistancebetweenstatessands′;forex-

ample,

ED(s,s′)=kx−x′k,

whentheCartesianpositionsofsands′inthestatespacearegivenbyxand

x′,respectively.σ2isthevarianceparameteroftheGaussiankernel.

TheaboveGaussianfunctionisdefinedonthestatespaceS,wheres′is

treatedasacenterofthekernel.InordertoemploytheGaussiankernelin

least-squarespolicyiteration,itneedstobeextendedoverthestate-action

spaceS×A.Thisisusuallycarriedoutbysimply“copying”theGaussian

functionovertheactionspace(Lagoudakis&Parr,2003;Mahadevan,2005).

Moreprecisely,letthetotalnumberkofbasisfunctionsbemp,wheremis

thenumberofpossibleactionsandpisthenumberofGaussiancenters.For

thei-thactiona(i)(∈A)(i=1,2,…,m)andforthej-thGaussiancenter

c(j)(∈S)(j=1,2,…,p),the(i+(j−1)m)-thbasisfunctionisdefinedasφi+(j−1)m(s,a)=I(a=a(i))K(s,c(j)),

(3.1)

whereI(·)istheindicatorfunction:

(1ifa=a(i),

I(a=a(i))=

Page 99: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0otherwise.

3.1.3

GeodesicGaussianKernels

Ongraphs,anaturaldefinitionofthedistancewouldbetheshortestpath.

TheGaussiankernelbasedontheshortestpathisgivenby

SP(s,s′)2

K(s,s′)=exp−

,

(3.2)

2σ2

whereSP(s,s′)denotestheshortestpathfromstatestostates′.Theshortest

pathonagraphcanbeinterpretedasadiscreteapproximationtothegeodesic

distanceonanon-linearmanifold(Chung,1997).Forthisreason,wecallEq.

(3.2)ageodesicGaussiankernel(GGK)(Sugiyamaetal.,2008).

ShortestpathsongraphscanbeefficientlycomputedusingtheDijkstraal-

gorithm(Dijkstra,1959).Withitsnaiveimplementation,computationalcom-

plexityforcomputingtheshortestpathsfromasinglenodetoallothernodes

isO(n2),wherenisthenumberofnodes.IftheFibonacciheapisemployed,

30

StatisticalReinforcementLearning

computationalcomplexitycanbereducedtoO(nlogn+ℓ)(Fredman&Tar-

jan,1987),whereℓisthenumberofedges.Sincethegraphinvaluefunction

approximationproblemsistypicallysparse(i.e.,ℓ≪n2),usingtheFibonacci

heapprovidessignificantcomputationalgains.Furthermore,thereexistvar-

iousapproximationalgorithmswhicharecomputationallyveryefficient(see

Goldberg&Harrelson,2005andreferencestherein).

AnalogouslytoOGKs,weneedtoextendGGKstothestate-actionspace

tousetheminleast-squarespolicyiteration.Anaivewayistojustemploy

Eq.(3.1),butthiscancauseashiftintheGaussiancenterssincethestate

Page 100: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

usuallychangeswhensomeactionistaken.Toincorporatethistransition,

thebasisfunctionsaredefinedastheexpectationofGaussianfunctionsafter

transition:

X

φi+(j−1)m(s,a)=I(a=a(i))

P(s′|s,a)K(s′,c(j)).

(3.3)

s′∈SThisshiftingschemeisshowntoworkverywellwhenthetransitionispre-

dominantlydeterministic(Sugiyamaetal.,2008).

3.1.4

ExtensiontoContinuousStateSpaces

Sofar,wefocusedondiscretestatespaces.However,theconceptofGGKs

canbenaturallyextendedtocontinuousstatespaces,whichisexplainedhere.

First,thecontinuousstatespaceisdiscretized,whichgivesagraphasadis-

creteapproximationtothenon-linearmanifoldstructureofthecontinuous

statespace.Basedonthegraph,GGKscanbeconstructedinthesameway

asthediscretecase.Finally,thediscreteGGKsareinterpolated,e.g.,usinga

linearmethodtogivecontinuousGGKs.

Althoughthisprocedurediscretizesthecontinuousstatespace,itmustbe

notedthatthediscretizationisonlyforthepurposeofobtainingthegraphas

adiscreteapproximationofthecontinuousnon-linearmanifold;theresulting

basisfunctionsthemselvesarecontinuouslyinterpolatedandhencethestate

spaceisstilltreatedascontinuous,asopposedtoconventionaldiscretization

procedures.

3.2

Illustration

Inthissection,thecharacteristicsofGGKsarediscussedincomparisonto

existingbasisfunctions.

Page 101: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

BasisDesignforValueFunctionApproximation

31

3.2.1

Setup

Letusconsideratoyreinforcementlearningtaskofguidinganagentto

agoalinadeterministicgridworld(seeFigure3.1(a)).Theagentcantake

4actions:up,down,left,andright.Notethatactionswhichmaketheagent

collidewiththewallaredisallowed.Apositiveimmediatereward+1isgiven

iftheagentreachesagoalstate;otherwiseitreceivesnoimmediatereward.

Thediscountfactorissetatγ=0.9.

Inthistask,astatescorrespondstoatwo-dimensionalCartesiangrid

positionxoftheagent.Forillustrationpurposes,letusdisplaythestate

valuefunction,

Vπ(s):S→R,

whichistheexpectedlong-termdiscountedsumofrewardstheagentreceives

whentheagenttakesactionsfollowingpolicyπfromstates.Fromthedefi-

nition,itcanbeconfirmedthatVπ(s)isexpressedintermsofQπ(s,a)as

Vπ(s)=Qπ(s,π(s)).

TheoptimalstatevaluefunctionV∗(s)(inlog-scale)isillustratedinFig-ure3.1(b).AnMDP-inducedgraphstructureestimatedfrom20seriesofran-

domwalksamples1oflength500isillustratedinFigure3.1(c).Here,theedge

weightsinthegrapharesetat1(whichisequivalenttotheEuclideandistance

betweentwonodes).

3.2.2

GeodesicGaussianKernels

AnexampleofGGKsforthisgraphisdepictedinFigure3.2(a),wherethe

varianceofthekernelissetatalargevalue(σ2=30)forillustrationpurposes.

ThegraphshowsthatGGKshaveanicesmoothsurfacealongthemaze,but

notacrossthepartitionbetweentworooms.SinceGGKshave“centers,”they

areextremelyusefulforadaptivelychoosingasubsetofbases,e.g.,usinga

uniformallocationstrategy,sample-dependentallocationstrategy,ormaze-

dependentallocationstrategyofthecenters.Thisisapracticaladvantage

Page 102: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

oversomenon-orderedbasisfunctions.Moreover,sinceGGKsarelocalby

nature,theilleffectsoflocalnoiseareconstrainedlocally,whichisanother

usefulpropertyinpractice.

Theapproximatedvaluefunctionsobtainedby40GGKs2aredepictedin

Figure3.3(a),whereoneGGKcenterisputatthegoalstateandtheremaining

9centersarechosenrandomly.ForGGKs,kernelfunctionsareextendedover

theactionspaceusingtheshiftingscheme(seeEq.(3.3))sincethetransitionis

1Moreprecisely,ineachrandomwalk,aninitialstateischosenrandomly.Then,anactionischosenrandomlyandtransitionismade;thisisrepeated500times.Thisentireprocedureisindependentlyrepeated20timestogeneratethetrainingset.

2Notethatthetotalnumberkofbasisfunctionsis160sinceeachGGKiscopiedovertheactionspaceasperEq.(3.3).

Page 103: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 104: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

32

StatisticalReinforcementLearning

1

1

1

0.5

0.5

0.5

0

0

0

5

5

5

10

10

10

20

20

20

15

15

15

15

15

15

10

y

10

y

10

y

Page 105: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

5

20

5

20

5

20

x

x

x

(a)GeodesicGaussiankernels

1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.2

0.4

0.2

5

5

5

10

10

10

20

20

Page 106: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

20

15

15

15

15

15

15

10

y

10

y

10

y

5

20

5

20

5

20

x

x

x

(b)OrdinaryGaussiankernels

0.05

0.05

0.05

0

0

0

−0.05

−0.05

−0.05

Page 107: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

5

5

5

10

10

10

20

20

20

15

15

15

15

15

15

10

y

10

y

10

y

5

20

5

20

5

20

x

x

x

(c)Graph-Laplacianeigenbases

0.2

Page 108: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0.15

0.2

0.1

0.1

0

0.05

0

−0.2

0

−0.1

5

5

5

10

10

10

20

20

20

15

15

15

15

15

15

10

y

10

y

10

y

5

Page 109: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

20

5

20

5

20

x

x

x

(d)Diffusionwavelets

FIGURE3.2:Examplesofbasisfunctions.

Page 110: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 111: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

BasisDesignforValueFunctionApproximation

33

−1.5

−2

−2

−2.5

−2.5

−3

−3

−3.5

−3.5

5

5

10

10

20

20

15

15

15

15

10

y

10

y

5

20

5

20

x

x

(a)GeodesicGaussiankernels(MSE=

Page 112: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(b)OrdinaryGaussiankernels(MSE=

1.03×10−2)

1.19×10−2)

−4

−6

−6

−8

−8

−10

−10

−12

−12

5

5

10

10

20

20

15

15

15

15

10

y

10

y

5

20

5

20

x

x

Page 113: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(c)Graph-Laplacianeigenbases(MSE=

(d)Diffusionwavelets

(MSE=5.00×

4.73×10−4)

10−4)

FIGURE3.3:Approximatedvaluefunctionsinlog-scale.Theerrorsarecom-

putedwithrespecttotheoptimalvaluefunctionillustratedinFigure3.1(b).

deterministic(seeSection3.1.3).TheproposedGGK-basedmethodproduces

anicesmoothfunctionalongthemazewhilethediscontinuityaroundthepar-

titionbetweentworoomsissharplymaintained(cf.Figure3.1(b)).Asaresult,

forthisparticularcase,GGKsgivetheoptimalpolicy(seeFigure3.4(a)).

AsdiscussedinSection3.1.3,thesparsityofthestatetransitionmatrixal-

lowsefficientandfastcomputationsofshortestpathsonthegraph.Therefore,

least-squarespolicyiterationwithGGK-basedbasesisstillcomputationally

attractive.

3.2.3

OrdinaryGaussianKernels

OGKssharesomeofthepreferablepropertiesofGGKsdescribedabove.

However,asillustratedinFigure3.2(b),thetailofOGKsextendsbeyondthe

partitionbetweentworooms.Therefore,OGKstendtoundesirablysmooth

outthediscontinuityofthevaluefunctionaroundthebarrierwall(see

34

StatisticalReinforcementLearning

123456789101112131415161718192021

123456789101112131415161718192021

1

Page 114: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

2

→→→→→→↓↓↓

→→→→→→→→

2

→→→→→→→→↓

→→→→→→→→

3

→→→→→↓↓↓↓

→→→→→→→→↑

3

→→→→→→→→↑

→→→→→→→→↑

4

→→→→→↓↓↓↓

→→→→→→→→↑

4

→→→→→→→→↑

→→→→→→→→↑

5

→→→→↓↓↓↓↓

→→→→→→→→↑

5

→→→→→→→→↑

→→→→→→→→↑

6

→→→↓↓↓↓↓↓

→→→→→→→→↑

6

→→→→→→→→↑

→→→→→→→→↑

7

Page 115: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

→→→↓↓↓↓↓↓

→→→→→→→↑↑

7

→→→→→→→→↑

→→→→→→→→↑

8

→→↓↓↓↓↓↓↓

→→→→→→→↑↑

8

→→→→→→→→↑

→→→→→↑↑↑↑

9

→↓↓↓↓↓↓↓↓

→→→→↑↑→↑↑

9

→→→→→→→→↑

→↑↑↑↑↑↑↑↑

10

→→→→→→→→→→→→→→↑↑↑↑↑

10

→→→→→→→→→→↑↑↑↑↑↑↑↑↑

11

→→→→→↑↑↑↑↑↑→↑↑↑↑↑↑↑

11

→→→→→→→↑↑↑↑↑↑↑↑↑↑↑↑

12

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

12

→→→→→→↑↑↑

↑↑↑↑↑↑↑↑↑

13

Page 116: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

13

→→→→→↑↑↑↑

↑↑↑↑↑↑↑↑↑

14

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

14

→→→→→↑↑↑↑

↑↑↑↑↑↑↑↑↑

15

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

15

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

16

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

16

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

17

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

17

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

18

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

Page 117: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

18

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

19

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

19

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

20

20

(a)GeodesicGaussiankernels

(b)OrdinaryGaussiankernels

123456789101112131415161718192021

123456789101112131415161718192021

1

1

2

→←↓↓↓↓↓↓↓

↓←↓↓→→→→

2

↓↓↓↓↓↓↓→↓

→→→→→→→→

3

↑←↓↓↓↓↓↓↓

↑↑↓↓→→→→↑

3

↓↓↓↓↓↓→↓↑

→→→→→→→→↑

4

↓↓↓↓↓↑↑↓↓

↑↑↑↓↓↓→→↑

Page 118: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

4

↓↓↓↓↓→↓→↓

→→→→→→→→↑

5

↓↓↓↓↓↑↑←↓

→↑↑↑↓↓↓↑↑

5

↓↓↓↓→↓→↓↑

→→→→→→→→↑

6

↓↓↓↓↓↓↓↓↓

↓→↑↑↓↓↓↑↑

6

↓→↓→↓→↓→↓

→→→→→→→→↑

7

↓↓↓↓↓↓↓↓↓

↓→→→→↓↓←↑

7

→↓→↓→↓→↓↓

→→→→→→→→↑

8

↓↓↑↑↓↓↓↓↓

↓↓↑→→→→←←

8

↓→↓→↓→↓→↑

→←→→→→↑→↑

9

↓↓↓↑↓↓↓↓↓

↓↓↑↑→→→→↑

9

→↓→↓→↓→→↑

Page 119: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

↑→←→→↑→↑↑

10

↓↓↓←↓↓↓↓→→↓↓↑↑←→→→↑

10

↓→↓→↓→↑→→→→↓↑↑↑→↑→↑

11

↓↓↓←↓↓↓↓→→→↓↓←←←←↑↑

11

→↓→↓→↑→→↑→↓↑↑↑↑↑→↑↑

12

↓↓↓↑↓↓↓↓↓

→→↓↓↓←←↑↑

12

↓→↓→↓→↑→↑

↑↑↑↑↑↑↑→↑

13

↓↓↑↑↓↓↓↓↓

→→→→↓←←←↓

13

→↓→↑→↑→↑↓

→↑↑↑↑↑↑↑↑

14

↓↓↓↓↓↓↓↓↓

↓→→→→→→↓↓

14

↑→↑→↑→↑→↑

↑→↑↑↑↑↑↑↑

15

↓↓↓↓↓↓↓↓↓

↓↓↓→→→←↓↓

15

↓↑↓↑↓↑→↑↓

Page 120: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

→↑→↑↑↑↑↑↑

16

↓↓↓↓↓↑↑←↓

↓↓↓←↑↑←↓↓

16

↑↓↑↓↑↓↑↓↑

↑→↑→↑↑↑←↑

17

↓↓↓↓↓↑↑↓↓

↑↓↓←←←←↓↓

17

↓↑↓↑↓↑↓↑↓

→↑→↑↑↑←↑←

18

↑↓↓↓↓←←↓↓

↑→→↓←←↑↓↓

18

↑↓↑↓↑↓↑↓↑

↑→↑→↑←↑←↑

19

→↑←←←←←←←→→→→←←↑↑↑

19

↑↑←↑←↑←↑←→↑→↑→↑←↑←

20

20

(c)Graph-Laplacianeigenbases

(d)Diffusionwavelets

FIGURE3.4:Obtainedpolicies.

Figure3.3(b)).Thiscausesanerrorinthepolicyaroundthepartition(see

x=10,y=2,3,…,9ofFigure3.4(b)).

3.2.4

Graph-LaplacianEigenbases

Page 121: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Mahadevan(2005)proposedemployingthesmoothestvectorsongraphsas

basesinvaluefunctionapproximation.Accordingtothespectralgraphtheory

(Chung,1997),suchsmoothbasesaregivenbytheminoreigenvectorsofthe

graph-Laplacianmatrix,whicharecalledgraph-Laplacianeigenbases(GLEs).

GLEsmayberegardedasanaturalextensionofFourierbasestographs.

ExamplesofGLEsareillustratedinFigure3.2(c),showingthattheyhave

Fourier-likestructureonthegraph.ItshouldbenotedthatGLEsarerather

globalinnature,implyingthatnoiseinalocalregioncanpotentiallyde-

gradetheglobalqualityofapproximation.AnadvantageofGLEsisthatthey

haveanaturalorderingofthebasisfunctionsaccordingtothesmoothness.

Thisispracticallyveryhelpfulinchoosingasubsetofbasisfunctions.Fig-

ure3.3(c)depictstheapproximatedvaluefunctioninlog-scale,wherethetop

BasisDesignforValueFunctionApproximation

35

40smoothestGLEsoutof326GLEsareused(notethattheactualnumber

ofbasesis160becauseoftheduplicationovertheactionspace).Itshows

thatGLEsgloballygiveaverygoodapproximation,althoughthesmalllocal

fluctuationissignificantlyemphasizedsincethegraphisinlog-scale.Indeed,

themeansquarederror(MSE)betweentheapproximatedandoptimalvalue

functionsdescribedinthecaptionsofFigure3.3showsthatGLEsgivea

muchsmallerMSEthanGGKsandOGKs.However,theobtainedvaluefunc-

tioncontainssystematiclocalfluctuationandthisresultsinaninappropriate

policy(seeFigure3.4(c)).

MDP-inducedgraphsaretypicallysparse.Insuchcases,theresultant

graph-LaplacianmatrixisalsosparseandGLEscanbeobtainedjustbysolv-

ingasparseeigenvalueproblem,whichiscomputationallyefficient.However,

findingminoreigenvectorscouldbenumericallyunstable.

3.2.5

DiffusionWavelets

CoifmanandMaggioni(2006)proposeddiffusionwavelets(DWs),which

Page 122: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

areanaturalextensionofwaveletstothegraph.Theconstructionisbased

onasymmetrizedrandomwalkonagraph.Itisdiffusedonthegraphupto

adesiredlevel,resultinginamulti-resolutionstructure.Adetailedalgorithm

forconstructingDWsandmathematicalpropertiesaredescribedinCoifman

andMaggioni(2006).

WhenconstructingDWs,themaximumnestlevelofwaveletsandtoler-

anceusedintheconstructionalgorithmneedstobespecifiedbyusers.The

maximumnestlevelissetat10andthetoleranceissetat10−10,whichare

suggestedbytheauthors.ExamplesofDWsareillustratedinFigure3.2(d),

showinganicemulti-resolutionstructureonthegraph.DWsareover-complete

bases,soonehastoappropriatelychooseasubsetofbasesforbetterapprox-

imation.Figure3.3(d)depictstheapproximatedvaluefunctionobtainedby

DWs,wherewechosethemostglobal40DWsfrom1626over-completeDWs

(notethattheactualnumberofbasesis160becauseoftheduplicationover

theactionspace).Thechoiceofthesubsetbasescouldpossiblybeenhanced

usingmultipleheuristics.However,thecurrentchoiceisreasonablesinceFig-

ure3.3(d)showsthatDWsgiveamuchsmallerMSEthanGaussiankernels.

Nevertheless,similartoGLEs,theobtainedvaluefunctioncontainsalotof

smallfluctuations(seeFigure3.3(d))andthisresultsinanerroneouspolicy

(seeFigure3.4(d)).

Thankstothemulti-resolutionstructure,computationofdiffusionwavelets

canbecarriedoutrecursively.However,duetotheover-completeness,itisstill

ratherdemandingincomputationtime.Furthermore,appropriatelydetermin-

ingthetuningparametersaswellaschoosinganappropriatebasissubsetis

notstraightforwardinpractice.

36

StatisticalReinforcementLearning

3.3

NumericalExamples

Page 123: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Asdiscussedintheprevioussection,GGKsbringanumberofpreferable

propertiesformakingvaluefunctionapproximationeffective.Inthissection,

thebehaviorofGGKsisillustratednumerically.

3.3.1

Robot-ArmControl

Here,asimulatorofatwo-jointrobotarm(movinginaplane),illustrated

inFigure3.5(a),isemployed.Thetaskistoleadtheend-effector(“hand”)

ofthearmtoanobjectwhileavoidingtheobstacles.Possibleactionsareto

increaseordecreasetheangleofeachjoint(“shoulder”and“elbow”)by5

degreesintheplane,simulatingcoarsestepper-motorjoints.Thus,thestate

spaceSisthe2-dimensionaldiscretespaceconsistingoftwojoint-angles,as

illustratedinFigure3.5(b).Theblackareainthemiddlecorrespondstothe

obstacleinthejoint-anglestatespace.TheactionspaceAinvolves4actions:

increaseordecreaseoneofthejointangles.Apositiveimmediatereward+1

isgivenwhentherobot’send-effectortouchestheobject;otherwisetherobot

receivesnoimmediatereward.Notethatactionswhichmakethearmcollide

withobstaclesaredisallowed.Thediscountfactorissetatγ=0.9.Inthis

environment,therobotcanchangethejointangleexactlyby5degrees,and

thereforetheenvironmentisdeterministic.However,becauseoftheobstacles,

itisdifficulttoexplicitlycomputeaninversekinematicmodel.Furthermore,

theobstaclesintroducediscontinuityinvaluefunctions.Therefore,thisrobot-

armcontroltaskisaninterestingtestbedforinvestigatingthebehaviorof

GGKs.

Trainingsamplesfrom50seriesof1000randomarmmovementsarecol-

lected,wherethestartstateischosenrandomlyineachtrial.Thegraph

inducedbytheaboveMDPconsistsof1605nodesanduniformweightsare

assignedtotheedges.Sincethereare16goalstatesinthisenvironment(see

Figure3.5(b)),thefirst16Gaussiancentersareputatthegoalsandthere-

mainingcentersarechosenrandomlyinthestatespace.ForGGKs,kernel

functionsareextendedovertheactionspaceusingtheshiftingscheme(see

Eq.(3.3))sincethetransitionisdeterministicinthisexperiment.

Figure3.6illustratesthevaluefunctionsapproximatedusingGGKsand

Page 124: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

OGKs.ThegraphsshowthatGGKsgiveanicesmoothsurfacewithobstacle-

induceddiscontinuitysharplypreserved,whileOGKstendtosmoothout

thediscontinuity.Thismakesasignificantdifferenceinavoidingtheobsta-

cle.From“A”to“B”inFigure3.5(b),theGGK-basedvaluefunctionresults

inatrajectorythatavoidstheobstacle(seeFigure3.6(a)).Ontheotherhand,

theOGK-basedvaluefunctionyieldsatrajectorythattriestomovethearm

throughtheobstaclebyfollowingthegradientupward(seeFigure3.6(b)),

causingthearmtogetstuckbehindtheobstacle.

Page 125: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 126: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 127: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 128: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 129: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 130: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 131: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 132: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

BasisDesignforValueFunctionApproximation

37

-

(a)Aschematic

A

Page 133: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

B

(b)Statespace

FIGURE3.5:Atwo-jointrobotarm.Inthisexperiment,GGKsareputat

allthegoalstatesandtheremainingkernelsaredistributeduniformlyover

themaze;theshiftingschemeisusedinGGKs.

Figure3.7summarizestheperformanceofGGKsandOGKsmeasured

bythepercentageofsuccessfultrials(i.e.,theend-effectorreachestheobject)

over30independentruns.Moreprecisely,ineachrun,50,000trainingsamples

arecollectedusingadifferentrandomseed,apolicyisthencomputedbythe

GGK-orOGK-basedleast-squarespolicyiteration,andfinallytheobtained

policyistested.ThisgraphshowsthatGGKsremarkablyoutperformOGKs

sincethearmcansuccessfullyavoidtheobstacle.TheperformanceofOGKs

doesnotgobeyond0.6evenwhenthenumberofkernelsisincreased.Thisis

causedbythetaileffectofOGKs.Asaresult,theOGK-basedpolicycannot

leadtheend-effectortotheobjectifitstartsfromthebottomlefthalfofthe

statespace.

Whenthenumberofkernelsisincreased,theperformanceofbothGGKs

andOGKsgetsworseataroundk=20.Thisiscausedbythekernelalloca-

Page 134: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 135: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

38

StatisticalReinforcementLearning

3

1

2

0.5

1

0

0

180

180

100

100

0

0

0

0

Joint2(degree)

Joint2(degree)

−180

−100

Joint1(degree)

−180

−100

Joint1(degree)

(a)GeodesicGaussiankernels

(b)OrdinaryGaussiankernels

FIGURE3.6:Approximatedvaluefunctionswith10kernels(theactual

numberofbasesis40becauseoftheduplicationovertheactionspace).

1

0.9

0.8

Page 136: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0.7

0.6

0.5

0.4

Fractionofsuccessfultrials0.3

0.2

GGK(5)

GGK(9)

0.1

OGK(5)

OGK(9)

00

20

40

60

80

100

Numberofkernels

FIGURE3.7:Fractionofsuccessfultrials.

tionstrategy:thefirst16kernelsareputatthegoalstatesandtheremaining

kernelcentersarechosenrandomly.Whenkislessthanorequalto16,the

approximatedvaluefunctiontendstohaveaunimodalprofilesinceallkernels

areputatthegoalstates.However,whenkislargerthan16,thisunimodality

isbrokenandthesurfaceoftheapproximatedvaluefunctionhasslightfluc-

tuations,causinganerrorinpoliciesanddegradingperformanceataround

BasisDesignforValueFunctionApproximation

39

k=20.Thisperformancedegradationtendstorecoverasthenumberof

kernelsisfurtherincreased.

MotionexamplesoftherobotarmtrainedwithGGKandOGKareillus-

Page 137: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

tratedinFigure3.8andFigure3.9,respectively.

Overall,theaboveresultshowsthatwhenGGKsarecombinedwiththe

above-mentionedkernel-centerallocationstrategy,almostperfectpoliciescan

beobtainedwithasmallnumberofkernels.Therefore,theGGKmethodis

computationallyhighlyadvantageous.

3.3.2

Robot-AgentNavigation

Theabovesimplerobot-armcontrolsimulationshowsthatGGKsare

promising.Here,GGKsareappliedtoamorechallengingtaskofmobile-robot

navigation,whichinvolvesahigh-dimensionalandverylargestatespace.

AKheperarobot,illustratedinFigure3.10(a),isemployedforthenavi-

gationtask.TheKheperarobotisequippedwith8infraredsensors(“s1”to

“s8”inthefigure),eachofwhichgivesameasureofthedistancefromthesur-

roundingobstacles.Eachsensorproducesascalarvaluebetween0and1023:

thesensorobtainsthemaximumvalue1023ifanobstacleisjustinfrontofthe

sensorandthevaluedecreasesastheobstaclegetsfartheruntilitreachesthe

minimumvalue0.Therefore,thestatespaceSis8-dimensional.TheKhep-

erarobothastwowheelsandtakesthefollowingdefinedactions:forward,

leftrotation,rightrotation,andbackward(i.e.,theactionspaceAcontains

actions).Thespeedoftheleftandrightwheelsforeachactionisdescribed

inFigure3.10(a)inthebracket(theunitispulseper10milliseconds).Note

thatthesensorvaluesandthewheelspeedarehighlystochasticduetothe

crosstalk,sensornoise,slip,etc.Furthermore,perceptualaliasingoccursdue

tothelimitedrangeandresolutionofsensors.Therefore,thestatetransition

isalsohighlystochastic.Thediscountfactorissetatγ=0.9.

ThegoalofthenavigationtaskistomaketheKheperarobotexplore

theenvironmentasmuchaspossible.Tothisend,apositivereward+1is

givenwhentheKheperarobotmovesforwardandanegativereward−2is

givenwhentheKheperarobotcollideswithanobstacle.Norewardisgiven

totheleftrotation,rightrotation,andbackwardactions.Thisrewarddesign

encouragestheKheperarobottogoforwardwithouthittingobstacles,through

whichextensiveexplorationintheenvironmentcouldbeachieved.

Page 138: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Trainingsamplesarecollectedfrom200seriesof100randommovementsin

afixedenvironmentwithseveralobstacles(seeFigure3.11(a)).Then,agraph

isconstructedfromthegatheredsamplesbydiscretizingthecontinuousstate

spaceusingaself-organizingmap(SOM)(Kohonen,1995).ASOMconsists

ofneuronslocatedonaregulargrid.Eachneuroncorrespondstoacluster

andneuronsareconnectedtoadjacentonesbyneighborhoodrelation.The

SOMissimilartothek-meansclusteringalgorithm,butitisdifferentinthat

thetopologicalstructureoftheentiremapistakenintoaccount.Thanksto

this,theentirespacetendstobecoveredbytheSOM.Thenumberofnodes

Page 139: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 140: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 141: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 142: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 143: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 144: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 145: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 146: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

40

StatisticalReinforcementLearning

FIGURE3.8:AmotionexampleoftherobotarmtrainedwithGGK(from

lefttorightandtoptobottom).

FIGURE3.9:AmotionexampleoftherobotarmtrainedwithOGK(from

lefttorightandtoptobottom).

Page 147: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 148: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 149: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

BasisDesignforValueFunctionApproximation

41

(a)Aschematic

1000

800

600

400

200

0

−200

−400

−1000

−800

−600

−400

−200

0

200

400

600

800

1000

(b)Statespaceprojectedontoa2-dimensionalsubspaceforvisualization

FIGURE3.10:Kheperarobot.Inthisexperiment,GGKsaredistributed

Page 150: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

uniformlyoverthemazewithouttheshiftingscheme.

(states)inthegraphissetat696(equivalenttotheSOMmapsizeof24×29).

Thisvalueiscomputedbythestandardrule-of-thumbformula5n(Vesanto

etal.,2000),wherenisthenumberofsamples.Theconnectivityofthegraph

isdeterminedbystatetransitionsoccurringinthesamples.Morespecifically,

ifthereisastatetransitionfromonenodetoanotherinthesamples,anedge

isestablishedbetweenthesetwonodesandtheedgeweightissetaccording

totheEuclideandistancebetweenthem.

Figure3.10(b)illustratesanexampleoftheobtainedgraphstructure.For

visualizationpurposes,the8-dimensionalstatespaceisprojectedontoa2-

dimensionalsubspacespannedby

(−1−10011

0

0),

(0

0

11

00

−1−1).

Page 151: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

42

StatisticalReinforcementLearning

(a)Training

(b)Test

FIGURE3.11:Simulationenvironment.

Notethatthisprojectionisperformedonlyforthepurposeofvisualization.

Allthecomputationsarecarriedoutusingtheentire8-dimensionaldata.

Thei-thelementintheabovebasescorrespondstotheoutputofthei-th

sensor(seeFigure3.10(a)).Theprojectionontothissubspaceroughlymeans

thatthehorizontalaxiscorrespondstothedistancetotheleftandright

obstacles,whiletheverticalaxiscorrespondstothedistancetothefrontand

backobstacles.Forclearvisibility,theedgeswhoseweightislessthan250are

plotted.RepresentativelocalposesoftheKheperarobotwithrespecttothe

obstaclesareillustratedinFigure3.10(b).Thisgraphhasanotablefeature:

thenodesaroundtheregion“B”inthefigurearedirectlyconnectedtothe

nodesat“A,”butareverysparselyconnectedtothenodesat“C,”“D,”and

“E.”Thisimpliesthatthegeodesicdistancefrom“B”to“C,”“B”to“D,”

or“B”to“E”istypicallylargerthantheEuclideandistance.

Sincethetransitionfromonestatetoanotherishighlystochasticinthe

currentexperiment,theGGKfunctionissimplyduplicatedovertheaction

space(seeEq.(3.1)).ForobtainingcontinuousGGKs,GGKfunctionsneedto

beinterpolated(seeSection3.1.4).Asimplelinearinterpolationmethodmay

beemployedingeneral,butthecurrentexperimenthasuniquecharacteristics:

Page 152: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

atleastoneofthesensorvaluesisalwayszerosincetheKheperarobotisnever

completelysurroundedbyobstacles.Therefore,samplesarealwaysonthe

surfaceofthe8-dimensionalhypercube-shapedstatespace.Ontheotherhand,

thenodecentersdeterminedbytheSOMarenotgenerallyonthesurface.This

meansthatanysampleisnotincludedintheconvexhullofitsnearestnodes

andthefunctionvalueneedstobeextrapolated.Here,theEuclideandistance

betweenthesampleanditsnearestnodeissimplyaddedwhencomputing

kernelvalues.Moreprecisely,forastatesthatisnotgenerallylocatedona

nodecenter,theGGK-basedbasisfunctionisdefinedas

(ED(s,˜

s)+SP(˜

s,c(j)))2

φi+(j−1)m(s,a)=I(a=a(i))exp−

,

2σ2

BasisDesignforValueFunctionApproximation

43

where˜

sisthenodeclosesttosintheEuclideandistance.

Figure3.12illustratesanexampleofactionsselectedateachnodebythe

GGK-basedandOGK-basedpolicies.Onehundredkernelsareusedandthe

widthissetat1000.Thesymbols↑,↓,⊂,and⊃inthefigureindicateforward,backward,leftrotation,andrightrotationactions.Thisshowsthatthereisa

cleardifferenceintheobtainedpoliciesatthestate“C.”Thebackwardaction

ismostlikelytobetakenbytheOGK-basedpolicy,whiletheleftrotation

andrightrotationaremostlikelytobetakenbytheGGK-basedpolicy.This

causesasignificantdifferenceintheperformance.Toexplainthis,supposethat

theKheperarobotisatthestate“C,”i.e.,itfacesawall.TheGGK-based

policyguidestheKheperarobotfrom“C”to“A”via“D”or“E”bytaking

theleftandrightrotationactionsanditcanavoidtheobstaclesuccessfully.

Page 153: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Ontheotherhand,theOGK-basedpolicytriestoplanapathfrom“C”to

“A”via“B”byactivatingthebackwardaction.Asaresult,theforwardaction

istakenat“B.”Forthisreason,theKheperarobotreturnsto“C”againand

endsupmovingbackandforthbetween“C”and“B.”

Fortheperformanceevaluation,amorecomplicatedenvironmentthan

theoneusedforgatheringtrainingsamples(seeFigure3.11)isused.This

meansthathowwelltheobtainedpoliciescanbegeneralizedtoanunknown

environmentisevaluatedhere.Inthistestenvironment,theKheperarobot

runsfromafixedstartingposition(seeFigure3.11(b))andtakes150steps

followingtheobtainedpolicy,withthesumofrewards(+1fortheforward

action)computed.IftheKheperarobotcollideswithanobstaclebefore150

steps,theevaluationisstopped.Themeantestperformanceover30indepen-

dentrunsisdepictedinFigure3.13asafunctionofthenumberofkernels.

Moreprecisely,ineachrun,agraphisconstructedbasedonthetraining

samplestakenfromthetrainingenvironmentandthespecifiednumberofker-

nelsisputrandomlyonthegraph.Then,apolicyislearnedbytheGGK-

orOGK-basedleast-squarespolicyiterationusingthetrainingsamples.Note

thattheactualnumberofbasesisfourtimesmorebecauseoftheexten-

sionofbasisfunctionsovertheactionspace.Thetestperformanceismea-

sured5timesforeachpolicyandtheaverageisoutput.Figure3.13shows

thatGGKssignificantlyoutperformOGKs,demonstratingthatGGKsare

promisingeveninthechallengingsettingwithahigh-dimensionallargestate

space.

Figure3.14depictsthecomputationtimeofeachmethodasafunctionof

thenumberofkernels.Thisshowsthatthecomputationtimemonotonically

increasesasthenumberofkernelsincreasesandtheGGK-basedandOGK-

basedmethodshavecomparablecomputationtime.However,giventhatthe

GGK-basedmethodworksmuchbetterthantheOGK-basedmethodwitha

smallernumberofkernels(seeFigure3.13),theGGK-basedmethodcouldbe

regardedasacomputationallyefficientalternativetothestandardOGK-based

method.

Finally,thetrainedKheperarobotisappliedtomapbuilding.Starting

Page 154: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

fromaninitialposition(indicatedbyasquareinFigure3.15),theKhepera

44

StatisticalReinforcementLearning

1000

⊃⊃⊃⊃⊃⊃⊃⊃↑⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↑↑⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↑

⊃⊃⊂⊂⊃⊃⊃⊃⊃⊃⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊂⊂⊂⊂

Page 155: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊂⊂↓

⊃⊂⊂⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃↑

⊃⊃⊃

Page 156: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊃⊃⊃⊃↓⊃⊃⊃↓

⊃⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂800

⊃⊃⊃⊃⊃⊃⊂⊂⊂⊃⊃⊃↑

⊂⊂⊂⊃⊃⊃⊃⊃↑↑↑⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂

Page 157: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃↑↑⊃⊃⊃↑

↑↑⊂⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃↑⊃⊃⊃⊃⊃↑⊂↑⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↑

⊂↑⊂⊂⊂⊂↑↑

Page 158: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊂↑↑⊂⊂⊂⊂600

⊂⊃⊃⊃⊃⊃↑⊃↑

⊂⊂⊂↑⊃⊃⊂⊃↑

⊂↑

⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃↑

⊃↑↑

⊃↑

Page 159: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊂↑⊂↑↑⊂⊂⊂⊂⊃⊃⊃⊃↑

⊂⊂↑↑↑

⊂⊂⊂⊂↑↑↑⊂⊂⊂400

⊃↑

⊃⊃⊃↑

↑⊂↑⊂⊂⊂⊃⊃⊃↑

↑↑↑↑

↑↑↑

Page 160: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊂⊂⊃⊃↑

⊃⊃⊃↑

↑↑

⊂↑

↑↑↑

↑↑↑

⊂⊂⊃⊃⊃⊃↑↑↑↑↑

⊂⊂⊂↑↑

⊂↑↑↑

⊂200

Page 161: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊃⊃⊃↑

⊂⊂⊃⊃↑↑↑↑↑

⊂⊃↑↑↑↑↑

↑↑

⊂⊂⊃⊃⊃⊃↑

⊃↑↑

↑↑↑

⊂↑

↑↑↑

↑↑↑

⊂⊂⊂⊃⊃

Page 162: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

↑↑

↑↑

⊃⊃↑

↑↑↑↑↑

↑↑

⊂↑↑

↑↑↑↑

↑⊂⊂0

⊃⊃⊃⊃↑↑↑↑↑

↑↑

↑↑↑

↑↑↑↑↑↑↑↑↑

↑↑

↑↑

↑↑

↑↑↑

Page 163: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

↑↑↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑↑

↑↑

↑↑

⊂↑↑

↑↑

⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃↑↑↑↑↑↑↑↑

↑↑↑↑

↑↑↑↑

⊃⊃

Page 164: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊃⊃⊃⊃↑↑

↑↑↑

↑↑↑

↑↑

↑↑

↑↑↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑↑

−200

↑↑

−400

−1000

−800

−600

−400

−200

0

Page 165: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

200

400

600

800

1000

(a)GeodesicGaussiankernels

1000

⊃⊃⊃↓↓↓↓↓↓↓

↓↓↓

⊃⊃⊃⊃⊃⊃↓↓

↓↓↓↓↓↓

↓↓

⊃⊃⊃⊃⊃↓

⊃⊃↓↓⊃↓↓↓↓

↓↓

↓↓

↓↓↓↓↓

↓↓↓

↓↓↓↓↓⊂

Page 166: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

↓⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↓

⊃⊃↓

↓↓↓

↓↓↓

↓↓

↓↓

↓↓↓↓⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃

Page 167: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊃↓↓⊃⊃↓↓↓

↓↓↓↓↓⊂⊂⊂⊂⊂⊂800

⊃⊃⊃⊃⊃↓

⊂⊂⊂⊃⊃⊃↓

⊂⊂⊂⊃⊃⊃↓↓↑↓

↓↓

⊂⊂⊂⊂⊂⊂⊂⊂

Page 168: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↓↑

↓↓↑

↑↓↓↓↓↓

⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↓↓↑↑↑

↓↓

⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↓

↑↓

⊂⊂⊂⊂⊃⊃

Page 169: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

↑↑

↓↓

⊂⊂⊂600

⊂⊃⊃⊃⊃⊃↓↑

⊂⊂⊂⊃⊃↑

⊂↓

⊂↑

⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃↑

⊂⊂⊃↑

⊃⊃

Page 170: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊃↑

↑↓⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃↑

⊂⊂↑↑↑

↑↑

⊂↓⊂⊂⊂⊂⊂⊂400

⊃⊃⊃⊃⊃↑

↑↑

⊂⊂⊂⊂⊃⊃⊃⊃↑

↑↑↑↑

Page 171: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

↑⊂⊂⊂⊂⊃⊃⊃↑

⊃⊃↑

↑⊂⊂↑

↑↑↑

↑↑⊂⊂⊂⊂⊃⊃⊃⊃⊃↑↑↑↑

⊂⊂⊂↑↑

⊂⊂↑↑⊂⊂

Page 172: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

200

⊃⊃↑

⊂⊂⊃⊃⊃⊃↑↑↑↑

⊂↑

↑↑↑↑↑

↑⊂⊂⊂⊃⊃⊃↑

⊃⊃⊃↑↑↑

⊂↑

↑↑↑

↑↑↑

⊂⊂⊂⊃

Page 173: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊃⊃↑↑

↑↑

⊃⊃↑

⊂↑↑↑↑↑

↑↑

↑↑

↑↑↑↑

⊂⊂⊂0

⊃⊃⊃⊃↑↑↑↑↑

↑↑

↑↑↑

↑↑↑↑↑↑↑↑↑

↑↑

↑↑

↑↑

Page 174: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

↑↑↑

↑↑↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑↑

↑↑

↑⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃↑↑↑↑↑↑↑

↑↑↑↑

↑↑↑↑

⊃⊃

Page 175: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

⊃⊃⊃⊃↑↑

↑↑↑

↑↑↑

↑↑

↑↑

↑↑↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑↑

−200

↑↑

−400

−1000

−800

−600

−400

−200

0

Page 176: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

200

400

600

800

1000

(b)OrdinaryGaussiankernels

FIGURE3.12:Examplesofobtainedpolicies.Thesymbols↑,↓,⊂,and⊃indicateforward,backward,leftrotation,andrightrotationactions.

robottakesanaction2000timesfollowingthelearnedpolicy.Eightykernels

withGaussianwidthσ=1000areusedforvaluefunctionapproximation.The

resultsofGGKsandOGKsaredepictedinFigure3.15.Thegraphsshowthat

theGGKresultgivesabroaderprofileoftheenvironment,whiletheOGK

resultonlyrevealsalocalareaaroundtheinitialposition.

MotionexamplesoftheKheperarobottrainedwithGGKandOGKare

illustratedinFigure3.16andFigure3.17,respectively.

BasisDesignforValueFunctionApproximation

45

70

GGK(200)

4500

65

GGK(1000)

OGK(200)

GGK(1000)

60

4000

OGK(1000)

OGK(1000)

55

Page 177: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

3500

50

3000

45

2500

40

2000

35

1500

Averagedtotalrewards30

Computationtime[sec]1000

25

500

200102030405060708090100

00102030405060708090100

Numberofkernels

Numberofkernels

FIGURE3.13:Averageamountof

FIGURE3.14:Computationtime.

exploration.

(a)GeodesicGaussiankernels

(b)OrdinaryGaussiankernels

FIGURE3.15:Resultsofmapbuilding(cf.Figure3.11(b)).

3.4

Remarks

Theperformanceofleast-squarespolicyiterationdependsheavilyonthe

choiceofbasisfunctionsforvaluefunctionapproximation.Inthischapter,

thegeodesicGaussiankernel(GGK)wasintroducedandshowntopossess

severalpreferablepropertiessuchassmoothnessalongthegraphandeasy

computability.ItwasalsodemonstratedthatthepoliciesobtainedbyGGKs

arenotassensitivetothechoiceoftheGaussiankernelwidth,whichisa

usefulpropertyinpractice.Also,theheuristicsofputtingGaussiancenters

Page 178: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ongoalstateswasshowntoworkwell.

However,whenthetransitionishighlystochastic(i.e.,thetransitionprob-

abilityhasawidesupport),thegraphconstructedbasedonthetransition

samplescouldbenoisy.Whenanerroneoustransitionresultsinashort-cut

Page 179: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 180: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 181: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 182: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 183: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 184: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 185: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 186: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

46

StatisticalReinforcementLearning

FIGURE3.16:AmotionexampleoftheKheperarobottrainedwithGGK

(fromlefttorightandtoptobottom).

FIGURE3.17:AmotionexampleoftheKheperarobottrainedwithOGK

(fromlefttorightandtoptobottom).

overobstacles,thegraph-basedapproachmaynotworkwellsincethetopology

ofthestatespacechangessignificantly.

Chapter4

SampleReuseinPolicyIteration

Off-policyreinforcementlearningisaimedatefficientlyusingdatasamples

gatheredfromapolicythatisdifferentfromthecurrentlyoptimizedpolicy.A

commonapproachistouseimportancesamplingtechniquesforcompensating

forthebiascausedbythedifferencebetweenthedata-samplingpolicyandthe

targetpolicy.Inthischapter,weexplainhowimportancesamplingcanbeuti-

lizedtoefficientlyreusepreviouslycollecteddatasamplesinpolicyiteration.

Afterformulatingtheproblemofoff-policyvaluefunctionapproximationin

Section4.1,representativeoff-policyvaluefunctionapproximationtechniques

includingadaptiveimportancesamplingarereviewedinSection4.2.Then,in

Section4.3,howtheadaptivityofimportancesamplingcanbeoptimallycon-

trolledisexplained.InSection4.4,off-policyvaluefunctionapproximation

techniquesareintegratedintheframeworkofleast-squarespolicyiteration

Page 187: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

forefficientsamplereuse.ExperimentalresultsareshowninSection4.5,and

finallythischapterisconcludedinSection4.6.

4.1

Formulation

AsexplainedinSection2.2,least-squarespolicyiterationmodelsthestate-

actionvaluefunctionQπ(s,a)byalineararchitecture,

θ⊤φ(s,a),

andlearnstheparameterθsothatthegeneralizationerrorGisminimized:

#

T

1X

2

G(θ)=Epπ(h)

θ⊤ψ(s

.

(4.1)

T

t,at)−r(st,at)

t=1

Here,Epπ(h)denotestheexpectationoverhistory

h=[s1,a1,…,sT,aT,sT+1]

followingthetargetpolicyπand

h

i

ψ(s,a)=φ(s,a)−γEπ(a′|s′)p(s′|s,a)φ(s′,a′).

47

48

StatisticalReinforcementLearning

Page 188: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Whenhistorysamplesfollowingthetargetpolicyπareavailable,thesitu-

ationiscalledon-policyreinforcementlearning.Inthiscase,justreplacingthe

expectationcontainedinthegeneralizationerrorGbysampleaveragesgives

astatisticallyconsistentestimator(i.e.,theestimatedparameterconvergesto

theoptimalvalueasthenumberofsamplesgoestoinfinity).

Here,weconsiderthesituationcalledoff-policyreinforcementlearning,

wherethesamplingpolicye

πforcollectingdatasamplesisgenerallydifferent

fromthetargetpolicyπ.Letusdenotethehistorysamplesfollowinge

πby

Heπ=heπ1,…,heπN,

whereeachepisodicsampleheπnisgivenas

heπn=[seπ1,n,aeπ1,n,…,seπT,n,aeπT,n,seπT+1,n].

Undertheoff-policysetup,naivelearningbyminimizingthesample-

approximatedgeneralizationerrorb

GNIWleadstoaninconsistentestimator:

N

XT

X

2

b

1

GNIW(θ)=

θ⊤b

ψ(seπ

,

NT

t,n,ae

π

t,n;He

π)−r(seπt,n,aeπt,n,seπt+1,n)

n=1t=1

Page 189: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

where

X

h

i

b

1

ψ(s,a;H)=φ(s,a)−

E

γφ(s′,a′).

|H

e

π(a′|s′)

(s,a)|s′∈H(s,a)H(s,a)denotesasubsetofHthatconsistsofalltransitionsamplesfromstate

sbyactiona,|H(s,a)|denotesthenumberofelementsinthesetH(s,a),and

P

denotesthesummationoveralldestinationstatess′intheset

s′∈Hs,a)H(s,a).NIWstandsfor“NoImportanceWeight,”whichwillbeexplained

later.

Thisinconsistencyproblemcanbeavoidedbygatheringnewsamplesfol-

lowingthetargetpolicyπ,i.e.,whenthecurrentpolicyisupdated,newsam-

plesaregatheredfollowingtheupdatedpolicyandthenewsamplesareused

forpolicyevaluation.However,whenthedatasamplingcostishigh,thisis

tooexpensive.Itwouldbemorecostefficientifpreviouslygatheredsamples

couldbereusedeffectively.

4.2

Off-PolicyValueFunctionApproximation

Importancesamplingisageneraltechniquefordealingwiththeoff-policy

situation.Supposewehavei.i.d.(independentandidenticallydistributed)sam-

plesxnN

n=1fromastrictlypositiveprobabilitydensityfunctione

Page 190: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

p(x).Using

SampleReuseinPolicyIteration

49

thesesamples,wewouldliketocomputetheexpectationofafunctiong(x)

overanotherprobabilitydensityfunctionp(x).Aconsistentapproximationof

theexpectationisgivenbytheimportance-weightedaverageas

1N

X

p(x

p(x)

g(x

n)N→∞

−→E

g(x)

N

n)ep(x

e

p(x)

e

p(x)

n=1

n)

Z

Z

p(x)

=

g(x)

Page 191: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

e

p(x)dx=

g(x)p(x)dx=E

e

p(x)

p(x)[g(x)].

However,applyingtheimportancesamplingtechniqueinoff-policyrein-

forcementlearningisnotstraightforwardsinceourtrainingsamplesofstate

sandactionaarenoti.i.d.duetothesequentialnatureofMarkovdeci-

sionprocesses(MDPs).Inthissection,representativeimportance-weighting

techniquesforMDPsarereviewed.

4.2.1

EpisodicImportanceWeighting

Basedontheindependencebetweenepisodes,

p(h,h′)=p(h)p(h′)=p(s1,a1,…,sT,aT,sT+1)p(s′1,a′1,…,s′T,a′T,s′T+1),thegeneralizationerrorGcanberewrittenas

#

T

1X

2

G(θ)=Epeπ(h)

θ⊤ψ(s

w

,

T

t,at)−r(st,at)

T

t=1

wherewTistheepisodicimportanceweight(EIW):

pπ(h)

wT=

.

Page 192: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

peπ(h)

pπ(h)andpeπ(h)aretheprobabilitydensitiesofobservingepisodicdatah

underpolicyπande

π:

T

Y

pπ(h)=p(s1)

π(at|st)p(st+1|st,at),

t=1

T

Y

peπ(h)=p(s1)

eπ(at|st)p(st+1|st,at).

t=1

Notethattheimportanceweightscanbecomputedwithoutexplicitlyknowing

p(s1)andp(st+1|st,at),sincetheyarecanceledout:

QTπ(a

w

t=1

t|st)

T=Q

.

T

t=1e

π(at|st)

50

StatisticalReinforcementLearning

Page 193: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

UsingthetrainingdataHeπ,wecanconstructaconsistentestimatorofG

as

N

XT

X

2

b

1

GEIW(θ)=

θ⊤b

ψ(seπ

b

w

NT

t,n,ae

π

t,n;He

π)−r(seπt,n,aeπt,n,seπt+1,n)

T,n,

n=1t=1

(4.2)

where

QTπ(aeπ

b

w

t=1

t,n|se

π

t,n)

T,n=Q

.

T

Page 194: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

t=1e

π(aeπt,n|seπt,n)

4.2.2

Per-DecisionImportanceWeighting

AcrucialobservationinEIWisthattheerroratthet-thstepdoesnot

dependonthesamplesafterthet-thstep(Precupetal.,2000).Thus,the

generalizationerrorGcanberewrittenas

#

T

1X

2

G(θ)=Epeπ(h)

θ⊤ψ(s

w

,

T

t,at)−r(st,at)

t

t=1

wherewtistheper-decisionimportanceweight(PIW):

Q

Q

p(s

t

π(a

t

π(a

w

1)

t′=1

t′|st′)p(st′+1|st′,at′)

Page 195: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

t′=1

t′|st′)

t=

Q

=Q

.

p(s

t

t

1)

t′=1e

π(at′|st′)p(st′+1|st′,at′)

t′=1e

π(at′|st′)

UsingthetrainingdataHeπ,wecanconstructaconsistentestimatoras

follows(cf.Eq.(4.2)):

N

XT

X

2

b

1

GPIW(θ)=

θ⊤b

ψ(seπ

b

w

NT

t,n,ae

π

t,n;He

π)−r(seπt,n,aeπt,n,seπt+1,n)

Page 196: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

t,n,

n=1t=1

where

Qt

π(aeπ

)

b

w

t′=1

t′,n|se

π

t′,n

t,n=Q

.

t

)

t′=1e

π(aeπt′,n|seπt′,n

b

wt,nonlycontainstherelevanttermsuptothet-thstep,whileb

wT,nincludes

allthetermsuntiltheendoftheepisode.

4.2.3

AdaptivePer-DecisionImportanceWeighting

ThePIWestimatorisguaranteedtobeconsistent.However,botharenot

efficientinthestatisticalsense(Shimodaira,2000),i.e.,theydonothavethe

smallestadmissiblevariance.Forthisreason,thePIWestimatorcanhave

largevarianceinfinitesamplecasesandthereforelearningwithPIWtendsto

beunstableinpractice.

Toimprovethestability,itisimportanttocontrolthetrade-offbetween

Page 197: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

SampleReuseinPolicyIteration

51

consistencyandefficiency(orsimilarlybiasandvariance)basedontraining

data.Here,theflatteningparameterν(∈[0,1])isintroducedtocontrolthetrade-offbyslightly“flattening”theimportanceweights(Shimodaira,2000;

Sugiyamaetal.,2007):

N

XT

X

b

1

GAIW(θ)=

θ⊤b

ψ(seπ

NT

t,n,ae

π

t,n;He

π)

n=1t=1

2

−r(seπt,n,aeπt,n,seπt+1,n)(b

wt,n)ν,

whereAIWstandsfortheadaptiveper-decisionimportanceweight.When

ν=0,AIWisreducedtoNIWandthereforeithaslargebiasbuthasrelatively

smallvariance.Ontheotherhand,whenν=1,AIWisreducedtoPIW.Thus,

ithassmallbiasbuthasrelativelylargevariance.Inpractice,anintermediate

valueofνwillyieldthebestperformance.

Letb

ΨbetheNT×Bmatrix,c

WbetheNT×NTdiagonalmatrix,and

rbetheNT-dimensionalvectordefinedas

Page 198: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

b

ΨN(t−1)+n,b=b

ψb(st,n,at,n),

c

WN(t−1)+n,N(t−1)+n=b

wt,n,

rN(t−1)+n=r(st,n,at,n,st+1,n).

Then,b

GAIWcanbecompactlyexpressedas

b

1

ν

GAIW(θ)=

(b

Ψθ−r)⊤c

W(b

Ψθ−r).

NT

Becausethisisaconvexquadraticfunctionwithrespecttoθ,itsglobalmin-

imizerb

θAIWcanbeanalyticallyobtainedbysettingitsderivativetozeroas

b

⊤ν

⊤ν

θ

cb

c

AIW=(b

ΨWΨ)−1b

ΨWr.

Page 199: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Thismeansthatthecostforcomputingb

θAIWisessentiallythesameasb

θNIW,

whichisgivenasfollows(seeSection2.2.2):

b

⊤⊤θ

b

NIW=(b

ΨΨ)−1b

Ψr.

4.2.4

Illustration

Here,theinfluenceoftheflatteningparameterνontheestimatorb

θAIWis

illustratedusingthechain-walkMDPillustratedinFigure4.1.

TheMDPconsistsof10states

S=s(1),…,s(10)

52

StatisticalReinforcementLearning

FIGURE4.1:Ten-statechain-walkMDP.

Page 200: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

andtwoactions

A=a(1),a(2)=“L,”“R”.

Thereward+1isgivenwhenvisitings(1)ands(10).Thetransitionprobability

pisindicatedbythenumbersattachedtothearrowsinthefigure.Forexample,

p(s(2)|s(1),a=“R”)=0.9

and

p(s(1)|s(1),a=“R”)=0.1

meanthattheagentcansuccessfullymovetotherightnodewithprobability

0.9(indicatedbysolidarrowsinthefigure)andtheactionfailswithprob-

ability0.1(indicatedbydashedarrowsinthefigure).SixGaussiankernels

withstandarddeviationσ=10areusedasbasisfunctions,andkernelcen-

tersarelocatedats(1),s(5),ands(10).Morespecifically,thebasisfunctions,

φ(s,a)=(φ1(s,a),…,φ6(s,a))aredefinedas

(s−c

φ

j)2

3(i−1)+j(s,a)=I(a=a(i))exp

,

2σ2

fori=1,2andj=1,2,3,where

c1=1,c2=5,c3=10,

and

1ifxistrue,

I(x)=

0ifxisnottrue.

Theexperimentsarerepeated50times,wherethesamplingpolicye

π(a|s)

andthecurrentpolicyπ(a|s)arechosenrandomlyineachtrialsuchthat

eπ6=π.Thediscountfactorissetatγ=0.9.ThemodelparameterbθAIWis

learnedfromthetrainingsamplesHeπanditsgeneralizationerroriscomputed

fromthetestsamplesHπ.

Page 201: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

TheleftcolumnofFigure4.2depictsthetruegeneralizationerrorGav-

eragedover50trialsasafunctionoftheflatteningparameterνforN=10,

30,and50.Figure4.2(a)showsthatwhenthenumberofepisodesislarge

(N=50),thegeneralizationerrortendstodecreaseastheflatteningparam-

eterincreases.Thiswouldbeanaturalresultduetotheconsistencyofb

θAIW

SampleReuseinPolicyIteration

53

0.07

0.08

0.068

0.075

0.066

Trueerror

0.064

Estimatederror

0.07

0.062

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Page 202: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Flatteningparameterν

Flatteningparameterν

(a)N=50

0.084

0.082

0.073

0.08

0.072

0.071

0.078

Trueerror

0.07

0.076

Estimatederror

0.069

0.074

0.068

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Flatteningparameterν

Flatteningparameterν

(b)N=30

Page 203: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0.11

0.14

0.135

0.105

0.13

0.125

Trueerror

0.1

0.12

Estimatederror

0.115

0.0950

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Flatteningparameterν

Flatteningparameterν

(c)N=10

FIGURE4.2:Left:TruegeneralizationerrorGaveragedover50trialsas

afunctionoftheflatteningparameterνinthe10-statechain-walkproblem.

ThenumberofstepsisfixedatT=10.ThetrendofGdiffersdependingon

thenumberNofepisodicsamples.Right:Generalizationerrorestimatedby

5-foldimportanceweightedcrossvalidation(IWCV)(b

GIWCV)averagedover

Page 204: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

50trialsasafunctionoftheflatteningparameterνinthe10-statechain-walk

problem.ThenumberofstepsisfixedatT=10.IWCVnicelycapturesthe

trendofthetruegeneralizationerrorG.

54

StatisticalReinforcementLearning

whenν=1.Ontheotherhand,Figure4.2(b)showsthatwhenthenumberof

episodesisnotlarge(N=30),ν=1performsratherpoorly.Thisimpliesthat

theconsistentestimatortendstobeunstablewhenthenumberofepisodes

isnotlargeenough;ν=0.7worksthebestinthiscase.Figure4.2(c)shows

theresultswhenthenumberofepisodesisfurtherreduced(N=10).This

illustratesthattheconsistentestimatorwithν=1isevenworsethanthe

ordinaryestimator(ν=0)becausethebiasisdominatedbylargevariance.

Inthiscase,thebestνisevensmallerandisachievedatν=0.4.

TheaboveresultsshowthatAIWcanoutperformPIW,particularlywhen

onlyasmallnumberoftrainingsamplesareavailable,providedthattheflat-

teningparameterνischosenappropriately.

4.3

AutomaticSelectionofFlatteningParameter

Inthissection,theproblemofselectingtheflatteningparameterinAIW

isaddressed.

4.3.1

Importance-WeightedCross-Validation

Generally,thebestνtendstobelarge(small)whenthenumberoftraining

samplesislarge(small).However,thisgeneraltrendisnotsufficienttofine-

tunetheflatteningparametersincethebestvalueofνdependsontraining

samples,policies,themodelofvaluefunctions,etc.Inthissection,wediscuss

howmodelselectionisperformedtochoosethebestflatteningparameterν

automaticallyfromthetrainingdataandpolicies.

Ideally,thevalueofνshouldbesetsothatthegeneralizationerrorG

Page 205: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

isminimized,butthetrueGisnotaccessibleinpractice.Tocopewiththis

problem,wecanusecross-validation(seeSection2.2.4)forestimatingthe

generalizationerrorG.However,intheoff-policyscenariowherethesampling

policye

πandthetargetpolicyπaredifferent,ordinarycross-validationgives

abiasedestimateofG.Intheoff-policyscenario,importance-weightedcross-

validation(IWCV)(Sugiyamaetal.,2007)ismoreuseful,wherethecross-

validationestimateofthegeneralizationerrorisobtainedwithimportance

weighting.

Morespecifically,letusdivideatrainingdatasetHeπcontainingNepisodes

intoKsubsetsHeπ

ofapproximatelythesamesize.Forsimplicity,weas-

kK

k=1

k

sumethatNisdivisiblebyK.Letb

θAIWbetheparameterlearnedfromH\Hk

(i.e.,allsampleswithoutHk).Then,thegeneralizationerrorisestimatedwith

SampleReuseinPolicyIteration

55

0.11

NIW(ν=0)

0.105

PIW(ν=1)

AIW+IWCV

0.1

0.095

0.09

Trueerror0.085

Page 206: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0.08

0.075

10

15

20

25

30

35

40

45

50

Numberofepisodes

FIGURE4.3:TruegeneralizationerrorGaveragedover50trialsobtained

byNIW(ν=0),PIW(ν=1),AIW+IWCV(νischosenbyIWCV)inthe

10-statechain-walkMDP.

importanceweightingas

K

X

b

1

G

b

IWCV=

Gk

K

IWCV,

k=1

where

XT

X

2

b

Page 207: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

K

k

Gk

b

⊤b

IWCV=

θ

ψ(s

b

w

NT

AIW

t,at;He

π

k)−r(st,at,st+1)

t.

h∈Heπt=1

k

Thegeneralizationerrorestimateb

GIWCViscomputedforallcandidate

models(inthecurrentsetting,acandidatemodelcorrespondstoadifferent

valueoftheflatteningparameterν)andtheonethatminimizestheestimated

generalizationerrorischosen:

b

ν

b

IWCV=argminGIWCV.

ν

4.3.2

Illustration

ToillustratehowIWCVworks,letususethesamenumericalexamples

Page 208: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

asSection4.2.4.TherightcolumnofFigure4.2depictsthegeneralization

errorestimatedby5-foldIWCVaveragedover50trialsasafunctionofthe

flatteningparameterν.ThegraphsshowthatIWCVnicelycapturesthetrend

ofthetruegeneralizationerrorforallthreecases.

Figure4.3describes,asafunctionofthenumberNofepisodes,theav-

eragetruegeneralizationerrorobtainedbyNIW(AIWwithν=0),PIW

56

StatisticalReinforcementLearning

(AIWwithν=1),andAIW+IWCV(ν∈0.0,0.1,…,0.9,1.0isselectedin

eachtrialusing5-foldIWCV).Thisresultshowsthattheimprovementofthe

performancebyNIWsaturateswhenN≥30,implyingthatthebiascaused

byNIWisnotnegligible.TheperformanceofPIWisworsethanNIWwhen

N≤20,whichiscausedbythelargevarianceofPIW.Ontheotherhand,

AIW+IWCVconsistentlygivesgoodperformanceforallN,illustratingthe

strongadaptationabilityofAIW+IWCV.

4.4

Sample-ReusePolicyIteration

Inthissection,AIW+IWCVisextendedfromsingle-steppolicyevaluation

tofullpolicyiteration.Thismethodiscalledsample-reusepolicyiteration

(SRPI).

4.4.1

Algorithm

LetusdenotethepolicyattheL-thiterationbyπL.Inon-policypolicy

iteration,newdatasamplesHπLarecollectedfollowingthenewpolicyπL

duringthepolicyevaluationstep.Thus,previouslycollecteddatasamples

Hπ1,…,HπL−1arenotused:

E:Hπ1

E:Hπ2

E:Hπ3

Page 209: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

π

I

I

1

b

Qπ1→π2−→

b

Qπ2→π3−→···I

−→πL,

where“E:H”indicatesthepolicyevaluationstepusingthedatasampleH

and“I”indicatesthepolicyimprovementstep.Itwouldbemorecostefficient

ifallpreviouslycollecteddatasampleswerereusedinpolicyevaluation:

E:Hπ1

E:Hπ1,Hπ2

E:Hπ1,Hπ2,Hπ3

π

I

I

1

−→

b

Qπ1→π2

−→

b

Qπ2→π3

−→

···I

−→πL.

Sincethepreviouspoliciesandthecurrentpolicyaredifferentingeneral,

anoff-policyscenarioneedstobeexplicitlyconsideredtoreusepreviously

collecteddatasamples.Here,weexplainhowAIW+IWCVcanbeusedin

Page 210: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

thissituation.Forthispurpose,thedefinitionofb

GAIWisextendedsothat

multiplesamplingpoliciesπ1,…,πLaretakenintoaccount:

L

XN

XT

X

b

1

GL

AIW=

θ⊤b

ψ(sπl

LNT

t,n,aπl

t,n;HπlL

l=1)

l=1n=1t=1

!

Qt

νL

2

πL(aπl

)

−r(

t′,n|sπl

t′,n

t′=1

l

t,n,aπl

t,n,sπl

Page 211: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

t+1,n)

Q

,

(4.3)

t

π

)

t′=1

l(aπl

t′,n|sπl

t′,n

whereb

GL

isthegeneralizationerrorestimatedattheL-thpolicyevaluation

AIW

usingAIW.TheflatteningparameterνLischosenbasedonIWCVbefore

performingpolicyevaluation.

SampleReuseinPolicyIteration

57

ν=0

4.5

4.5

ν=1

ν^

=νIWCV

4.4

4.4

4.3

4.3

Page 212: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

4.2

4.2

Return

Return

4.1

4.1

4

4

ν=0

3.9

ν=1

3.9

ν^

3.8

IWCV

3.8

5

10

15

20

25

30

35

40

45

10

15

20

25

30

35

Page 213: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

40

Totalnumberofepisodes

Totalnumberofepisodes

(a)N=5

(b)N=10

FIGURE4.4:Theperformanceofpolicieslearnedinthreescenarios:ν=0,

ν=1,andSRPI(νischosenbyIWCV)inthe10-statechain-walkproblem.

Theperformanceismeasuredbytheaveragereturncomputedfromtestsam-

plesover30trials.TheagentcollectstrainingsampleHπL(N=5or10with

T=10)ateveryiterationandperformspolicyevaluationusingallcollected

samplesHπ1,…,HπL.Thetotalnumberofepisodesmeansthenumberof

trainingepisodes(N×L)collectedbytheagentinpolicyiteration.

4.4.2

Illustration

Here,thebehaviorofSRPIisillustratedunderthesameexperimental

setupasSection4.3.2.Letusconsiderthreescenarios:νisfixedat0,νisfixedat1,andνischosenbyIWCV(i.e.,SRPI).TheagentcollectssamplesHπLin

L

eachpolicyiterationfollowingthecurrentpolicyπLandcomputesb

θAIWfrom

allcollectedsamplesHπ1,…,HπLusingEq.(4.3).ThreeGaussiankernels

areusedasbasisfunctions,wherekernelcentersarerandomlyselectedfrom

thestatespaceSineachtrial.Theinitialpolicyπ1ischosenrandomlyand

Gibbspolicyimprovement,

exp(Qπ(s,a)/τ)

π(a|s)←−P

,

(4.4)

exp(Qπ(s,a′)/τ)

a′∈Aisperformedwithτ=2L.

Figure4.4depictstheaveragereturnover30trialswhenN=5and10

withafixednumberofsteps(T=10).ThegraphsshowthatSRPIprovides

Page 214: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

stableandfastlearningofpolicies,whiletheperformanceimprovementof

policieslearnedwithν=0saturatesinearlyiterations.Themethodwith

ν=1canimprovepolicieswell,butitsprogresstendstobebehindSRPI.

Figure4.5depictstheaveragevalueoftheflatteningparameterusedin

SRPIasafunctionofthetotalnumberofepisodicsamples.Thegraphsshow

thatthevalueoftheflatteningparameterchosenbyIWCVtendstoriseinthe

beginningandgodownlater.Atfirstsight,thisdoesnotagreewiththegeneral

trendofpreferringalow-varianceestimatorinearlystagesandpreferringa

58

StatisticalReinforcementLearning

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

Flatteningparameter

0.3

Flatteningparameter

0.2

Page 215: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0.2

0.1

0.1

0

0

5

10

15

20

25

30

35

40

45

10

15

20

25

30

35

40

Totalnumberofepisodes

Totalnumberofepisodes

(a)N=5

(b)N=10

FIGURE4.5:FlatteningparametervaluesusedbySRPIaveragedover30

trialsasafunctionofthetotalnumberofepisodicsamplesinthe10-state

chain-walkproblem.

low-biasestimatorlater.However,thisresultisstillconsistentwiththegeneral

trend:whenthereturnincreasesrapidly(thetotalnumberofepisodicsamples

isupto15whenN=5and30whenN=10inFigure4.5),thevalueofthe

flatteningparameterincreases(seeFigure4.4).Afterthat,thereturndoes

Page 216: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

notincreaseanymore(seeFigure4.4)sincethepolicyiterationhasalready

beenconverged.Then,itisnaturaltopreferasmallflatteningparameter

(Figure4.5)sincethesampleselectionbiasbecomesmildafterconvergence.

TheseresultsshowthatSRPIcaneffectivelyreusepreviouslycollected

samplesbyappropriatelytuningtheflatteningparameteraccordingtothe

conditionofdatasamples,policies,etc.

4.5

NumericalExamples

Inthissection,theperformanceofSRPIisnumericallyinvestigatedin

morecomplextasks.

4.5.1

InvertedPendulum

First,weconsiderthetaskoftheswing-upinvertedpendulumillustrated

inFigure4.6,whichconsistsofapolehingedatthetopofacart.Thegoalof

thetaskistoswingthepoleupbymovingthecart.Therearethreeactions:

applyingpositiveforce+50(kg·m/s2)tothecarttomoveright,negative

force−50tomoveleft,andzeroforcetojustcoast.Thatis,theactionspace

SampleReuseinPolicyIteration

59

FIGURE4.6:Illustrationoftheinvertedpendulumtask.

Page 217: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Aisdiscreteanddescribedby

A=50,−50,0kg·m/s2.

Notethattheforceitselfisnotstrongenoughtoswingthepoleup.Thusthe

cartneedstobemovedbackandforthseveraltimestoswingthepoleup.

ThestatespaceSiscontinuousandconsistsoftheangleϕ[rad](∈[0,2π])andtheangularvelocity˙

ϕ[rad/s](∈[−π,π]).Thus,astatesisdescribedbytwo-dimensionalvectors=(ϕ,˙

ϕ)⊤.Theangleϕandangularvelocity˙

ϕare

updatedasfollows:

ϕt+1=ϕt+˙

ϕt+1∆t,

9.8sin(ϕ

˙

ϕ

t)−αwd(˙

ϕt)2sin(2ϕt)/2+αcos(ϕt)at

t+1=˙

ϕt+

∆t,

4l/3−αwdcos2(ϕt)

whereα=1/(W+w)andat(∈A)istheactionchosenattimet.Therewardfunctionr(s,a,s′)isdefinedas

r(s,a,s′)=cos(ϕs′),

whereϕs′denotestheangleϕofstates′.Theproblemparametersaresetas

follows:themassofthecartWis8[kg],themassofthepolewis2[kg],the

lengthofthepoledis0.5[m],andthesimulationtimestep∆tis0.1[s].

Forty-eightGaussiankernelswithstandarddeviationσ=πareusedas

basisfunctions,andkernelcentersarelocatedoverthefollowinggridpoints:

0,2/3π,4/3π,2π×−3π,−π,π,3π.

Thatis,thebasisfunctionsφ(s,a)=φ1(s,a),…,φ16(s,a)aresetas

Page 218: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ks−c

φ

jk2

16(i−1)+j(s,a)=I(a=a(i))exp

,

2σ2

fori=1,2,3andj=1,…,16,where

c1=(0,−3π)⊤,c2=(0,−π)⊤,…,c12=(2π,3π)⊤.

60

StatisticalReinforcementLearning

−6

ν=0

1

−7

ν=1

0.9

ν^

=νIWCV

−8

0.8

0.7

−9

0.6

−10

0.5

−11

0.4

−12

0.3

Page 219: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Flatteningparameter

−13

Sumofdiscountedrewards

0.2

−14

0.1

−15

0

10

20

30

40

50

60

70

80

90

10

20

30

40

50

60

70

80

90

Totalnumberofepisodes

Totalnumberofepisodes

(a)Performanceofpolicy

(b)Averageflatteningparameter

FIGURE4.7:ResultsofSRPIintheinvertedpendulumtask.Theagentcol-

lectstrainingsampleHπL(N=10andT=100)ineachiterationandpolicy

Page 220: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

evaluationisperformedusingallcollectedsamplesHπ1,…,HπL.(a)The

performanceofpolicieslearnedwithν=0,ν=1,andSRPI.Theperformance

ismeasuredbytheaveragereturncomputedfromtestsamplesover20trials.

Thetotalnumberofepisodesmeansthenumberoftrainingepisodes(N×L)

collectedbytheagentinpolicyiteration.(b)Averageflatteningparameter

valueschosenbyIWCVinSRPIover20trials.

Theinitialpolicyπ1(a|s)ischosenrandomly,andtheinitial-stateproba-

bilitydensityp(s)issettobeuniform.TheagentcollectsdatasamplesHπL

(N=10andT=100)ateachpolicyiterationfollowingthecurrentpolicy

πL.Thediscountedfactorissetatγ=0.95andthepolicyisupdatedby

Gibbspolicyimprovement(4.4)withτ=L.

Figure4.7(a)describestheperformanceoflearnedpolicies.Thegraph

showsthatSRPInicelyimprovestheperformancethroughouttheentirepolicy

iteration.Ontheotherhand,theperformancewhentheflatteningparameter

isfixedatν=0orν=1isnotproperlyimprovedafterthemiddleof

iterations.TheaverageflatteningparametervaluesdepictedinFigure4.7(b)

showthattheflatteningparametertendstoincreasequicklyinthebeginning

andtheniskeptatmediumvalues.Motionexamplesoftheinvertedpendulum

bySRPIwithνchosenbyIWCVandν=1areillustratedinFigure4.8and

Figure4.9,respectively.

Theseresultsindicatethattheflatteningparameteriswelladjustedto

reusethepreviouslycollectedsampleseffectivelyforpolicyevaluation,and

thusSRPIcanoutperformtheothermethods.

4.5.2

MountainCar

Next,weconsiderthemountaincartaskillustratedinFigure4.10.The

taskconsistsofacarandtwohillswhoselandscapeisdescribedbysin(3x).

Page 221: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 222: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 223: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 224: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 225: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

SampleReuseinPolicyIteration

61

FIGURE4.8:MotionexamplesoftheinvertedpendulumbySRPIwithν

chosenbyIWCV(fromlefttorightandtoptobottom).

FIGURE4.9:MotionexamplesoftheinvertedpendulumbySRPIwith

ν=1(fromlefttorightandtoptobottom).

Page 226: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

62

StatisticalReinforcementLearning

Goal

FIGURE4.10:Illustrationofthemountaincartask.

Thetopoftherighthillisthegoaltowhichwewanttoguidethecar.There

arethreeactions,

+0.2,−0.2,0,

whicharethevaluesoftheforceappliedtothecar.Notethattheforceofthe

carisnotstrongenoughtoclimbuptheslopetoreachthegoal.Thestate

spaceSisdescribedbythehorizontalpositionx[m](∈[−1.2,0.5])andthevelocity˙x[m/s](∈[−1.5,1.5]):s=(x,˙x)⊤.

Thepositionxandvelocity˙xareupdatedby

xt+1=xt+˙xt+1∆t,

a

˙x

t

t+1=˙

xt+−9.8wcos(3xt)+

−k˙x∆t,

w

t

Page 227: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

whereat(∈A)istheactionchosenatthetimet.TherewardfunctionR(s,a,s′)isdefinedas

1ifx

R(s,a,s′)=

s′≥0.5,

−0.01otherwise,

wherexs′denotesthehorizontalpositionxofstates′.Theproblemparame-

tersaresetasfollows:themassofthecarwis0.2[kg],thefrictioncoefficientkis0.3,andthesimulationtimestep∆tis0.1[s].

Thesameexperimentalsetupastheswing-upinvertedpendulumtaskin

Section4.5.1isused,exceptthatthenumberofGaussiankernelsis36,the

kernelstandarddeviationissetatσ=1,andthekernelcentersareallocated

overthefollowinggridpoints:

−1.2,0.35,0.5×−1.5,−0.5,0.5,1.5.

Figure4.11(a)showstheperformanceoflearnedpoliciesmeasuredbythe

SampleReuseinPolicyIteration

63

0.2

1

ν=0

ν=1

0.9

ν^

=νIWCV

0.15

0.8

0.7

0.1

0.6

0.5

Page 228: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0.05

0.4

0.3

Flatteningparameter

Sumofdiscountedrewards

0

0.2

0.1

−0.05

0

10

20

30

40

50

60

70

80

90

10

20

30

40

50

60

70

80

90

Totalnumberofepisodes

Totalnumberofepisodes

(a)Performanceofpolicy

(b)Averageflatteningparameter

Page 229: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

FIGURE4.11:Resultsofsample-reusepolicyiterationinthemountain-car

task.TheagentcollectstrainingsampleHπL(N=10andT=100)atev-

eryiterationandpolicyevaluationisperformedusingallcollectedsamples

Hπ1,…,HπL.(a)Theperformanceismeasuredbytheaveragereturncom-

putedfromtestsamplesover20trials.Thetotalnumberofepisodesmeansthe

numberoftrainingepisodes(N×L)collectedbytheagentinpolicyiteration.

(b)AverageflatteningparametervaluesusedbySRPIover20trials.

averagereturncomputedfromthetestsamples.Thegraphshowssimilarten-

denciestotheswing-upinvertedpendulumtaskforSRPIandν=1,while

themethodwithν=0performsrelativelywellthistime.Thisimpliesthat

thebiasinthepreviouslycollectedsamplesdoesnotaffecttheestimationof

thevaluefunctionsthatstrongly,becausethefunctionapproximatorisbetter

suitedtorepresentthevaluefunctionforthisproblem.Theaverageflattening

parametervalues(cf.Figure4.11(b))showthattheflatteningparameterde-

creasessoonaftertheincreaseinthebeginning,andthenthesmallervalues

tendtobechosen.ThisindicatesthatSRPItendstouselow-varianceesti-

matorsinthistask.MotionexamplesbySRPIwithνchosenbyIWCVare

illustratedinFigure4.12.

TheseresultsshowthatSRPIcanperformstableandfastlearningby

effectivelyreusingpreviouslycollecteddata.

4.6

Remarks

Instabilityhasbeenoneofthecriticallimitationsofimportance-sampling

techniques,whichoftenmakesoff-policymethodsimpractical.Toovercome

thisweakness,anadaptiveimportance-samplingtechniquewasintroducedfor

controllingthetrade-offbetweenconsistencyandstabilityinvaluefunction

Page 230: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 231: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 232: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 233: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 234: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

64

StatisticalReinforcementLearning

Goal

Goal

Goal

Goal

Page 235: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

FIGURE4.12:MotionexamplesofthemountaincarbySRPIwithνchosen

byIWCV(fromlefttorightandtoptobottom).

approximation.Furthermore,importance-weightedcross-validationwasintro-

ducedforautomaticallychoosingthetrade-offparameter.

Therangeofapplicationofimportancesamplingisnotlimitedtopolicy

iteration.Wewillexplainhowimportancesamplingcanbeutilizedforsample

reuseinthepolicysearchframeworksinChapter8andChapter9.

Chapter5

ActiveLearninginPolicyIteration

InChapter4,weconsideredtheoff-policysituationwhereadata-collecting

policyandthetargetpolicyaredifferent.Intheframeworkofsample-reuse

policyiteration,newsamplesarealwayschosenfollowingthetargetpolicy.

However,acleverchoiceofsamplingpoliciescanactuallyfurtherimprovethe

performance.Thetopicofchoosingsamplingpoliciesiscalledactivelearning

instatisticsandmachinelearning.Inthischapter,weaddresstheproblem

ofchoosingsamplingpoliciesinsample-reusepolicyiteration.InSection5.1,

weexplainhowastatisticalactivelearningmethodcanbeemployedforop-

timizingthesamplingpolicyinvaluefunctionapproximation.InSection5.2,

Page 236: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

weintroduceactivepolicyiteration,whichincorporatestheactivelearning

ideaintotheframeworkofsample-reusepolicyiteration.Theeffectivenessof

activepolicyiterationisnumericallyinvestigatedinSection5.3,andfinally

thischapterisconcludedinSection5.4.

5.1

EfficientExplorationwithActiveLearning

Theaccuracyofestimatedvaluefunctionsdependsontrainingsamples

collectedfollowingsamplingpolicye

π(a|s).Inthissection,weexplainhowa

statisticalactivelearningmethod(Sugiyama,2006)canbeemployedforvalue

functionapproximation.

5.1.1

ProblemSetup

Letusconsiderasituationwherecollectingstate-actiontrajectorysam-

plesiseasyandcheap,butgatheringimmediaterewardsamplesishardand

expensive.Forexample,considerarobot-armcontroltaskofhittingaball

withabatanddrivingtheballasfarawayaspossible(seeFigure5.6).Let

usadoptthecarryoftheballastheimmediatereward.Inthissetting,ob-

tainingstate-actiontrajectorysamplesoftherobotarmiseasyandrelatively

cheapsincewejustneedtocontroltherobotarmandrecorditsstate-action

trajectoriesovertime.However,explicitlycomputingthecarryoftheball

fromthestate-actionsamplesishardduetofrictionandelasticityoflinks,

65

66

StatisticalReinforcementLearning

airresistance,aircurrents,andsoon.Forthisreason,inpractice,wemay

havetoputtherobotinopenspace,lettherobotreallyhittheball,and

measurethecarryoftheballmanually.Thus,gatheringimmediatereward

samplesismuchmoreexpensivethanthestate-actiontrajectorysamples.In

Page 237: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

suchasituation,immediaterewardsamplesaretooexpensivetobeusedfor

designingthesamplingpolicy.Onlystate-actiontrajectorysamplesmaybe

usedfordesigningsamplingpolicies.

Thegoalofactivelearninginthecurrentsetupistodeterminethesampling

policysothattheexpectedgeneralizationerrorisminimized.However,since

thegeneralizationerrorisnotaccessibleinpractice,itneedstobeestimated

fromsamplesforperformingactivelearning.Adifficultyofestimatingthe

generalizationerrorinthecontextofactivelearningisthatitsestimation

needstobecarriedoutonlyfromstate-actiontrajectorysampleswithoutusing

immediaterewardsamples.Thismeansthatstandardgeneralizationerror

estimationtechniquessuchascross-validationcannotbeemployed.Below,

weexplainhowthegeneralizationerrorcanbeestimatedwithoutthereward

samples.

5.1.2

DecompositionofGeneralizationError

Theinformationweareallowedtouseforestimatingthegeneralization

errorisasetofroll-outsampleswithoutimmediaterewards:

Heπ=heπ1,…,heπN,

whereeachepisodicsampleheπnisgivenas

heπn=[seπ1,n,aeπ1,n,…,seπT,n,aeπT,n,seπT+1,n].

Letusdefinethedeviationofanobservedimmediaterewardreπ

t,nfromits

expectationr(seπt,n,aeπt,n)as

ǫeπt,n=reπt,n−r(seπt,n,aeπt,n).

Notethatǫeπt,ncouldberegardedasadditivenoiseinthecontextofleast-

squaresfunctionfitting.Bydefinition,ǫeπt,nhasmeanzeroanditsvariance

generallydependsonseπt,nandaeπt,n,i.e.,heteroscedasticnoise(Bishop,2006).

However,sinceestimatingthevarianceofǫeπt,nwithoutusingrewardsamples

isnotgenerallypossible,weignorethedependenceofthevarianceonseπt,nand

aeπt,n.Letusdenotetheinput-independentcommonvariancebyσ2.

Wewouldliketoestimatethegeneralizationerror,

Page 238: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

#

1T

X

⊤2

G(b

θ)=E

bb

pe

π(h)

θψ(s

,

T

t,at;He

π)−r(st,at)

t=1

ActiveLearninginPolicyIteration

67

fromHeπ.Itsexpectationover“noise”canbedecomposedasfollows

(Sugiyama,2006):

h

i

EǫeπG(b

θ)=Bias+Variance+ModelError,

whereEǫeπdenotestheexpectationover“noise”ǫeπt,nT,N

t=1,n=1.

“Bias,”

“Variance,”and“ModelError”arethebiasterm,thevarianceterm,andthe

Page 239: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

modelerrortermdefinedby

#

T

1Xn

hi

o2

Bias=E

b

pe

π(h)

(E

θ−θ∗)⊤b

ψ(s

,

T

ǫe

π

t,at;He

π)

t=1

#

T

1Xn

hi

o2

Variance=E

b

pe

π(h)

(b

Page 240: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

θ−E

θ)⊤b

ψ(s

,

T

ǫe

π

t,at;He

π)

t=1

#

T

1X

ModelError=Epeπ(h)

(θ∗⊤b

ψ(s

.

T

t,at;He

π)−r(st,at))2

t=1

θ∗denotestheoptimalparameterinthemodel:”

#

T

1X

θ∗=argminEpeπ(h)(θ⊤ψ(st,at)−r(st,at))2.

θ

Tt=1

Notethat,foralinearestimatorb

Page 241: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

θsuchthat

bθ=b

Lr,

whereb

LissomematrixandristheNT-dimensionalvectordefinedas

rN(t−1)+n=r(st,n,at,n,st+1,n),

thevariancetermcanbeexpressedinacompactformas

⊤Variance=σ2tr(Ub

Lb

L),

wherethematrixUisdefinedas

#

1T

X

U=E

b

pe

π(h)

ψ(s

.

(5.1)

T

t,at;He

π)b

ψ(st,at;Heπ)⊤t=1

5.1.3

EstimationofGeneralizationError

Sinceweareinterestedinfindingaminimizerofthegeneralizationerror

withrespecttoe

Page 242: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

π,themodelerror,whichisconstant,canbesafelyignoredin

generalizationerrorestimation.Ontheotherhand,thebiastermincludesthe

68

StatisticalReinforcementLearning

unknownoptimalparameterθ∗.Thus,itmaynotbepossibletoestimatethebiastermwithoutusingrewardsamples.Similarly,itmaynotbepossibleto

estimatethe“noise”varianceσ2includedinthevariancetermwithoutusing

rewardsamples.

Itisknownthatthebiastermissmallenoughtobeneglectedwhenthe

modelisapproximatelycorrect(Sugiyama,2006),i.e.,θ∗⊤b

ψ(s,a)approxi-

matelyagreeswiththetruefunctionr(s,a).Thenwehave

h

i

⊤EǫeπG(b

θ)−ModelError−Bias∝tr(UbLb

L),

(5.2)

whichdoesnotrequireimmediaterewardsamplesforitscomputation.Since

Epeπ(h)includedinUisnotaccessible(seeEq.(5.1)),Uisreplacedbyits

consistentestimatorb

U:

N

XT

X

b

1

Page 243: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

U=

b

ψ(seπ

NT

t,n,ae

π

t,n;He

π)b

ψ(seπt,n,aeπt,n;Heπ)⊤b

wt,n.

n=1t=1

Consequently,thefollowinggeneralizationerrorestimatorisobtained:

⊤J=tr(b

Ub

Lb

L),

whichcanbecomputedonlyfromHeπandthuscanbeemployedintheactive

learningscenarios.IfitispossibletogatherHeπmultipletimes,theaboveJ

maybecomputedmultipletimesandtheiraverageisusedasageneralization

errorestimator.

NotethatthevaluesofthegeneralizationerrorestimatorJandthetrue

generalizationerrorGarenotdirectlycomparablesinceirrelevantadditive

andmultiplicativeconstantsareignored(seeEq.(5.2)).However,thisisno

problemaslongastheestimatorJhasasimilarprofiletothetrueerrorGas

afunctionofsamplingpolicye

πsincethepurposeofderivingageneralization

errorestimatorinactivelearningisnottoapproximatethetruegeneralization

erroritself,buttoapproximatetheminimizerofthetruegeneralizationerror

withrespecttosamplingpolicye

π.

5.1.4

Page 244: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

DesigningSamplingPolicies

Basedonthegeneralizationerrorestimatorderivedabove,asampling

policyisdesignedasfollows:

1.PrepareKcandidatesofsamplingpolicy:e

πkK.

k=1

2.Collectepisodicsampleswithoutimmediaterewardsforeachsampling-

policycandidate:HeπkK.

k=1

3.EstimateUusingallsamplesHeπkK:

k=1

K

XN

XT

X

b

1

U=

b

ψ(seπk

KNT

t,n,ae

πk

t,n;He

πkK

k=1)b

ψ(seπk

t,n,ae

πk

t,n;He

πkK

k=1)⊤b

Page 245: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

weπk

t,n,

k=1n=1t=1

ActiveLearninginPolicyIteration

69

whereb

weπk

t,ndenotestheimportanceweightforthek-thsamplingpolicy

eπk:

Qt

π(aeπk

)

b

weπ

t′,n|se

πk

t′,n

k

t′=1

t,n=Q

.

t

)

t′=1e

πk(aeπk

t′,n|se

πk

t′,n

4.Estimatethegeneralizationerrorforeachk:

Page 246: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

e

πk

e

πk

J

b⊤k=tr(b

Ub

L

L

),

e

πk

whereb

L

isdefinedas

beπk

e

πk

e

πk

e

πk

e

πk

e

πk

L

=(b

Ψ⊤c

W

b

Page 247: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Ψ)−1b

Ψ⊤c

W

.

beπk

e

πk

Ψ

istheNT×Bmatrixandc

W

istheNT×NTdiagonalmatrix

definedas

b

Ψeπk

=b

ψ

N(t−1)+n,b

b(se

πk

t,n,ae

πk

t,n),

c

Weπk

=

N(t−1)+n,N(t−1)+n

b

weπk

t,n.

5.(Ifpossible)repeat2to4severaltimesandcalculatetheaveragefor

eachk.

6.Determinethesamplingpolicyas

Page 248: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

eπAL=argminJk.

k=1,…,K

7.Collecttrainingsampleswithimmediaterewardsfollowinge

πAL.

8.Learnthevaluefunctionbyleast-squarespolicyiterationusingthecol-

lectedsamples.

5.1.5

Illustration

Here,thebehavioroftheactivelearningmethodisillustratedonatoy

10-statechain-walkenvironmentshowninFigure5.1.TheMDPconsistsof

10states,

S=s(i)10

i=1=1,2,…,10,

and2actions,

A=a(i)2i=1=“L,”“R”.

70

StatisticalReinforcementLearning

02

02

.

.

1

2

3

8

9

10

···

08

Page 249: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

08

.

.

FIGURE5.1:Ten-statechainwalk.Filledandunfilledarrowsindicatethe

transitionswhentakingaction“R”and“L,”andsolidanddashedlinesindi-

catethesuccessfulandfailedtransitions.

Theimmediaterewardfunctionisdefinedas

r(s,a,s′)=f(s′),

wheretheprofileofthefunctionf(s′)isillustratedinFigure5.2.

Thetransitionprobabilityp(s′|s,a)isindicatedbythenumbersattached

tothearrowsinFigure5.1.Forexample,p(s(2)|s(1),a=“R”)=0.8and

p(s(1)|s(1),a=“R”)=0.2.Thus,theagentcansuccessfullymovetothe

intendeddirectionwithprobability0.8(indicatedbysolid-filledarrowsinthe

figure)andtheactionfailswithprobability0.2(indicatedbydashed-filled

arrowsinthefigure).Thediscountfactorγissetat0.9.Thefollowing12

Gaussianbasisfunctionsφ(s,a)areused:

(s−c

i)2

I(a=a(j))exp−

2τ2

φ2(i−1)+j(s,a)=

fori=1,…,5andj=1,2

I(a=a(j))fori=6andj=1,2,

wherec1=1,c2=3,c3=5,c4=7,c5=9,andτ=1.5.I(a=a′)denotes

theindicatorfunction:

1ifa=a′,

I(a=a′)=

0

ifa6=a′.

Page 250: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Samplingpoliciesandevaluationpoliciesareconstructedasfollows.First,

3

2.5

2

’)1.5

f(s

1

0.5

01

2

3

4

5

6

7

8

9

10

s’

FIGURE5.2:Profileofthefunctionf(s′).

ActiveLearninginPolicyIteration

71

adeterministic“base”policyπisprepared.Forexample,“LLLLLRRRRR,”

wherethei-thletterdenotestheactiontakenats(i).Letπǫbethe“ǫ-greedy”

versionofthebasepolicyπ,i.e.,theintendedactioncanbesuccessfullychosen

withprobability1−ǫ/2andtheotheractionischosenwithprobabilityǫ/2.

Page 251: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Experimentsareperformedforthreedifferentevaluationpolicies:

π1:“RRRRRRRRRR,”

π2:“RRLLLLLRRR,”

π3:“LLLLLRRRRR,”

withǫ=0.1.Foreachevaluationpolicyπ0.1

i

(i=1,2,3),10candidatesofthe

samplingpolicye

π(k)

areprepared,where

=πk/10.Notethat

is

i

10

k=1

eπ(k)

i

i

eπ(1)

i

equivalenttotheevaluationpolicyπ0.1

i

.

Foreachsamplingpolicy,theactivelearningcriterionJiscomputed5

timesandtheiraverageistaken.Thenumbersofepisodesandstepsareset

atN=10andT=10,respectively.Theinitial-stateprobabilityp(s)is

settobeuniform.Whenthematrixinverseiscomputed,10−3isaddedto

diagonalelementstoavoiddegeneracy.Thisexperimentisrepeated100times

withdifferentrandomseedsandthemeanandstandarddeviationofthetrue

generalizationerroranditsestimateareevaluated.

TheresultsaredepictedinFigure5.3asfunctionsoftheindexkofthe

samplingpolicies.Thegraphsshowthatthegeneralizationerrorestimator

Page 252: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

overallcapturesthetrendofthetruegeneralizationerrorwellforallthree

cases.

Next,thevaluesoftheobtainedgeneralizationerrorGisevaluatedwhen

kischosensothatJisminimized(activelearning,AL),theevaluationpolicy

(k=1)isusedforsampling(passivelearning,PL),andkischosenoptimally

sothatthetruegeneralizationerrorisminimized(optimal,OPT).Figure5.4

showsthattheactivelearningmethodcomparesfavorablywithpassivelearn-

ingandperformswellforreducingthegeneralizationerror.

5.2

ActivePolicyIteration

InSection5.1,theunknowngeneralizationerrorwasshowntobeaccu-

ratelyestimatedwithoutusingimmediaterewardsamplesinone-steppolicy

evaluation.Inthissection,thisone-stepactivelearningideaisextendedtothe

frameworkofsample-reusepolicyiterationintroducedinChapter4,whichis

calledactivepolicyiteration.LetusdenotetheevaluationpolicyattheL-th

iterationbyπL.

72

StatisticalReinforcementLearning

2.5

2

2

1.5

1.5

|G

1

J

1

0.5

0.5

Page 253: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0

−0.5

0

2

4

6

8

10

2

4

6

8

10

Samplingpolicyindexk

Samplingpolicyindexk

(a)π0.1

1

0.6

1.4

0.5

1.2

0.4

1

0.3

0.8

|G

J

0.2

0.6

0.1

0.4

0

Page 254: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0.2

−0.1

0

2

4

6

8

10

2

4

6

8

10

Samplingpolicyindexk

Samplingpolicyindexk

(b)π0.1

2

0.8

1

0.6

0.8

0.4

0.6

|G

J

0.2

0.4

0

0.2

−0.2

0

2

Page 255: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

4

6

8

10

2

4

6

8

10

Samplingpolicyindexk

Samplingpolicyindexk

(c)π0.1

3

FIGURE5.3:Themeanandstandarddeviationofthetruegeneralization

errorG(left)andtheestimatedgeneralizationerrorJ(right)over100trials.

5.2.1

Sample-ReusePolicyIterationwithActiveLearning

Intheoriginalsample-reusepolicyiteration,newdatasamplesHπlare

collectedfollowingthenewtargetpolicyπlforthenextpolicyevaluation

step:

E:Hπ1

E:Hπ1,Hπ2

E:Hπ1,Hπ2,Hπ3

π

I

I

1

b

Qπ1→π2

b

Page 256: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Qπ2→π3

···I

→πL+1,

ActiveLearninginPolicyIteration

73

3.5

0.35

3

0.3

2.5

0.25

2

0.2

1.5

0.15

1

0.1

0.5

0.05

0

0

AL

PL

OPT

AL

PL

OPT

(a)π0.1

Page 257: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(b)π0.1

1

2

1

0.8

0.6

0.4

0.2

0

AL

PL

OPT

(c)π0.1

3

FIGURE5.4:Thebox-plotsofthevaluesoftheobtainedgeneralizationerror

Gover100trialswhenkischosensothatJisminimized(activelearning,AL),

theevaluationpolicy(k=1)isusedforsampling(passivelearning,PL),andk

ischosenoptimallysothatthetruegeneralizationerrorisminimized(optimal,

OPT).Thebox-plotnotationindicatesthe5%quantile,25%quantile,50%

quantile(i.e.,median),75%quantile,and95%quantilefrombottomtotop.

where“E:H”indicatespolicyevaluationusingthedatasampleHand“I”

denotespolicyimprovement.Ontheotherhand,inactivepolicyiteration,the

optimizedsamplingpolicye

πlisusedateachiteration:

E:He

π1

E:He

π1,Heπ2

E:He

π1,Heπ2,Heπ3

π

I

Page 258: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

I

1

b

Qπ1→π2

b

Qπ2→π3

···I

→πL+1.

Notethat,inactivepolicyiteration,thepreviouslycollectedsamplesareused

notonlyforvaluefunctionapproximation,butalsoforactivelearning.Thus,

activepolicyiterationmakesfulluseofthesamples.

5.2.2

Illustration

Here,thebehaviorofactivepolicyiterationisillustratedusingthesame

10-statechain-walkproblemasSection5.1.5(seeFigure5.1).

74

StatisticalReinforcementLearning

Theinitialevaluationpolicyπ1issetas

π

b

1(a|s)=0.15pu(a)+0.85I(a=argmaxQ0(s,a′)),

a′

wherepu(a)denotestheprobabilitymassfunctionoftheuniformdistribution

and

12

Page 259: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

X

b

Q0(s,a)=

φb(s,a).

b=1

Policiesareupdatedinthel-thiterationusingtheǫ-greedyrulewithǫ=

0.15/l.Inthesampling-policyselectionstepofthel-thiteration,thefollowing

foursampling-policycandidatesareprepared:

eπ(1),

,

,

,π0.15/l+0.15,π0.15/l+0.5,π0.15/l+0.85

l

eπ(2)

l

eπ(3)

l

eπ(4)

l

=π0.15/l

l

l

l

l

,

whereπldenotesthepolicyobtainedbygreedyupdateusingb

Qπl−1.

Thenumberofiterationstolearnthepolicyissetat7,thenumberof

stepsissetatT=10,andthenumberNofepisodesisdifferentineachitera-

tionanddefinedasN1,…,N7,whereNl(l=1,…,7)denotesthenumberofepisodescollectedinthel-thiteration.Inthisexperiment,twotypesof

schedulingarecompared:5,5,3,3,3,1,1and3,3,3,3,3,3,3,whichare

referredtoasthe“decreasingN”strategyandthe“fixedN”strategy,respec-

Page 260: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

tively.TheJ-valuecalculationisrepeated5timesforactivelearning.The

performanceofthefinallyobtainedpolicyπ8ismeasuredbythereturnfor

testsamplesrπ8

t,nT,N

t,n=1(50episodeswith50stepscollectedfollowingπ8):

1N

XT

X

Performance=

γt−1rπ8

N

t,n,

n=1t=1

wherethediscountfactorγissetat0.9.

Theperformanceofpassivelearning(PL;thecurrentpolicyisusedasthe

samplingpolicyineachiteration)andactivelearning(AL;thebestsampling

policyischosenfromthepolicycandidatespreparedineachiteration)is

compared.Theexperimentsarerepeated1000timeswithdifferentrandom

seedsandtheaverageperformanceofPLandALisevaluated.Theresults

aredepictedinFigure5.5,showingthatALworksbetterthanPLinboth

typesofepisodeschedulingwithstatisticalsignificancebythet-testatthe

significancelevel1%(Henkel,1976)fortheerrorvaluesobtainedafterthe7th

iteration.Furthermore,the“decreasingN”strategyoutperformsthe“fixed

N”strategyforbothPLandAL,showingtheusefulnessofthe“decreasing

N”strategy.

ActiveLearninginPolicyIteration

75

14

13

Page 261: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

12

11

10

AL(decreasingN)

Performance(average)

9

PL(decreasingN)

AL(fixedN)

8

PL(fixedN)

71

2

3

4

5

6

7

Iteration

FIGURE5.5:Themeanperformanceover1000trialsinthe10-statechain-

walkexperiment.Thedottedlinesdenotetheperformanceofpassivelearning

(PL)andthesolidlinesdenotetheperformanceoftheproposedactivelearning

(AL)method.Theerrorbarsareomittedforclearvisibility.Forboththe

“decreasingN”and“fixedN”strategies,theperformanceofALafterthe7th

iterationissignificantlybetterthanthatofPLaccordingtothet-testatthe

significancelevel1%appliedtotheerrorvaluesatthe7thiteration.

5.3

NumericalExamples

Inthissection,theperformanceofactivepolicyiterationisevaluatedusing

aball-battingrobotillustratedinFigure5.6,whichconsistsoftwolinksand

twojoints.Thegoaloftheball-battingtaskistocontroltherobotarmso

thatitdrivestheballasfarawayaspossible.ThestatespaceSiscontinuous

andconsistsofanglesϕ1[rad](∈[0,π/4])andϕ2[rad](∈[−π/4,π/4])and

Page 262: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

angularvelocities˙

ϕ1[rad/s]and˙

ϕ2[rad/s].Thus,astates(∈S)isdescribedbya4-dimensionalvectors=(ϕ1,˙

ϕ1,ϕ2,˙

ϕ2)⊤.TheactionspaceAisdiscrete

andcontainstwoelements:

A=a(i)2i=1=(50,−35)⊤,(−50,10)⊤,

wherethei-thelement(i=1,2)ofeachvectorcorrespondstothetorque

[N·m]addedtojointi.

Theopendynamicsengine(http://ode.org/)isusedforphysicalcalculationsincludingtheupdateoftheanglesandangularvelocities,andcollision

detectionbetweentherobotarm,ball,andpin.Thesimulationtimestepis

setat7.5[ms]andthenextstateisobservedafter10timesteps.Theaction

choseninthecurrentstateistakenfor10timesteps.Tomaketheexperi-

mentsrealistic,noiseisaddedtoactions:ifaction(f1,f2)⊤istaken,theactual

Page 263: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

76

StatisticalReinforcementLearning

FIGURE5.6:Aball-battingrobot.

torquesappliedtothejointsaref1+ε1andf2+ε2,whereε1andε2aredrawn

independentlyfromtheGaussiandistributionwithmean0andvariance3.

Theimmediaterewardisdefinedasthecarryoftheball.Thisrewardis

givenonlywhentherobotarmcollideswiththeballforthefirsttimeatstate

s′aftertakingactionaatcurrentstates.Forvaluefunctionapproximation,

thefollowing110basisfunctionsareused:

ks−c

ik2

Page 264: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

I(a=a(j))exp−

2τ2

φ2(i−1)+j=

fori=1,…,54andj=1,2,

I(a=a(j))fori=55andj=1,2,

whereτissetat3π/2andtheGaussiancentersci(i=1,…,54)arelocated

ontheregulargrid:0,π/4×−π,0,π×−π/4,0,π/4×−π,0,π.

ForL=7andT=10,the“decreasingN”strategyandthe“fixed

N”strategyarecompared.The“decreasingN”strategyisdefinedby

10,10,7,7,7,4,4andthe“fixedN”strategyisdefinedby7,7,7,7,7,7,7.

Theinitialstateisalwayssetats=(π/4,0,0,0)⊤,andJ-calculationsare

repeated5timesintheactivelearningmethod.Theinitialevaluationpolicy

π1issetattheǫ-greedypolicydefinedas

π

b

1(a|s)=0.15pu(a)+0.85I

a=argmaxQ0(s,a′),

a′

110

X

b

Q0(s,a)=

φb(s,a).

b=1

Policiesareupdatedinthel-thiterationusingtheǫ-greedyrulewithǫ=

0.15/l.Sampling-policycandidatesarepreparedinthesamewayasthechain-

walkexperimentinSection5.2.2.

Thediscountfactorγissetat1andtheperformanceoflearnedpolicyπ8

ActiveLearninginPolicyIteration

Page 265: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

77

70

65

60

55

50

45

AL(decreasingN)

Performance(average)

40

PL(decreasingN)

AL(fixedN)

35

PL(fixedN)

301

2

3

4

5

6

7

Iteration

FIGURE5.7:Themeanperformanceover500trialsintheball-batting

experiment.Thedottedlinesdenotetheperformanceofpassivelearning(PL)

andthesolidlinesdenotetheperformanceoftheproposedactivelearning(AL)

method.Theerrorbarsareomittedforclearvisibility.Forthe“decreasingN”

strategy,theperformanceofALafterthe7thiterationissignificantlybetter

thanthatofPLaccordingtothet-testatthesignificancelevel1%forthe

errorvaluesatthe7thiteration.

ismeasuredbythereturnfortestsamplesrπ8

t,n10,20

t,n=1(20episodeswith10

Page 266: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

P

P

stepscollectedfollowingπ

N

T

8):

rπ8

n=1

t=1t,n.

Theexperimentisrepeated500timeswithdifferentrandomseedsand

theaverageperformanceofeachlearningmethodisevaluated.Theresults,

depictedinFigure5.7,showthatactivelearningoutperformspassivelearning.

Forthe“decreasingN”strategy,theperformancedifferenceisstatistically

significantbythet-testatthesignificancelevel1%fortheerrorvaluesafter

the7thiteration.

Motionexamplesoftheball-battingrobottrainedwithactivelearningand

passivelearningareillustratedinFigure5.8andFigure5.9,respectively.

5.4

Remarks

Whenwecannotaffordtocollectmanytrainingsamplesduetohighsam-

plingcosts,itiscrucialtochoosethemostinformativesamplesforefficiently

learningthevaluefunction.Inthischapter,anactivelearningmethodforop-

timizingdatasamplingstrategieswasintroducedintheframeworkofsample-

reusepolicyiteration,andtheresultingactivepolicyiterationwasdemon-

stratedtobepromising.

Page 267: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 268: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 269: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 270: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 271: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 272: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 273: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 274: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 275: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 276: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 277: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 278: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

78

StatisticalReinforcementLearning

FIGURE5.8:Amotionexampleoftheball-battingrobottrainedwithactive

learning(fromlefttorightandtoptobottom).

FIGURE5.9:Amotionexampleoftheball-battingrobottrainedwithpas-

sivelearning(fromlefttorightandtoptobottom).

Page 279: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Chapter6

RobustPolicyIteration

Theframeworkofleast-squarespolicyiteration(LSPI)introducedinChap-

ter2isuseful,thankstoitscomputationalefficiencyandanalyticaltractabil-

ity.However,duetothesquaredloss,ittendstobesensitivetooutliersin

observedrewards.Inthischapter,weintroduceanalternativepolicyiter-

ationmethodthatemploystheabsolutelossforenhancingrobustnessand

reliability.InSection6.1,robustnessandreliabilitybroughtbytheuseofthe

absolutelossisdiscussed.InSection6.2,thepolicyiterationframeworkwith

theabsolutelosscalledleast-absolutepolicyiteration(LAPI)isintroduced.

InSection6.3,theusefulnessofLAPIisillustratedthroughexperiments.

VariationsofLAPIareconsideredinSection6.4,andfinallythischapteris

concludedinSection6.5.

6.1

RobustnessandReliabilityinPolicyIteration

ThebasicideaofLSPIistofitalinearmodeltoimmediaterewardsun-

derthesquaredloss,whiletheabsolutelossisusedinthischapter(seeFig-

ure6.1).Thisisjustreplacementoflossfunctions,butthismodificationhighly

enhancesrobustnessandreliability.

6.1.1

Robustness

Inmanyroboticsapplications,immediaterewardsareobtainedthrough

measurementsuchasdistancesensorsorcomputervision.Duetointrinsic

measurementnoiseorrecognitionerror,theobtainedrewardsoftendeviate

fromthetruevalue.Inparticular,therewardsoccasionallycontainoutliers,

whicharesignificantlydifferentfromregularvalues.

Residualminimizationunderthesquaredlossamountstoobtainingthe

meanofsamplesxim

i=1:

Page 280: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

#

m

X

1m

X

argmin

(xi−c)2=mean(xim

i=1)=

xi.

c

m

i=1

i=1

Ifoneofthevaluesisanoutlierhavingaverylargeorsmallvalue,themean

79

80

StatisticalReinforcementLearning

5

Absoluteloss

Squaredloss

4

3

2

1

0

−3

−2

−1

0

Page 281: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

2

3

FIGURE6.1:Theabsoluteandsquaredlossfunctionsforreducingthe

temporal-differenceerror.

wouldbestronglyaffectedbythisoutlier.Thismeansthatallthevalues

xim

i=1areresponsibleforthemean,andthereforeevenasingleoutlierob-

servationcansignificantlydamagethelearnedresult.

Ontheotherhand,residualminimizationundertheabsolutelossamounts

toobtainingthemedian:

#

2n+1

X

argmin

|xi−c|=median(xi2n+1)=x

i=1

n+1,

c

i=1

wherex1≤x2≤···≤x2n+1.Themedianisinfluencednotbythemagnitude

ofthevaluesxi2n+1butonlybytheirorder.Thus,aslongastheorderis

i=1

keptunchanged,themedianisnotaffectedbyoutliers.Infact,themedianis

knowntobethemostrobustestimatorinlightofbreakdown-pointanalysis

(Huber,1981;Rousseeuw&Leroy,1987).

Therefore,theuseoftheabsolutelosswouldremedytheproblemofro-

bustnessinpolicyiteration.

6.1.2

Reliability

Inpracticalrobot-controltasks,weoftenwanttoattainastableperfor-

Page 282: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

mance,ratherthantoachievea“dream”performancewithlittlechanceof

success.Forexample,intheacquisitionofahumanoidgait,wemaywantthe

robottowalkforwardinastablemannerwithhighprobabilityofsuccess,

ratherthantorushveryfastinachancelevel.

Ontheotherhand,wedonotwanttobetooconservativewhentraining

robots.Ifweareoverlyconcernedwithunrealisticfailure,nopracticallyuseful

controlpolicycanbeobtained.Forexample,anyrobotscanbebrokenin

principleiftheyareactivatedforalongtime.However,ifwefearthisfact

toomuch,wemayendupinpraisingacontrolpolicythatdoesnotmovethe

robotsatall,whichisobviouslynonsense.

Sincethesquared-losssolutionisnotrobustagainstoutliers,itissensitive

torareeventswitheitherpositiveornegativeverylargeimmediaterewards.

RobustPolicyIteration

81

Consequently,thesquaredlossprefersanextraordinarilysuccessfulmotion

evenifthesuccessprobabilityisverylow.Similarly,itdislikesanunrealistic

troubleevenifsuchaterribleeventmaynothappeninreality.Ontheother

hand,theabsolutelosssolutionisnoteasilyaffectedbysuchrareeventsdueto

itsrobustness.Therefore,theuseoftheabsolutelosswouldproduceareliable

controlpolicyeveninthepresenceofsuchextremeevents.

6.2

LeastAbsolutePolicyIteration

Inthissection,apolicyiterationmethodwiththeabsolutelossisintro-

duced.

6.2.1

Algorithm

Insteadofthesquaredloss,alinearmodelisfittedtoimmediaterewards

undertheabsolutelossas

Page 283: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

#

T

X

min

θ⊤b

ψ(st,at)−rt.

θ

t=1

Thisminimizationproblemlookscumbersomeduetotheabsolutevalueoper-

atorwhichisnon-differentiable,butthisminimizationproblemcanbereduced

tothefollowinglinearprogram(Boyd&Vandenberghe,2004):

T

X

min

bt

θ,btT

t=1

t=1

subjectto−bt≤θ⊤bψ(st,at)−rt≤bt,t=1,…,T.

ThenumberofconstraintsisTintheabovelinearprogram.WhenTislarge,

wemayemploysophisticatedoptimizationtechniquessuchascolumngen-

eration(Demirizetal.,2002)forefficientlysolvingthelinearprogramming

problem.Alternatively,anapproximatesolutioncanbeobtainedbygradient

descentorthe(quasi)-Newtonmethodsiftheabsolutelossisapproximated

byasmoothloss(see,e.g.,Section6.4.1).

Thepolicyiterationmethodbasedontheabsolutelossiscalledleastab-

solutepolicyiteration(LAPI).

Page 284: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

6.2.2

Illustration

Forillustrationpurposes,letusconsiderthe4-stateMDPproblemde-

scribedinFigure6.2.Theagentisinitiallylocatedatstates(0)andtheactions

82

StatisticalReinforcementLearning

FIGURE6.2:IllustrativeMDPproblem.

theagentisallowedtotakearemovingtotheleftorrightstate.Iftheleft

movementactionischosen,theagentalwaysreceivessmallpositivereward

+0.1ats(L).Ontheotherhand,iftherightmovementactionischosen,the

agentreceivesnegativereward−1withprobability0.9999ats(R1)oritre-

ceivesverylargepositivereward+20,000withprobability0.0001ats(R2).The

meanandmedianrewardsforleftmovementareboth+0.1,whilethemean

andmedianrewardsforrightmovementare+1.0001and−1,respectively.

IfQ(s(0),“Left”)andQ(s(0),“Right”)areapproximatedbytheleast-

squaresmethod,itreturnsthemeanrewards,i.e.,+0.1and+1.0001,re-

spectively.Thus,theleast-squaresmethodprefersrightmovement,whichisa

“gambling”policythatnegativereward−1isalmostalwaysobtainedats(R1),

butitispossibletoobtainveryhighreward+20,000withaverysmallprob-

abilityats(R2).Ontheotherhand,ifQ(s(0),“Left”)andQ(s(0),“Right”)are

approximatedbytheleastabsolutemethod,itreturnsthemedianrewards,

i.e.,+0.1and−1,respectively.Thus,theleastabsolutemethodprefersleft

movement,whichisastablepolicythattheagentcanalwaysreceivesmall

positivereward+0.1ats(L).

IfalltherewardsinFigure6.2arenegated,thevaluefunctionsarealso

negatedandadifferentinterpretationcanbeobtained:theleast-squares

methodisafraidoftheriskofreceivingverylargenegativereward−20,000

ats(R2)withaverylowprobability,andconsequentlyitendsupinavery

conservativepolicythattheagentalwaysreceivesnegativereward−0.1at

s(L).Ontheotherhand,theleastabsolutemethodtriestoreceivepositive

Page 285: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

reward+1ats(R1)withoutbeingafraidofvisitings(R2)toomuch.

Asillustratedabove,theleastabsolutemethodtendstoprovidequalita-

tivelydifferentsolutionsfromtheleast-squaresmethod.

RobustPolicyIteration

83

6.2.3

Properties

Here,propertiesoftheleastabsolutemethodareinvestigatedwhenthe

modelb

Q(s,a)iscorrectlyspecified,i.e.,thereexistsaparameterθ∗suchthatb

Q(s,a)=Q(s,a)

forallsanda.

Underthecorrectmodelassumption,whenthenumberofsamplesTtends

toinfinity,theleastabsolutesolutionb

θwouldsatisfythefollowingequa-

tion(Koenker,2005):

b⊤θψ(s,a)=Mp(s′|s,a)[r(s,a,s′)]forallsanda,

(6.1)

whereMp(s′|s,a)denotestheconditionalmedianofs′overp(s′|s,a)givens

anda.ψ(s,a)isdefinedby

ψ(s,a)=φ(s,a)−γEp(s′|s,a)Eπ(a′|s′)[φ(s′,a′)],

whereEp(s′|s,a)denotestheconditionalexpectationofs′overp(s′|s,a)given

sanda,andEπ(a′|s′)denotestheconditionalexpectationofa′overπ(a′|s′)

givens′.

FromEq.(6.1),wecanobtainthefollowingBellman-likerecursiveexpres-

sion:

h

i

Page 286: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

b

Q(s,a)=M

b

p(s′|s,a)[r(s,a,s′)]+γEp(s′|s,a)Eπ(a′|s′)Q(s′,a′).

(6.2)

Notethatinthecaseoftheleast-squaresmethodwhere

b⊤θψ(s,a)=Ep(s′|s,a)[r(s,a,s′)]

issatisfiedinthelimitunderthecorrectmodelassumption,wehave

h

i

b

Q(s,a)=E

b

p(s′|s,a)[r(s,a,s′)]+γEp(s′|s,a)Eπ(a′|s′)Q(s′,a′).

(6.3)

ThisistheordinaryBellmanequation,andthusEq.(6.2)couldberegarded

asanextensionoftheBellmanequationtotheabsoluteloss.

FromtheordinaryBellmanequation(6.3),wecanrecovertheoriginal

definitionofthestate-valuefunctionQ(s,a):

#

T

X

Qπ(

s,a)=Epπ(h)

γt−1r(st,at,st+1),s1=s,a1=a,

t=1

whereEpπ(h)denotestheexpectationovertrajectoryh=[s1,a1,…,

sT,aT,sT+1]and“|s1=s,a1=a”meansthattheinitialstates1andthe

firstactiona1arefixedats1=sanda1=a,respectively.Incontrast,from

theabsolute-lossBellmanequation(6.2),wehave

Page 287: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

#

T

X

Q′(

s,a)=Epπ(h)

γt−1Mp(s

s

.

t+1|st,at)[r(st,at,st+1)]1=s,a1=a

t=1

84

StatisticalReinforcementLearning

Bar

1stlink

1stjoint

2ndlink

Page 288: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

2ndjoint

Endeffector

FIGURE6.3:Illustrationoftheacrobot.Thegoalistoswinguptheend

effectorbyonlycontrollingthesecondjoint.

Thisisthevaluefunctionthattheleastabsolutemethodistryingtoap-

proximate,whichisdifferentfromtheordinaryvaluefunction.Sincethedis-

countedsumofmedianrewards—nottheexpectedrewards—ismaximized,

theleastabsolutemethodisexpectedtobelesssensitivetooutliersthanthe

least-squaresmethod.

6.3

NumericalExamples

Inthissection,thebehaviorofLAPIisillustratedthroughexperiments

usingtheacrobotshowninFigure6.3.Theacrobotisanunder-actuated

systemandconsistsoftwolinks,twojoints,andanendeffector.Thelengthof

eachlinkis0.3[m],andthediameterofeachjointis0.15[m].Thediameterof

theendeffectoris0.10[m],andtheheightofthehorizontalbaris1.2[m].The

firstjointconnectsthefirstlinktothehorizontalbarandisnotcontrollable.

Thesecondjointconnectsthefirstlinktothesecondlinkandiscontrollable.

Theendeffectorisattachedtothetipofthesecondlink.Thecontrolcommand

(action)wecanchooseistoapplypositivetorque+50[N·m],notorque0

[N·m],ornegativetorque−50[N·m]tothesecondjoint.Notethatthe

acrobotmovesonlywithinaplaneorthogonaltothehorizontalbar.

Thegoalistoacquireacontrolpolicysuchthattheendeffectorisswungup

ashighaspossible.Thestatespaceconsistsoftheangleθi[rad]andangular

velocity˙θi[rad/s]ofthefirstandsecondjoints(i=1,2).Theimmediate

RobustPolicyIteration

85

rewardisgivenaccordingtotheheightyofthecenteroftheendeffectoras

Page 289: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

10

ify>1.75,

r(s,a,s′)=

exp−(y−1.85)2

if1.5<y≤1.75,

2(0.2)2

0.001

otherwise.

Notethat0.55≤y≤1.85inthecurrentsetting.

Here,supposethatthelengthofthelinksisunknown.Thus,theheight

ycannotbedirectlycomputedfromstateinformation.Theheightoftheend

effectorissupposedtobeestimatedfromanimagetakenbyacamera—

theendeffectorisdetectedintheimageandthenitsverticalcoordinateis

computed.Duetorecognitionerror,theestimatedheightishighlynoisyand

couldcontainoutliers.

Ineachpolicyiterationstep,20episodictrainingsamplesoflength150

aregathered.Theperformanceoftheobtainedpolicyisevaluatedusing50

episodictestsamplesoflength300.Notethatthetestsamplesarenotused

forlearningpolicies.Theyareusedonlyforevaluatinglearnedpolicies.The

policiesareupdatedinasoft-maxmanner:

exp(Q(s,a)/η)

π(a|s)←−P

,

exp(Q(s,a′)/η)

a′∈Awhereη=10−l+1withlbeingtheiterationnumber.Thediscounted

factorissetatγ=1,i.e.,nodiscount.Asbasisfunctionsforvaluefunction

Page 290: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

approximation,theGaussiankernelwithstandarddeviationπisused,where

Gaussiancentersarelocatedat

(θ1,θ2,˙θ1,˙θ2)∈−π,−π,0,π,π×−π,0,π×−π,0,π×−π,0,π.2

2

Theabove135(=5×3×3×3)Gaussiankernelsaredefinedforeachofthe

threeactions.Thus,405(=135×3)kernelsareusedintotal.

Letusconsidertwonoiseenvironments:oneisthecasewherenonoiseis

addedtotherewardsandtheothercaseiswhereLaplaciannoisewithmean

zeroandstandarddeviation2isaddedtotherewardswithprobability0.1.

NotethatthetailoftheLaplaciandensityisheavierthanthatoftheGaussian

density(seeFigure6.4),implyingthatasmallnumberofoutlierstendtobe

includedintheLaplaciannoiseenvironment.Anexampleofthenoisytraining

samplesisshowninFigure6.5.Foreachnoiseenvironment,theexperimentis

repeated50timeswithdifferentrandomseedsandtheaveragesofthesumof

rewardsobtainedbyLAPIandLSPIaresummarizedinFigure6.6.Thebest

methodintermsofthemeanvalueandcomparablemethodsaccordingtothe

t-test(Henkel,1976)atthesignificancelevel5%isspecifiedby“.”

Inthenoiselesscase(seeFigure6.6(a)),bothLAPIandLSPIimprovethe

performanceoveriterationsinacomparableway.Ontheotherhand,inthe

noisycase(seeFigure6.6(b)),theperformanceofLSPIisnotimprovedmuch

duetooutliers,whileLAPIstillproducesagoodcontrolpolicy.

86

StatisticalReinforcementLearning

1

10

Gaussiandensity

True

Laplaciandensity

Samplewithnoise

Page 291: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

8

0.8

6

0.6

4

2

0.4

Immediatereward0

0.2

−2

−4

0

0.55

1.5

1.751.85

−4

−2

0

2

4

Heightofendeffector

FIGURE6.4:Probabilitydensity

FIGURE6.5:Exampleoftraining

functionsofGaussianandLapla-

sampleswithLaplaciannoise.The

ciandistributions.

horizontalaxisistheheightofthe

endeffector.Thesolidlinedenotes

thenoiselessimmediaterewardand

“”denotesanoisytrainingsample.

14

12

Page 292: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

12

10

10

8

8

6

6

Sumofrewards

Sumofrewards

4

4

2

2

LSPI

LAPI

0

0

1

2

3

4

5

6

7

8

9

10

1

2

3

4

5

Page 293: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

6

7

8

9

10

Iteration

Iteration

(a)Nonoise

(b)Laplaciannoise

FIGURE6.6:Averageandstandarddeviationofthesumofrewardsover50

runsfortheacrobotswinging-upsimulation.Thebestmethodintermsofthe

meanvalueandcomparablemethodsaccordingtothet-testatthesignificance

level5%specifiedby“.”

Figure6.7andFigure6.8depictmotionexamplesoftheacrobotlearned

byLAPIandLSPIintheLaplacian-noiseenvironment.WhenLSPIisused

(Figure6.7),thesecondjointisswunghardinordertolifttheendeffector.

However,theendeffectortendstostaybelowthehorizontalbar,andtherefore

onlyasmallamountofrewardcanbeobtainedbyLSPI.Thiswouldbedueto

theexistenceofoutliers.Ontheotherhand,whenLAPIisused(Figure6.8),

theendeffectorgoesbeyondthebar,andthereforealargeamountofreward

canbeobtainedeveninthepresenceofoutliers.

Page 294: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 295: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 296: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 297: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 298: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 299: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 300: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 301: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 302: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 303: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 304: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 305: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

RobustPolicyIteration

87

FIGURE6.7:AmotionexampleoftheacrobotlearnedbyLSPIinthe

Laplacian-noiseenvironment(fromlefttorightandtoptobottom).

FIGURE6.8:AmotionexampleoftheacrobotlearnedbyLAPIinthe

Laplacian-noiseenvironment(fromlefttorightandtoptobottom).

88

StatisticalReinforcementLearning

Page 306: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

6.4

PossibleExtensions

Inthissection,possiblevariationsofLAPIareconsidered.

6.4.1

HuberLoss

UseoftheHuberlosscorrespondstomakingacompromisebetweenthe

squaredandabsolutelossfunctions(Huber,1981):

#

T

X

argmin

ρHB

κ

θ⊤b

ψ(st,at)−rt

,

θ

t=1

whereκ(≥0)isathresholdparameterandρHB

κ

istheHuberlossdefinedas

follows(seeFigure6.9):

1x2

if|x|≤κ,

2

ρHB

κ

(x)= κ|x|−1κ2if|x|>κ.

2

TheHuberlossconvergestotheabsolutelossasκtendstozero,andit

Page 307: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

convergestothesquaredlossasκtendstoinfinity.

TheHuberlossfunctionisratherintricate,butthesolutioncanbeob-

tainedbysolvingthefollowingconvexquadraticprogram(Mangasarian&

Musicant,2000):

T

T

X

X

1

min

b2t+κ

ct

θ,b

2

t,ctT

t=1

t=1

t=1

subjectto−ct≤θ⊤bψ(st,at)−rt−bt≤ct,t=1,…,T.

Anotherwaytoobtainthesolutionistouseagradientdescentmethod,

wheretheparameterθisupdatedasfollowsuntilconvergence:

T

X

θ←θ−ε

∆ρHB

κ

Page 308: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(θ⊤b

ψ(st,at)−rt)b

ψ(st,at).

t=1

ε(>0)isthelearningrateand∆ρHB

κ

isthederivativeofρHB

κ

givenby

x

if|x|≤κ,

∆ρHB

κ

(x)=

κ

ifx>κ,

−κifx<−κ.

Inpractice,thefollowingstochasticgradientmethod(Amari,1967)wouldbe

RobustPolicyIteration

89

5

Huberloss

Pinballloss

4

Deadzone-linearloss

3

Page 309: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

2

1

0

−3

−2

−1

0

1

2

FIGURE6.9:TheHuberlossfunction(withκ=1),thepinballlossfunction

(withτ=0.3),andthedeadzone-linearlossfunction(withǫ=1).

moreconvenient.Forarandomlychosenindext∈1,…,Tineachiteration,

repeatthefollowingupdateuntilconvergence:

θ←θ−ε∆ρHB

κ

(θ⊤b

ψ(st,at)−rt)b

ψ(st,at).

Theplain/stochasticgradientmethodsalsocomeinhandywhenapprox-

imatingtheleastabsolutesolution,sincetheHuberlossfunctionwithsmall

κcanberegardedasasmoothapproximationtotheabsoluteloss.

6.4.2

PinballLoss

Theabsolutelossinducesthemedian,whichcorrespondstothe50-

percentilepoint.Asimilardiscussionisalsopossibleforanarbitrarypercentile

100τ(0≤τ≤1)basedonthepinballloss(Koenker,2005):

#

T

X

min

ρPB

Page 310: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

τ

(θ⊤b

ψ(st,at)−rt),

θ

t=1

whereρPB

τ

(x)isthepinballlossdefinedby

(2τx

ifx≥0,

ρPB

τ

(x)=

2(τ−1)xifx<0.

TheprofileofthepinballlossisdepictedinFigure6.9.Whenτ=0.5,the

pinballlossisreducedtotheabsoluteloss.

Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram:

T

X

min

bt

θ,btT

t=1

t=1

Page 311: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

b

b

subjectto

t

t

θ⊤b

ψ(s

,t=1,…,T.

2(τ−1)

t,at)−rt≤2τ

90

StatisticalReinforcementLearning

6.4.3

Deadzone-LinearLoss

Anothervariantoftheabsolutelossisthedeadzone-linearloss(seeFig-

ure6.9):

#

T

X

min

ρDL

ǫ

(θ⊤b

ψ(st,at)−rt),

θ

t=1

whereρDL

Page 312: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ǫ

(x)isthedeadzone-linearlossdefinedby

(0

if|x|≤ǫ,

ρDL

ǫ

(x)=

|x|−ǫif|x|>ǫ.

Thatis,ifthemagnitudeoftheerrorislessthanǫ,noerrorisassessed.This

lossisalsocalledtheǫ-insensitivelossandusedinsupportvectorregression

(Vapnik,1998).

Whenǫ=0,thedeadzone-linearlossisreducedtotheabsoluteloss.

Thus,thedeadzone-linearlossandtheabsolutelossarerelatedtoeachother.

However,theeffectofthedeadzone-linearlossiscompletelyoppositetothe

absolutelosswhenǫ>0.Theinfluenceof“good”samples(withsmallerror)

isdeemphasizedinthedeadzone-linearloss,whiletheabsolutelosstendsto

suppresstheinfluenceof“bad”samples(withlargeerror)comparedwiththe

squaredloss.

Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram(Boyd

&Vandenberghe,2004):

T

X

min

b

t

θ,btT

Page 313: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

t=1

t=1

subjectto

−b

t−ǫ≤θ⊤b

ψ(st,at)−rt≤bt+ǫ,

bt≥0,t=1,…,T.

6.4.4

ChebyshevApproximation

TheChebyshevapproximationminimizestheerrorforthe“worst”sample:

min

max|θ⊤b

ψ(st,at)−rt|.

θ

t=1,…,T

Thisisalsocalledtheminimaxapproximation.

Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram(Boyd

&Vandenberghe,2004):

min

b

θ,b

subjectto−b≤θ⊤bψ(st,at)−rt≤b,t=1,…,T.

Page 314: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

RobustPolicyIteration

91

FIGURE6.10:Theconditionalvalue-at-risk(CVaR).

6.4.5

ConditionalValue-At-Risk

Intheareaoffinance,theconditionalvalue-at-risk(CVaR)isapopular

riskmeasure(Rockafellar&Uryasev,2002).TheCVaRcorrespondstothe

meanoftheerrorforasetof“bad”samples(seeFigure6.10).

Morespecifically,letusconsiderthedistributionoftheabsoluteerrorover

alltrainingsamples(st,at,rt)Tt=1:

Φ(α|θ)=P(st,at,rt):|θ⊤b

ψ(st,at)−rt|≤α.

Forβ∈[0,1),letαβ(θ)bethe100βpercentileoftheabsoluteerrordistribu-tion:

αβ(θ)=minα|Φ(α|θ)≥β.

Thus,onlythefraction(1−β)oftheabsoluteerror|θ⊤b

ψ(st,at)−rt|exceeds

thethresholdαβ(θ).αβ(θ)isalsoreferredtoasthevalue-at-risk(VaR).

Letusconsidertheβ-taildistributionoftheabsoluteerror:

0

ifα<αβ(θ),

Φβ(α|θ)= Φ(α|θ)−β

ifα≥α

1−β

β(θ).

Letφβ(θ)bethemeanoftheβ-taildistributionoftheabsolutetemporal

difference(TD)error:

h

i

Page 315: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

φβ(θ)=EΦ

|θ⊤b

ψ(s

,

β

t,at)−rt|

whereEΦdenotestheexpectationoverthedistributionΦ

β

β.φβ(θ)iscalled

theCVaR.Bydefinition,theCVaRoftheabsoluteerrorisreducedtothe

meanabsoluteerrorifβ=0anditconvergestotheworstabsoluteerror

asβtendsto1.Thus,theCVaRsmoothlybridgestheleastabsoluteand

Chebyshevapproximationmethods.CVaRisalsoreferredtoastheexpected

shortfall.

92

StatisticalReinforcementLearning

TheCVaRminimizationprobleminthecurrentcontextisformulatedas

h

h

ii

minEΦ

|θ⊤b

ψ(s

.

β

t,at)−rt|

θ

Thisoptimizationproblemlookscomplicated,butthesolutionb

θCVcanbeob-

Page 316: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

tainedbysolvingthefollowinglinearprogram(Rockafellar&Uryasev,2002):

T

X

min

T(1−β)α+

ct

θ,btT

,c

t=1

tT

t=1

t=1

subjectto

−b

t≤θ⊤b

ψ(st,at)−rt≤bt,

ct≥bt−α,

ct≥0,t=1,…,T.

Notethatifthedefinitionoftheabsoluteerrorisslightlychanged,the

CVaRminimizationmethodamountstominimizingthedeadzone-linearloss

Page 317: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(Takeda,2007).

6.5

Remarks

LSPIcanberegardedasregressionofimmediaterewardsunderthe

squaredloss.Inthischapter,theabsolutelosswasusedforregression,which

contributestoenhancingrobustnessandreliability.Theleastabsolutemethod

isformulatedasalinearprogramanditcanbesolvedefficientlybystandard

optimizationsoftware.

LSPImaximizesthestate-actionvaluefunctionQ(s,a),whichistheex-

pectationofreturns.Anotherwaytoaddresstherobustnessandreliability

istomaximizeotherquantitiessuchasthemedianoraquantileofreturns.

AlthoughBellman-likesimplerecursiveexpressionsarenotavailableforquan-

tilesofrewards,aBellman-likerecursiveequationholdsforthedistribution

ofthediscountedsumofrewards(Morimuraetal.,2010a;Morimuraetal.,

2010b).Developingrobustreinforcementlearningalgorithmsalongthisline

ofresearchwouldbeapromisingfuturedirection.

PartIII

Model-FreePolicySearch

InthepolicyiterationapproachexplainedinPartII,thevaluefunctionis

firstestimatedandthenthepolicyisdeterminedbasedonthelearnedvalue

function.Policyiterationwasdemonstratedtoworkwellinmanyreal-world

applications,especiallyinproblemswithdiscretestatesandactions(Tesauro,

1994;Williams&Young,2007;Abeetal.,2010).Althoughpolicyiteration

canalsohandlecontinuousstatesbyfunctionapproximation(Lagoudakis&

Parr,2003),continuousactionsarehardtodealwithduetothedifficultyof

findingamaximizerofthevaluefunctionwithrespecttoactions.Moreover,

sincepoliciesareindirectlydeterminedviavaluefunctionapproximation,mis-

specificationofvaluefunctionmodelscanleadtoaninappropriatepolicyeven

inverysimpleproblems(Weaver&Baxter,1999;Baxteretal.,2001).Another

limitationofpolicyiterationespeciallyinphysicalcontroltasksisthatcontrol

Page 318: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

policiescanvarydrasticallyineachiteration.Thiscausessevereinstabilityin

thephysicalsystemandthusisnotfavorableinpractice.

Policysearchisanalternativeapproachtoreinforcementlearningthatcan

overcomethelimitationsofpolicyiteration(Williams,1992;Dayan&Hin-

ton,1997;Kakade,2002).Inthepolicysearchapproach,policiesaredirectly

learnedsothatthereturn(i.e.,thediscountedsumoffuturerewards),

T

Xγt−1r(st,at,st+1),

t=1

ismaximized.

InPartIII,wefocusontheframeworkofpolicysearch.First,directpolicy

searchmethodsareintroduced,whichtrytofindthepolicythatachievesthe

maximumreturnviagradientascent(Chapter7)orexpectation-maximization

(Chapter8).Apotentialweaknessofthedirectpolicysearchapproachisits

instabilityduetotherandomnessofstochasticpolicies.Toovercometheinsta-

bilityproblem,analternativeapproachcalledpolicy-priorsearchisintroduced

inChapter9.

Thispageintentionallyleftblank

Chapter7

DirectPolicySearchbyGradient

Ascent

Thedirectpolicysearchapproachtriestofindthepolicythatmaximizes

theexpectedreturn.Inthischapter,weintroducegradient-basedalgorithms

fordirectpolicysearch.AftertheproblemformulationinSection7.1,the

gradientascentalgorithmisintroducedinSection7.2.Then,inSection7.3,

itsextentionusingnaturalgradientsisdescribed.InSection7.4,applicationto

computergraphicsisshown.Finally,thischapterisconcludedinSection7.5.

7.1

Formulation

Page 319: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Inthissection,theproblemofdirectpolicysearchismathematicallyfor-

mulated.

LetusconsideraMarkovdecisionprocessspecifiedby

(S,A,p(s′|s,a),p(s),r,γ),

whereSisasetofcontinuousstates,Aisasetofcontinuousactions,p(s′|s,a)

isthetransitionprobabilitydensityfromcurrentstatestonextstates′when

actionaistaken,p(s)istheprobabilitydensityofinitialstates,r(s,a,s′)

isanimmediaterewardfortransitionfromstos′bytakingactiona,and

0<γ≤1isthediscountedfactorforfuturerewards.

Letπ(a|s,θ)beastochasticpolicyparameterizedbyθ,whichrepresents

theconditionalprobabilitydensityoftakingactionainstates.Lethbea

trajectoryoflengthT:

h=[s1,a1,…,sT,aT,sT+1].

Thereturn(i.e.,thediscountedsumoffuturerewards)alonghisdefinedas

T

X

R(h)=

γt−1r(st,at,st+1),

t=1

andtheexpectedreturnforpolicyparameterθisdefinedas

Z

J(θ)=Ep(h|θ)[R(h)]=

p(h|θ)R(h)dh,

95

96

StatisticalReinforcementLearning

FIGURE7.1:Gradientascentfordirectpolicysearch.

whereEp(h|θ)istheexpectationovertrajectoryhdrawnfromp(h|θ),and

p(h|θ)denotestheprobabilitydensityofobservingtrajectoryhunderpolicy

Page 320: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

parameterθ:

T

Y

p(h|θ)=p(s1)

p(st+1|st,at)π(at|st,θ).

t=1

Thegoalofdirectpolicysearchistofindtheoptimalpolicyparameterθ∗thatmaximizestheexpectedreturnJ(θ):

θ∗=argmaxJ(θ).θ

However,directlymaximizingJ(θ)ishardsinceJ(θ)usuallyinvolveshigh

non-linearitywithrespecttoθ.Below,agradient-basedalgorithmisintro-

ducedtofindalocalmaximizerofJ(θ).Analternativeapproachbasedon

theexpectation-maximizationalgorithmisprovidedinChapter8.

7.2

GradientApproach

Inthissection,agradientascentmethodfordirectpolicysearchisintro-

duced(Figure7.1).

7.2.1

GradientAscent

Thesimplestapproachtofindingalocalmaximizeroftheexpectedreturn

isgradientascent(Williams,1992):

θ←−θ+ε∇θJ(θ),

DirectPolicySearchbyGradientAscent

97

whereεisasmallpositiveconstantand∇θJ(θ)denotesthegradientofex-pectedreturnJ(θ)withrespecttopolicyparameterθ.Thegradient∇θJ(θ)isgivenby

Page 321: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Z

∇θJ(θ)=∇θp(h|θ)R(h)dhZ

=

p(h|θ)∇θlogp(h|θ)R(h)dhZ

T

X

=

p(h|θ)

∇θlogπ(at|st,θ)R(h)dh,t=1

wheretheso-called“logtrick”isused:

∇θp(h|θ)=p(h|θ)∇θlogp(h|θ).Thisexpressionmeansthatthegradient∇θJ(θ)isgivenastheexpectationoverp(h|θ):

#

T

X

∇θJ(θ)=Ep(h|θ)∇θlogπ(at|st,θ)R(h).t=1

Sincep(h|θ)isunknown,theexpectationisapproximatedbytheempirical

averageas

N

T

1XX

∇bθJ(θ)=

Page 322: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

N

θlogπ(at,n|st,n,θ)R(hn),

n=1t=1

where

hn=[s1,n,a1,n,…,sT,n,aT,n,sT+1,n]

isanindependentsamplefromp(h|θ).ThisalgorithmiscalledREINFORCE

(Williams,1992),whichisanacronymfor“REwardIncrement=Nonnegative

Factor×OffsetReinforcement×CharacteristicEligibility.”

Apopularchoiceforpolicymodelπ(a|s,θ)istheGaussianpolicymodel,

wherepolicyparameterθconsistsofmeanvectorµandstandarddeviation

σ:

1

(a−µ⊤φ(s))2

π(a|s,µ,σ)=√

exp−

.

(7.1)

σ2π

2σ2

Here,φ(s)denotesthebasisfunction.ForthisGaussianpolicymodel,the

policygradientsareexplicitlycomputedas

a−µ⊤φ(s)

∇µlogπ(a|s,µ,σ)=φ(s),

σ2

(a−µ⊤φ(s))2−σ2

∇σlogπ(a|s,µ,σ)=.

σ3

Page 323: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

98

StatisticalReinforcementLearning

Asshownabove,thegradientascentalgorithmfordirectpolicysearchis

verysimpletoimplement.Furthermore,thepropertythatpolicyparameters

aregraduallyupdatedinthegradientascentalgorithmispreferablewhen

reinforcementlearningisappliedtothecontrolofavulnerablephysicalsystem

suchasahumanoidrobot,becausesuddenpolicychangecandamagethe

system.However,thevarianceofpolicygradientstendstobelargeinpractice

(Peters&Schaal,2006;Sehnkeetal.,2010),whichcanresultinslowand

unstableconvergence.

7.2.2

BaselineSubtractionforVarianceReduction

Baselinesubtractionisausefultechniquetoreducethevarianceofgradient

estimators.Technically,baselinesubtractioncanbeviewedasthemethodof

controlvariates(Fishman,1996),whichisaneffectiveapproachtoreducing

thevarianceofMonteCarlointegralestimators.

Thebasicideaofbaselinesubtractionisthatanunbiasedestimatorb

ηis

stillunbiasedifazero-meanrandomvariablemmultipliedbyaconstantξis

subtracted:

b

ηξ=b

η−ξm.

Theconstantξ,whichiscalledabaseline,maybechosensothatthevariance

ofb

ηξisminimized.Bybaselinesubtraction,amorestableestimatorthanthe

originalb

ηcanbeobtained.

Apolicygradientestimatorwithbaselineξsubtractedisgivenby

T

X

∇b

Page 324: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

b

θJξ(θ)=∇θJ(θ)−ξ∇θlogπ(at,n|st,n,θ)t=1

1N

X

T

X

=

(R(h

∇N

n)−ξ)

θlogπ(at,n|st,n,θ),

n=1

t=1

wheretheexpectationof∇θlogπ(a|s,θ)iszero:Z

E[∇θlogπ(a|s,θ)]=π(a|s,θ)∇θlogπ(a|s,θ)daZ

=

∇θπ(a|s,θ)daZ

=∇θπ(a|s,θ)da=∇θ1=0.Theoptimalbaselineisdefinedastheminimizerofthevarianceofthegradient

estimatorwithrespecttothebaseline(Greensmithetal.,2004;Weaver&Tao,

2001):

ξ∗=argminVarb

p(h|θ)[∇θJξ(θ)],

Page 325: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ξ

DirectPolicySearchbyGradientAscent

99

whereVarp(h|θ)denotesthetraceofthecovariancematrix:

Varp(h|

E

θ)[ζ]=tr

p(h|θ)(ζ−Ep(h|θ)[ζ])(ζ−Ep(h|θ)[ζ])⊤h

i

=Ep(h|θ)kζ−Ep(h|θ)[ζ]k2.

ItwasshowninPetersandSchaal(2006)thattheoptimalbaselineξ∗isgivenas

P

E

T

ξ∗=p(h|θ)[R(h)k

t=1∇θlogπ(at|st,θ)k2]P

.

E

T

p(h|θ)[k

t=1∇θlogπ(at|st,θ)k2]Inpractice,theexpectationsareapproximatedbysampleaverages.

7.2.3

VarianceAnalysisofGradientEstimators

Page 326: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Here,thevarianceofgradientestimatorsistheoreticallyinvestigatedfor

theGaussianpolicymodel(7.1)withφ(s)=s.SeeZhaoetal.(2012)for

technicaldetails.

Inthetheoreticalanalysis,subsetsofthefollowingassumptionsarecon-

sidered:

Assumption(A):r(s,a,s′)∈[−β,β]forβ>0.Assumption(B):r(s,a,s′)∈[α,β]for0<α<β.Assumption(C):Forδ>0,thereexisttwoseriesctTt=1anddtTt=1such

thatkstk≥ctandkstk≤dtholdwithprobabilityatleast1−δ,

2N

respectively,overthechoiceofsamplepaths.

NotethatAssumption(B)isstrongerthanAssumption(A).Let

ζ(T)=CTα2−DTβ2/(2π),

where

T

X

T

X

CT=

c2tandDT=

d2t.

t=1

t=1

First,thevarianceofgradientestimatorsisanalyzed.

Theorem7.1UnderAssumptions(A)and(C),thefollowingupperbound

holdswithprobabilityatleast1−δ/2:

h

i

D

Var

b

Tβ2(1−γT)2

Page 327: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

p(h|θ)∇µJ(µ,σ)≤

.

Nσ2(1−γ)2

UnderAssumption(A),itholdsthat

h

i

2Tβ2(1−γT)2

Var

b

p(h|θ)∇σJ(µ,σ)≤

.

Nσ2(1−γ)2

100

StatisticalReinforcementLearning

Theaboveupperboundsaremonotoneincreasingwithrespecttotrajec-

torylengthT.

Forthevarianceof∇bµJ(µ,σ),thefollowinglowerboundholds(itsupper

boundhasnotbeenderivedyet):

Theorem7.2UnderAssumptions(B)and(C),thefollowinglowerbound

holdswithprobabilityatleast1−δ:

h

i

(1−γT)2

Var

b

Page 328: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

p(h|θ)∇µJ(µ,σ)≥

ζ(T).

Nσ2(1−γ)2

Thislowerboundisnon-trivialifζ(T)>0,whichcanbefulfilled,e.g.,if

αandβsatisfy

2πCTα2>DTβ2.

Next,thecontributionoftheoptimalbaselineisinvestigated.Itwasshown

(Greensmithetal.,2004;Weaver&Tao,2001)thattheexcessvarianceforan

arbitrarybaselineξisgivenby

Var

b

b

p(h|θ)[∇θJξ(θ)]−Varp(h|θ)[∇θJξ∗(θ)]

2

(ξ−ξ∗)2T

X

=

E

∇.

N

p(h|θ)

θlogπ(at|st,θ)

t=1

Basedonthisexpression,thefollowingtheoremcanbeobtained.

Theorem7.3UnderAssumptions(B)and(C),thefollowingboundshold

withprobabilityatleast1−δ:

Page 329: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

CTα2(1−γT)2≤Var

b

J(µ,σ)]−Var

b

Jξ∗(µ,σ)]Nσ2(1−γ)2

p(h|θ)[∇µp(h|θ)[∇µβ2(1−γT)2D

T.

Nσ2(1−γ)2

Thistheoremshowsthatthelowerboundoftheexcessvarianceispositive

andmonotoneincreasingwithrespecttothetrajectorylengthT.Thismeans

thatthevarianceisalwaysreducedbyoptimalbaselinesubtractionandthe

amountofvariancereductionismonotoneincreasingwithrespecttothetra-

jectorylengthT.Notethattheupperboundisalsomonotoneincreasingwith

respecttothetrajectorylengthT.

Finally,thevarianceofgradientestimatorswiththeoptimalbaselineis

investigated:

Theorem7.4UnderAssumptions(B)and(C),itholdsthat

(1−γT)2

Var

b

p(h|θ)[∇µJξ∗(µ,σ)]≤(β2D

Nσ2(1−γ)2

T−α2CT),

wheretheinequalityholdswithprobabilityatleast1−δ.

Page 330: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

DirectPolicySearchbyGradientAscent

101

(a)Ordinarygradients

(b)Naturalgradients

FIGURE7.2:Ordinaryandnaturalgradients.Ordinarygradientstreatall

dimensionsequally,whilenaturalgradientstaketheRiemannianstructure

intoaccount.

Thistheoremshowsthattheupperboundofthevarianceofthegradient

estimatorswiththeoptimalbaselineisstillmonotoneincreasingwithrespect

tothetrajectorylengthT.Thus,whenthetrajectorylengthTislarge,the

varianceofthegradientestimatorscanstillbelargeevenwiththeoptimal

baseline.

InChapter9,anothergradientapproachwillbeintroducedforovercoming

thislarge-varianceproblem.

7.3

NaturalGradientApproach

Thegradient-basedpolicyparameterupdateusedintheREINFORCE

algorithmisperformedundertheEuclideanmetric.Inthissection,weshow

anotherusefulchoiceofthemetricforgradient-basedpolicysearch.

7.3.1

NaturalGradientAscent

UseoftheEuclideanmetricimpliesthatalldimensionsofthepolicypa-

rametervectorθaretreatedequally(Figure7.2(a)).However,sinceapolicy

parameterθspecifiesaconditionalprobabilitydensityπ(a|s,θ),useofthe

Euclideanmetricintheparameterspacedoesnotnecessarilymeanalldi-

mensionsaretreatedequallyinthespaceofconditionalprobabilitydensities.

Thus,asmallchangeinthepolicyparameterθcancauseabigchangeinthe

conditionalprobabilitydensityπ(a|s,θ)(Kakade,2002).

Figure7.3describestheGaussiandensitieswithmeanµ=−5,0,5and

standarddeviationσ=1,2.Thisshowsthatifthestandarddeviationis

Page 331: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

102

StatisticalReinforcementLearning

0.4

0.3

0.2

0.1

0

−10

−5

0

5

10

a

FIGURE7.3:Gaussiandensitieswithdifferentmeansandstandarddevi-

ations.Ifthestandarddeviationisdoubled(fromthesolidlinestodashed

lines),thedifferenceinmeanshouldalsobedoubledtomaintainthesame

overlappinglevel.

doubled,thedifferenceinmeanshouldalsobedoubledtomaintainthesame

overlappinglevel.Thus,itis“natural”tocomputethedistancebetweentwo

Gaussiandensitiesparameterizedwith(µ,σ)and(µ+∆µ,σ)notby∆µ,but

by∆µ/σ.

Gradientsthattreatalldimensionsequallyinthespaceofprobability

densitiesarecallednaturalgradients(Amari,1998;Amari&Nagaoka,2000).

Theordinarygradientisdefinedasthesteepestascentdirectionunderthe

Euclideanmetric(Figure7.2(a)):

∇θJ(θ)=argmaxJ(θ+∆θ)subjectto∆θ⊤∆θ≤ǫ,

∆θ

whereǫisasmallpositivenumber.Ontheotherhand,thenaturalgradi-

entisdefinedasthesteepestascentdirectionundertheRiemannianmetric

(Figure7.2(b)):

e

∇θJ(θ)=argmaxJ(θ+∆θ)subjectto∆θ⊤Rθ∆θ≤ǫ,

Page 332: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

∆θ

whereRθistheRiemannianmetric,whichisapositivedefinitematrix.The

solutionoftheaboveoptimizationproblemisgivenby

e

∇θJ(θ)=R−1θ

∇θJ(θ).Thus,theordinarygradient∇θJ(θ)ismodifiedbytheinverseRiemannianmetricR−1inthenaturalgradient.

θ

Astandarddistancemetricinthespaceofprobabilitydensitiesisthe

Kullback–Leibler(KL)divergence(Kullback&Leibler,1951).TheKLdiver-

gencefromdensityptodensityqisdefinedas

Z

p(θ)

KL(pkq)=

p(θ)log

dθ.

q(θ)

DirectPolicySearchbyGradientAscent

103

KL(pkq)isalwaysnon-negativeandzeroifandonlyifp=q.Thus,smaller

KL(pkq)meansthatpandqare“closer.”However,notethattheKLdiver-

genceisnotsymmetric,i.e.,KL(pkq)6=KL(qkp)ingeneral.

Forsmall∆θ,theKLdivergencefromp(h|θ)top(h|θ+∆θ)canbeap-

proximatedby

∆θ⊤Fθ∆θ,

whereFθistheFisherinformationmatrix:

Page 333: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Fθ=Ep(h|θ)[∇θlogp(h|θ)∇θlogp(h|θ)⊤].

Thus,FθistheRiemannianmetricinducedbytheKLdivergence.

Thentheupdateruleofthepolicyparameterθbasedonthenatural

gradientisgivenby

−1

θ←−θ+εb

Fθ∇θJ(θ),whereεisasmallpositiveconstantandb

FθisasampleapproximationofFθ:

N

X

b

1

Fθ=

∇N

θlogp(hn|θ)∇θlogp(hn|θ)⊤.

n=1

Undermildregularityconditions,theFisherinformationmatrixFθcan

beexpressedas

Fθ=−Ep(h|θ)[∇2θlogp(h|θ)],where∇2logp(hθ

|θ)denotestheHessianmatrixoflogp(h|θ).Thatis,the

(b,b′)-thelementof∇2logp(hlogp(h

θ

|θ)isgivenby

∂2

∂θ

|θ).Thismeans

b∂θb′

Page 334: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

thatthenaturalgradienttakesthecurvatureintoaccount,bywhichthecon-

vergencebehavioratflatplateausandsteepridgestendstobeimproved.On

theotherhand,apotentialweaknessofnaturalgradientsisthatcomputation

oftheinverseRiemannianmetrictendstobenumericallyunstable(Deisenroth

etal.,2013).

7.3.2

Illustration

Letusillustratethedifferencebetweenordinaryandnaturalgradients

numerically.

Considerone-dimensionalreal-valuedstatespaceS=Randone-

dimensionalreal-valuedactionspaceA=R.Thetransitiondynamicsislin-

earanddeterministicass′=s+a,andtherewardfunctionisquadraticas

r=0.5s2−0.05a.Thediscountfactorissetatγ=0.95.TheGaussianpolicy

model,

1

(a−µs)2

π(a|s,µ,σ)=√

exp−

,

σ2π

2σ2

isemployed,whichcontainsthemeanparameterµandthestandarddevia-

tionparameterσ.Theoptimalpolicyparametersinthissetuparegivenby

(µ∗,σ∗)≈(−0.912,0).

104

StatisticalReinforcementLearning

1

1

0.8

Page 335: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0.8

0.6

0.6

σ

σ

0.4

0.4

0.2

0.2

0

0

−1.5

−1

−0.5

−1.5

−1

−0.5

µ

µ

(a)Ordinarygradients

(b)Naturalgradients

FIGURE7.4:Numericalillustrationsofordinaryandnaturalgradients.

Figure7.4showsnumericalcomparisonofordinaryandnaturalgradients

fortheGaussianpolicy.Thecontourlinesandthearrowsindicatetheex-

pectedreturnsurfaceandthegradientdirections,respectively.Thegraphs

showthattheordinarygradientstendtostronglyreducethestandarddevia-

tionparameterσwithoutreallyupdatingthemeanparameterµ.Thismeans

thatthestochasticityofthepolicyislostquicklyandthustheagentbecomes

lessexploratory.Consequently,onceσgetsclosertozero,thesolutionisat

aflatplateaualongthedirectionofµandthuspolicyupdatesinµarevery

slow.Ontheotherhand,thenaturalgradientsreduceboththemeanparam-

eterµandthestandarddeviationparameterσinabalancedway.Asaresult,

Page 336: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

convergencegetsmuchfasterthantheordinarygradientmethod.

7.4

ApplicationinComputerGraphics:ArtistAgent

Orientalinkpainting,whichisalsocalledsumie,isoneofthemostdis-

tinctivepaintingstylesandhasattractedartistsaroundtheworld.Major

challengesinsumiesimulationaretoabstractcomplexsceneinformationand

reproducesmoothandnaturalbrushstrokes.Reinforcementlearningisuseful

toautomaticallygeneratesuchsmoothandnaturalstrokes(Xieetal.,2013).

Inthissection,theREINFORCEalgorithmexplainedinSection7.2isapplied

tosumieagenttraining.

DirectPolicySearchbyGradientAscent

105

7.4.1

SumiePainting

Amongvarioustechniquesofnon-photorealisticrendering(Gooch&

Gooch,2001),stroke-basedpainterlyrenderingsynthesizesanimagefroma

sourceimageinadesiredpaintingstylebyplacingdiscretestrokes(Hertz-

mann,2003).Suchanalgorithmsimulatesthecommonpracticeofhuman

painterswhocreatepaintingswithbrushstrokes.

Westernpaintingstylessuchaswater-color,pastel,andoilpaintingoverlay

strokesontomultiplelayers,whileorientalinkpaintingusesafewexpressive

strokesproducedbysoftbrushtuftstoconveysignificantinformationabouta

targetscene.Theappearanceofthestrokeinorientalinkpaintingistherefore

determinedbytheshapeoftheobjecttopaint,thepathandpostureofthe

brush,andthedistributionofpigmentsinthebrush.

Drawingsmoothandnaturalstrokesinarbitraryshapesischallenging

sinceanoptimalbrushtrajectoryandthepostureofabrushfootprintare

differentforeachshape.Existingmethodscanefficientlymapbrushtexture

bydeformationontoauser-giventrajectorylineortheshapeofatargetstroke

(Hertzmann,1998;Guo&Kunii,2003).However,thegeometricalprocessof

Page 337: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

morphingtheentiretextureofabrushstrokeintothetargetshapeleads

toundesirableeffectssuchasunnaturalfoldingsandcreasedappearancesat

cornersorcurves.

Here,asoft-tuftbrushistreatedasareinforcementlearningagent,andthe

REINFORCEalgorithmisusedtoautomaticallydrawartisticstrokes.More

specifically,givenanyclosedcontourthatrepresentstheshapeofadesired

singlestrokewithoutoverlap,theagentmovesthebrushonthecanvastofill

thegivenshapefromastartpointtoanendpointwithstableposesalonga

smoothcontinuousmovementtrajectory(seeFigure7.5).

Inorientalinkpainting,thereareseveraldifferentbrushstylesthatcharac-

terizethepaintings.Below,tworepresentativestylescalledtheuprightbrush

styleandtheobliquebrushstyleareconsidered(seeFigure7.6).Intheupright

brushstyle,thetipofthebrushshouldbelocatedonthemedialaxisofthe

expectedstrokeshape,andthebottomofthebrushshouldbetangenttoboth

sidesoftheboundary.Ontheotherhand,intheobliquebrushstyle,thetip

ofthebrushshouldtouchonesideoftheboundaryandthebottomofthe

brushshouldbetangenttotheothersideoftheboundary.Thechoiceofthe

uprightbrushstyleandtheobliquebrushstyleisexclusiveandauserisasked

tochooseoneofthestylesinadvance.

7.4.2

DesignofStates,Actions,andImmediateRewards

Here,specificdesignofstates,actions,andimmediaterewardstailoredto

thesumieagentisdescribed.

Page 338: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 339: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

106

StatisticalReinforcementLearning

(a)Brushmodel

(b)Footprints

Page 340: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(c)Basicstrokestyles

FIGURE7.5:Illustrationofthebrushagentanditspath.(a)Astrokeisgen-

eratedbymovingthebrushwiththefollowing3actions:Action1isregulating

thedirectionofthebrushmovement,Action2ispushingdown/liftingupthe

brush,andAction3isrotatingthebrushhandle.OnlyAction1isdetermined

byreinforcementlearning,andAction2andAction3aredeterminedbased

onAction1.(b)Thetopsymbolillustratesthebrushagent,whichconsistsof

atipQandacirclewithcenterCandradiusr.Othersillustratefootprintsof

arealbrushwithdifferentinkquantities.(c)Thereare6basicstrokestyles:

fullink,dryink,first-halfhollow,hollow,middlehollow,andboth-endhollow.

Smallfootprintsonthetopofeachstrokeshowtheinterpolationorder.

7.4.2.1

States

Theglobalmeasurement(i.e.,theposeconfigurationofafootprintunder

theglobalCartesiancoordinate)andthelocalmeasurement(i.e.,thepose

andthelocomotioninformationofthebrushagentrelativetothesurrounding

environment)areusedasstates.Here,onlythelocalmeasurementisusedto

calculatearewardandapolicy,bywhichtheagentcanlearnthedrawing

policythatisgeneralizabletonewshapes.Below,thelocalmeasurementis

regardedasstatesandtheglobalmeasurementisdealtwithonlyimplicitly.

Page 341: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

DirectPolicySearchbyGradientAscent

107

FIGURE7.6:Uprightbrushstyle(left)andobliquebrushstyle(right).

Thelocalstate-spacedesignconsistsoftwocomponents:acurrentsur-

roundingshapeandanupcomingshape.Morespecifically,statevectorscon-

sistsofthefollowingsixfeatures:

s=(ω,φ,d,κ1,κ2,l)⊤.

Eachfeatureisdefinedasfollows(seeFigures7.7):

•ω∈(−π,π]:Theangleofthevelocityvectorofthebrushagentrelativetothemedialaxis.

•φ∈(−π,π]:Theheadingdirectionofthebrushagentrelativetothemedialaxis.

•d∈[−2,2]:TheratioofoffsetdistanceδfromthecenterCofthebrushagenttothenearestpointPonthemedialaxisMovertheradiusrof

thebrushagent(|d|=δ/r).dtakesapositive/negativevaluewhenthe

centerofthebrushagentisontheleft-/right-handsideofthemedial

axis:

–dtakesthevalue0whenthecenterofthebrushagentisonthe

Page 342: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

medialaxis.

–dtakesavaluein[−1,1]whenthebrushagentisinsidethebound-

aries.

–Thevalueofdisin[−2,−1)orin(1,2]whenthebrushagentgoes

overtheboundaryofoneside.

Page 343: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

108

StatisticalReinforcementLearning

dt–1<=1

t

P

rt–1

f

t–1

C

t–1

t–1

Q

Qt–1

C

r

t

r

d

P

C

t

t

P

>1

Qt

t

f

t

t

Pt–1

FIGURE7.7:Illustrationofthedesignofstates.Left:Thebrushagent

consistsofatipQandacirclewithcenterCandradiusr.Right:Theratiod

Page 344: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

oftheoffsetdistanceδovertheradiusr.Footprintft−1isinsidethedrawing

area,andthecirclewithcenterCt−1andthetipQt−1touchtheboundaryon

eachside.Inthiscase,δt−1≤rt−1anddt−1∈[0,1].Ontheotherhand,ftgoesovertheboundary,andthenδt>rtanddt>1.Notethatdisrestrictedtobein[−2,2],andPisthenearestpointonmedialaxisMtoC.

Notethatthecenteroftheagentisrestrictedwithintheshape.There-

fore,theextremevaluesofdare±2whenthecenteroftheagentison

theboundary.

•κ1,κ2∈(−1,1):κ1providesthecurrentsurroundinginformationonthepointPt,whereasκ2providestheupcomingshapeinformationonpoint

Pt+1:

2

p

κi=

arctan0.05/r′,

π

i

wherer′iistheradiusofthecurve.Morespecifically,thevaluetakes

0/negative/positivewhentheshapeisstraight/left-curved/right-curved,

andthelargeritsabsolutevalueis,thetighterthecurveis.

•l∈0,1:Abinarylabelthatindicateswhethertheagentmovestoaregioncoveredbythepreviousfootprintsornot.l=0meansthatthe

agentmovestoaregioncoveredbythepreviousfootprint.Otherwise,

l=1meansthatitmovestoanuncoveredregion.

7.4.2.2

Actions

Togenerateelegantbrushstrokes,thebrushagentshouldmoveinside

givenboundariesproperly.Here,thefollowingactionsareconsideredtocontrol

thebrush(seeFigure7.5(a)):

•Action1:Movementofthebrushonthecanvaspaper.

•Action2:Scalingup/downofthefootprint.

Page 345: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

DirectPolicySearchbyGradientAscent

109

•Action3:Rotationoftheheadingdirectionofthebrush.

Sinceproperlycoveringthewholedesiredregionisthemostimportantin

termsofthevisualquality,themovementofthebrush(Action1)isregarded

astheprimaryaction.Morespecifically,Action1takesavaluein(−π,−π]

thatindicatestheoffsetturningangleofthemotiondirectionrelativetothe

medialaxisofanexpectedstrokeshape.Inpracticalapplications,theagent

shouldbeabletodealwitharbitrarystrokesinvariousscales.Toachieve

stableperformanceindifferentscales,thevelocityisadaptivelychangedas

r/3,whereristheradiusofthecurrentfootprint.

Action1isdeterminedbytheGaussianpolicyfunctiontrainedbythe

REINFORCEalgorithm,andAction2andAction3aredeterminedasfollows.

•Obliquebrushstrokestyle:Thetipoftheagentissettotouchoneside

oftheboundary,andthebottomoftheagentissettobetangenttothe

othersideoftheboundary.

•Uprightbrushstrokestyle:Thetipoftheagentischosentotravelalong

themedialaxisoftheshape.

IfitisnotpossibletosatisfytheaboveconstraintsbyadjustingAction2and

Action3,thenewfootprintwillsimplybethesamepostureastheprevious

one.

7.4.2.3

ImmediateRewards

Theimmediaterewardfunctionmeasuresthequalityofthebrushagent’s

movementaftertakinganactionateachtimestep.Therewardisdesignedto

reflectthefollowingtwoaspects:

•Thedistancebetweenthecenterofthebrushagentandthenearestpoint

onthemedialaxisoftheshapeatthecurrenttimestep:Thisdetects

whethertheagentmovesoutoftheregionortravelsbackwardfromthe

correctdirection.

•Changeofthelocalconfigurationofthebrushagentafterexecutingan

Page 346: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

action:Thisdetectswhethertheagentmovessmoothly.

Thesetwoaspectsareformalizedbydefiningtherewardfunctionasfol-

lows:

0

ifft=ft+1orlt+1=0,

r(st,at,st+1)= 2+|κ1(t)|+|κ2(t)|

otherwise,

E(t)

+E(t)

location

posture

whereftandft+1arethefootprintsattimestepstandt+1,respectively.This

rewarddesignimpliesthattheimmediaterewardiszerowhenthebrushis

blockedbyaboundaryasft=ft+1orthebrushisgoingbackwardtoaregion

110

StatisticalReinforcementLearning

thathasalreadybeencoveredbypreviousfootprints.κ1(t)andκ2(t)arethe

valuesofκ1andκ2attimestept.|κ1(t)|+|κ2(t)|adaptivelyincreasesthe

immediaterewarddependingonthecurvaturesκ1(t)andκ2(t)ofthemedial

axis.

E(t)

measuresthequalityofthelocationofthebrushagentwithre-

location

specttothemedialaxis,definedby

Page 347: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

E(t)

=

1|ωt|+τ2(|dt|+5)

dt∈[−2,−1)∪(1,2],location

τ1|ωt|+τ2|dt|

dt∈[−1,1],wheredtisthevalueofdattimestept.τ1andτ2areweightparameters,

whicharechosendependingonthebrushstyle:τ1=τ2=0.5fortheupright

brushstyleandτ1=0.1andτ2=0.9fortheobliquebrushstyle.Sincedt

containsinformationaboutwhethertheagentgoesovertheboundaryornot,

asillustratedinFigure7.7,thepenalty+5isaddedtoElocationwhenthe

agentgoesovertheboundaryoftheshape.

E(t)

posturemeasuresthequalityofthepostureofthebrushagentbasedon

neighboringfootprints,definedby

E(t)

posture=∆ωt/3+∆φt/3+∆dt/3,

where∆ωt,∆φt,and∆dtarechangesinangleωofthevelocityvector,heading

directionφ,andratiodoftheoffsetdistance,respectively.Thenotation∆xt

denotesthenormalizedsquaredchangebetweenxt−1andxtdefinedby

1

ifxt=xt−1=0,

∆xt=

(x

t−xt−1)2

otherwise.

(|xt|+|xt−1|)2

7.4.2.4

Page 348: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

TrainingandTestSessions

Anaivewaytotrainanagentistouseanentirestrokeshapeasatraining

sample.However,thishasseveraldrawbacks,e.g.,collectingmanytraining

samplesiscostlyandgeneralizationtonewshapesishard.Toovercomethese

limitations,theagentistrainedbasedonpartialshapes,nottheentireshapes

(Figure7.8(a)).Thisallowsustogeneratevariouspartialshapesfromasingle

entireshape,whichsignificantlyincreasesthenumberandvariationoftrain-

ingsamples.Anothermeritisthatthegeneralizationabilitytonewshapes

canbeenhanced,becauseevenwhentheentireprofileofanewshapeisquite

differentfromthatoftrainingdata,thenewshapemaycontainsimilarpartial

shapes.Figure7.8(c)illustrates8examplesof80digitizedrealsinglebrush

strokesthatarecommonlyusedinorientalinkpainting.Boundariesareex-

tractedastheshapeinformationandarearrangedinaqueuefortraining(see

Figure7.8(b)).

Inthetrainingsession,theinitialpositionofthefirstepisodeischosento

bethestartpointofthemedialaxis,andthedirectiontomoveischosentobe

Page 349: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 350: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 351: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

DirectPolicySearchbyGradientAscent

111

(a)Combinationofshapes

(b)Setupofpolicytraining

(c)Trainingshapes

FIGURE7.8:Policytrainingscheme.(a)Eachentireshapeiscomposed

ofoneoftheupperregionsUi,thecommonregionΩ,andoneofthelower

Page 352: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

regionsLj.(b)Boundariesareextractedastheshapeinformationandare

arrangedinaqueuefortraining.(c)Eightexamplesof80digitizedrealsingle

brushstrokesthatarecommonlyusedinorientalinkpaintingareillustrated.

thegoalpoint,asillustratedinFigure7.8(b).Inthefirstepisode,theinitial

footprintissetatthestartpointoftheshape.Then,inthefollowingepisodes,

theinitialfootprintissetateitherthelastfootprintinthepreviousepisode

orthestartpointoftheshape,dependingonwhethertheagentmovedwell

orwasblockedbytheboundaryinthepreviousepisode.

Afterlearningadrawingpolicy,thebrushagentappliesthelearnedpolicy

tocoveringgivenboundarieswithsmoothstrokes.Thelocationoftheagentis

112

StatisticalReinforcementLearning

30

30

25

25

20

20

15

15

Return10

Return10

Upperbound

Upperbound

5

RL

5

RL

0

10

Page 353: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

20

30

40

10

20

30

40

Iteration

Iteration

(a)Uprightbrushstyle

(b)Obliquebrushstyle

FIGURE7.9:Averageandstandarddeviationofreturnsobtainedbythe

reinforcementlearning(RL)methodover10trialsandtheupperlimitofthe

returnvalue.

initializedatthestartpointofanewshape.Theagentthensequentiallyselects

actionsbasedonthelearnedpolicyandmakestransitionsuntilitreachesthe

goalpoint.

7.4.3

ExperimentalResults

First,theperformanceofthereinforcementlearning(RL)methodisin-

vestigated.PoliciesareseparatelytrainedbytheREINFORCEalgorithmfor

theuprightbrushstyleandtheobliquebrushstyleusing80singlestrokesas

trainingdata(seeFigure7.8(c)).Theparametersoftheinitialpolicyareset

at

θ=(µ⊤,σ)⊤=(0,0,0,0,0,0,2)⊤,

wherethefirstsixelementscorrespondtotheGaussianmeanandthelast

elementistheGaussianstandarddeviation.TheagentcollectsN=300

episodicsampleswithtrajectorylengthT=32.Thediscountedfactoris

setatγ=0.99.

Theaverageandstandarddeviationsofthereturnfor300trainingepisodic

samplesover10trialsareplottedinFigure7.9.Thegraphsshowthatthe

averagereturnssharplyincreaseinanearlystageandapproachtheoptimal

Page 354: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

values(i.e.,receivingthemaximumimmediatereward,+1,forallsteps).

Next,theperformanceoftheRLmethodiscomparedwiththatofthe

dynamicprogramming(DP)method(Xieetal.,2011),whichinvolvesdis-

cretizationofcontinuousstatespace.InFigure7.10,theexperimentalresults

obtainedbyDPwithdifferentnumbersoffootprintcandidatesineachstep

oftheDPsearchareplottedtogetherwiththeresultobtainedbyRL.This

showsthattheexecutiontimeoftheDPmethodincreasessignificantlyasthe

numberoffootprintcandidatesincreases.IntheDPmethod,thebestreturn

DirectPolicySearchbyGradientAscent

113

30

2500

DP

2000

RL

20

1500

10

1000

Averagereturn

0

DP

Computationtime

500

RL

−10

0

0

50

Page 355: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

100

150

200

0

50

100

150

200

Thenumberoffootprintcandidates

Thenumberoffootprintcandidates

(a)Averagereturn

(b)Computationtime

FIGURE7.10:Averagereturnandcomputationtimeforreinforcement

learning(RL)anddynamicprogramming(DP).

value26.27isachievedwhenthenumberoffootprintcandidatesissetat180.

Althoughthismaximumvalueiscomparabletothereturnobtainedbythe

RLmethod(26.44),RLisabout50timesfasterthantheDPmethod.Fig-

ure7.11showssomeexemplarystrokesgeneratedbyRL(thetoptworows)

andDP(thebottomtworows).ThisshowsthattheagenttrainedbyRLis

abletodrawnicestrokeswithstableposesafterthe30thpolicyupdateiter-

ation(seealsoFigure7.9).Ontheotherhand,asillustratedinFigure7.11,

theDPresultsfor5,60,and100footprintcandidatesareunacceptablypoor.

GiventhattheDPmethodrequiresmanualtuningofthenumberoffootprint

candidatesateachstepforeachinputshape,theRLmethodisdemonstrated

tobepromising.

TheRLmethodisfurtherappliedtomorerealisticshapes,illustratedin

Figure7.12.Althoughtheshapesarenotincludedinthetrainingsamples,

theRLmethodcanproducesmoothandnaturalbrushstrokesforvarious

unlearnedshapes.MoreresultsareillustratedinFigure7.13,showingthat

theRLmethodispromisinginphotoconversionintothesumiestyle.

7.5

Remarks

Page 356: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Inthischapter,gradient-basedalgorithmsfordirectpolicysearchareintro-

duced.Thesegradient-basedmethodsaresuitableforcontrollingvulnerable

physicalsystemssuchashumanoidrobots,thankstothenatureofgradient

methodsthatparametersareupdatedgradually.Furthermore,directpolicy

searchcanhandlecontinuousactionsinastraightforwardway,whichisan

advantageoverpolicyiteration,explainedinPartII.

Page 357: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 358: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 359: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 360: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 361: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

114

StatisticalReinforcementLearning

1stiteration

10thiteration

20thiteration

30thiteration

40thiteration

(a)RLmethod

5candidates

60candidates

100candidates

140candidates

180candidates

(b)DPmethod

FIGURE7.11:ExamplesofstrokesgeneratedbyRLandDP.Thetoptwo

rowsshowtheRLresultsoverpolicyupdateiterations,whilethebottomtwo

rowsshowtheDPresultsfordifferentnumbersoffootprintcandidates.The

linesegmentconnectsthecenterandthetipofafootprint,andthecircle

denotesthebottomcircleofthefootprint.

Thegradient-basedmethodwassuccessfullyappliedtoautomaticsumie

paintinggeneration.Consideringlocalmeasurementsinstatedesignwas

showntobeuseful,whichallowedabrushagenttolearnageneraldrawing

policythatisindependentofaspecificentireshape.Anotherimportantfactor

wastotrainthebrushagentonpartialshapes,nottheentireshapes.This

Page 362: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

contributedhighlytoenhancingthegeneralizationabilitytonewshapes,be-

causeevenwhenanewshapeisquitedifferentfromtrainingdataasawhole,

itoftencontainssimilarpartialshapes.Inthiskindofreal-worldapplica-

tionsmanuallydesigningimmediaterewardfunctionsisoftentimeconsuming

anddifficult.Theuseofinversereinforcementlearning(Abbeel&Ng,2004)

wouldbeapromisingapproachforthispurpose.Inparticular,inthecon-

Page 363: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 364: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

DirectPolicySearchbyGradientAscent

115

(a)Realphoto

(b)Userinputboundaries

(c)TrajectoriesestimatedbyRL

(d)Renderingresults

FIGURE7.12:Resultsonnewshapes.

textofsumiedrawing,suchdata-drivendesignofrewardfunctionswillallow

automaticlearningofthestyleofaparticularartistfromhis/herdrawings.

Apracticalweaknessofthegradient-basedapproachisthatthestepsize

ofgradientascentisoftendifficulttochoose.InChapter8,astep-size-free

methodofdirectpolicysearchbasedontheexpectation-maximizationalgo-

rithmwillbeintroduced.Anothercriticalproblemofdirectpolicysearchis

thatpolicyupdateisratherunstableduetothestochasticityofpolicies.Al-

thoughvariancereductionbybaselinesubtractioncanmitigatethisproblem

tosomeextent,theinstabilityproblemisstillcriticalinpractice.Thenatural

gradientmethodcouldbeanalternative,butcomputingtheinverseRieman-

nianmetrictendstobeunstable.InChapter9,anothergradientapproach

Page 365: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

thatcanaddresstheinstabilityproblemwillbeintroduced.

Page 366: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 367: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 368: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 369: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

116

StatisticalReinforcementLearning

FIGURE7.13:Photoconversionintothesumiestyle.

Page 370: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Chapter8

DirectPolicySearchby

Expectation-Maximization

Gradient-baseddirectpolicysearchmethodsintroducedinChapter7are

usefulparticularlyincontrollingcontinuoussystems.However,appropriately

choosingthestepsizeofgradientascentisoftendifficultinpractice.In

thischapter,weintroduceanotherdirectpolicysearchmethodbasedonthe

expectation-maximization(EM)algorithmthatdoesnotcontainthestepsize

parameter.InSection8.1,themainideaoftheEM-basedmethodisdescribed,

whichisexpectedtoconvergefasterbecausepoliciesaremoreaggressivelyup-

datedthanthegradient-basedapproach.Inpractice,however,directpolicy

searchoftenrequiresalargenumberofsamplestoobtainastablepolicy

updateestimator.Toimprovethestabilitywhenthesamplesizeissmall,

reusingpreviouslycollectedsamplesisapromisingapproach.InSection8.2,

thesample-reusetechniquethathasbeensuccessfullyusedtoimprovethe

performanceofpolicyiteration(seeChapter4)isappliedtotheEM-based

method.ThenitsexperimentalperformanceisevaluatedinSection8.3and

thischapterisconcludedinSection8.4.

8.1

Expectation-MaximizationApproach

Thegradient-basedoptimizationalgorithmsintroducedinSection7.2

graduallyupdatepolicyparametersoveriterations.Althoughthisisadvan-

tageouswhencontrollingaphysicalsystem,itrequiresmanyiterationsuntil

convergence.Inthissection,theexpectation-maximization(EM)algorithm

(Dempsteretal.,1977)isusedtocopewiththisproblem.

ThebasicideaofEM-basedpolicysearchistoiterativelyupdatethepolicy

parameterθbymaximizingalowerboundoftheexpectedreturnJ(θ):

Z

J(θ)=

p(h|θ)R(h)dh.

ToderivealowerboundofJ(θ),Jensen’sinequality(Bishop,2006)isutilized:

Z

Page 371: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Z

q(h)f(g(h))dh≥f

q(h)g(h)dh,

117

118

StatisticalReinforcementLearning

whereqisaprobabilitydensity,fisaconvexfunction,andgisanon-negative

function.Forf(t)=−logt,Jensen’sinequalityyields

Z

Z

q(h)logg(h)dh≤log

q(h)g(h)dh.

(8.1)

AssumethatthereturnR(h)isnonnegative.Lete

θbethecurrentpolicy

parameterduringtheoptimizationprocedure,andqandginEq.(8.1)areset

as

p(h|e

θ)R(h)

p(h|θ)

q(h)=

andg(h)=

.

J(e

θ)

p(h|e

Page 372: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

θ)

Thenthefollowinglowerboundholdsforallθ:

Z

J(θ)

p(h|θ)R(h)

log

=log

dh

J(e

θ)

J(e

θ)

Zp(h|eθ)R(h)p(h|θ)

=log

dh

J(e

θ)

p(h|e

θ)

Zp(h|eθ)R(h)

p(h|θ)

log

dh.

J(e

θ)

p(h|e

θ)

Thisyields

logJ(θ)≥loge

J(θ),

where

Page 373: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ZR(h)p(h|eθ)

p(h|θ)

loge

J(θ)=

log

dh+logJ(e

θ).

J(e

θ)

p(h|e

θ)

IntheEMapproach,theparameterθisiterativelyupdatedbymaximizing

thelowerbounde

J(θ):

bθ=argmaxe

J(θ).

θ

Sinceloge

J(e

θ)=logJ(e

θ),thelowerbounde

JtouchesthetargetfunctionJat

thecurrentsolutione

θ:

e

J(e

θ)=J(e

θ).

Thus,monotonenon-decreaseoftheexpectedreturnisguaranteed:

J(b

θ)≥J(e

θ).

Page 374: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Thisupdateisiterateduntilconvergence(seeFigure8.1).

LetusemploytheGaussianpolicymodeldefinedas

π(a|s,θ)=π(a|s,µ,σ)

DirectPolicySearchbyExpectation-Maximization

119

FIGURE8.1:PolicyparameterupdateintheEM-basedpolicysearch.The

policyparameterθisupdatediterativelybymaximizingthelowerbound

e

J(θ),whichtouchesthetrueexpectedreturnJ(θ)atthecurrentsolutione

θ.

1

(a−µ⊤φ(s))2

=

exp−

,

σ2π

2σ2

whereθ=(µ⊤,σ)⊤andφ(s)denotesthebasisfunction.

Themaximizerb

θ=(b

µ⊤,b

σ)⊤ofthelowerbounde

J(θ)canbeanalytically

obtainedas

Z

!

!

Page 375: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

T

−1

X

Z

T

X

b

µ=

p(h|e

θ)R(h)

φ(st)φ(st)⊤dh

p(h|e

θ)R(h)

atφ(st)dh

t=1

t=1

!

!

N

−1

X

T

X

N

X

T

X

R(hn)

φ(st,n)φ(st,n)⊤R(hn)

at,nφ(st,n),

Page 376: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

n=1

t=1

n=1

t=1

Z

!

−1

Z

T

1X

b

σ2=

p(h|e

θ)R(h)dh

p(h|e

θ)R(h)

(a

T

t−b

µ⊤φ(st))2dh

t=1

!

!

N

−1

X

N

X

T

1X

R(hn)

Page 377: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

R(hn)

(a

,

T

t,n−b

µ⊤φ(st,n))2

n=1

n=1

t=1

wheretheexpectationoverhisapproximatedbytheaverageoverroll-out

samplesH=hnN

n=1fromthecurrentpolicye

θ:

hn=[s1,n,a1,n,…,sT,n,aT,n].

NotethatEM-basedpolicysearchforGaussianmodelsiscalledreward-

weightedregression(RWR)(Peters&Schaal,2007).

120

StatisticalReinforcementLearning

8.2

SampleReuse

Inpractice,alargenumberofsamplesisneededtoobtainastablepolicy

updateestimatorintheEM-basedpolicysearch.Inthissection,thesample-

reusetechniqueisappliedtotheEMmethodtocopewiththeinstability

problem.

8.2.1

EpisodicImportanceWeighting

TheoriginalRWRmethodisanon-policyalgorithmthatusesdatadrawn

fromthecurrentpolicy.Ontheotherhand,thesituationcalledoff-policyrein-

forcementlearningisconsideredhere,wherethesamplingpolicyforcollecting

Page 378: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

datasamplesisdifferentfromthetargetpolicy.Morespecifically,Ntrajec-

torysamplesaregatheredfollowingthepolicyπℓintheℓ-thpolicyupdate

iteration:

Hπℓ=hπℓ

1,…,hπℓ

N,

whereeachtrajectorysamplehπℓ

nisgivenas

hπℓ

n=[sπℓ

1,n,aπℓ

1,n,…,sπℓ,aπℓ,sπℓ

].

T,n

T,n

T+1,n

Wewanttoutilizeallthesesamplestoimprovethecurrentpolicy.

SupposethatwearecurrentlyattheL-thpolicyupdateiteration.Ifthe

policiesπℓL

remainunchangedovertheRWRupdates,justusingthe

ℓ=1

NIW

plainupdaterulesprovidedinSection8.1givesaconsistentestimatorb

θL+1=

(b

µNIW⊤L+1

,b

σNIW)⊤,where

L+1

!

L

Page 379: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

−1

XN

X

T

X

b

µNIW

L+1=

R(hπℓ

n)

φ(sπℓ

t,n)φ(sπℓ

t,n)⊤ℓ=1n=1

t=1

!

L

XN

X

T

X

×

R(hπℓ

n)

aπℓ

t,nφ(sπℓ

t,n)

,

ℓ=1n=1

t=1

!

L

Page 380: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

−1

XN

X

(b

σNIW

L+1)2=

R(hπℓ

n)

ℓ=1n=1

!

L

XN

X

T

1X

2

×

R(hπℓ

⊤n)

aπℓ

φ(sπℓ

.

T

t,n−b

µNIW

L+1

t,n)

ℓ=1n=1

t=1

Thesuperscript“NIW”standsfor“noimportanceweight.”However,since

policiesareupdatedineachRWRiteration,datasamplesHπℓL

Page 381: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

collected

ℓ=1

overiterationsgenerallyfollowdifferentprobabilitydistributionsinducedby

differentpolicies.Therefore,naiveuseoftheaboveupdateruleswillresultin

aninconsistentestimator.

DirectPolicySearchbyExpectation-Maximization

121

InthesamewayasthediscussioninChapter4,importancesamplingcan

beusedtocopewiththisproblem.Thebasicideaofimportancesampling

istoweightthesamplesdrawnfromadifferentdistributiontomatchthe

targetdistribution.Morespecifically,fromi.i.d.(independentandidentically

distributed)sampleshπℓ

nN

n=1followingp(h|θℓ),theexpectationofafunction

g(h)overanotherprobabilitydensityfunctionp(h|θL)canbeestimatedina

consistentmannerbytheimportance-weightedaverage:

N

1X

p(hπℓ

p(h|θ

g(hπ

N→∞

L)

n|θL)

−→E

g(h)

N

Page 382: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

n)p(hπℓ

p(h|θℓ)

n|θ

p(h|θ

n=1

ℓ)

ℓ)

Z

Z

p(h|θ

=

g(h)

L)p(h|θ

g(h)p(h|θ

p(h|

ℓ)dh=

L)dh

θℓ)

=Ep(h|θL)[g(h)].

Theratiooftwodensitiesp(h|θL)/p(h|θℓ)iscalledtheimportanceweightfor

trajectoryh.

ThisimportancesamplingtechniquecanbeemployedinRWRtoobtain

EIW

aconsistentestimatorb

θ

⊤L+1=(b

µEIW

L+1

,b

σEIW)⊤,where

L+1

Page 383: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

!

L

−1

XN

X

T

X

b

µEIW

L+1=

R(hπℓ

n)w(L,ℓ)(h)

φ(sπℓ

t,n)φ(sπℓ

t,n)⊤ℓ=1n=1

t=1

!

L

XN

X

T

X

×

R(hπℓ

n)w(L,ℓ)(h)

aπℓ

t,nφ(sπℓ

t,n)

,

ℓ=1n=1

t=1

Page 384: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

!

L

−1

XN

X

(b

σEIW

L+1)2=

R(hπℓ

n)w(L,ℓ)(hπℓ

n)

ℓ=1n=1

!

L

XN

X

T

1X

2

×

R(hπℓ

⊤n)w(L,ℓ)(hπℓ

n)

aπℓ

φ(sπℓ

.

T

t,n−b

µEIW

L+1

t,n)

Page 385: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ℓ=1n=1

t=1

Here,w(L,ℓ)(h)denotestheimportanceweightdefinedby

p(h|θ

w(L,ℓ)(h)=

L).

p(h|θℓ)

Thesuperscript“EIW”standsfor“episodicimportanceweight.”

p(h|θL)andp(h|θℓ)denotetheprobabilitydensityofobservingtrajectory

h=[s1,a1,…,sT,aT,sT+1]

underpolicyparametersθLandθℓ,whichcanbeexplicitlywrittenas

T

Y

p(h|θL)=p(s1)

p(st+1|st,at)π(at|st,θL),

t=1

122

StatisticalReinforcementLearning

T

Y

p(h|θℓ)=p(s1)

p(st+1|st,at)π(at|st,θℓ).

t=1

Thetwoprobabilitydensitiesp(h|θL)andp(h|θℓ)bothcontainunknownprob-

abilitydensitiesp(s1)andp(st+1|st,at)Tt=1.However,sincetheycancelout

intheimportanceweight,itcanbecomputedwithouttheknowledgeofp(s)

andp(s′|s,a)as

QTπ(a

w(L,ℓ)(h)=

Page 386: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

t=1

t|st,θL)

Q

.

T

π(a

t=1

t|st,θℓ)

EIW

Althoughtheimportance-weightedestimatorb

θL+1isguaranteedtobe

consistent,ittendstohavelargevariance(Shimodaira,2000;Sugiyama&

Kawanabe,2012).Therefore,theimportance-weightedestimatortendstobe

unstablewhenthenumberofepisodesNisrathersmall.

8.2.2

Per-DecisionImportanceWeight

Sincetherewardatthet-thstepdoesnotdependonfuturestate-action

transitionsafterthet-thstep,anepisodicimportanceweightcanbedecom-

posedintostepwiseimportanceweights(Precupetal.,2000).Forinstance,

theexpectedreturnJ(θL)canbeexpressedas

Z

J(θL)=

R(h)p(h|θL)dh

ZT

X

=

γt−1r(st,at,st+1)w(L,ℓ)(h)p(h|θℓ)dh

t=1

ZT

X

=

γt−1r(st,at,st+1)w(L,ℓ)

Page 387: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

t

(h)p(h|θℓ)dh,

t=1

wherew(L,ℓ)

t

(h)isthet-stepimportanceweight,calledtheper-decisionim-

portanceweight(PIW),definedas

Qt

π(a

w(L,ℓ)

t′=1

t′|st′,θL)

t

(h)=Q

.

t

π(a

t′=1

t′|st′,θℓ)

Here,thePIWideaisappliedtoRWRandamorestablealgorithmis

developed.Aslightcomplicationisthatthepolicyupdateformulagivenin

Section8.2.1containsdoublesumsoverTsteps,e.g.,

T

X

T

X

R(h)

φ(st′)φ(st′)=

γt−1r(st,at,st+1)φ(st′)φ(st′).

t′=1

t,t′=1

Inthiscase,thesummand

Page 388: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

γt−1r(st,at,st+1)φ(st′)φ(st′)

DirectPolicySearchbyExpectation-Maximization

123

doesnotdependonfuturestate-actionpairsafterthemax(t,t′)-thstep.Thus,

theepisodicimportanceweightfor

γt−1r(st,at,st+1)φ(st′)φ(st′)

canbesimplifiedtotheper-decisionimportanceweightw(L,ℓ)

.Conse-

max(t,t′)

quently,thePIW-basedpolicyupdaterulesaregivenas

−1

L

XN

XT

X

b

µPIW

L+1=

γt−1rt,nφ(sπℓ)φ(sπℓ)⊤w(L,ℓ)

(hπℓ

t′,n

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

Page 389: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

L

XN

XT

X

×

γt−1r

t,naπℓφ(sπℓ)w(L,ℓ)

(hπℓ

,

t′,n

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

!

L

−1

XN

XT

X

(b

σPIW

L+1)2=

γt−1rt,nw(L,ℓ)

t

(hπℓ

n)

ℓ=1n=1t=1

!

L

Page 390: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

N

T

1XXX

2

×

γt−1r

aπℓ

⊤φ(sπℓ)

w(L,ℓ)

(hπℓ

,

T

t,n

t′,n−b

µPIW

L+1

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

where

rt,n=r(st,n,at,n,st+1,n).

PIW

ThisPIWestimatorb

θ

⊤L+1=(b

µPIW

L+1

,b

σPIW)⊤isconsistentandpotentially

L+1EIW

Page 391: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

morestablethantheplainEIWestimatorb

θL+1.

8.2.3

AdaptivePer-DecisionImportanceWeighting

TomoreactivelycontrolthestabilityofthePIWestimator,theadaptive

per-decisionimportanceweight(AIW)isemployed.Morespecifically,anim-

portanceweightw(L,ℓ)

(h)is“flattened”byflatteningparameterν

max(t,t

∈[0,1]′)

ν

asw(L,ℓ)

(h)

,i.e.,theν-thpoweroftheper-decisionimportanceweight.

max(t,t′)

AIW

Thenwehaveb

θ

⊤L+1=(b

µAIW

L+1

,b

σAIW

L+1)⊤,where

−1

L

XN

XT

X

Page 392: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ν

b

µAIW

L+1=

γt−1rt,nφ(sπℓ)φ(sπℓ)⊤w(L,ℓ)

(hπℓ

t′,n

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

L

XN

XT

X

ν

×

γt−1r

t,naπℓφ(sπℓ)

w(L,ℓ)

(hπℓ

,

t′,n

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

Page 393: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

!

L

−1

XN

XT

X

ν

(b

σAIW

L+1)2=

γt−1rt,nw(L,ℓ)

t

(hπℓ

n)

ℓ=1n=1t=1

124

StatisticalReinforcementLearning

!

L

N

T

1XXX

2

ν

×

γt−1r

aπℓ

⊤φ(sπℓ)

w(L,ℓ)

Page 394: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(hπℓ

.

T

t,n

t′,n−b

µAIW

L+1

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

Whenν=0,AIWisreducedtoNIW.Therefore,itisrelativelystable,but

notconsistent.Ontheotherhand,whenν=1,AIWisreducedtoPIW.

Therefore,itisconsistent,butratherunstable.Inpractice,anintermediate

νoftenproducesabetterestimator.Notethatthevalueoftheflattening

parametercanbedifferentineachiteration,i.e.,νmaybereplacedbyνℓ.

However,forsimplicity,asinglecommonvalueνisconsideredhere.

8.2.4

AutomaticSelectionofFlatteningParameter

Theflatteningparameterallowsustocontrolthetrade-offbetweenconsis-

tencyandstability.Here,weshowhowthevalueoftheflatteningparameter

canbeoptimallychosenusingdatasamples.

Thegoalofpolicysearchistofindtheoptimalpolicythatmaximizesthe

expectedreturnJ(θ).Therefore,theoptimalflatteningparametervalueν∗LattheL-thiterationisgivenby

AIW

ν∗L=argmaxJ(bθL+1(ν)).ν

Directlyobtainingν∗requiresthecomputationoftheexpectedreturnL

AIW

J(b

Page 395: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

θL+1(ν))foreachcandidateofν.Tothisend,datasamplesfollowing

AIW

π(a|s;bθL+1(ν))areneededforeachν,whichisprohibitivelyexpensive.To

reusesamplesgeneratedbypreviouspolicies,avariationofcross-validation

calledimportance-weightedcross-validation(IWCV)(Sugiyamaetal.,2007)

isemployed.

ThebasicideaofIWCVistosplitthetrainingdatasetHπ1:L=HπℓLℓ=1

intoan“estimationpart”anda“validationpart.”Thenthepolicyparam-

AIW

eterb

θL+1(ν)islearnedfromtheestimationpartanditsexpectedreturn

AIW

J(b

θ

(ν))isapproximatedusingtheimportance-weightedlossfortheval-

idationpart.AspointedoutinSection8.2.1,importanceweightingtendsto

beunstablewhenthenumberNofepisodesissmall.Forthisreason,per-

decisionimportanceweightingisusedforcross-validation.Below,howIWCV

isappliedtotheselectionoftheflatteningparameterνinthecurrentcontext

isexplainedinmoredetail.

LetusdividethetrainingdatasetHπ1:L=HπℓLintoKdisjointsubsets

ℓ=1

Hπ1:L

ofthesamesize,whereeach

containsN/Kepisodicsamples

k

K

k=1

Hπ1:L

k

fromeveryHπℓ.Forsimplicity,weassumethatNisdivisiblebyK,i.e.,N/K

isaninteger.K=5willbeusedintheexperimentslater.

Page 396: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

AIW

Letb

θL+1,k(ν)bethepolicyparameterlearnedfromHπ1:L

k

k′6=k(i.e.,all

AIW

datawithoutHπ1:L)byAIWestimation.Theexpectedreturnofb

θ

k

L+1,k(ν)is

DirectPolicySearchbyExpectation-Maximization

125

estimatedusingthePIWestimatorfromHπ1:Las

k

X

T

X

b

AIW

1

Jk

IWCV(b

θL+1,k(ν))=

γt−1r(s

η

t,at,st+1)w(L,ℓ)

t

Page 397: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(h),

π

h∈H1:Lt=1k

whereηisanormalizationconstant.Anordinarychoiceisη=LN/K,buta

morestablevariantgivenby

X

η=

w(L,ℓ)

t

(h)

π

h∈H1:Lk

isoftenpreferredinpractice(Precupetal.,2000).

Theaboveprocedureisrepeatedforallk=1,…,K,andtheaverage

score,

K

X

b

AIW

1

AIW

J

b

IWCV(b

θL+1(ν))=

Jk

K

IWCV(b

θL+1,k(ν)),

k=1

Page 398: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

AIW

iscomputed.ThisistheK-foldIWCVestimatorofJ(b

θL+1(ν)),whichwas

showntobealmostunbiased(Sugiyamaetal.,2007).

ThisK-foldIWCVscoreiscomputedforeachcandidatevalueoftheflat-

teningparameterνandtheonethatmaximizestheIWCVscoreischosen:

AIW

b

ν

b

IWCV=argmaxJIWCV(b

θL+1(ν)).

ν

ThisIWCVschemecanalsobeusedforchoosingthebasisfunctionsφ(s)in

theGaussianpolicymodel.

Notethatwhentheimportanceweightsw(L,ℓ)

areallone(i.e.,noim-

max(t,t′)

portanceweighting),theaboveIWCVprocedureisreducedtotheordinary

CVprocedure.TheuseofIWCVisessentialheresincethetargetpolicy

AIW

π(a|s,bθL+1(ν))isusuallydifferentfromthepreviouspoliciesusedforcollect-

ingthedatasamplesHπ1:L.Therefore,theexpectedreturnestimatedusing

AIW

ordinaryCV,b

JCV(b

θL+1(ν)),wouldbeheavilybiased.

8.2.5

Reward-WeightedRegressionwithSampleReuse

Sofar,wehaveintroducedAIWtocontrolthestabilityofthepolicy-

parameterupdateandIWCVtoautomaticallychoosetheflatteningparameter

basedontheestimatedexpectedreturn.Thepolicysearchalgorithmthat

Page 399: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

combinesthesetwomethodsiscalledreward-weightedregressionwithsample

reuse(RRR).

Ineachiteration(L=1,2,…)ofRRR,episodicdatasamplesHπLare

collectedfollowingthecurrentpolicyπ(a|s,θAIW

L

),theflatteningparameter

νischosensoastomaximizetheexpectedreturnb

JIWCV(ν)estimatedby

IWCVusingHπℓL,andthenthepolicyparameterisupdatedto

ℓ=1

θAIW

L+1

usingHπℓL.

ℓ=1

126

StatisticalReinforcementLearning

elbow

Page 400: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

wrist

FIGURE8.2:Ballbalancingusingarobotarmsimulator.Twojointsofthe

robotsarecontrolledtokeeptheballinthemiddleofthetray.

8.3

NumericalExamples

TheperformanceofRRRisexperimentallyevaluatedonaball-balancing

taskusingarobotarmsimulator(Schaal,2009).

AsillustratedinFigure8.2,a7-degree-of-freedomarmismountedonthe

ceilingupsidedown,whichisequippedwithacirculartrayofradius0.24[m]

attheendeffector.Thegoalistocontrolthejointsoftherobotsothatthe

ballisbroughttothemiddleofthetray.However,thedifficultyisthatthe

angleofthetraycannotbecontrolleddirectly,whichisatypicalrestriction

inreal-worldjoint-motionplanningbasedonfeedbackfromtheenvironment

(e.g.,thestateoftheball).

Tosimplifytheproblem,onlytwojointsarecontrolledhere:thewristangle

αrollandtheelbowangleαpitch.Alltheremainingjointsarefixed.Control

ofthewristandelbowangleswouldroughlycorrespondtochangingtheroll

andpitchanglesofthetray,butnotdirectly.

Twoseparatecontrolsubsystemsaredesignedhere,eachofwhichisin

chargeofcontrollingtherollandpitchangles.Eachsubsystemhasitsown

policyparameterθ,statespaceS,andactionspaceA.ThestatespaceSis

continuousandconsistsof(x,˙x),wherex[m]isthepositionoftheballonthe

trayalongeachaxisand˙x[m/s]isthevelocityoftheball.Theactionspace

Aiscontinuousandcorrespondstothetargetanglea[rad]ofthejoint.The

rewardfunctionisdefinedas

5(x′)2+(˙x′)2+a2

r(s,a,s′)=exp−

,

2(0.24/2)2

wherethenumber0.24inthedenominatorcomesfromtheradiusofthetray.

Below,howthecontrolsystemisdesignedisexplainedinmoredetail.

Page 401: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

DirectPolicySearchbyExpectation-Maximization

127

FIGURE8.3:Theblockdiagramoftherobot-armcontrolsystemforball

balancing.Thecontrolsystemhastwofeedbackloops,i.e.,joint-trajectory

planningbyRRRandtrajectorytrackingbyahigh-gainproportional-

derivative(PD)controller.

AsillustratedinFigure8.3,thecontrolsystemhastwofeedbackloopsfor

trajectoryplanningusinganRRRcontrollerandtrajectorytrackingusinga

high-gainproportional-derivative(PD)controller(Siciliano&Khatib,2008).

TheRRRcontrolleroutputsthetargetjointangleobtainedbythecurrent

policyatevery0.2[s].NineGaussiankernelsareusedasbasisfunctionsφ(s)

withthekernelcenterscb9

locatedoverthestatespaceat

b=1

(x,˙x)∈(−0.2,−0.4),(−0.2,0),(−0.1,0.4),(0,−0.4),(0,0),(0,0.4),

(0.1,−0.4),(0.2,0),(0.2,0.4).

TheGaussianwidthissetatσbasis=0.1.Basedonthediscrete-timetarget

anglesobtainedbyRRR,thedesiredjointtrajectoryinthecontinuoustime

domainislinearlyinterpolatedas

at,u=at+u˙at,

Page 402: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

whereuisthetimefromthelastoutputatofRRRatthet-thstep.˙atisthe

angularvelocitycomputedby

a

˙a

t−at−1

t=

,

0.2

wherea0istheinitialangleofajoint.Theangularvelocityisassumedtobe

constantduringthe0.2[s]cycleoftrajectoryplanning.

Ontheotherhand,thePDcontrollerconvertsdesiredjointtrajectoriesto

motortorquesas

τt,u=µp∗(at,u−αt,u)+µd∗(˙at−˙αt,u),whereτisthe2-dimensionalvectorconsistingofthetorqueappliedtothe

wristandelbowjoints.a=(apitch,aroll)⊤and˙a=(˙apitch,˙aroll)⊤arethe

2-dimensionalvectorsconsistingofthedesiredanglesandvelocities.α=

128

StatisticalReinforcementLearning

(αpitch,αroll)⊤and˙α=(˙αpitch,˙αroll)⊤arethe2-dimensionalvectorsconsist-

ingofthecurrentjointanglesandvelocities.µpandµdarethe2-dimensional

vectorsconsistingoftheproportionalandderivativegains.“∗”denotestheelement-wiseproduct.Sincethecontrolcycleoftherobotarmis0.002[s],

thePDcontrollerisapplied100times(i.e.,t=0.002,0.004,…,0.198,0.2)ineachRRRcycle.

Figure8.4depictsadesiredtrajectoryofthewristjointgeneratedby

arandompolicyandanactualtrajectoryobtainedusingthehigh-gainPD

controllerdescribedabove.Thegraphsshowthatthedesiredtrajectoryis

followedbytherobotarmreasonablywell.

ThepolicyparameterθLislearnedthroughtheRRRiterations.Theinitial

policyparametersθ1=(µ⊤

Page 403: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1,σ1)⊤aresetmanuallyas

µ1=(−0.5,−0.5,0,−0.5,0,0,0,0,0)⊤andσ1=0.1,

sothatawiderangeofstatesandactionscanbesafelyexploredinthefirstiter-

ation.Theinitialpositionoftheballisrandomlyselectedasx∈[−0.05,0.05].Thedatasetcollectedineachiterationconsistsof10episodeswith20steps.

Thedurationofanepisodeis4[s]andthesamplingcyclebyRRRis0.2[s].

Threescenariosareconsideredhere:

•NIW:Samplereusewithν=0.

•PIW:Samplereusewithν=1.

•RRR:SamplereusewithνchosenbyIWCVfrom0,0.25,0.5,0.75,1

ineachiteration.

Thediscountfactorissetatγ=0.99.Figure8.5depictstheaveragedexpected

returnover10trialsasafunctionofthenumberofpolicyupdateiterations.

Theexpectedreturnineachtrialiscomputedfrom20testepisodicsamples

thathavenotbeenusedfortraining.ThegraphshowsthatRRRnicelyim-

provestheperformanceoveriterations.Ontheotherhand,theperformance

forν=0issaturatedafterthe3rditeration,andtheperformanceforν=1

isimprovedinthebeginningbutsuddenlygoesdownatthe5thiteration.

Theresultforν=1indicatesthatalargechangeinpoliciescausessevere

instabilityinsamplereuse.

Figure8.6andFigure8.7depictexamplesoftrajectoriesofthewristangle

αroll,theelbowangleαpitch,resultingballmovementx,andrewardrfor

policiesobtainedbyNIW(ν=0)andRRR(νischosenbyIWCV)after

the10thiteration.BythepolicyobtainedbyNIW,theballgoesthroughthe

middleofthetray,i.e.,(xroll,xpitch)=(0,0),anddoesnotstop.Ontheother

hand,thepolicyobtainedbyRRRsuccessfullyguidestheballtothemiddle

ofthetrayalongtherollaxis,althoughthemovementalongthepitchaxis

lookssimilartothatbyNIW.MotionexamplesbyRRRwithνchosenby

IWCVareillustratedinFigure8.8.

DirectPolicySearchbyExpectation-Maximization

Page 404: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

129

0.2

1

0.15

0.5

0.1

0

0.05

Angle[rad]

−0.5

Angularvelocity[rad/s]

0

−1

Desiredtrajectory

Actualtrajectory

−0.05

−1.5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0

0.2

0.4

0.6

Page 405: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0.8

1

1.2

1.4

1.6

1.8

2

Time[s]

Time[s]

(a)Trajectoryinangles

(b)Trajectoryinangularvelocities

FIGURE8.4:Anexampleofdesiredandactualtrajectoriesofthewrist

jointintherealisticball-balancingtask.Thetargetjointangleisdetermined

byarandompolicyatevery0.2[s],andthenalinearlyinterpolatedangleand

constantvelocityaretrackedusingtheproportional-derivative(PD)controller

inthecycleof0.002[s].

17

16

15

Reusen=0(NIW)

14

Reusen=1(PIW)

RRR(

^

n=νIWCV)

13

12

Expectedreturn11

10

9

2

4

Page 406: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

6

8

10

Iteration

FIGURE8.5:Theperformanceoflearnedpolicieswhenν=0(NIW),ν=1

(PIW),andνischosenbyIWCV(RRR)inballbalancingusingasimulated

robot-armsystem.Theperformanceismeasuredbythereturnaveragedover

10trials.Thesymbol“”indicatesthatthemethodisthebestorcomparable

tothebestoneintermsoftheexpectedreturnbythet-testatthesignifi-

cancelevel5%,performedateachiteration.Theerrorbarsindicate1/10ofa

standarddeviation.

130

StatisticalReinforcementLearning

0.2

1.7

0.15

1.65

0.1

1.6

0.05

[rad]

[rad]

roll

0

α

pitch1.55

α

−0.05

Angle

1.5

Page 407: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Angle

−0.1

1.45

−0.15

−0.2

1.4

0

1

2

3

4

0

1

2

3

4

Time[s]

Time[s]

0.2

1

Pitch

Roll

0.15

Middleoftray

0.8

0.1

[m]

r

x

0.6

0.05

Reward0.4

Page 408: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0

Ballposition

0.2

−0.05

−0.1

0

0

1

2

3

4

0

1

2

3

4

Time[s]

Time[s]

FIGURE8.6:Typicalexamplesoftrajectoriesofwristangleαroll,elbow

angleαpitch,resultingballmovementx,andrewardrforpoliciesobtainedby

NIW(ν=0)atthe10thiterationintheball-balancingtask.

0.2

1.7

0.15

1.65

0.1

1.6

0.05

[rad]

[rad]

roll

0

Page 409: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

α

pitch1.55

α

−0.05

Angle

1.5

Angle

−0.1

1.45

−0.15

−0.2

1.4

0

1

2

3

4

0

1

2

3

4

Time[s]

Time[s]

0.2

1

Pitch

Roll

0.15

Middleoftray

0.8

0.1

Page 410: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

[m]

r

x

0.6

0.05

Reward0.4

0

Ballposition

0.2

−0.05

−0.1

0

0

1

2

3

4

0

1

2

3

4

Time[s]

Time[s]

FIGURE8.7:Typicalexamplesoftrajectoriesofwristangleαroll,elbow

angleαpitch,resultingballmovementx,andrewardrforpoliciesobtainedby

RRR(νischosenbyIWCV)atthe10thiterationintheball-balancingtask.

Page 411: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 412: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 413: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 414: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 415: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 416: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

DirectPolicySearchbyExpectation-Maximization

131

Page 417: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

FIGURE8.8:MotionexamplesofballbalancingbyRRR(fromlefttoright

andtoptobottom).

132

StatisticalReinforcementLearning

8.4

Remarks

Adirectpolicysearchalgorithmbasedonexpectation-maximization(EM)

iterativelymaximizesthelower-boundoftheexpectedreturn.TheEM-based

approachdoesnotincludethestepsizeparameter,whichisanadvantageover

thegradient-basedapproachintroducedinChapter7.Asample-reusevariant

oftheEM-basedmethodwasalsoprovided,whichcontributestoimproving

thestabilityofthealgorithminsmall-samplescenarios.

Inpractice,however,theEM-basedapproachisstillratherinstableevenif

itiscombinedwiththesample-reusetechnique.InChapter9,anotherpolicy

searchapproachwillbeintroducedtofurtherimprovethestabilityofpolicy

updates.

Chapter9

Policy-PriorSearch

ThedirectpolicysearchmethodsexplainedinChapter7andChapter8are

usefulinsolvingproblemswithcontinuousactionssuchasrobotcontrol.How-

ever,theytendtosufferfrominstabilityofpolicyupdate.Inthischapter,we

introduceanalternativepolicysearchmethodcalledpolicy-priorsearch,which

isadoptedinthePGPE(policygradientswithparameter-basedexploration)

method(Sehnkeetal.,2010).Thebasicideaistousedeterministicpoliciesto

removeexcessiverandomnessandintroduceusefulstochasticitybyconsidering

apriordistributionforpolicyparameters.

Page 418: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Afterformulatingtheproblemofpolicy-priorsearchinSection9.1,a

gradient-basedalgorithmisintroducedinSection9.2,includingitsimprove-

mentusingbaselinesubtraction,theoreticalanalysis,andexperimentaleval-

uation.Then,inSection9.3,asample-reusevariantisdescribedanditsper-

formanceistheoreticallyanalyzedandexperimentallyinvestigatedusinga

humanoidrobot.Finally,thischapterisconcludedinSection9.4.

9.1

Formulation

Inthissection,thepolicysearchproblemisformulatedbasedonpolicy

priors.

Thebasicideaistouseadeterministicpolicyandintroducestochasticity

bydrawingpolicyparametersfromapriordistribution.Morespecifically,pol-

icyparametersarerandomlydeterminedfollowingthepriordistributionatthe

beginningofeachtrajectory,andthereafteractionselectionisdeterministic

(Figure9.1).Notethattransitionsaregenerallystochastic,andthustrajecto-

riesarealsostochasticeventhoughthepolicyisdeterministic.Thankstothis

per-trajectoryformulation,thevarianceofgradientestimatorsinpolicy-prior

searchdoesnotincreasewithrespecttothetrajectorylength,whichallows

ustoovercomethecriticaldrawbackofdirectpolicysearch.

Policy-priorsearchusesadeterministicpolicywithtypicallyalinearar-

chitecture:

π(a|s,θ)=δ(a=θ⊤φ(s)),

whereδ(·)istheDiracdeltafunctionandφ(s)isthebasisfunction.Thepolicy

133

134

StatisticalReinforcementLearning

a

s

a

Page 419: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

s

a

s

s

a

s

a

s

a

s

(a)Stochasticpolicy

a

s

a

s

s

a

s

a

s

a

s

a

s

(b)Deterministicpolicywithprior

FIGURE9.1:Illustrationofthestochasticpolicyandthedeterministicpol-

icywithapriorunderdeterministictransition.Thenumberofpossibletra-

jectoriesisexponentialwithrespecttothetrajectorylengthwhenstochastic

policiesareused,whileitdoesnotgrowwhendeterministicpoliciesdrawn

fromapriordistributionareused.

parameterθisdrawnfromapriordistributionp(θ|ρ)withhyper-parameter

ρ.

Page 420: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Theexpectedreturninpolicy-priorsearchisdefinedintermsoftheex-

pectationsoverbothtrajectoryhandpolicyparameterθasafunctionof

hyper-parameterρ:

ZZ

J(ρ)=Ep(h|θ)p(θ|ρ)[R(h)]=

p(h|θ)p(θ|ρ)R(h)dhdθ,

whereEp(h|θ)p(θ|ρ)denotestheexpectationovertrajectoryhandpolicy

parameterθdrawnfromp(h|θ)p(θ|ρ).Inpolicy-priorsearch,thehyper-

parameterρisoptimizedsothattheexpectedreturnJ(ρ)ismaximized.

Thus,theoptimalhyper-parameterρ∗isgivenbyρ∗=argmaxJ(ρ).ρ

9.2

PolicyGradientswithParameter-BasedExploration

Inthissection,agradient-basedalgorithmforpolicy-priorsearchisgiven.

Policy-PriorSearch

135

9.2.1

Policy-PriorGradientAscent

Here,agradientmethodisusedtofindalocalmaximizeroftheexpected

returnJwithrespecttohyper-parameterρ:

ρ←−ρ+ε∇ρJ(ρ),whereεisasmallpositiveconstantand∇ρJ(ρ)isthederivativeofJwithrespecttoρ:

ZZ

∇ρJ(ρ)=p(h|θ)∇ρp(θ|ρ)R(h)dhdθ

Page 421: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ZZ

=

p(h|θ)p(θ|ρ)∇ρlogp(θ|ρ)R(h)dhdθ=Ep(h|θ)p(θ|ρ)[∇ρlogp(θ|ρ)R(h)],wherethelogarithmicderivative,

∇∇ρp(θ|ρ)

ρlogp(θ|ρ)=

,

p(θ|ρ)

wasusedinthederivation.Theexpectationsoverhandθareapproximated

bytheempiricalaverages:

1N

X

∇bρJ(ρ)=

∇N

ρlogp(θn|ρ)R(hn),

(9.1)

n=1

whereeachtrajectorysamplehnisdrawnindependentlyfromp(h|θn)and

parameterθnisdrawnfromp(θ|ρ).Thus,inpolicy-priorsearch,samplesare

pairsofθandh:

H=(θ1,h1),…,(θN,hN).

Asthepriordistributionforpolicyparameterθ=(θ1,…,θB)⊤,where

Bisthedimensionalityofthebasisvectorφ(s),theindependentGaussian

distributionisastandardchoice.ForthisGaussianprior,thehyper-parameter

ρconsistsofpriormeansη=(η1,…,ηB)⊤andpriorstandarddeviations

τ=(τ1,…,τB)⊤:

B

Page 422: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Y

1

p(

b−ηb)2

θ|η,τ)=

exp−

.

(9.2)

τ

2τ2

b=1

b

b

Thenthederivativesoflog-priorlogp(θ|η,τ)withrespecttoηbandτbare

givenas

θ

∇b−ηb

ηlogp(θ|η,τ)=

,

b

τ2

b

∇b−ηb)2−τ2

b

τlogp(θ|η,τ)=

.

Page 423: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

b

τ3

b

BysubstitutingthesederivativesintoEq.(9.1),thepolicy-priorgradientswith

respecttoηandτcanbeapproximated.

136

StatisticalReinforcementLearning

9.2.2

BaselineSubtractionforVarianceReduction

AsexplainedinSection7.2.2,subtractionofabaselinecanreducethevari-

anceofgradientestimators.Here,abaselinesubtractionmethodforpolicy-

priorsearchisdescribed.

Forabaselineξ,amodifiedgradientestimatorisgivenby

1N

X

∇bρJξ(ρ)=

(R(h

N

n)−ξ)∇ρlogp(θn|ρ).n=1

Letξ∗betheoptimalbaselinethatminimizesthevarianceofthegradient:ξ∗=argminVarb

p(h|θ)p(θ|ρ)[∇ρJξ(ρ)],ξ

whereVarp(h|θ)p(θ|ρ)denotesthetraceofthecovariancematrix:

Varp(h|θ)p(θ|ρ)[ζ]

=trEp(h|θ)p(θ|ρ)(ζ−Ep(h|θ)p(θ|ρ)[ζ])(ζ−Ep(h|θ)p(θ|ρ)[ζ])⊤

Page 424: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

h

i

=Ep(h|θ)p(θ|ρ)kζ−Ep(h|θ)p(θ|ρ)[ζ]k2.

ItwasshowninZhaoetal.(2012)thattheoptimalbaselineforpolicy-prior

searchisgivenby

E

ξ∗=p(h|θ)p(θ|ρ)[R(h)k∇ρlogp(θ|ρ)k2],Ep(θ|ρ)[k∇ρlogp(θ|ρ)k2]whereEp(θ|ρ)denotestheexpectationoverpolicyparameterθdrawnfrom

p(θ|ρ).Inpractice,theexpectationsareapproximatedbythesampleaverages.

9.2.3

VarianceAnalysisofGradientEstimators

Herethevarianceofgradientestimatorsistheoreticallyinvestigatedfor

theindependentGaussianprior(9.2)withφ(s)=s.SeeZhaoetal.(2012)

fortechnicaldetails.

Below,subsetsofthefollowingassumptionsareconsidered(whicharethe

sameastheonesusedinSection7.2.3):

Assumption(A):r(s,a,s′)∈[−β,β]forβ>0.Assumption(B):r(s,a,s′)∈[α,β]for0<α<β.Assumption(C):Forδ>0,thereexisttwoseriesctTt=1anddtTt=1such

that

kstk≥ctandtk≤dt

holdwithprobabilityatleast1−δ,respectively,overthechoiceof

2N

samplepaths.

Policy-PriorSearch

137

Page 425: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

NotethatAssumption(B)isstrongerthanAssumption(A).

Let

B

X

G=

τ−2.

b

b=1

First,thevarianceofgradientestimatorsinpolicy-priorsearchisanalyzed:

Theorem9.1UnderAssumption(A),thefollowingupperboundshold:

h

i

β2(1−γT)2G

β2G

Var

b

p(h|θ)p(θ|ρ)∇ηJ(η,τ)≤≤

,

N(1−γ)2

N(1−γ)2

h

i

2β2(1−γT)2G

2β2G

Var

b

p(h|θ)p(θ|ρ)∇τJ(η,τ)≤≤

.

N(1−γ)2

N(1−γ)2

Page 426: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ThesecondupperboundsareindependentofthetrajectorylengthT,while

theupperboundsfordirectpolicysearch(Theorem7.1inSection7.2.3)are

monotoneincreasingwithrespecttothetrajectorylengthT.Thus,gradient

estimationinpolicy-priorsearchisexpectedtobemorereliablethanthatin

directpolicysearchwhenthetrajectorylengthTislarge.

Thefollowingtheoremmoreexplicitlycomparesthevarianceofgradient

estimatorsindirectpolicysearchandpolicy-priorsearch:

Theorem9.2InadditiontoAssumptions(B)and(C),assumethat

ζ(T)=CTα2−DTβ2/(2π)

ispositiveandmonotoneincreasingwithrespecttoT,where

T

X

T

X

CT=

c2tandDT=

d2t.

t=1

t=1

IfthereexistsT0suchthat

ζ(T0)≥β2Gσ2,

thenitholdsthat

Var

b

b

p(h|θ)p(θ|ρ)[∇µJ(θ)]>Varp(h|θ)p(θ|ρ)[∇ηJ(η,τ)]forallT>T0,withprobabilityatleast1−δ.

Theabovetheoremmeansthatpolicy-priorsearchismorefavorablethan

directpolicysearchintermsofthevarianceofgradientestimatorsofthe

mean,iftrajectorylengthTislarge.

Next,thecontributionoftheoptimalbaselinetothevarianceofthegradi-

entestimatorwithrespecttomeanparameterηisinvestigated.Itwasshown

Page 427: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

inZhaoetal.(2012)thattheexcessvarianceforabaselineξisgivenby

Var

b

b

p(h|θ)p(θ|ρ)[∇ρJξ(ρ)]−Varp(h|θ)p(θ|ρ)[∇ρJξ∗(ρ)]

138

StatisticalReinforcementLearning

(ξ−ξ∗)2h

i

=

E

k∇.

N

p(h|θ)p(θ|ρ)

ρlogp(θ|ρ)k2

Basedonthisexpression,thefollowingtheoremholds.

Theorem9.3Ifr(s,a,s′)≥α>0,thefollowinglowerboundholds:

α2(1−γT)2G

Var

b

b

p(h|θ)p(θ|ρ)[∇ηJ(η,τ)]−Varp(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≥.

N(1−γ)2

UnderAssumption(A),thefollowingupperboundholds:

β2(1−γT)2G

Page 428: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Var

b

b

p(h|θ)p(θ|ρ)[∇ηJ(η,τ)]−Varp(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≤.

N(1−γ)2

Theabovetheoremshowsthatthelowerboundoftheexcessvariance

ispositiveandmonotoneincreasingwithrespecttothetrajectorylengthT.

Thismeansthatthevarianceisalwaysreducedbysubtractingtheoptimal

baselineandtheamountofvariancereductionismonotoneincreasingwith

respecttothetrajectorylengthT.Notethattheupperboundisalsomonotone

increasingwithrespecttothetrajectorylengthT.

Finally,thevarianceofthegradientestimatorwiththeoptimalbaseline

isinvestigated:

Theorem9.4UnderAssumptions(B)and(C),thefollowingupperbound

holdswithprobabilityatleast1−δ:

(1−γT)2

(β2−α2)G

Var

b

p(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≤(β2−α2)G≤

.

N(1−γ)2

N(1−γ)2

ThesecondupperboundisindependentofthetrajectorylengthT,while

Theorem7.4inSection7.2.3showedthattheupperboundofthevariance

ofgradientestimatorswiththeoptimalbaselineindirectpolicysearchis

monotoneincreasingwithrespecttotrajectorylengthT.Thus,whentrajec-

torylengthTislarge,policy-priorsearchismorefavorablethandirectpolicy

searchintermsofthevarianceofthegradientestimatorwithrespecttothe

meanevenwhenoptimalbaselinesubtractionisapplied.

Page 429: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

9.2.4

NumericalExamples

Here,theperformanceofthedirectpolicysearchandpolicy-priorsearch

algorithmsareexperimentallycompared.

9.2.4.1

Setup

LetthestatespaceSbeone-dimensionalandcontinuous,andtheinitial

stateisrandomlychosenfollowingthestandardnormaldistribution.Theac-

tionspaceAisalsosettobeone-dimensionalandcontinuous.Thetransition

dynamicsoftheenvironmentissetat

st+1=st+at+ε,

Policy-PriorSearch

Page 430: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

139

TABLE9.1:Varianceandbiasofestimatedparameters.

(a)TrajectorylengthT=10

Method

Variance

Bias

µ,η

σ,τ

µ,η

σ,τ

REINFORCE

13.25726.917-0.310-1.510

REINFORCE-OB

0.091

0.120

0.067

0.129

PGPE

0.971

1.686

-0.069

0.132

PGPE-OB

0.037

0.069

-0.016

0.051

(b)TrajectorylengthT=50

Method

Variance

Bias

µ,η

Page 431: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

σ,τ

µ,η

σ,τ

REINFORCE

188.386278.310-1.813-5.175

REINFORCE-OB

0.545

0.900

-0.299-0.201

PGPE

1.657

3.372

-0.105-0.329

PGPE-OB

0.085

0.182

0.048

-0.078

whereε∼N(0,0.52)isstochasticnoiseandN(µ,σ2)denotesthenormaldistributionwithmeanµandvarianceσ2.Theimmediaterewardisdefined

as

r=exp−s2/2−a2/2+1,

whichisboundedas1<r≤2.ThelengthofthetrajectoryissetatT=10

or50,thediscountfactorissetatγ=0.9,andthenumberofepisodicsamples

issetatN=100.

9.2.4.2

VarianceandBias

First,thevarianceandthebiasofgradientestimatorsofthefollowing

methodsareinvestigated:

•REINFORCE:REINFORCE(gradient-baseddirectpolicysearch)

withoutabaseline(Williams,1992).

•REINFORCE-OB:REINFORCEwithoptimalbaselinesubtraction

Page 432: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(Peters&Schaal,2006).

•PGPE:PGPE(gradient-basedpolicy-priorsearch)withoutabaseline

(Sehnkeetal.,2010).

•PGPE-OB:PGPEwithoptimalbaselinesubtraction(Zhaoetal.,

2012).

Table9.1summarizesthevarianceofgradientestimatorsover100runs,

showingthatthevarianceofREINFORCEisoveralllargerthanPGPE.A

notabledifferencebetweenREINFORCEandPGPEisthatthevarianceof

REINFORCEsignificantlygrowsasthetrajectorylengthTincreases,whereas

140

StatisticalReinforcementLearning

thatofPGPEisnotinfluencedthatmuchbyT.Thisagreeswellwiththe

theoreticalanalysesgiveninSection7.2.3andSection9.2.3.Optimalbaseline

subtraction(REINFORCE-OBandPGPE-OB)isshowntocontributehighly

toreducingthevariance,especiallywhentrajectorylengthTislarge,which

alsoagreeswellwiththetheoreticalanalysis.

Thebiasofthegradientestimatorofeachmethodisalsoinvestigated.

Here,gradientsestimatedwithN=1000areregardedastruegradients,and

thebiasofgradientestimatorsiscomputed.Theresultsarealsoincludedin

Table9.1,showingthatintroductionofbaselinesdoesnotincreasethebias;

rather,ittendstoreducethebias.

9.2.4.3

VarianceandPolicyHyper-ParameterChangethroughEn-

tirePolicy-UpdateProcess

Next,thevarianceofgradientestimatorsisinvestigatedwhenpolicyhyper-

parametersareupdatedoveriterations.Ifthedeviationparameterσtakesa

negativevalueduringthepolicy-updateprocess,itissetat0.05.Inthisex-

periment,thevarianceiscomputedfrom50runsforT=20andN=10,and

policiesareupdatedover50iterations.Inordertoevaluatethevariancein

astablemanner,theaboveexperimentsarerepeated20timeswithrandom

Page 433: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

choiceofinitialmeanparameterµfrom[−3.0,−0.1],andtheaveragevariance

ofgradientestimatorsisinvestigatedwithrespecttomeanparameterµover

20trials.TheresultsareplottedinFigure9.2.Figure9.2(a)comparesthe

varianceofREINFORCEwith/withoutbaselines,whereasFigure9.2(b)com-

paresthevarianceofPGPEwith/withoutbaselines.Thesegraphsshowthat

introductionofbaselinescontributeshighlytothereductionofthevariance

overiterations.

LetusillustratehowparametersareupdatedbyPGPE-OBover50itera-

tionsforN=10andT=10.Theinitialmeanparameterissetatη=−1.6,

−0.8,or−0.1,andtheinitialdeviationparameterissetatτ=1.Figure9.3

depictsthecontouroftheexpectedreturnandillustratestrajectoriesofpa-

rameterupdatesoveriterationsbyPGPE-OB.Inthegraph,themaximumof

thereturnsurfaceislocatedatthemiddlebottom,andPGPE-OBleadsthe

solutionstoamaximumpointrapidly.

9.2.4.4

PerformanceofLearnedPolicies

Finally,thereturnobtainedbyeachmethodisevaluated.Thetrajectory

lengthisfixedatT=20,andthemaximumnumberofpolicy-updateitera-

tionsissetat50.Averagereturnsover20runsareinvestigatedasfunctions

ofthenumberofepisodicsamplesN.Figure9.4(a)showstheresultswhen

initialmeanparameterµischosenrandomlyfrom[−1.6,−0.1],whichtends

toperformwell.ThegraphshowsthatPGPE-OBperformsthebest,espe-

ciallywhenN<5;thenREINFORCE-OBfollowswithasmallmargin.The

Policy-PriorSearch

141

6

REINFORCE

REINFORCE−OB

5

−scale4

Page 434: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

10

3

2

Varianceinlog

1

00

10

20

30

40

50

Iteration

(a)REINFORCEandREINFORCE-OB

4

PGPE

3.5

PGPE−OB

3

2.5

−scale

10

2

1.5

1

0.5

Varianceinlog

0

−0.50

10

20

30

40

Page 435: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

50

Iteration

(b)PGPEandPGPE-OB

FIGURE9.2:Meanandstandarderrorofthevarianceofgradientestimators

withrespecttothemeanparameterthroughpolicy-updateiterations.

1

17.00

τ

17.54

17.81

17.27

0.8

18.07

17.54

18.34

18.0717.81

18.61

0.6

18.88

19.14

18.34

0.4

18.61

19.41

18.88

19.68

19.14

0.2

19.41

19.68

Policy-priorstandarddeviation

0

Page 436: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

Policy-priormeanη

FIGURE9.3:Trajectoriesofpolicy-priorparameterupdatesbyPGPE.

142

StatisticalReinforcementLearning

16.5

16

15.5

Return

15

REINFORCE

14.5

REINFORCE−OB

PGPE

PGPE−OB

0

5

10

15

20

Iteration

(a)Goodinitialpolicy

16.5

Page 437: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

16

15.5

15

14.5

Return

14

13.5

REINFORCE

REINFORCE−OB

13

PGPE

PGPE−OB

12.50

5

10

15

20

Iteration

(b)Poorinitialpolicy

FIGURE9.4:Averageandstandarderrorofreturnsover20runsasfunctions

ofthenumberofepisodicsamplesN.

plainPGPEalsoworksreasonablywell,althoughitisslightlyunstabledueto

largervariance.TheplainREINFORCEishighlyunstable,whichiscausedby

thehugevarianceofgradientestimators(seeFigure9.2again).Figure9.4(b)

describestheresultswheninitialmeanparameterµischosenrandomlyfrom

[−3.0,−0.1],whichtendstoresultinpoorerperformance.Inthissetup,the

differenceamongthecomparedmethodsismoresignificantthanthecasewith

goodinitialpolicies,meaningthatREINFORCEissensitivetothechoiceof

initialpolicies.Overall,thePGPEmethodstendtooutperformtheREIN-

FORCEmethods,andamongthePGPEmethods,PGPE-OBworksvery

wellandconvergesquickly.

Page 438: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Policy-PriorSearch

143

9.3

SampleReuseinPolicy-PriorSearch

AlthoughPGPEwasshowntooutperformREINFORCE,itsbehavioris

stillratherunstableifthenumberofdatasamplesusedforestimatingthegra-

dientissmall.Inthissection,thesample-reuseideaisappliedtoPGPE.Tech-

nically,theoriginalPGPEiscategorizedasanon-policyalgorithmwheredata

drawnfromthecurrenttargetpolicyisusedtoestimatepolicy-priorgradients.

Ontheotherhand,off-policyalgorithmsaremoreflexibleinthesensethat

adata-collectingpolicyandthecurrenttargetpolicycanbedifferent.Here,

PGPEisextendedtotheoff-policyscenariousingtheimportance-weighting

technique.

9.3.1

ImportanceWeighting

Letusconsideranoff-policyscenariowhereadata-collectingpolicyand

thecurrenttargetpolicyaredifferentingeneral.InthecontextofPGPE,

twohyper-parametersareconsidered:ρasthetargetpolicytolearnandρ′

asapolicyfordatacollection.Letusdenotethedatasamplescollectedwith

hyper-parameterρ′byH′:

H′=

i.i.d.

θ′n,h′nN′

n=1

∼p(h|θ)p(θ|ρ′).IfdataH′isnaivelyusedtoestimatepolicy-priorgradientsbyEq.(9.1),we

sufferaninconsistencyproblem:

N′

1X∇N′

Page 439: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ρlogp(θ′n|ρ)R(h′n)N′−→∞

9

∇ρJ(ρ),n=1

where

ZZ

∇ρJ(ρ)=p(h|θ)p(θ|ρ)∇ρlogp(θ|ρ)R(h)dhdθisthegradientoftheexpectedreturn,

ZZ

J(ρ)=

p(h|θ)p(θ|ρ)R(h)dhdθ,

withrespecttothepolicyhyper-parameterρ.Below,thisnaivemethodis

referredtoasnon-importance-weightedPGPE(NIW-PGPE).

Thisinconsistencyproblemcanbesystematicallyresolvedbyimportance

weighting:

1N′

X

∇bN′→∞

ρJIW(ρ)=

w(θ′

−→∇N′

n)∇ρlogp(θ′n|ρ)R(h′n)ρJ(ρ),

n=1

144

Page 440: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

StatisticalReinforcementLearning

wherew(θ)=p(θ|ρ)/p(θ|ρ′)istheimportanceweight.Thisextendedmethod

iscalledimportance-weightedPGPE(IW-PGPE).

Below,thevarianceofgradientestimatorsinIW-PGPEistheoretically

analyzed.SeeZhaoetal.(2013)fortechnicaldetails.AsdescribedinSec-

tion9.2.1,thedeterministiclinearpolicymodelisusedhere:

π(a|s,θ)=δ(a=θ⊤φ(s)),

(9.3)

whereδ(·)istheDiracdeltafunctionandφ(s)istheB-dimensionalbasis

function.Policyparameterθ=(θ1,…,θB)⊤isdrawnfromtheindependent

Gaussianprior,wherepolicyhyper-parameterρconsistsofpriormeansη=

(η1,…,ηB)⊤andpriorstandarddeviationsτ=(τ1,…,τB)⊤:

B

Y

1

p(

b−ηb)2

θ|η,τ)=

exp−

.

(9.4)

τ

2τ2

b=1

b

b

Let

B

X

Page 441: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

G=

τ−2,

b

b=1

andletVarp(h′|θ′)p(θ′|ρ′)denotethetraceofthecovariancematrix:

Varp(h′|θ′)p(θ′|ρ′)[ζ]

=trEp(h′|θ′)p(θ′|ρ′)(ζ−Ep(h′|θ′)p(θ′|ρ′)[ζ])(ζ−Ep(h′|θ′)p(θ′|ρ′)[ζ])⊤h

i

=Ep(h′|θ′)p(θ′|ρ′)kζ−Ep(h′|θ′)p(θ′|ρ′)[ζ]k2,

whereEp(h′|θ′)p(θ′|ρ′)denotestheexpectationovertrajectoryh′andpolicy

parameterθ′drawnfromp(h′|θ′)p(θ′|ρ′).Thenthefollowingtheoremholds:

Theorem9.5Assumethatforalls,a,ands′,thereexistsβ>0suchthat

r(s,a,s′)∈[−β,β],and,forallθ,thereexists0<wmax<∞suchthat0<w(θ)≤wmax.Then,thefollowingupperboundshold:

h

i

β2(1−γT)2G

Var

b

p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)≤

w

N′(1−γ)2

max,

h

i

2β2(1−γT)2G

Var

b

p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)≤

w

N′(1−γ)2

Page 442: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

max.

Itisinterestingtonotethattheupperboundsarethesameastheones

fortheplainPGPE(Theorem9.1inSection9.2.3)exceptforfactorwmax.

Whenwmax=1,theboundsarereducedtothoseoftheplainPGPEmethod.

However,ifthesamplingdistributionissignificantlydifferentfromthetarget

distribution,wmaxcantakealargevalueandthusIW-PGPEcanproducea

gradientestimatorwithlargevariance.Therefore,IW-PGPEmaynotbea

reliableapproachasitis.

Below,avariancereductiontechniqueforIW-PGPEisintroducedwhich

leadstoapracticallyusefulalgorithm.

Policy-PriorSearch

145

9.3.2

VarianceReductionbyBaselineSubtraction

Here,abaselineisintroducedforIW-PGPEtoreducethevarianceof

gradientestimators,inthesamewayastheplainPGPEexplainedinSec-

tion9.2.2.

Apolicy-priorgradientestimatorwithabaselineξ∈RisdefinedasN′

1X

∇bρJξ

(ρ)=

(R(h′

IW

N′

n)−ξ)w(θ′n)∇ρlogp(θ′n|ρ).n=1

Page 443: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Here,thebaselineξisdeterminedsothatthevarianceisminimized.Letξ∗betheoptimalbaselineforIW-PGPEthatminimizesthevariance:

ξ∗=argminVarb

p(h′|θ′)p(θ′|ρ′)[∇ρJξ(ρ)].IW

ξ

ThentheoptimalbaselineforIW-PGPEisgivenasfollows(Zhaoetal.,2013):

E

ξ∗=p(h′|θ′)p(θ′|ρ′)[R(h′)w2(θ′)k∇ρlogp(θ′|ρ)k2],Ep(θ′|ρ′)[w2(θ′)k∇ρlogp(θ′|ρ)k2]whereEp(θ′|ρ′)denotestheexpectationoverpolicyparameterθ′drawnfrom

p(θ′|ρ′).Inpractice,theexpectationsareapproximatedbythesampleaver-

ages.Theexcessvarianceforabaselineξisgivenas

Var

b

b

p(h′|θ′)p(θ′|ρ′)[∇ρJξ(ρ)]Jξ∗(ρ)]IW

−Varp(h′|θ′)p(θ′|ρ′)[∇ρIW(ξ−ξ∗)2=

E

N′

p(θ′|ρ′)[w2(θ′)k∇ρlogp(θ′|ρ)k2].Next,contributionsoftheoptimalbaselinetovariancereductioninIW-

PGPEareanalyzedforthedeterministiclinearpolicymodel(9.3)andthe

independentGaussianprior(9.4).SeeZhaoetal.(2013)fortechnicaldetails.

Theorem9.6Assumethatforalls,a,ands′,thereexistsα>0suchthat

r(s,a,s′)≥α,and,forallθ,thereexistswmin>0suchthatw(θ)≥wmin.

Page 444: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Then,thefollowinglowerboundshold:

h

i

h

i

Var

b

b

p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)IW

α2(1−γT)2G

w

N′(1−γ)2

min,

h

i

h

i

Var

b

b

p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)IW

2α2(1−γT)2G

w

N′(1−γ)2

min.

Assumethatforalls,a,ands′,thereexistsβ>0suchthatr(s,a,s′)∈

Page 445: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

146

StatisticalReinforcementLearning

[−β,β],and,forallθ,thereexists0<wmax<∞suchthat0<w(θ)≤wmax.

Then,thefollowingupperboundshold:

h

i

h

i

Var

b

b

p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)IW

β2(1−γT)2G

w

N′(1−γ)2

max,

h

i

h

i

Var

b

b

p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)IW

2β2(1−γT)2G

w

N′(1−γ)2

Page 446: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

max.

ThistheoremshowsthattheboundsofthevariancereductioninIW-PGPE

broughtbytheoptimalbaselinedependontheboundsoftheimportance

weight,wminandwmax—thelargertheupperboundwmaxis,themore

optimalbaselinesubtractioncanreducethevariance.

FromTheorem9.5andTheorem9.6,thefollowingcorollarycanbeimme-

diatelyobtained:

Corollary9.7Assumethatforalls,a,ands′,thereexists0<α<βsuch

thatr(s,a,s′)∈[α,β],and,forallθ,thereexists0<wmin<wmax<∞suchthatwmin≤w(θ)≤wmax.Then,thefollowingupperboundshold:

h

i

(1−γT)2G

Var

b

p(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)(β2w

IW

≤N′(1−γ)2

max−α2wmin),

h

i

2(1−γT)2G

Var

b

p(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)(β2w

IW

≤N′(1−γ)2

max−α2wmin).

FromTheorem9.5andthiscorollary,wecanconfirmthattheupper

boundsforthebaseline-subtractedIW-PGPEaresmallerthanthoseforthe

plainIW-PGPEwithoutbaselinesubtraction,becauseα2wmin>0.Inpartic-

Page 447: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ular,ifwminislarge,theupperboundsforthebaseline-subtractedIW-PGPE

canbemuchsmallerthanthosefortheplainIW-PGPEwithoutbaseline

subtraction.

9.3.3

NumericalExamples

Here,weconsiderthecontrollingtaskofthehumanoidrobotCB-i(Cheng

etal.,2007)showninFigure9.5(a).Thegoalistoleadtheendeffectorof

therightarm(righthand)toatargetobject.First,itssimulatedupper-body

model,illustratedinFigure9.5(b),isusedtoinvestigatetheperformanceof

theIW-PGPE-OBmethod.ThentheIW-PGPE-OBmethodisappliedtothe

realrobot.

9.3.3.1

Setup

Theperformanceofthefollowing4methodsiscompared:

Page 448: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 449: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Policy-PriorSearch

147

(a)CB-i

(b)Simulatedupper-bodymodel

FIGURE9.5:HumanoidrobotCB-ianditsupper-bodymodel.Thehu-

manoidrobotCB-iwasdevelopedbytheJST-ICORPComputationalBrain

ProjectandATRComputationalNeuroscienceLabs(Chengetal.,2007).

•IW-REINFORCE-OB:Importance-weightedREINFORCEwiththe

optimalbaseline.

•NIW-PGPE-OB:Data-reusePGPE-OBwithoutimportanceweight-

ing.

Page 450: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

•PGPE-OB:PlainPGPE-OBwithoutdatareuse.

•IW-PGPE-OB:Importance-weightedPGPEwiththeoptimalbase-

line.

TheupperbodyofCB-ihas9degreesoffreedom:theshoulderpitch,

shoulderroll,elbowpitchoftherightarm;shoulderpitch,shoulderroll,elbow

pitchoftheleftarm;waistyaw;torsoroll;andtorsopitch(Figure9.5(b)).At

eachtimestep,thecontrollerreceivesstatesfromthesystemandsendsout

actions.Thestatespaceis18-dimensional,whichcorrespondstothecurrent

angleandangularvelocityofeachjoint.Theactionspaceis9-dimensional,

whichcorrespondstothetargetangleofeachjoint.Bothstatesandactions

arecontinuous.

Giventhestateandactionineachtimestep,thephysicalcontrolsystem

calculatesthetorquesateachjointbyusingaproportional-derivative(PD)

controlleras

τi=Kp(a

˙s

i

i−si)−Kdii,

148

StatisticalReinforcementLearning

wheresi,˙si,andaidenotethecurrentangle,thecurrentangularvelocity,

andthetargetangleofthei-thjoint,respectively.KpandK

denotethe

i

di

positionandvelocitygainsforthei-thjoint,respectively.Theseparameters

aresetat

Kp=200andK=10

i

di

Page 451: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

fortheelbowpitchjoints,and

Kp=2000andK=100

i

di

forotherjoints.

Theinitialpositionoftherobotisfixedatthestanding-up-straightpose

withthearmsdown.Theimmediaterewardrtatthetimesteptisdefinedas

rt=exp(−10dt)−0.0005min(ct,10,000),

wheredtisthedistancebetweentherighthandoftherobotandthetarget

object,andctisthesumofcontrolcostsforeachjoint.Thelineardeterministic

policyisusedforthePGPEmethods,andtheGaussianpolicyisusedforIW-

REINFORCE-OB.Inbothcases,thelinearbasisfunctionφ(s)=sisused.

ForPGPE,theinitialpriormeanηisrandomlychosenfromthestandard

normaldistribution,andtheinitialpriorstandarddeviationτissetat1.

Toevaluatetheusefulnessofdatareusemethodswithasmallnumber

ofsamples,theagentcollectsonlyN=3on-policysampleswithtrajectory

lengthT=100ateachiteration.Allpreviousdatasamplesarereusedto

estimatethegradientsinthedatareusemethods,whileonlyon-policysam-

plesareusedtoestimatethegradientsintheplainPGPE-OBmethod.The

discountfactorissetatγ=0.9.

9.3.3.2

Simulationwith2DegreesofFreedom

First,theperformanceonthereachingtaskwithonly2degreesoffreedom

isinvestigated.Thebodyoftherobotisfixedandonlytherightshoulderpitch

andrightelbowpitchareused.Figure9.6depictstheaveragedexpectedreturn

over10trialsasafunctionofthenumberofiterations.Theexpectedreturn

ateachtrialiscomputedfrom50newlydrawntestepisodicdatathatarenot

usedforpolicylearning.ThegraphshowsthatIW-PGPE-OBnicelyimproves

theperformanceoveriterationswithonlyasmallnumberofon-policysamples.

TheplainPGPE-OBmethodcanalsoimprovetheperformanceoveritera-

tions,butslowly.NIW-PGPE-OBisnotasgoodasIW-PGPE-OB,especially

atthelateriterations,becauseoftheinconsistencyoftheNIWestimator.

Page 452: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Thedistancefromtherighthandtotheobjectandthecontrolcostsalong

thetrajectoryarealsoinvestigatedforthreepolicies:theinitialpolicy,thepolicyobtainedatthe20thiterationbyIW-PGPE-OB,andthepolicyobtained

atthe50thiterationbyIW-PGPE-OB.Figure9.7(a)plotsthedistanceto

thetargetobjectasafunctionofthetimestep.Thisshowsthatthepolicy

obtainedatthe50thiterationdecreasesthedistancerapidlycomparedwith

Policy-PriorSearch

149

5

IW−PGPE−OB

NIW−PGPE−OB

PGPE−OB

4

IW−REINFORCE−OB

3

Return

2

1

0

10

20

30

40

50

Iteration

FIGURE9.6:Averageandstandarderrorofreturnsover10runsasfunctions

ofthenumberofiterationsforthereachingtaskwith2degreesoffreedom

(rightshoulderpitchandrightelbowpitch).

0.35

Initialpolicy

Page 453: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Policyatthe20thiteration

0.3

Policyatthe50thiteration

0.25

0.2

0.15

Distance

0.1

0.05

00

10

20

30

40

50

60

70

80

90

100

TImesteps

(a)Distance

120

Initialpolicy

110

Policyatthe20thiteration

Policyatthe50thiteration

100

90

80

70

Controlcosts60

Page 454: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

50

40

300

10

20

30

40

50

60

70

80

90

100

Timesteps

(b)Controlcosts

FIGURE9.7:Distanceandcontrolcostsofarmreachingwith2degreesof

freedomusingthepolicylearnedbyIW-PGPE-OB.

Page 455: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 456: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 457: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 458: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

150

StatisticalReinforcementLearning

FIGURE9.8:Typicalexampleofarmreachingwith2degreesoffreedom

usingthepolicyobtainedbyIW-PGPE-OBatthe50thiteration(fromleftto

rightandtoptobottom).

theinitialpolicyandthepolicyobtainedatthe20thiteration,whichmeans

thattherobotcanreachtheobjectquicklybyusingthelearnedpolicy.

Figure9.7(b)plotsthecontrolcostasafunctionofthetimestep.This

showsthatthepolicyobtainedatthe50thiterationdecreasesthecontrol

coststeadilyuntilthereachingtaskiscompleted.Thisisbecausetherobot

mainlyadjuststheshoulderpitchinthebeginning,whichconsumesalarger

amountofenergythantheenergyrequiredforcontrollingtheelbowpitch.

Then,oncetherighthandgetsclosertothetargetobject,therobotstarts

Page 459: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

adjustingtheelbowpitchtoreachthetargetobject.Thepolicyobtainedat

the20thiterationactuallyconsumeslesscontrolcosts,butitcannotleadthe

armtothetargetobject.

Figure9.8illustratesatypicalsolutionofthereachingtaskwith2degrees

offreedombythepolicyobtainedbyIW-PGPE-OBatthe50thiteration.The

imagesshowthattherighthandissuccessfullyledtothetargetobjectwithin

only10timesteps.

9.3.3.3

SimulationwithAll9DegreesofFreedom

Finally,thesameexperimentiscarriedoutusingall9degreesoffreedom.

Thepositionofthetargetobjectismoredistantfromtherobotsothatit

cannotbereachedbyonlyusingtherightarm.

Policy-PriorSearch

151

−2

−3

−4

−5

−6

Return

−7

−8

TruncatedIW−PGPE−OB

−9

IW−PGPE−OB

−10

NIW−PGPE−OB

PGPE−OB

0

50

Page 460: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

100

150

200

250

300

350

400

Iteration

FIGURE9.9:Averageandstandarderrorofreturnsover10runsasfunctions

ofthenumberofiterationsforthereachingtaskwithall9degreesoffreedom.

Becauseall9jointsareused,thedimensionalityofthestatespaceismuch

increasedandthisgrowsthevaluesofimportanceweightsexponentially.In

ordertomitigatethelargevaluesofimportanceweights,wedecidednotto

reuseallpreviouslycollectedsamples,butonlysamplescollectedinthelast

5iterations.Thisallowsustokeepthedifferencebetweenthesamplingdis-

tributionandthetargetdistributionreasonablysmall,andthusthevaluesof

importanceweightscanbesuppressedtosomeextent.Furthermore,follow-

ingWawrzynski(2009),weconsideraversionofIW-PGPE-OB,denotedas

“truncatedIW-PGPE-OB”below,wheretheimportanceweightistruncated

asw=min(w,2).

TheresultsplottedinFigure9.9showthattheperformanceofthetrun-

catedIW-PGPE-OBisthebest.Thisimpliesthatthetruncationofimpor-

tanceweightsishelpfulwhenapplyingIW-PGPE-OBtohigh-dimensional

problems.

Figure9.10illustratesatypicalsolutionofthereachingtaskwithall9

degreesoffreedombythepolicyobtainedbythetruncatedIW-PGPE-OB

atthe400thiteration.Theimagesshowthatthepolicylearnedbyourpro-

posedmethodsuccessfullyleadstherighthandtothetargetobject,andthe

irrelevantpartsarekeptattheinitialpositionforreducingthecontrolcosts.

9.3.3.4

RealRobotControl

Finally,theIW-PGPE-OBmethodisappliedtotherealCB-irobotshown

Page 461: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

inFigure9.11(Sugimotoetal.,2014).

Theexperimentalsettingisessentiallythesameastheabovesimulation

studieswith9joints,butpoliciesareupdatedonlyevery5trialsandsamples

takenfromthelast10trialsarereusedforstabilizationpurposes.Figure9.12

Page 462: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 463: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 464: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 465: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

152

StatisticalReinforcementLearning

FIGURE9.10:Typicalexampleofarmreachingwithall9degreesoffree-

domusingthepolicyobtainedbythetruncatedIW-PGPE-OBatthe400th

iteration(fromlefttorightandtoptobottom).

FIGURE9.11:ReachingtaskbytherealCB-irobot(Sugimotoetal.,2014).

plotstheobtainedrewardscumulatedoverpolicyupdateiterations,showing

thatrewardsaresteadilyincreasedoveriteration.Figure9.13exhibitsthe

acquiredreachingmotionbasedonthepolicyobtainedatthe120thiteration,

showingthattheendeffectoroftherobotcansuccessfullyreachthetarget

object.

Policy-PriorSearch

153

Page 466: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

60

40

Cumulativerewards20

0

20

40

60

80

100

120

Numberofupdates

FIGURE9.12:Obtainedrewardcumulatedoverpolicyupdatediterations.

9.4

Remarks

Whenthetrajectorylengthislarge,directpolicysearchtendstoproduce

gradientestimatorswithlargevariance,duetotherandomnessofstochas-

ticpolicies.Policy-priorsearchcanavoidthisproblembyusingdeterminis-

ticpoliciesandintroducingstochasticitybyconsideringapriordistribution

overpolicyparameters.Boththeoreticallyandexperimentally,advantagesof

policy-priorsearchoverdirectpolicysearchwereshown.

Asamplereuseframeworkforpolicy-priorsearchwasalsointroduced

whichishighlyusefulinreal-worldreinforcementlearningproblemswithhigh

samplingcosts.Followingthesamelineasthesamplereusemethodsforpolicy

iterationdescribedinChapter4anddirectpolicysearchintroducedinChap-

ter8,importanceweightingplaysanessentialroleinsample-reusepolicy-prior

search.Whenthedimensionalityofthestate-actionspaceishigh,however,

importanceweightstendtotakeextremelylargevalues,whichcausesinstabil-

ityoftheimportanceweightingmethods.Tomitigatethisproblem,truncation

oftheimportanceweightsisusefulinpractice.

Page 467: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 468: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 469: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 470: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 471: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

154

StatisticalReinforcementLearning

FIGURE9.13:Typicalexampleofarmreachingusingthepolicyobtained

bytheIW-PGPE-OBmethod(fromlefttorightandtoptobottom).

PartIV

Model-Based

ReinforcementLearning

Page 472: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ThereinforcementlearningmethodsexplainedinPartIIandPartIIIare

categorizedintothemodel-freeapproach,meaningthatpoliciesarelearned

withoutexplicitlymodelingtheunknownenvironment(i.e.,thetransition

probabilityoftheagent).Ontheotherhand,inPartIV,weintroducean

alternativeapproachcalledthemodel-basedapproach,whichexplicitlymodels

theenvironmentinadvanceandusesthelearnedenvironmentmodelforpolicy

learning.

Inthemodel-basedapproach,noadditionalsamplingcostisnecessaryto

generateartificialsamplesfromthelearnedenvironmentmodel.Thus,the

model-basedapproachisusefulwhendatacollectionisexpensive(e.g.,robot

control).However,accuratelyestimatingthetransitionmodelfromalimited

amountoftrajectorydatainmulti-dimensionalcontinuousstateandaction

spacesishighlychallenging.

InChapter10,weintroduceanon-parametricmodelestimatorthatpos-

sessestheoptimalconvergenceratewithhighcomputationalefficiency,and

demonstrateitsusefulnessthroughexperiments.Then,inChapter11,we

combinedimensionalityreductionwithmodelestimationtocopewithhigh

dimensionalityofstateandactionspaces.

Thispageintentionallyleftblank

Chapter10

TransitionModelEstimation

Inthischapter,weintroducetransitionprobabilityestimationmethodsfor

model-basedreinforcementlearning(Wang&Dietterich,2003;Deisenroth&

Rasmussen,2011).AmongthemethodsdescribedinSection10.1,anon-

parametrictransitionmodelestimatorcalledleast-squaresconditionaldensity

estimation(LSCDE)(Sugiyamaetal.,2010)isshowntobethemostpromis-

ingapproach(Tangkarattetal.,2014a).TheninSection10.2,wedescribe

howthetransitionmodelestimatorcanbeutilizedinmodel-basedreinforce-

mentlearning.InSection10.3,experimentalperformanceofamodel-based

Page 473: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

policy-priorsearchmethodisevaluated.Finally,inSection10.4,thischapter

isconcluded.

10.1

ConditionalDensityEstimation

Inthissection,theproblemofapproximatingthetransitionprobabil-

ityp(s′|s,a)fromindependenttransitionsamples(sm,am,s′m)M

m=1isad-

dressed.

10.1.1

Regression-BasedApproach

Intheregression-basedapproach,theproblemoftransitionprobability

estimationisformulatedasafunctionapproximationproblemofpredicting

outputs′giveninputsandaunderGaussiannoise:

s′=f(s,a)+ǫ,

wherefisanunknownregressionfunctiontobelearned,ǫisanindepen-

dentGaussiannoisevectorwithmeanzeroandcovariancematrixσ2I,andI

denotestheidentitymatrix.

Letusapproximatefbythefollowinglinear-in-parametermodel:

f(s,a,Γ)=Γ⊤φ(s,a),

whereΓistheB×dim(s)parametermatrixandφ(s,a)istheB-dimensional

157

158

StatisticalReinforcementLearning

basisvector.AtypicalchoiceofthebasisvectoristheGaussiankernel,which

isdefinedforB=Mas

ks−s

φ

bk2+(a−ab)2

b(s,a)=exp

Page 474: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

,

2κ2

andκ>0denotestheGaussiankernelwidth.IfBistoolarge,thenumberof

basisfunctionsmaybereducedbyonlyusingasubsetofsamplesasGaussian

centers.DifferentGaussianwidthsforsandamaybeusedifnecessary.

TheparametermatrixΓislearnedsothattheregularizedsquarederror

isminimized:

#

M

X

2

b

Γ=argmin

f(sm,am,Γ)−f(sm,am)

+trΓ⊤RΓ

,

Γ

m=1

whereRistheB×Bpositivesemi-definitematrixcalledtheregularization

matrix.Thesolutionb

Γisgivenanalyticallyas

b

Γ=(Φ⊤Φ+R)−1Φ⊤(s′1,…,s′M)⊤,

whereΦistheM×Bdesignmatrixdefinedas

Φm,b=φb(sm,am).

Wecanconfirmthatpredictedoutputvectorbs′=f(s,a,b

Γ)actuallyfollows

theGaussiandistributionwithmean

(s′1,…,s′M)Φ(Φ⊤Φ+R)−1φ(s,a)

andcovariancematrixb

Page 475: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

δ2I,where

bδ2=σ2tr(Φ⊤Φ+R)−2Φ⊤Φ.

ThetuningparameterssuchastheGaussiankernelwidthκandtheregu-

larizationmatrixRcanbedeterminedeitherbycross-validationorevidence

maximizationiftheabovemethodisregardedasGaussianprocessregression

intheBayesianframework(Rasmussen&Williams,2006).

Thisistheregression-basedestimatorofthetransitionprobabilitydensity

p(s′|s,a)foranarbitrarytestinputsanda.Thus,bytheuseofkernel

regressionmodels,theregressionfunctionf(whichistheconditionalmeanof

outputs)isapproximatedinanon-parametricway.However,theconditional

distributionofoutputsitselfisrestrictedtobeGaussian,whichishighly

restrictiveinreal-worldreinforcementlearning.

10.1.2

ǫ-NeighborKernelDensityEstimation

Whentheconditioningvariables(s,a)arediscrete,theconditionaldensity

p(s′|s,a)canbeeasilyestimatedbystandarddensityestimatorssuchaskernel

TransitionModelEstimation

159

densityestimation(KDE)byonlyusingsampless′iisuchthat(si,ai)agrees

withthetargetvalues(s,a).ǫ-neighborKDE(ǫKDE)extendsthisideatothe

continuouscasesuchthat(si,ai)areclosetothetargetvalues(s,a).

Morespecifically,ǫKDEwiththeGaussiankernelisgivenby

1

X

b

p(s′|s,a)=

N(s′;s′

|I

i,σ2I),

Page 476: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(s,a),ǫ|i∈I(s,a),ǫwhereI(s,a),ǫisthesetofsampleindicessuchthatk(s,a)−(si,ai)k≤ǫ

andN(s′;s′i,σ2I)denotestheGaussiandensitywithmeans′iandcovariance

matrixσ2I.TheGaussianwidthσandthedistancethresholdǫmaybechosen

bycross-validation.

ǫKDEisausefulnon-parametricdensityestimatorthatiseasytoim-

plement.However,itisunreliableinhigh-dimensionalproblemsduetothe

distance-basedconstruction.

10.1.3

Least-SquaresConditionalDensityEstimation

Anon-parametricconditionaldensityestimatorcalledleast-squarescondi-

tionaldensityestimation(LSCDE)(Sugiyamaetal.,2010)possessesvarious

usefulproperties:

•Itcandirectlyhandlemulti-dimensionalmulti-modalinputsandout-

puts.

•Itwasprovedtoachievetheoptimalconvergencerate(Kanamorietal.,

2012).

•Ithashighnumericalstability(Kanamorietal.,2013).

•Itisrobustagainstoutliers(Sugiyamaetal.,2010).

•Itssolutioncanbeanalyticallyandefficientlycomputedjustbysolving

asystemoflinearequations(Kanamorietal.,2009).

•Generatingsamplesfromthelearnedtransitionmodelisstraightforward.

Letusmodelthetransitionprobabilityp(s′|s,a)bythefollowinglinear-

in-parametermodel:

α⊤φ(s,a,s′),

(10.1)

whereαistheB-dimensionalparametervectorandφ(s,a,s′)istheB-

dimensionalbasisfunctionvector.Atypicalchoiceofthebasisfunctionis

theGaussiankernel,whichisdefinedforB=Mas

ks−s

φ

bk2+(a−ab)2+ks′−s′bk2

Page 477: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

b(s,a,s′)=exp

.

2κ2

160

StatisticalReinforcementLearning

κ>0denotestheGaussiankernelwidth.IfBistoolarge,thenumberof

basisfunctionsmaybereducedbyonlyusingasubsetofsamplesasGaussian

centers.DifferentGaussianwidthsfors,a,ands′maybeusedifnecessary.

Theparameterαislearnedsothatthefollowingsquarederrorismini-

mized:

ZZZ

1

2

J0(α)=

α⊤φ(s,a,s′)−p(s′|s,a)p(s,a)dsdads′

2ZZZ

1

2

=

α⊤φ(s,a,s′)

p(s,a)dsdads′

2ZZZ

α⊤φ(s,a,s′)p(s,a,s′)dsdads′+C,

wheretheidentityp(s′|s,a)=p(s,a,s′)/p(s,a)isusedinthesecondterm

Page 478: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

and

ZZZ

1

C=

p(s′|s,a)p(s,a,s′)dsdads′.

2

BecauseCisconstantindependentofα,onlythefirsttwotermswillbe

consideredfromhereon:

1

J(α)=J0(α)−C=α⊤Uα−α⊤v,

2

whereUistheB×BandvistheB-dimensionalvectordefinedas

ZZ

U=

Φ(s,a)p(s,a)dsda,

ZZZ

v=

φ(s,a,s′)p(s,a,s′)dsdads′,

Z

Φ(s,a)=

φ(s,a,s′)φ(s,a,s′)⊤ds′.

Notethat,fortheGaussianmodel(10.1),the(b,b′)-thelementofmatrix

Φ(s,a)canbecomputedanalyticallyas

ks′

Φ

b−s′b′k2

b,b′(s,a)=(

πκ)dim(s′)exp−

4κ2

ks−s

×exp−

Page 479: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

bk2+ks−sb′k2+(a−ab)2+(a−ab′)2

.

2κ2

BecauseUandvincludedinJ(α)containtheexpectationsoverunknown

densitiesp(s,a)andp(s,a,s′),theyareapproximatedbysampleaverages.

Thenwehave

b

1

J(α)=

α⊤b

Uα−b

v⊤α,

2

TransitionModelEstimation

161

where

M

X

M

X

b

1

1

U=

Φ(s

φ(s

M

Page 480: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

m,am)

and

bv=M

m,am,s′m).

m=1

m=1

Byaddinganℓ2-regularizertob

J(α)toavoidoverfitting,theLSCDEop-

timizationcriterionisgivenas

λ

e

α=argminb

J(α)+

kαk2,

α∈RM2

whereλ≥0istheregularizationparameter.Thesolutione

αisgivenanalyti-

callyas

e

α=(b

U+λI)−1b

v,

whereIdenotestheidentitymatrix.Becauseconditionalprobabilitydensities

arenon-negativebydefinition,thesolutione

αismodifiedas

b

αb=max(0,e

αb).

Finally,thesolutionisnormalizedinthetestphase.Morespecifically,given

atestinputpoint(s,a),thefinalLSCDEsolutionisgivenas

b

Page 481: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

α⊤φ(s,a,s′)

b

p(s′|s,a)=R

,

b

α⊤φ(s,a,s′′)ds′′

where,fortheGaussianmodel(10.1),thedenominatorcanbeanalytically

computedas

Z

B

X

ks−s

b

bk2+(a−ab)2

α⊤φ(s,a,s′′)ds′′=(2πκ)dim(s′)

αbexp−

.

2κ2

b=1

ModelselectionoftheGaussianwidthκandtheregularizationparameterλ

ispossiblebycross-validation(Sugiyamaetal.,2010).

10.2

Model-BasedReinforcementLearning

Model-basedreinforcementlearningissimplycarriedoutasfollows.

1.Collecttransitionsamples(sm,am,s′m)M

m=1.

2.Obtainatransitionmodelestimateb

p(s′|s,a)from(sm,am,s′m)M

m=1.

Page 482: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

162

StatisticalReinforcementLearning

3.Runamodel-freereinforcementlearningmethodusingtrajectorysam-

plesehte

T

t=1artificiallygeneratedfromestimatedtransitionmodel

b

p(s′|s,a)andcurrentpolicyπ(a|s,θ).

Model-basedreinforcementlearningisparticularlyadvantageouswhenthe

samplingcostislimited.Morespecifically,inmodel-freemethods,weneedto

fixthesamplingscheduleinadvance—forexample,whethermanysamples

aregatheredinthebeginningoronlyasmallbatchofsamplesiscollectedfor

alongerperiod.However,optimizingthesamplingscheduleinadvanceisnot

possiblewithoutstrongpriorknowledge.Thus,weneedtojustblindlydesign

thesamplingscheduleinpractice,whichcancausesignificantperformance

degradation.Ontheotherhand,model-basedmethodsdonotsufferfromthis

problem,becausewecandrawasmanytrajectorysamplesaswewantfrom

thelearnedtransitionmodelwithoutadditionalsamplingcosts.

10.3

NumericalExamples

Inthissection,theexperimentalperformanceofthemodel-freeandmodel-

basedversionsofPGPE(policygradientswithparameter-basedexploration)

areevaluated:

M-PGPE(LSCDE):Themodel-basedPGPEmethodwithtransitionmodel

estimatedbyLSCDE.

M-PGPE(GP):Themodel-basedPGPEmethodwithtransitionmodeles-

timatedbyGaussianprocess(GP)regression.

IW-PGPE:Themodel-freePGPEmethodwithsamplereusebyimportance

weighting(themethodintroducedinChapter9).

10.3.1

ContinuousChainWalk

Letusfirstconsiderasimplecontinuouschainwalktask,describedin

Page 483: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Figure10.1.

10.3.1.1

Setup

Let

(1(4<s′<6),

s∈S=[0,10],a∈A=[−5,5],andr(s,a,s′)=0(otherwise).Thatis,theagentreceivespositivereward+1atthecenterofthestatespace.

ThetrajectorylengthissetatT=10andthediscountfactorissetat

TransitionModelEstimation

163

0

4

6

10

FIGURE10.1:Illustrationofcontinuouschainwalk.

γ=0.99.Thefollowinglinear-in-parameterpolicymodelisusedinboth

theM-PGPEandIW-PGPEmethods:

6

X

(s−c

a=

θ

i)2

iexp

,

2

i=1

where(c1,…,c6)=(0,2,4,6,8,10).Ifanactiondeterminedbytheabove

Page 484: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

policyisoutoftheactionspace,itispulledbacktobeconfinedinthedomain.

Astransitiondynamics,thefollowingtwoscenariosareconsidered:

Gaussian:Thetruetransitiondynamicsisgivenby

st+1=st+at+εt,

whereεtistheGaussiannoisewithmean0andstandarddeviation0.3.

Bimodal:Thetruetransitiondynamicsisgivenby

st+1=st±at+εt,

whereεtistheGaussiannoisewithmean0andstandarddeviation0.3,

andthesignofatisrandomlychosenwithprobability1/2.

Ifthenextstateisoutofthestatespace,itisprojectedbacktothe

domain.Below,thebudgetfordatacollectionisassumedtobelimitedto

N=20trajectorysamples.

10.3.1.2

ComparisonofModelEstimators

WhenthetransitionmodelislearnedintheM-PGPEmethods,allN=20

trajectorysamplesaregatheredrandomlyinthebeginningatonce.More

specifically,theinitialstates1andtheactiona1arechosenfromtheuniform

distributionsoverSandA,respectively.Thenthenextstates2andtheim-

mediaterewardr1areobtained.Afterthat,theactiona2ischosenfromthe

uniformdistributionoverA,andthenextstates3andtheimmediatereward

r2areobtained.ThisprocessisrepeateduntilrTisobtained,bywhichatra-

jectorysampleisobtained.ThisdatagenerationprocessisrepeatedNtimes

toobtainNtrajectorysamples.

Figure10.2andFigure10.3illustratethetruetransitiondynamicsand

Page 485: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 486: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

164

StatisticalReinforcementLearning

)10

,as’|(sp’s5

argmax05

10

0

5

−5

a

0

s

(a)Truetransition

)10

)10

,a

,a

s’|

s’|

(s

(s

p’

p’

s5

s5

argmax

argmax

0

0

5

5

10

Page 487: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

10

0

0

5

5

−5

a

0

s

−5

a

0

s

(b)TransitionestimatedbyLSCDE

(c)TransitionestimatedbyGP

FIGURE10.2:GaussiantransitiondynamicsanditsestimatesbyLSCDE

andGP.

theirestimatesobtainedbyLSCDEandGPintheGaussianandbimodal

cases,respectively.Figure10.2showsthatbothLSCDEandGPcanlearnthe

entireprofileofthetruetransitiondynamicswellintheGaussiancase.Onthe

otherhand,Figure10.3showsthatLSCDEcanstillsuccessfullycapturethe

entireprofileofthetruetransitiondynamicswelleveninthebimodalcase,

butGPfailstocapturethebimodalstructure.

Basedontheestimatedtransitionmodels,policiesarelearnedbytheM-

PGPEmethod.Morespecifically,fromthelearnedtransitionmodel,1000

artificialtrajectorysamplesaregeneratedforgradientestimationandan-

other1000artificialtrajectorysamplesareusedforbaselineestimation.Then

policiesareupdatedbasedontheseartificialtrajectorysamples.Thispolicy

updatestepisrepeated100times.Forevaluatingthereturnofalearnedpol-

icy,100additionaltesttrajectorysamplesareusedwhicharenotemployedfor

policylearning.Figure10.4andFigure10.5depicttheaveragesandstandard

errorsofreturnsover100runsfortheGaussianandbimodalcases,respec-

Page 488: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

tively.Theresultsshowthat,intheGaussiancase,theGP-basedmethod

performsverywellandLSCDEalsoexhibitsreasonableperformance.Inthe

bimodalcase,ontheotherhand,GPperformspoorlyandLSCDEgivesmuch

betterresultsthanGP.ThisillustratesthehighflexibilityofLSCDE.

Page 489: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 490: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

TransitionModelEstimation

165

)10

,as’|(sp’s5

argmax05

10

0

5

−5

a

0

s

(a)Truetransition

)10

)

,a

10

s

,a

’|

s’|

(s

(s

p’

p

s5

’s5

argmax0

argmax0

5

5

10

Page 491: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

10

0

0

5

5

−5

a

0

s

−5

a

0

s

(b)TransitionestimatedbyLSCDE

(c)TransitionestimatedbyGP

FIGURE10.3:BimodaltransitiondynamicsanditsestimatesbyLSCDE

andGP.

10

2.8

M−PGPE(LSCDE)

M−PGPE(GP)

8

2.6

IW−PGPE

2.4

M−PGPE(LSCDE)

6

M−PGPE(GP)

2.2

Return

IW−PGPE

Return

Page 492: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

4

2

1.8

2

1.6

0

20

40

60

80

100

0

20

40

60

80

100

Iteration

Iteration

FIGURE10.4:Averagesandstan-

FIGURE10.5:Averagesandstan-

darderrorsofreturnsofthepolicies

darderrorsofreturnsofthepolicies

over100runsobtainedbyM-PGPE

over100runsobtainedbyM-PGPE

withLSCDE,M-PGPEwithGP,

withLSCDE,M-PGPEwithGP,

andIW-PGPEforGaussiantransi-

andIW-PGPEforbimodaltransi-

tion.

tion.

Page 493: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

166

StatisticalReinforcementLearning

4

2

1.9

3.5

1.8

3

Return

Return1.7

2.5

1.6

2

1.5

20x1

10x2

5x4

4x5

2x10

1x20

20x1

10x2

5x4

4x5

2x10

1x20

Samplingschedules

Samplingschedules

FIGURE10.6:Averagesandstan-

FIGURE10.7:Averagesandstan-

darderrorsofreturnsobtainedby

Page 494: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

darderrorsofreturnsobtainedby

IW-PGPEover100runsforGaus-

IW-PGPEover100runsforbimodal

siantransitionwithdifferentsam-

transitionwithdifferentsampling

plingschedules(e.g.,5×4means

schedules(e.g.,5×4meansgathering

gatheringk=5trajectorysamples

k=5trajectorysamples4times).

4times).

10.3.1.3

ComparisonofModel-BasedandModel-FreeMethods

Next,theperformanceofthemodel-basedandmodel-freePGPEmethods

arecompared.

Underthefixedbudgetscenario,thescheduleofcollecting20trajectory

samplesneedstobedeterminedfortheIW-PGPEmethod.First,theinfluence

ofthechoiceofsamplingschedulesisillustrated.Figure10.6andFigure10.7

showexpectedreturnsaveragedover100runsunderthesamplingschedule

thatabatchofktrajectorysamplesaregathered20/ktimesfordifferentval-

uesofk.Here,policyupdateisperformed100timesafterobservingeachbatch

ofktrajectorysamples,becausethisperformedbetterthantheusualscheme

ofupdatingthepolicyonlyonce.Figure10.6showsthattheperformanceof

IW-PGPEdependsheavilyonthesamplingschedule,andgatheringk=20

trajectorysamplesatonceisshowntobethebestchoiceintheGaussiancase.

Figure10.7showsthatgatheringk=20trajectorysamplesatonceisalsothe

bestchoiceinthebimodalcase.

Althoughthebestsamplingscheduleisnotaccessibleinpractice,theop-

timalsamplingscheduleisusedforevaluatingtheperformanceofIW-PGPE.

Figure10.4andFigure10.5showtheaveragesandstandarderrorsofreturns

obtainedbyIW-PGPEover100runsasfunctionsofthesamplingsteps.These

graphsshowthatIW-PGPEcanimprovethepoliciesonlyinthebeginning,

becausealltrajectorysamplesaregatheredatonceinthebeginning.The

Page 495: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

performanceofIW-PGPEmaybefurtherimprovedifitispossibletogather

moretrajectorysamples.However,thisisprohibitedunderthefixedbudget

scenario.Ontheotherhand,returnsofM-PGPEkeepincreasingoveriter-

TransitionModelEstimation

167

ations,becauseartificialtrajectorysamplescanbekeptgeneratedwithout

additionalsamplingcosts.Thisillustratesapotentialadvantageofmodel-

basedreinforcementlearning(RL)methods.

10.3.2

HumanoidRobotControl

Finally,theperformanceofM-PGPEisevaluatedonapracticalcontrol

problemofasimulatedupper-bodymodelofthehumanoidrobotCB-i(Cheng

etal.,2007),whichwasalsousedinSection9.3.3;seeFigure9.5forthe

illustrationsofCB-ianditssimulator.

10.3.2.1

Setup

ThesimulatorisbasedontheupperbodyoftheCB-ihumanoidrobot,

whichhas9jointsforshoulderpitch,shoulderroll,elbowpitchoftheright

arm,andshoulderpitch,shoulderroll,elbowpitchoftheleftarm,waistyaw,

torsoroll,andtorsopitch.Thestatevectoris18-dimensionalandreal-valued,

whichcorrespondstothecurrentangleindegreeandthecurrentangular

velocityforeachjoint.Theactionvectoris9-dimensionalandreal-valued,

whichcorrespondstothetargetangleofeachjointindegree.Thegoalofthe

controlproblemistoleadtheendeffectoroftherightarm(righthand)tothe

targetobject.Anoisycontrolsystemissimulatedbyperturbingactionvectors

withindependentbimodalGaussiannoise.Morespecifically,foreachelement

oftheactionvector,Gaussiannoisewithmean0andstandarddeviation3is

addedwithprobability0.6,andGaussiannoisewithmean−5andstandard

deviation3isaddedwithprobability0.4.

Theinitialpostureoftherobotisfixedtobestandingupstraightwith

Page 496: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

armsdown.Thetargetobjectislocatedinfrontofandabovetherighthand,

whichisreachablebyusingthecontrollablejoints.Therewardfunctionat

eachtimestepisdefinedas

rt=exp(−10dt)−0.000005minct,1,000,000,

wheredtisthedistancebetweentherighthandandtargetobjectattimestep

t,andctisthesumofcontrolcostsforeachjoint.Thedeterministicpolicy

modelusedinM-PGPEandIW-PGPEisdefinedasa=θ⊤φ(s)withthe

basisfunctionφ(s)=s.ThetrajectorylengthissetatT=100andthe

discountfactorissetatγ=0.9.

10.3.2.2

Experimentwith2Joints

First,weconsiderusingonly2jointsamongthe9joints,i.e.,onlytheright

shoulderpitchandrightelbowpitchareallowedtobecontrolled,whilethe

otherjointsremainstillateachtimestep(nocontrolsignalissenttothese

168

StatisticalReinforcementLearning

joints).Therefore,thedimensionalitiesofstatevectorsandactionvectora

are4and2,respectively.

WesupposethatthebudgetfordatacollectionislimitedtoN=50trajec-

torysamples.FortheM-PGPEmethods,alltrajectorysamplesarecollected

atfirstusingtheuniformlyrandominitialstatesandpolicy.Morespecifically,

theinitialstateischosenfromtheuniformdistributionoverS.Ateachtime

step,theactionaiofthei-thjointisfirstdrawnfromtheuniformdistribu-

tionon[si−5,si+5],wheresidenotesthestateforthei-thjoint.Intotal,

5000transitionsamplesarecollectedformodelestimation.Then,fromthe

learnedtransitionmodel,1000artificialtrajectorysamplesaregeneratedfor

gradientestimationandanother1000artificialtrajectorysamplesaregener-

atedforbaselineestimationineachiteration.Thesamplingscheduleofthe

IW-PGPEmethodischosentocollectk=5trajectorysamples50/ktimes,

whichperformswell,asshowninFigure10.8.Theaverageandstandarderror

Page 497: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

ofthereturnobtainedbyeachmethodover10runsareplottedinFigure10.9,

showingthatM-PGPE(LSCDE)tendstooutperformbothM-PGPE(GP)and

IW-PGPE.

Figure10.10illustratesanexampleofthereachingmotionwith2joints

obtainedbyM-PGPE(LSCDE)atthe60thiteration.Thisshowsthatthe

learnedpolicysuccessfullyleadstherighthandtothetargetobjectwithin

only13stepsinthisnoisycontrolsystem.

10.3.2.3

Experimentwith9Joints

Finally,theperformanceofM-PGPE(LSCDE)andIW-PGPEisevaluated

onthereachingtaskwithall9joints.

Theexperimentalsetupisessentiallythesameasthe2-jointcase,butthe

budgetforgatheringN=1000trajectorysamplesisgiventothiscomplex

andhigh-dimensionaltask.Thepositionofthetargetobjectismovedto

farleft,whichisnotreachablebyusingonly2joints.Thus,therobotis

requiredtomoveotherjointstoreachtheobjectwiththerighthand.Five

thousandrandomlychosentransitionsamplesareusedasGaussiancentersfor

M-PGPE(LSCDE).ThesamplingscheduleforIW-PGPEissetatgathering

1000trajectorysamplesatonce,whichisthebestsamplingscheduleaccording

toFigure10.11.Theaveragesandstandarderrorsofreturnsobtainedby

M-PGPE(LSCDE)andIW-PGPEover30runsareplottedinFigure10.12,

showingthatM-PGPE(LSCDE)tendstooutperformIW-PGPE.

Figure10.13exhibitsatypicalreachingmotionwith9jointsobtainedby

M-PGPE(LSCDE)atthe1000thiteration.Thisshowsthattherighthandis

ledtothedistantobjectsuccessfullywithin14steps.

TransitionModelEstimation

169

3.5

3

Return2.5

Page 498: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

2

1.5

50x1

25x2

10x5

5x10

1x50

Samplingschedules

FIGURE10.8:AveragesandstandarderrorsofreturnsobtainedbyIW-

PGPEover10runsforthe2-jointhumanoidrobotsimulatorfordifferent

samplingschedules(e.g.,5×10meansgatheringk=5trajectorysamples10

times).

0

150

300

450

600

750

1000

5

4

3

Return2

1

M−PGPE(LSCDE)

M−PGPE(GP)

IW−PGPE

0

0

20

40

60

Page 499: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Iteration

FIGURE10.9:Averagesandstandarderrorsofobtainedreturnsover10

runsforthe2-jointhumanoidrobotsimulator.Allmethodsuse50trajectory

samplesforpolicylearning.InM-PGPE(LSCDE)andM-PGPE(GP),all50

trajectorysamplesaregatheredinthebeginningandtheenvironmentmodel

islearned;then2000artificialtrajectorysamplesaregeneratedineachup-

dateiteration.InIW-PGPE,abatchof5trajectorysamplesisgatheredfor

10iterations,whichwasshowntobethebestsamplingscheduling(seeFig-

ure10.8).Notethatpolicyupdateisperformed100timesafterobservingeach

batchoftrajectorysamples,whichweconfirmedtoperformwell.Thebottom

horizontalaxisisfortheM-PGPEmethods,whilethetophorizontalaxisis

fortheIW-PGPEmethod.

Page 500: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 501: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 502: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 503: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

170

StatisticalReinforcementLearning

FIGURE10.10:Exampleofarmreachingwith2jointsusingapolicyob-

tainedbyM-PGPE(LSCDE)atthe60thiteration(fromlefttorightandtop

tobottom).

−4.5

−5

−5.5

Return

−6

−6.5

−71000x1

500x2

Page 504: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

100x10

50x20

10x100

5x200

1x1000

Samplingschedules

FIGURE10.11:AveragesandstandarderrorsofreturnsobtainedbyIW-

PGPEover30runsforthe9-jointhumanoidrobotsimulatorfordifferent

samplingschedules(e.g.,100×10meansgatheringk=100trajectorysamples

10times).

Page 505: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 506: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 507: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 508: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

TransitionModelEstimation

171

0

20

40

60

80

100

−4

−5

−6

Return

−7

M−PGPE

IW−PGPE

−8

0

200

400

600

800

1000

Iteration

Page 509: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

FIGURE10.12:Averagesandstandarderrorsofobtainedreturnsover30

runsforthehumanoidrobotsimulatorwith9joints.Bothmethodsuse1000

trajectorysamplesforpolicylearning.InM-PGPE(LSCDE),all1000tra-

jectorysamplesaregatheredinthebeginningandtheenvironmentmodel

islearned;then2000artificialtrajectorysamplesaregeneratedineachup-

dateiteration.InIW-PGPE,abatchof1000trajectorysamplesisgatheredat

once,whichwasshowntobethebestscheduling(seeFigure10.11).Notethat

policyupdateisperformed100timesafterobservingeachbatchoftrajectory

samples.ThebottomhorizontalaxisisfortheM-PGPEmethod,whilethe

tophorizontalaxisisfortheIW-PGPEmethod.

FIGURE10.13:Exampleofarmreachingwith9jointsusingapolicyob-

tainedbyM-PGPE(LSCDE)atthe1000thiteration(fromlefttorightand

toptobottom).

172

StatisticalReinforcementLearning

10.4

Remarks

Model-basedreinforcementlearningisapromisingapproach,giventhat

thetransitionmodelcanbeestimatedaccurately.However,estimatingthe

high-dimensionalconditionaldensityischallenging.Inthischapter,anon-

parametricconditionaldensityestimatorcalledleast-squaresconditionalden-

sityestimation(LSCDE)wasintroduced,andmodel-basedPGPEwith

LSCDEwasshowntoworkexcellentlyinexperiments.

Underthefixedsamplingbudget,themodel-freeapproachrequiresusto

designthesamplingscheduleappropriatelyinadvance.However,thisisprac-

ticallyveryhardunlessstrongpriorknowledgeisavailable.Ontheotherhand,

model-basedmethodsdonotsufferfromthisproblem,whichisanexcellent

practicaladvantageoverthemodel-freeapproach.

Inrobotics,themodel-freeapproachseemstobepreferredbecauseac-

Page 510: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

curatelylearningthetransitiondynamicsofcomplexrobotsischallenging

(Deisenrothetal.,2013).Furthermore,model-freemethodscanutilizethe

priorknowledgeintheformofpolicydemonstration(Kober&Peters,2011).

Ontheotherhand,themodel-basedapproachisadvantageousinthatnoin-

teractionwiththerealrobotisrequiredoncethetransitionmodelhasbeen

learnedandthelearnedtransitionmodelcanbeutilizedforfurthersimulation.

Actually,thechoiceofmodel-freeormodel-basedmethodsisnotonlyan

ongoingresearchtopicinmachinelearning,butalsoabigdebatableissuein

neuroscience.Therefore,furtherdiscussionwouldbenecessarytomoredeeply

understandtheprosandconsofthemodel-basedandmodel-freeapproaches.

Combiningorswitchingthemodel-freeandmodel-basedapproacheswould

alsobeaninterestingdirectiontobefurtherinvestigated.

Chapter11

DimensionalityReductionfor

TransitionModelEstimation

Least-squaresconditionaldensityestimation(LSCDE),introducedinChap-

ter10,isapracticaltransitionmodelestimator.However,transitionmodel

estimationisstillchallengingwhenthedimensionalityofstateandaction

spacesishigh.Inthischapter,adimensionalityreductionmethodisintro-

ducedtoLSCDEwhichfindsalow-dimensionalexpressionoftheoriginal

stateandactionvectorthatisrelevanttopredictingthenextstate.After

mathematicallyformulatingtheproblemofdimensionalityreductioninSec-

tion11.1,adetaileddescriptionofthedimensionalityreductionalgorithm

basedonsquared-lossconditionalentropyisprovidedinSection11.2.Then

numericalexamplesaregiveninSection11.3,andthischapterisconcluded

inSection11.4.

11.1

SufficientDimensionalityReduction

Sufficientdimensionalityreduction(Li,1991;Cook&Ni,2005)isaframe-

Page 511: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

workofdimensionalityreductioninasupervisedlearningsettingofanalyzing

aninput-outputrelation—inourcase,inputisthestate-actionpair(s,a)

andoutputisthenextstates′.Sufficientdimensionalityreductionisaimedat

findingalow-dimensionalexpressionzofinput(s,a)thatcontains“sufficient”

informationaboutoutputs′.

Letzbealinearprojectionofinput(s,a).Morespecifically,usingmatrix

WsuchthatWW⊤=IwhereIdenotestheidentitymatrix,zisgivenby

s

z=W

.

a

Thegoalofsufficientdimensionalityreductionis,fromindependenttransition

samples(sm,am,s′m)M

m=1,tofindWsuchthats′and(s,a)areconditionally

independentgivenz.Thisconditionalindependencemeansthatzcontainsall

informationabouts′andisequivalentlyexpressedas

p(s′|s,a)=p(s′|z).

(11.1)

173

174

StatisticalReinforcementLearning

11.2

Squared-LossConditionalEntropy

Inthissection,thedimensionalityreductionmethodbasedonthesquared-

lossconditionalentropy(SCE)isintroduced.

11.2.1

ConditionalIndependence

Page 512: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

SCEisdefinedandexpressedas

ZZ

1

SCE(s′|z)=−

p(s′|z)p(s′,z)dzds′

2ZZ

Z

1

2

1

=−

p(s′|z)−1p(z)dzds′−1+

ds′.

2

2

ItwasshowninTangkarattetal.(2015)that

SCE(s′|z)≥SCE(s′|s,a),

andtheequalityholdsifandonlyifEq.(11.1)holds.Thus,sufficientdimen-

sionalityreductioncanbeperformedbyminimizingSCE(s′|z)withrespect

toW:

W∗=argminSCE(s′|z).W∈GHere,GdenotestheGrassmannmanifold,whichisthesetofmatricesW

suchthatWW⊤=Iwithoutredundancyintermsofthespan.

SinceSCEcontainsunknowndensitiesp(s′|z)andp(s′,z),itcannotbe

directlycomputed.Here,letusemploytheLSCDEmethodintroducedin

Chapter10toobtainanestimatorb

p(s′|z)ofconditionaldensityp(s′|z).Then,

byreplacingtheexpectationoverp(s′,z)withthesampleaverage,SCEcan

beapproximatedas

M

X

Page 513: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

d

1

1

SCE(s′|z)=−

b

p(s′

e

α⊤b

v,

2M

m|zm)=−2

m=1

where

M

s

1X

z

m

m=W

and

bv=

φ(z

a

m,s′m).

m

Mm=1

φ(z,s′)isthebasisfunctionvectorusedinLSCDEgivenby

kz−z

φ

bk2+ks′−s′bk2

b(z,s′)=exp

Page 514: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

,

2κ2

DimensionalityReductionforTransitionModelEstimation

175

whereκ>0denotestheGaussiankernelwidth.e

αistheLSCDEsolution

givenby

e

α=(b

U+λI)−1b

v,

whereλ≥0istheregularizationparameterand

b

(πκ)dim(s′)

ks′

U

b−s′b′k2

b,b′=

exp−

M

4κ2

M

X

kz

×

Page 515: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

exp−

m−zbk2+kzm−zb′k2

.

2κ2

m=1

11.2.2

DimensionalityReductionwithSCE

WiththeaboveSCEestimator,apracticalformulationforsufficientdi-

mensionalityreductionisgivenby

c

W=argmaxS(W),whereS(W)=e

α⊤b

v.

W∈GThegradientofS(W)withrespecttoWℓ,ℓ′isgivenby

∂S

∂b

v⊤=−e

α⊤∂b

U

e

α+2

e

α.

∂Wℓ,ℓ′

∂Wℓ,ℓ′

∂Wℓ,ℓ′

IntheEuclideanspace,theabovegradientgivesthesteepestdirection(see

alsoSection7.3.1).However,ontheGrassmannmanifold,thenaturalgradi-

ent(Amari,1998)givesthesteepestdirection.ThenaturalgradientatW

istheprojectionoftheordinarygradienttothetangentspaceoftheGrass-

Page 516: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

mannmanifold.Ifthetangentspaceisequippedwiththecanonicalmetric

W,W′=1tr(W⊤W′),thenaturalgradientatWisgivenasfollows(Edel-

2

manetal.,1998):

∂SW⊤∂W

⊥W⊥,

whereW⊥isthematrixsuchthatW⊤,W⊤isanorthogonalmatrix.

⊥ThegeodesicfromWtothedirectionofthenaturalgradientoverthe

Grassmannmanifoldcanbeexpressedusingt∈Ras”

#!

O

∂SW⊤W

W

∂W

⊥t=

I

Oexp−t

⊤,

−W

∂S

W

⊥O

⊥∂W

where“exp”foramatrixdenotesthematrixexponentialandOdenotesthe

Page 517: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

zeromatrix.Thenlinesearchalongthegeodesicinthenaturalgradientdi-

rectionisperformedbyfindingthemaximizerfromWt|t≥0(Edelman

etal.,1998).

176

StatisticalReinforcementLearning

OnceWisupdatedbythenaturalgradientmethod,SCEisre-estimated

fornewWandnaturalgradientascentisperformedagain.Thisentirepro-

cedureisrepeateduntilWconverges,andthefinalsolutionisgivenby

b

α⊤φ(z,s′)

b

p(s′|z)=R

,

b

α⊤φ(z,s′′)ds′′

whereb

αb=max(0,e

αb),andthedenominatorcanbeanalyticallycomputedas

Z

B

X

kz−z

b

bk2

α⊤φ(z,s′′)ds′′=(2πκ)dim(s′)

αbexp−

Page 518: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

.

2κ2

b=1

WhenSCEisre-estimated,performingcross-validationforLSCDEinevery

stepiscomputationallyexpensive.Inpractice,cross-validationmaybeper-

formedonlyonceeveryseveralgradientupdates.Furthermore,tofindabetter

localoptimalsolution,thisgradientascentproceduremaybeexecutedmul-

tipletimeswithrandomlychoseninitialsolutions,andtheoneachievingthe

largestobjectivevalueischosen.

11.2.3

RelationtoSquared-LossMutualInformation

TheabovedimensionalityreductionmethodminimizesSCE:

ZZ

1

p(z,s′)2

SCE(s′|z)=−

dzds′.

2

p(z)

Ontheotherhand,thedimensionalityreductionmethodproposedinSuzuki

andSugiyama(2013)maximizessquared-lossmutualinformation(SMI):

ZZ

1

p(z,s′)2

SMI(z,s′)=

dzds′.

2

p(z)p(s′)

NotethatSMIcanbeapproximatedalmostinthesamewayasSCEby

theleast-squaresmethod(Suzuki&Sugiyama,2013).Theaboveequations

showthattheessentialdifferencebetweenSCEandSMIiswhetherp(s′)

isincludedinthedenominatorofthedensityratio,andSCEisreducedto

Page 519: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

thenegativeSMIifp(s′)isuniform.However,ifp(s′)isnotuniform,the

densityratiofunctionp(z,s′)includedinSMImaybemorefluctuatedthan

p(z)p(s′)

p(z,s′)includedinSCE.Sinceasmootherfunctioncanbemoreaccurately

p(z)

estimatedfromasmallnumberofsamplesingeneral(Vapnik,1998),SCE-

baseddimensionalityreductionisexpectedtoworkbetterthanSMI-based

dimensionalityreduction.

DimensionalityReductionforTransitionModelEstimation

177

11.3

NumericalExamples

Inthissection,experimentalbehavioroftheSCE-baseddimensionality

reductionmethodisillustrated.

11.3.1

ArtificialandBenchmarkDatasets

Thefollowingdimensionalityreductionschemesarecompared:

•None:Nodimensionalityreductionisperformed.

•SCE(Section11.2):Dimensionalityreductionisperformedbymini-

mizingtheleast-squaresSCEapproximatorusingnaturalgradientsover

theGrassmannmanifold(Tangkarattetal.,2015).

•SMI(Section11.2.3):Dimensionalityreductionisperformedbymax-

imizingtheleast-squaresSMIapproximatorusingnaturalgradientsover

theGrassmannmanifold(Suzuki&Sugiyama,2013).

•True:The“true”subspaceisused(onlyforartificialdatasets).

Afterdimensionalityreduction,thefollowingconditionaldensityestimators

arerun:

•LSCDE(Section10.1.3):Least-squaresconditionaldensityestima-

tion(Sugiyamaetal.,2010).

Page 520: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

•ǫKDE(Section10.1.2):ǫ-neighborkerneldensityestimation,where

ǫischosenbyleast-squarescross-validation.

First,thebehaviorofSCE-LSCDEiscomparedwiththeplainLSCDE

withnodimensionalityreduction.Thedatasetshave5-dimensionalinputx=

(x(1),…,x(5))⊤and1-dimensionaloutputy.Amongthe5dimensionsofx,

onlythefirstdimensionx(1)isrelevanttopredictingtheoutputyandthe

other4dimensionsx(2),…,x(5)arejuststandardGaussiannoise.Figure11.1

plotsthefirstdimensionofinputandoutputofthesamplesinthedatasets

andconditionaldensityestimationresults.Thegraphsshowthattheplain

LSCDEdoesnotperformwellduetotheirrelevantnoisedimensionsininput,

whileSCE-LSCDEgivesmuchbetterestimates.

Next,artificialdatasetswith5-dimensionalinputx=(x(1),…,x(5))⊤and1-dimensionaloutputyareused.Eachelementofxfollowsthestandard

Gaussiandistributionandyisgivenby

(a)y=x(1)+(x(1))2+(x(1))3+ε,

(b)y=(x(1))2+(x(2))2+ε,

178

StatisticalReinforcementLearning

6

6

Sample

Sample

Plain-LSCDE

Plain-LSCDE

5

SCE-LSCDE

SCE-LSCDE

4

4

Page 521: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

y2

y3

2

0

1

−2

0

2

3

4

5

6

7

3

4

5

6

7

8

x(1)

x(1)

(a)Bonemineraldensity

(b)OldFaithfulgeyser

FIGURE11.1:ExamplesofconditionaldensityestimationbyplainLSCDE

andSCE-LSCDE.

whereεistheGaussiannoisewithmeanzeroandstandarddeviation1/4.

ThetoprowofFigure11.2showsthedimensionalityreductionerrorbe-

tweentrueW∗anditsestimatecWfordifferentsamplesizen,measured

by

⊤Error

Page 522: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

c

DR=kc

WW−W∗⊤W∗kFrobenius,wherek·kFrobeniusdenotestheFrobeniusnorm.TheSMI-basedandSCE-based

dimensionalityreductionmethodsbothperformsimilarlyforthedataset(a),

whiletheSCE-basedmethodclearlyoutperformstheSMI-basedmethodfor

thedataset(b).Thehistogramsofy400

i=1plottedinthe2ndrowofFigure11.2

showthattheprofileofthehistogram(whichisasampleapproximationof

p(y))inthedataset(b)ismuchsharperthanthatinthedataset(a).As

explainedinSection11.2.3,thedensityratiofunctionusedinSMIcontains

p(y)inthedenominator.Therefore,itwouldbehighlynon-smoothandthus

ishardtoapproximate.Ontheotherhand,thedensityratiofunctionused

inSCEdoesnotcontainp(y).Therefore,itwouldbesmootherthantheone

usedinSMIandthusiseasiertoapproximate.

The3rdand4throwsofFigure11.2plottheconditionaldensityestimation

errorbetweentruep(y|x)anditsestimateb

p(y|x),evaluatedbythesquared

loss(withoutaconstant):

Z

1

n′

X

1n′

X

ErrorCDE=

b

p(y|e

x

b

p(e

y

Page 523: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

2n′

i)2dy−n′

i|e

xi),

i=1

i=1

where(e

xi,e

yi)n′

i=1isasetoftestsamplesthathavenotbeenusedfor

conditionaldensityestimation.Wesetn′=1000.Thegraphsshowthat

LSCDEoveralloutperformsǫKDEforbothdatasets.Forthedataset(a),

SMI-LSCDEandSCE-LSCDEperformequallywell,andaremuchbetterthan

DimensionalityReductionforTransitionModelEstimation

179

1

0.25

SMI-based

SMI-based

SCE-based

SCE-based

0.8

0.2

0.6

0.15

DR

DR

Error0.4

Error

0.1

Page 524: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0.2

0.05

0

0

50

100150200250300350400

50

100150200250300350400

Samplesizen

Samplesizen

40

200

30

150

20

100

Frequency

Frequency

10

50

0

0

−2

0

2

4

6

−5

0

5

10

y

Page 525: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

y

1

0.1

LSCDE

εKDE

LSCDE

εKDE

LSCDE*

εKDE*

0

LSCDE*

εKDE*

0.5

−0.1

0

−0.2

−0.5

CDE

CDE

−0.3

−1

Error−0.4

Error

−1.5

−0.5

−0.6

−2

−0.7

−2.5

50

100150200250300350400

50

Page 526: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

100150200250300350400

Samplesizen

Samplesizen

0.1

1

SMI-LSCDE

SMI-LSCDE

SMI-

SMI-

εKDE

εKDE

0

SCE-LSCDE

SCE-εKDE

SCE-LSCDE

SCE-εKDE

0.5

−0.1

0

−0.2

−0.5

CDE

CDE

−0.3

−1

Error−0.4

Error

−1.5

−0.5

−0.6

−2

−0.7

Page 527: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

−2.5

50

100150200250300350400

50

100150200250300350400

Samplesizen

Samplesizen

FIGURE11.2:Toprow:Themeanandstandarderrorofthedimensionality

reductionerrorover20runsontheartificialdatasets.2ndrow:Histograms

ofoutputyi400

i=1.3rdand4throws:Themeanandstandarderrorofthe

conditionaldensityestimationerrorover20runs.

180

StatisticalReinforcementLearning

plainLSCDEwithnodimensionalityreduction(LSCDE)andcomparableto

LSCDEwiththetruesubspace(LSCDE*).Forthedataset(b),SCE-LSCDE

outperformsSMI-LSCDEandLSCDEandiscomparabletoLSCDE*.

Next,theUCIbenchmarkdatasets(Bache&Lichman,2013)areusedfor

performanceevaluation.nsamplesareselectedrandomlyfromeachdatasetfor

conditionaldensityestimation,andtherestofthesamplesareusedtomeasure

theconditionaldensityestimationerror.Sincethedimensionalityofzisun-

knownforthebenchmarkdatasets,itwasdeterminedbycross-validation.The

resultsaresummarizedinTable11.1,showingthatSCE-LSCDEworkswell

overall.Table11.2describesthedimensionalitiesselectedbycross-validation,

showingthatboththeSCE-basedandSMI-basedmethodsreducethedimen-

sionalitysignificantly.

11.3.2

HumanoidRobot

Finally,SCE-LSCDEisappliedtotransitionestimationofahumanoid

robot.Weuseasimulatoroftheupper-bodypartofthehumanoidrobot

Page 528: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

CB-i(Chengetal.,2007)(seeFigure9.5).

Therobothas9controllablejoints:shoulderpitch,shoulderroll,elbow

pitchoftherightarm,andshoulderpitch,shoulderroll,elbowpitchofthe

leftarm,waistyaw,torsoroll,andtorsopitchjoints.Postureoftherobotis

describedby18-dimensionalreal-valuedstatevectors,whichcorrespondsto

theangleandangularvelocityofeachjointinradianandradian-per-second,

respectively.Therobotiscontrolledbysendinganactioncommandatothe

system.Theactioncommandaisa9-dimensionalreal-valuedvector,which

correspondstothetargetangleofeachjoint.Whentherobotiscurrentlyat

statesandreceivesactiona,thephysicalcontrolsystemofthesimulator

calculatestheamountoftorquetobeappliedtoeachjoint(seeSection9.3.3

fordetails).

Intheexperiment,theactionvectoraisrandomlychosenandanoisy

controlsystemissimulatedbyaddingabimodalGaussiannoisevector.More

specifically,theactionaiofthei-thjointisfirstdrawnfromtheuniformdis-

tributionon[si−0.087,si+0.087],wheresidenotesthestateforthei-th

joint.ThedrawnactionisthencontaminatedbyGaussiannoisewithmean

0andstandarddeviation0.034withprobability0.6andGaussiannoisewith

mean−0.087andstandarddeviation0.034withprobability0.4.Byrepeat-

edlycontrollingtherobotMtimes,transitionsamples(sm,am,s′m)M

m=1

areobtained.Ourgoalistolearnthesystemdynamicsasastatetransition

probabilityp(s′|s,a)fromthesesamples.

Thefollowingthreescenariosareconsidered:usingonly2joints(right

shoulderpitchandrightelbowpitch),only4joints(inaddition,rightshoulder

rollandwaistyaw),andall9joints.Thesesetupscorrespondto6-dimensional

inputand4-dimensionaloutputinthe2-jointcase,12-dimensionalinputand

8-dimensionaloutputinthe4-jointcase,and27-dimensionalinputand18-

dimensionaloutputinthe9-jointcase.Fivehundred,1000,and1500transition

Page 529: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 530: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

DimensionalityReductionforTransitionModelEstimation

181

r

llea

0

t-test

le

1

1

1

1

1

1

1

1

1

0

1

1

0

0

ca

Page 531: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

1

1

(sm

×××××××××

××

ed

S

×

××

ira

)

)

)

)

)

)

)

)

)

)

)

)

)

)

sets

p

1

4

6

2

1

Page 532: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

4

2

3

4

4

4

1

2

ta

E

a

le

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.1

(.1

(.1

(.0

(.0

d

p

n

D

3

Page 533: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

6

2

5

1

9

3

6

0

5

5

5

5

9

s

m

.1

.4

.7

.9

.9

.8

.1

.9

.8

.9

.2

.6

.7

.8

u

ǫK

1

Page 534: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

2

2

0

1

1

6

0

1

9

3

0

0

-sa

ctio

−−−−−−−−−−−−−−

rio

o

u

)

)

va

tw

)

)

)

)

)

)

)

)

)

Page 535: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

)

)

)

red

5

4

9

4

1

1

2

7

2

6

3

3

3

6

r

e

o

E

fo

th

D

(.0

(.0

(.0

(.0

(.0

(.0

(.0

Page 536: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(.0

(.0

(.0

(.1

(.1

(.0

(.3

s

N

C

1

5

2

2

9

6

3

0

1

2

5

5

3

0

n

to

S

.4

.6

.7

.0

.0

Page 537: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

.4

.1

.1

.3

.9

.8

.6

1

.7

2

1

.1

2

2

3

1

2

7

3

0

1

ru

L

1

1

g

−−

−−−−−−−−−

0

Page 538: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

inrd

)

er

)

)

)

)

)

)

)

)

)

)

)

)

)

8

5

1

9

2

2

7

2

4

9

3

0

6

1

ov

Page 539: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

cco

E

(.3

r

a

(.0

(.0

(.1

(.2

(.0

(.1

(.1

(.0

(.0

(.4

(.6

(.1

(.5

s

D

2

7

5

7

7

0

3

3

8

1

7

4

Page 540: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

8

7

d

.6

.7

.9

.4

.9

.6

.9

.9

.1

.4

.2

.4

.3

.3

erro

o

ǫK

0

sed

1

1

2

5

0

2

1

6

1

3

Page 541: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

7

1

2

n

eth

a

−−

−−−−

−−−

b

−−

tioam

I-

)

)

)

)

)

)

)

)

)

)

)

)

)

)

le

Page 542: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

M

5

4

8

6

1

2

3

4

3

7

0

4

5

3

b

S

ED(.0(.0(.1(.2(.0(.0(.0(.0(.0(.4(.5(.8(.2(.6

estim

ra

3

3

4

6

a

C

1

5

9

0

5

2

Page 543: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0

2

0

4

y

p

S

.9

.8

.6

.6

.2

.3

.8

.9

.3

.0

.4

.0

.0

.7

L

1

1

2

5

1

2

2

6

1

6

Page 544: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

9

8

2

9

sit

m

−−−−−−−−−−−−−−

en

co

d

d

)

)

)

)

)

l

)

6

4

4

)

5

)

)

)

)

7

)

)

)

a

Page 545: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

n

1

2

7

3

6

2

4

4

7

a

n

r

E

(.1

(.0

(.1

(.1

(.0

(.1

(.1

(.0

(.0

(.2

(.3

(.5

(.1

(.1

io

D

7

4

Page 546: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

3

3

9

7

5

3

0

8

5

0

3

4

it

.5

.9

.9

.9

.2

.1

.5

.7

.4

d

erro

ce.

ǫK

1

.7

.0

.2

0

.4

Page 547: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

6

1

4

.7

7

1

2

n

1

3

6

2

9

n

fa

sed

−−−−−−−−−−−−−−

co

a

e

ea

ld

-b

)

m

o

E

)

)

)

)

Page 548: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

)

)

)

)

)

)

)

6

)

)

th

b

9

4

8

2

1

1

2

2

3

4

3

1

3

f

e

C

E

o

y

S

Page 549: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

(.8

th

b

D

(.0

(.0

(.1

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.5

9

(.2

(.8

r

f

C

3

0

2

6

9

1

5

8

6

3

7

Page 550: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

7

o

ed

S

.7

.8

.9

.4

.1

.3

.8

.1

.3

.1

.3

.4

.8

.3

L

1

1

2

6

1

2

2

7

1

7

8

0

Page 551: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

2

8

erro

s

−1

ecifi

−−−−−

−−−−

−−−

rdatermsp

0

0

0

0

0

0

0

0

0

0

0

0

d

0

0

0

0

0

0

0

0

Page 552: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

0

0

0

0

0

0

n

n

in

re

1

1

5

8

5

4

3

1

3

2

1

1

2

5

a

)

)

sta

d

)

)

)

Page 553: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

)

)

o

)

)

)

)

)

)

)

)

8

d

%5

,dy

,1

,1

,1

,1

,1

,1

,1

,1

,1

,2

,2

,4

,8

,1

n

3

1

Page 554: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1

2

2

7

a

eth

el

(1

(7

(4

(6

(9

(1

(1

(1

(8

(8

(7

(6

(1

n

m

v

(dx

(2

le

ea

est

e

e

es

M

Page 555: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

b

ce

g

G

em

in

:

P

ir

y

ts

ts

ts

e

n

set

o

t

ch

in

W

F

.1

h

ca

sin

M

ch

o

W

crete

ck

Page 556: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

erg

in

in

in

1

T

o

o

o

ifi

ta

u

erv

a

e

n

n

to

J

J

J

1

a

n

o

to

sic

it

D

S

Y

o

Page 557: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

S

H

u

y

h

ed

rest

C

E

2

4

9

E

o

sig

A

h

W

R

F

L

P

e

B

etter).

th

A

b

t

T

is

a

Page 558: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

182

StatisticalReinforcementLearning

Page 559: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

TABLE11.2:Meanandstandarderrorofthechosensubspacedimensional-

ityover10runsforbenchmarkandrobottransitiondatasets.

SCE-based

SMI-based

Dataset

(dx,dy)

LSCDE

ǫKDE

LSCDE

ǫKDE

Housing

(13,1)

3.9(0.74)

2.0(0.79)

2.0(0.39)

1.3(0.15)

AutoMPG

(7,1)

3.2(0.66)

1.3(0.15)

2.1(0.67)

1.1(0.10)

Servo

(4,1)

1.9(0.35)

2.4(0.40)

2.2(0.33)

1.6(0.31)

Yacht

(6,1)

1.0(0.00)

1.0(0.00)

Page 560: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

1.0(0.00)

1.0(0.00)

Physicochem

(9,1)

6.5(0.58)

1.9(0.28)

6.6(0.58)

2.6(0.86)

WhiteWine

(11,1)

1.2(0.13)

1.0(0.00)

1.4(0.31)

1.0(0.00)

RedWine

(11,1)

1.0(0.00)

1.3(0.15)

1.2(0.20)

1.0(0.00)

ForestFires

(12,1)

1.2(0.20)

4.9(0.99)

1.4(0.22)

6.8(1.23)

Concrete

(8,1)

1.0(0.00)

1.0(0.00)

1.2(0.13)

1.0(0.00)

Page 561: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Energy

(8,2)

5.9(0.10)

3.9(0.80)

2.1(0.10)

2.0(0.30)

Stock

(7,2)

3.2(0.83)

2.1(0.59)

2.1(0.60)

2.7(0.67)

2Joints

(6,4)

2.9(0.31)

2.7(0.21)

2.5(0.31)

2.0(0.00)

4Joints

(12,8)

5.2(0.68)

6.2(0.63)

5.4(0.67)

4.6(0.43)

9Joints

(27,18)

13.8(1.28)15.3(0.94)11.4(0.75)13.2(1.02)

samplesaregeneratedforthe2-joint,4-joint,and9-jointcases,respectively.

Thenrandomlychosenn=100,200,and500samplesareusedforconditional

densityestimation,andtherestisusedforevaluatingthetesterror.The

resultsaresummarizedinTable11.1,showingthatSCE-LSCDEperforms

wellfortheallthreecases.Table11.2describesthedimensionalitiesselected

Page 562: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

bycross-validation.Thisshowsthatthedimensionalitiesaremuchreduced,

implyingthattransitionofthehumanoidrobotishighlyredundant.

11.4

Remarks

Copingwithhighdimensionalityofthestateandactionspacesisoneof

themostimportantchallengesinmodel-basedreinforcementlearning.Inthis

chapter,adimensionalityreductionmethodforconditionaldensityestimation

wasintroduced.Thekeyideawastousethesquared-lossconditionalentropy

(SCE)fordimensionalityreduction,whichcanbeestimatedbyleast-squares

conditionaldensityestimation.Thisallowedustoperformdimensionalityre-

ductionandconditionaldensityestimationsimultaneouslyinanintegrated

manner.Incontrast,dimensionalityreductionbasedonsquared-lossmutual

information(SMI)yieldsatwo-stepprocedureoffirstreducingthedimension-

alityandthentheconditionaldensityisestimated.SCE-baseddimensionality

reductionwasshowntooutperformtheSMI-basedmethod,particularlywhen

outputfollowsaskeweddistribution.

References

Abbeel,P.,&Ng,A.Y.(2004).Apprenticeshiplearningviainverserein-

forcementlearning.ProceedingsofInternationalConferenceonMachine

Learning(pp.1–8).

Abe,N.,Melville,P.,Pendus,C.,Reddy,C.K.,Jensen,D.L.,Thomas,V.P.,

Bennett,J.J.,Anderson,G.F.,Cooley,B.R.,Kowalczyk,M.,Domick,M.,

&Gardinier,T.(2010).Optimizingdebtcollectionsusingconstrainedrein-

forcementlearning.ProceedingsofACMSIGKDDInternationalConference

onKnowledgeDiscoveryandDataMining(pp.75–84).

Amari,S.(1967).Theoryofadaptivepatternclassifiers.IEEETransactions

onElectronicComputers,EC-16,299–307.

Amari,S.(1998).Naturalgradientworksefficientlyinlearning.NeuralCom-

putation,10,251–276.

Page 563: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Amari,S.,&Nagaoka,H.(2000).Methodsofinformationgeometry.Provi-

dence,RI,USA:OxfordUniversityPress.

Bache,K.,&Lichman,M.(2013).UCImachinelearningrepository.http:

//archive.ics.uci.edu/ml/

Baxter,J.,Bartlett,P.,&Weaver,L.(2001).Experimentswithinfinite-

horizon,policy-gradientestimation.JournalofArtificialIntelligenceRe-

search,15,351–381.

Bishop,C.M.(2006).Patternrecognitionandmachinelearning.NewYork,

NY,USA:Springer.

Boyd,S.,&Vandenberghe,L.(2004).Convexoptimization.Cambridge,UK:

CambridgeUniversityPress.

Bradtke,S.J.,&Barto,A.G.(1996).Linearleast-squaresalgorithmsfor

temporaldifferencelearning.MachineLearning,22,33–57.

Chapelle,O.,Schölkopf,B.,&Zien,A.(Eds.).(2006).Semi-supervisedlearn-

ing.Cambridge,MA,USA:MITPress.

Cheng,G.,Hyon,S.,Morimoto,J.,Ude,A.,Joshua,G.H.,Colvin,G.,Scrog-

gin,W.,&Stephen,C.J.(2007).CB:Ahumanoidresearchplatformfor

exploringneuroscience.AdvancedRobotics,21,1097–1114.

183

184

References

Chung,F.R.K.(1997).Spectralgraphtheory.Providence,RI,USA:American

MathematicalSociety.

Coifman,R.,&Maggioni,M.(2006).Diffusionwavelets.AppliedandCom-

putationalHarmonicAnalysis,21,53–94.

Cook,R.D.,&Ni,L.(2005).Sufficientdimensionreductionviainverse

regression.JournaloftheAmericanStatisticalAssociation,100,410–428.

Dayan,P.,&Hinton,G.E.(1997).Usingexpectation-maximizationforrein-

forcementlearning.NeuralComputation,9,271–278.

Deisenroth,M.P.,Neumann,G.,&Peters,J.(2013).Asurveyonpolicy

Page 564: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

searchforrobotics.FoundationsandTrendsinRobotics,2,1–142.

Deisenroth,M.P.,&Rasmussen,C.E.(2011).PILCO:Amodel-basedand

data-efficientapproachtopolicysearch.ProceedingsofInternationalCon-

ferenceonMachineLearning(pp.465–473).

Demiriz,A.,Bennett,K.P.,&Shawe-Taylor,J.(2002).Linearprogramming

boostingviacolumngeneration.MachineLearning,46,225–254.

Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).Maximumlikelihood

fromincompletedataviatheEMalgorithm.JournaloftheRoyalStatistical

Society,seriesB,39,1–38.

Dijkstra,E.W.(1959).Anoteontwoproblemsinconnexion[sic]withgraphs.

NumerischeMathematik,1,269–271.

Edelman,A.,Arias,T.A.,&Smith,S.T.(1998).Thegeometryofalgo-

rithmswithorthogonalityconstraints.SIAMJournalonMatrixAnalysis

andApplications,20,303–353.

Efron,B.,Hastie,T.,Johnstone,I.,&Tibshirani,R.(2004).Leastangle

regression.AnnalsofStatistics,32,407–499.

Engel,Y.,Mannor,S.,&Meir,R.(2005).ReinforcementlearningwithGaus-

sianprocesses.ProceedingsofInternationalConferenceonMachineLearn-

ing(pp.201–208).

Fishman,G.S.(1996).MonteCarlo:Concepts,algorithms,andapplications.

Berlin,Germany:Springer-Verlag.

Fredman,M.L.,&Tarjan,R.E.(1987).Fibonacciheapsandtheiruses

inimprovednetworkoptimizationalgorithms.JournaloftheACM,34,

569–615.

Goldberg,A.V.,&Harrelson,C.(2005).Computingtheshortestpath:A*

searchmeetsgraphtheory.ProceedingsofAnnualACM-SIAMSymposium

onDiscreteAlgorithms(pp.156–165).

References

185

Gooch,B.,&Gooch,A.(2001).Non-photorealisticrendering.Natick,MA,

Page 565: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

USA:A.K.PetersLtd.

Greensmith,E.,Bartlett,P.L.,&Baxter,J.(2004).Variancereductiontech-

niquesforgradientestimatesinreinforcementlearning.JournalofMachine

LearningResearch,5,1471–1530.

Guo,Q.,&Kunii,T.L.(2003).“Nijimi”renderingalgorithmforcreating

qualityblackinkpaintings.ProceedingsofComputerGraphicsInternational

(pp.152–159).

Henkel,R.E.(1976).Testsofsignificance.BeverlyHills,CA,USA.:SAGE

Publication.

Hertzmann,A.(1998).Painterlyrenderingwithcurvedbrushstrokesofmul-

tiplesizes.ProceedingsofAnnualConferenceonComputerGraphicsand

InteractiveTechniques(pp.453–460).

Hertzmann,A.(2003).Asurveyofstrokebasedrendering.IEEEComputer

GraphicsandApplications,23,70–81.

Hoerl,A.E.,&Kennard,R.W.(1970).Ridgeregression:Biasedestimation

fornonorthogonalproblems.Technometrics,12,55–67.

Huber,P.J.(1981).Robuststatistics.NewYork,NY,USA:Wiley.

Kakade,S.(2002).Anaturalpolicygradient.AdvancesinNeuralInformation

ProcessingSystems14(pp.1531–1538).

Kanamori,T.,Hido,S.,&Sugiyama,M.(2009).Aleast-squaresapproachto

directimportanceestimation.JournalofMachineLearningResearch,10,

1391–1445.

Kanamori,T.,Suzuki,T.,&Sugiyama,M.(2012).Statisticalanalysisof

kernel-basedleast-squaresdensity-ratioestimation.MachineLearning,86,

335–367.

Kanamori,T.,Suzuki,T.,&Sugiyama,M.(2013).Computationalcomplex-

ityofkernel-baseddensity-ratioestimation:Aconditionnumberanalysis.

MachineLearning,90,431–460.

Kober,J.,&Peters,J.(2011).Policysearchformotorprimitivesinrobotics.

MachineLearning,84,171–203.

Koenker,R.(2005).Quantileregression.Cambridge,MA,USA:Cambridge

UniversityPress.

Page 566: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Kohonen,T.(1995).Self-organizingmaps.Berlin,Germany:Springer.

Kullback,S.,&Leibler,R.A.(1951).Oninformationandsufficiency.Annals

ofMathematicalStatistics,22,79–86.

186

References

Lagoudakis,M.G.,&Parr,R.(2003).Least-squarespolicyiteration.Journal

ofMachineLearningResearch,4,1107–1149.

Li,K.(1991).Slicedinverseregressionfordimensionreduction.Journalof

theAmericanStatisticalAssociation,86,316–342.

Mahadevan,S.(2005).Proto-valuefunctions:Developmentalreinforcement

learning.ProceedingsofInternationalConferenceonMachineLearning(pp.

553–560).

Mangasarian,O.L.,&Musicant,D.R.(2000).Robustlinearandsupport

vectorregression.IEEETransactionsonPatternAnalysisandMachine

Intelligence,22,950–955.

Morimura,T.,Sugiyama,M.,Kashima,H.,Hachiya,H.,&Tanaka,T.(2010a).

Nonparametricreturndistributionapproximationforreinforcementlearn-

ing.ProceedingsofInternationalConferenceonMachineLearning(pp.

799–806).

Morimura,T.,Sugiyama,M.,Kashima,H.,Hachiya,H.,&Tanaka,T.

(2010b).Parametricreturndensityestimationforreinforcementlearning.

ConferenceonUncertaintyinArtificialIntelligence(pp.368–375).

Peters,J.,&Schaal,S.(2006).Policygradientmethodsforrobotics.Process-

ingoftheIEEE/RSJInternationalConferenceonIntelligentRobotsand

Systems(pp.2219–2225).

Peters,J.,&Schaal,S.(2007).Reinforcementlearningbyreward-weighted

regressionforoperationalspacecontrol.ProceedingsofInternationalCon-

ferenceonMachineLearning(pp.745–750).Corvallis,Oregon,USA.

Precup,D.,Sutton,R.S.,&Singh,S.(2000).Eligibilitytracesforoff-policypolicyevaluation.ProceedingsofInternationalConferenceonMachine

Page 567: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Learning(pp.759–766).

Rasmussen,C.E.,&Williams,C.K.I.(2006).Gaussianprocessesformachine

learning.Cambridge,MA,USA:MITPress.

Rockafellar,R.T.,&Uryasev,S.(2002).Conditionalvalue-at-riskforgeneral

lossdistributions.JournalofBanking&Finance,26,1443–1471.

Rousseeuw,P.J.,&Leroy,A.M.(1987).Robustregressionandoutlierdetec-

tion.NewYork,NY,USA:Wiley.

Schaal,S.(2009).TheSLsimulationandreal-timecontrolsoftwarepack-

age(TechnicalReport).ComputerScienceandNeuroscience,Universityof

SouthernCalifornia.

Sehnke,F.,Osendorfer,C.,Rückstiess,T.,Graves,A.,Peters,J.,&Schmid-

huber,J.(2010).Parameter-exploringpolicygradients.NeuralNetworks,

23,551–559.

References

187

Shimodaira,H.(2000).Improvingpredictiveinferenceundercovariateshift

byweightingthelog-likelihoodfunction.JournalofStatisticalPlanningand

Inference,90,227–244.

Siciliano,B.,&Khatib,O.(Eds.).(2008).Springerhandbookofrobotics.

Berlin,Germany:Springer-Verlag.

Sugimoto,N.,Tangkaratt,V.,Wensveen,T.,Zhao,T.,Sugiyama,M.,&Mo-

rimoto,J.(2014).Efficientreuseofpreviousexperiencesinhumanoidmotor

learning.ProceedingsofIEEE-RASInternationalConferenceonHumanoid

Robots(pp.554–559).

Sugiyama,M.(2006).Activelearninginapproximatelylinearregressionbased

onconditionalexpectationofgeneralizationerror.JournalofMachine

LearningResearch,7,141–166.

Sugiyama,M.,Hachiya,H.,Towell,C.,&Vijayakumar,S.(2008).Geodesic

Gaussiankernelsforvaluefunctionapproximation.AutonomousRobots,

25,287–304.

Page 568: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Sugiyama,M.,&Kawanabe,M.(2012).Machinelearninginnon-stationary

environments:Introductiontocovariateshiftadaptation.Cambridge,MA,

USA:MITPress.

Sugiyama,M.,Krauledat,M.,&Müller,K.-R.(2007).Covariateshiftadapta-

tionbyimportanceweightedcrossvalidation.JournalofMachineLearning

Research,8,985–1005.

Sugiyama,M.,Suzuki,T.,&Kanamori,T.(2012).Densityratiomatching

undertheBregmandivergence:Aunifiedframeworkofdensityratioesti-

mation.AnnalsoftheInstituteofStatisticalMathematics,64,1009–1044.

Sugiyama,M.,Takeuchi,I.,Suzuki,T.,Kanamori,T.,Hachiya,H.,&

Okanohara,D.(2010).Least-squaresconditionaldensityestimation.IEICE

TransactionsonInformationandSystems,E93-D,583–594.

Sutton,R.S.,&Barto,G.A.(1998).Reinforcementlearning:Anintroduction.

Cambridge,MA,USA:MITPress.

Suzuki,T.,&Sugiyama,M.(2013).

Sufficientdimensionreductionvia

squared-lossmutualinformationestimation.NeuralComputation,25,725–

758.

Takeda,A.(2007).Supportvectormachinebasedonconditionalvalue-at-risk

minimization(TechnicalReportB-439).DepartmentofMathematicaland

ComputingSciences,TokyoInstituteofTechnology.

Tangkaratt,V.,Mori,S.,Zhao,T.,Morimoto,J.,&Sugiyama,M.(2014).

Model-basedpolicygradientswithparameter-basedexplorationbyleast-

squaresconditionaldensityestimation.NeuralNetworks,57,128–140.

188

References

Tangkaratt,V.,Xie,N.,&Sugiyama,M.(2015).Conditionaldensityesti-

mationwithdimensionalityreductionviasquared-lossconditionalentropy

minimization.NeuralComputation,27,228–254.

Tesauro,G.(1994).

Page 569: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

TD-gammon,aself-teachingbackgammonprogram,

achievesmaster-levelplay.NeuralComputation,6,215–219.

Tibshirani,R.(1996).Regressionshrinkageandsubsetselectionwiththe

lasso.JournaloftheRoyalStatisticalSociety,SeriesB,58,267–288.

Tomioka,R.,Suzuki,T.,&Sugiyama,M.(2011).Super-linearconvergenceof

dualaugmentedLagrangianalgorithmforsparsityregularizedestimation.

JournalofMachineLearningResearch,12,1537–1586.

Vapnik,V.N.(1998).Statisticallearningtheory.NewYork,NY,USA:Wiley.

Vesanto,J.,Himberg,J.,Alhoniemi,E.,&Parhankangas,J.(2000).SOM

toolboxforMatlab5(TechnicalReportA57).HelsinkiUniversityofTech-

nology.

Wahba,G.(1990).Splinemodelsforobservationaldata.Philadelphia,PA,

USA:SocietyforIndustrialandAppliedMathematics.

Wang,X.,&Dietterich,T.G.(2003).Model-basedpolicygradientrein-

forcementlearning.ProceedingsofInternationalConferenceonMachine

Learning(pp.776–783).

Wawrzynski,P.(2009).Real-timereinforcementlearningbysequentialactor-

criticsandexperiencereplay.NeuralNetworks,22,1484–1497.

Weaver,L.,&Baxter,J.(1999).Reinforcementlearningfromstateandtem-

poraldifferences(TechnicalReport).DepartmentofComputerScience,

AustralianNationalUniversity.

Weaver,L.,&Tao,N.(2001).Theoptimalrewardbaselineforgradient-

basedreinforcementlearning.ProceedingsofConferenceonUncertaintyin

ArtificialIntelligence(pp.538–545).

Williams,J.D.,&Young,S.J.(2007).PartiallyobservableMarkovdecision

processesforspokendialogsystems.ComputerSpeechandLanguage,21,

393–422.

Williams,R.J.(1992).Simplestatisticalgradient-followingalgorithmsfor

connectionistreinforcementlearning.MachineLearning,8,229–256.

Xie,N.,Hachiya,H.,&Sugiyama,M.(2013).Artistagent:Areinforcement

learningapproachtoautomaticstrokegenerationinorientalinkpainting.

IEICETransactionsonInformationandSystems,E95-D,1134–1144.

Page 570: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

Xie,N.,Laga,H.,Saito,S.,&Nakajima,M.(2011).Contour-drivenSumi-e

renderingofrealphotos.Computers&Graphics,35,122–134.

References

189

Zhao,T.,Hachiya,H.,Niu,G.,&Sugiyama,M.(2012).Analysisandim-

provementofpolicygradientestimation.NeuralNetworks,26,118–129.

Zhao,T.,Hachiya,H.,Tangkaratt,V.,Morimoto,J.,&Sugiyama,M.(2013).

Efficientsamplereuseinpolicygradientswithparameter-basedexploration.

NeuralComputation,25,1512–1547.

Thispageintentionallyleftblank

Page 571: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 572: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)
Page 573: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)

DocumentOutlineCoverContentsForewordPrefaceAuthorPartI:Introduction

Chapter1:IntroductiontoReinforcementLearningPartII:Model-FreePolicyIteration

Chapter2:PolicyIterationwithValueFunctionApproximationChapter3:BasisDesignforValueFunctionApproximationChapter4:SampleReuseinPolicyIterationChapter5:ActiveLearninginPolicyIterationChapter6:RobustPolicyIteration

PartIII:Model-FreePolicySearchChapter7:DirectPolicySearchbyGradientAscentChapter8:DirectPolicySearchbyExpectation-MaximizationChapter9:Policy-PriorSearch

PartIV:Model-BasedReinforcementLearningChapter10:TransitionModelEstimationChapter11:DimensionalityReductionforTransitionModelEstimation

References


Recommended