Samplez~𝑃(𝑧) andtrajectories,thenuseMAMLtoadaptmetabaseline𝑉(
)(*) forpolicygradient:
Input-DependentBaselines
State-dependentbaseline:b(st)=V(st),∀zt:∞Input-dependentbaseline:b(st ,zt:∞)=V(st|zt:∞)
Dependontheentirefutureinputsequence{zt,zt+1,…,z∞}duringtraining
Input-dependentbaselinesarebias-free forpolicygradients:
Implementationsofinput-dependentbaselines:
Experiments
Motivation Example
Input-DrivenProcesses
VarianceReductionforReinforcementLearninginInput-DrivenEnvironments
HongziMaoShaileshh Bojja Venkatakrishnan Malte SchwarzkopfMohammadAlizadehMITComputerScienceandArtificialIntelligenceLaboratory
Time
Job
size
Load balancer
Server 1 Server 2
Input 1
Input 2
Input-dependent State-dependent
TRPO A2C
RobustAdversarialRLMeta-PolicyOptimization
Input-dependentbaselinesareapplicabletomanypolicygradientmethods,suchasA2C,TRPO,PPO,andtheyarecomplementaryandorthogonaltorobustadversarialRLmethodssuchasRARL (Pintoetal.,2017)andmeta-policyoptimizationsuchasMPO (Clavera etal.,2018).
(a)
(b)
st-1 st st+1
zt-1 zt zt+1
at-1 at at+1
st-1 st st+1
zt-1 zt zt+1
at-1 at at+1
st-1 st st+1
at-1 at at+1
(c)
(a) StandardMDP
(b) Input-Driven MDP
(c) Input-DrivenPOMDP
Policyvariance Reward
Policyvisualization
Loadbalancingexample
Sample𝑧+~ 𝑧,, 𝑧., … , 𝑧0 , usethecorrespondingvaluenetforpolicygradient:
Samplez~𝑃(𝑧), useLSTMtocompute𝑉((1, 𝑧) forpolicygradient:
Inputz~$(&) LSTM
Baseline()(*, &)
Input!"
Input!#
Input!$
Valuenet%&'( (*)
Valuenet%&', (*)
Valuenet%&'-(*)
Inputz~$(&)
Metabaseline()*(+)
MAMLadapt
Baseline()+(,)
(Notefficient)
workload networkcondition wind buoyancy movingtarget
Environmentswithexogenous,stochasticinputprocessesthataffectthedynamics
Sincetherewardispartiallydictatedbytheinputprocess,thestatealoneonlyprovideslimitedinformationtoestimatetheaveragereturn.Thus,policygradientmethodswithstandardstate-dependentbaselinessufferfromhighvariance.
b(s, z) = V(s|z) (MAML)
b(s, z) = V(s|z) (10 values)b(s) = V(s),�z
heuristic
TRPO, b(s, z) = V(s|z)
RARL, b(s, z) = V(s|z)
TRPO, b(s) = V(s),�z
RARL, b(s) = V(s),�z b(s, z) = V(s|z); TRPOb(s, z) = V(s|z); MPO
b(s) = V(s),�z; TRPOb(s) = V(s),�z; MPO