Learning Complex Neural Network Policies with Trajectory ...svlevine/papers/cgps.pdfLearning Complex...

Learning Complex Neural Network Policies with Trajectory Optimization

Sergey Levine [email protected]

Computer Science Department, Stanford University, Stanford, CA 94305 USA

Vladlen Koltun [email protected]

Adobe Research, San Francisco, CA 94103 USA

AbstractDirect policy search methods offer the promiseof automatically learning controllers for com-plex, high-dimensional tasks. However, prior ap-plications of policy search often required spe-cialized, low-dimensional policy classes, limit-ing their generality. In this work, we introducea policy search algorithm that can directly learnhigh-dimensional, general-purpose policies, rep-resented by neural networks. We formulate thepolicy search problem as an optimization overtrajectory distributions, alternating between opti-mizing the policy to match the trajectories, andoptimizing the trajectories to match the policyand minimize expected cost. Our method canlearn policies for complex tasks such as bipedalpush recovery and walking on uneven terrain,while outperforming prior methods.

1. IntroductionDirect policy search offers the promise of automaticallylearning controllers for complex, high-dimensional tasks.It has seen applications in fields ranging from robotics (Pe-ters & Schaal, 2008; Theodorou et al., 2010; Deisenrothet al., 2013; Kober et al., 2013) and autonomous flight(Ross et al., 2013) to energy generation (Kolter et al.,2012). However, existing policy search methods usually re-quire the policy class to be chosen carefully, so that a goodpolicy can be found without falling into poor local optima.Research into new, specialized policy classes is an activearea that has provided substantial improvements on real-world systems (Ijspeert et al., 2003; Paraschos et al., 2013).This specialization is necessary because most model-freepolicy search methods can only feasibly be applied to poli-cies with a few hundred parameters (Deisenroth et al.,

Proceedings of the 31 st International Conference on MachineLearning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).

2013). Such specialized policy classes are limited in thetypes of behaviors they can represent, and engineering newpolicy classes requires considerable effort.

In recent work, we introduced a new class of policy searchalgorithms that can learn much more complex policies byusing model-based trajectory optimization to guide the pol-icy search (Levine & Koltun, 2013a;b). By optimizing tra-jectories in tandem with the policy, guided policy searchmethods combine the flexibility of trajectory optimizationwith the generality of policy search. These methods canscale to highly complex policy classes and can be used totrain general-purpose neural network controllers that do notrequire task-specific engineering. Furthermore, the trainingtrajectories can be initialized with examples for learningfrom demonstration.

A key challenge in guided policy search is ensuring thatthe trajectories are useful for learning the policy, since notall trajectories can be realized by policies from a particularpolicy class. For example, a policy provided with partialobservations cannot make decisions based on unobservedstate variables. In this paper, we present a constrainedguided policy search algorithm that gradually brings thetrajectories into agreement with the policy, ensuring thatthe trajectories and policy match at convergence. This isaccomplished by gradually enforcing a constraint betweenthe trajectories and the policy using dual gradient descent,resulting in an algorithm that iterates between optimizingthe trajectories to minimize cost and agree with the policy,optimizing the policy to agree with the trajectories, and up-dating the dual variables to improve constraint satisfaction.

By enforcing agreement between the policy and the tra-jectories, our algorithm can discover policies for highlycomplex behaviors. We evaluate our approach on a set ofchallenging locomotion tasks, including a push recoverytask that requires the policy to combine multiple recoverystrategies learned in parallel from multiple trajectories. Ourapproach successfully learned a policy that could not onlyperform multiple different recoveries, but could also cor-rectly choose the best strategy under new conditions.


2. Preliminaries and OverviewPolicy search is an optimization over the parameters θ ofa policy πθ(ut|xt), which is a distribution over actions utconditioned on states xt, with respect to the expected valueof a cost function `(xt,ut), denoted Eπθ [

∑Tt=1 `(xt,ut)].

The expectation is taken under the policy and the systemdynamics p(xt+1|xt,ut), and together they induce a dis-tribution over trajectories. We will therefore abbreviate theexpectation asEπθ [`(τ)], where τ = (x1..T ,u1..T ) denotesa trajectory. In continuous spaces, this expectation can-not be computed exactly, since the number of states is infi-nite. Many policy search methods approximate this quan-tity, typically by sampling (Peters & Schaal, 2008).

Sampling-based policy search methods do not need or usethe dynamics distribution p(xt+1|xt,ut). However, inpractice this advantage often also becomes a weakness:without the use of a system model, the policy search isforced to rely on “trial and error” exploration strategies.While this works wells for simple problems, for examplewhen either the state space or the dimensionality of θ issmall, such model-free methods often do not scale to poli-cies with more than a few hundred parameters (Deisenrothet al., 2013). Scaling policy search up to powerful, expres-sive, general-purpose policy classes, such as large neuralnetworks, is extremely challenging with such methods.

When a dynamics model is available, trajectories can beoptimized directly with respect to their actions, without aparametric policy.1 Trajectory optimization is easier thangeneral policy search, because the policy parameters cou-ple the actions at different time steps. Our constrainedguided policy search algorithm employs trajectory opti-mization to guide the policy search process, avoiding theneed for “trial and error” random exploration. The algo-rithm alternates between optimizing a set of trajectories tominimize cost and match the current policy, and optimizingthe policy to follow the actions in each trajectory. However,simply training a policy on individual trajectories usuallyfails to produce effective policies, since a small error ateach time step can quickly compound and place the policyin costly, unexplored parts of the space (Ross et al., 2011).

To avoid compounding errors, the policy must be trained ondata sampled from a distribution over states. The ideal dis-tribution is the one induced by the optimal policy, but it isunknown. The initial policy often has a broad state distribu-tion that visits very costly states, where it is inefficient andunnecessary to determine the optimal actions. Instead, wetrain the policy on distributions over good trajectories, de-noted q(τ), which are produced using trajectory optimiza-

1Action sequences can be viewed as open-loop policies, andlinear feedback can be added to turn them into nonstationaryclosed-loop policies.

tion. Alternating policy and trajectory optimization, q(τ)and πθ(τ) are gradually brought into agreement, so that thefinal policy is trained on its own state distribution.

Since q(τ) may not match πθ(τ) before convergence, wemake q(τ) as broad as possible, so that the policy learnsstable feedbacks from a wide range of states, reducing thechance that compounding errors will place it into unex-plored regions. To that end, we use the maximum entropyobjective Eq [`(τ)] −H(q(τ)), which has previously beenproposed for control and reinforcement learning (Todorov,2006; Ziebart, 2010; Kappen et al., 2012).

This objective minimizes cost and maximizes entropy, pro-ducing broad distributions over good trajectories. It cor-responds to the KL-divergence DKL(q(τ)‖ρ(τ)), whereρ(τ) ∝ exp(`(τ)), making q(τ) an I-projection of ρ(τ). Inthe absence of policy constraints, a Gaussian q(τ) can beapproximately optimized by a variant of the iLQG algo-rithm (Todorov & Li, 2005), as described in previous work(Levine & Koltun, 2013b). In the next section, we derive asimilar algorithm that also gradually enforces a constrainton the action conditionals q(ut|xt), to force them to matchthe policy πθ(ut|xt) at convergence. As with iLQG, ourtrajectory optimization algorithm uses a local linearizationof the dynamics and a quadratic expansion of the cost,which corresponds to using a Laplace approximation ofρ(τ). We solve the constrained problem by optimizing itsLagrangian with respect to q(τ) and θ, and iteratively up-dating the Lagrange multipliers by means of dual gradientdescent (Boyd & Vandenberghe, 2004).

3. Policy Search via Trajectory OptimizationWe begin by reformulating the policy optimization task asthe following constrained problem:

minθ,q(τ)

DKL(q(τ)‖ρ(τ)) (1)

s.t. q(x1) = p(x1),

q(xt+1|xt,ut) = p(xt+1|xt,ut),DKL(q(xt)πθ(ut|xt)‖q(xt,ut)) = 0.

The first two constraints ensure that the distribution q(τ)is consistent with the domain’s initial state distribution anddynamics, and are enforced implicitly by our trajectory op-timization algorithm. The last constraint ensures that theconditional action distributions q(ut|xt) match the policydistribution πθ(ut|xt). Since the action distributions fullydetermine the state distribution, q(τ) and πθ(τ) becomeidentical when the constraints are satisfied, making the con-strained optimization equivalent to optimizing the policywith respect to DKL(πθ(τ)‖ρ(τ)). This objective differsfrom the expected cost but, as discussed in the previous sec-tion, it provides for more reasonable handling of trajectorydistributions prior to convergence. In practice, a solution


with very good expected cost can be obtained by increasingthe magnitude of the cost over the course of the optimiza-tion. As this magnitude goes to infinity, the entropy termbecomes irrelevant, though a good deterministic policy canusually be obtained by taking the mean of the stochasticpolicy optimized under even a moderate cost magnitude.

Our method approximately optimizes Equation 1 with dualgradient descent (Boyd & Vandenberghe, 2004) and locallinearization, leading to an iterative algorithm that alter-nates between optimizing one or more Gaussian distribu-tions qi(τ) with dynamic programming, and optimizing thepolicy πθ(ut|xt) to match q(ut|xt). A separate Gaussianqi(τ) is used for each initial condition (for example whenlearning to control a bipedal walker that must respond todifferent lateral pushes), but because the trajectories areonly coupled by the policy parameters θ we will drop thesubscript i in our derivation and consider just a single q(τ).

We first write the Lagrangian of Equation 1, omitting theimplicitly enforced initial state and dynamics constraints:

L(θ, q, λ) =DKL(q(τ)‖ρ(τ))+

T∑t=1

λtDKL(q(xt)πθ(ut|xt)‖q(xt,ut)).

Dual gradient descent alternates between optimizing Lwith respect to q(τ) and θ, and updating the dual variablesλt with subgradient descent, using a step size η:

λt ← λt + ηDKL(q(xt)πθ(ut|xt)‖q(xt,ut)). (2)

The KL-divergence is estimated from samples, and the in-ner optimization over q(τ) and θ is performed in alternat-ing fashion, first over each q(τ) and then over θ. Althoughneither the objective nor the constraints are in general con-vex, we found this approach to yield a good local optimum.The full method is summarized in Algorithm 1. We initial-ize each trajectory on line 1, either with unconstrained tra-jectory optimization or from example demonstrations. Wethen optimize the policy for K iterations of dual gradientdescent. In each iteration, we optimize each trajectory dis-tribution qi(τ) on line 3, using a few iterations of the algo-rithm described in Section 3.1. This step can be parallelizedover all trajectories. We then optimize the policy to matchall of the distributions qi(τ) on line 4, using a simple super-vised learning procedure described in Section 3.2. Finally,we update the dual variables according to Equation 2.

3.1. Trajectory optimization

The trajectory optimization phase optimizes each qi(τ)with respect to the Lagrangian L(θ, qi(τ), λi). Since thetrajectories can be optimized independently, we again dropthe subscript i in this section. Similarly to iLQG, theoptimization uses locally linearized dynamics, though the

Algorithm 1 Constrained guided policy search1: Initialize the trajectories {q1(τ), . . . , qM (τ)}2: for iteration k = 1 to K do3: Optimize each qi(τ) with respect to L(θ, qi(τ), λi)

4: Optimize θ with respect to∑Mi=1 L(θ, qi(τ), λi)

5: Update dual variables λ using Equation 26: end for7: return optimized policy parameters θ

policy KL-divergence constraints necessitate a novel algo-rithm. One iteration of this trajectory optimization algo-rithm is summarized in Algorithm 2.

Each q(τ) has a mean τ = (x1..T , u1..T ), a conditionalaction distribution q(ut|xt) = N (ut + Kxt,At) at eachtime step, and dynamics q(xt+1|xt,ut) = N (fxtxt +futut,Ft), which corresponds to locally linearized Gaus-sian dynamics with mean ft(xt,ut) and covariance Ft(subscripts in fxt and fut denote derivatives). We will as-sume without loss of generality that all xt and ut are ini-tially zero, and xt and ut denote a perturbation from a nom-inal trajectory, which is updated at every iteration. Giventhis definition of q(τ), we can rewrite the objective as

L(q) =

T∑t=1

Eq(xt,ut)[`(xt,ut)]−1

2log |At|+

λtEq(xt)[DKL(πθ(ut|xt)‖q(ut|xt))].

We can evaluate both expectations with the Laplace ap-proximation, which models the policy πθ(ut|xt) as a (lo-cally) linear Gaussian with mean µπt (xt) and covarianceΣπt , and the cost with its first and second derivatives `xutand `xu,xut (subscripts again denote derivatives, in thiscase twice with respect to both the state and action):

L(q) ≈T∑t=1

1

2

[xtut

]T`xu,xut

[xtut

]+

[xtut

]T`xut+

1

2tr (Σt`xu,xut)−

1

2log |At|+

λt2

log |At|+

λt2

(ut − µπt (xt))TA−1t (ut − µπt (xt)) +

λt2

tr(A−1t Σπt

)+

λt2

tr(St (Kt − µπxt(xt))

TA−1t (Kt − µπxt(xt))

),

where constants are omitted, µπxt(xt) is the gradient of thepolicy mean at xt, and Σt denotes the joint marginal co-variance over states and actions in q(xt,ut). We use Stto refer to the covariance of q(xt) and At to refer to theconditional covariance of q(ut|xt).

Each iteration of trajectory optimization first forms this ap-proximate Lagrangian by computing the derivatives of thedynamics and cost function on line 1. A dynamic program-ming algorithm then computes the gradients and Hessians


Algorithm 2 Trajectory optimization iteration1: Compute fxut, µπxt, `xut, `xu,xut around τ2: for t = T to 1 do3: Compute Kt and kt using Equations 3 and 44: Compute Lxt and Lx,xt using Equation 55: end for6: Initialize α← 17: repeat8: Obtain new trajectory τ ′ using ut = αkt + Ktxt9: Decrease step size α

10: until L(τ ′,K,A) > L(τ ,K,A)11: Compute fxut, µπxt, `xut, `xu,xut around τ ′

12: repeat13: Compute St using current At and Kt

14: for t = T to 1 do15: repeat16: Compute Kt using Equation 717: Compute At by solving CARE in Equation 818: until At and Kt converge (about 5 iterations)19: Compute LSt using Equation 620: end for21: until all St and At converge (about 2 iterations)22: return new mean τ ′ and covariance terms At, Kt

with respect to the mean state and action at each time step(keeping At fixed), as summarized on lines 2-5, allowingus to take a Newton-like step by multiplying the gradientby the inverse Hessian, analogously to iLQG (Todorov &Li, 2005). The action then becomes a linear function ofthe corresponding state, resulting in linear feedback termsKt. After taking this Newton-like step, we perform a linesearch to ensure improvement on lines 7-10, and then up-date At at each time step on lines 12-21. To derive thegradients and Hessians, it will be convenient to define thefollowing quantities, which incorporate information aboutfuture costs analogously to the Q-function:

Qxut = `xut + fTxuLxt+1

Qxu,xut = `xu,xut + fTxuLx,xt+1fxu,

where the double subscripts xu again denote derivativeswith respect to (xt,ut)

T, and Lxt+1 and Lx,xt+1 are thegradient and Hessian of the objective with respect to xt+1.As with iLQG, we assume locally linear dynamics and ig-nore the higher order dynamics derivatives. Proceedingrecursively backwards through time, the first and secondderivatives with respect to ut are then given by

Lut = Qu,utut +Qu,xtxt +Qut + λtA−1t (ut − µπt (xt))

Lu,ut = Qu,ut + λtA−1t ,

where the application of the chain rule to include the effectof ut on subsequent time steps is subsumed inside Qut,

Qu,ut, and Qu,xt. Since we assume that xt and ut are bothzero, we can solve for the optimal change to the action kt:

kt=−(Qu,ut+λtA

−1t

)−1(Qut−λtA−1t µπt (xt)

). (3)

The feedback Kt is the derivative of kt with respect to xt:

Kt=−(Qu,ut+λtA

−1t

)−1(Qu,xt−λtA−1t µπxt(xt)

).(4)

To complete the dynamic programming step, we can nowdifferentiate the objective with respect to xt, treatingut = kt + Ktxt as a function of xt:

Lxt = Qx,xtxt +Qx,ut(kt + Ktxt) + KTt Qu,xtxt+

KTt Qu,ut(Ktxt + kt) +Qxt + KT

t Qut+

λt(Kt − µπxt(xt))TA−1t (Ktxt + kt − µπt (xt))

= Qx,utkt + KTt Qu,utkt +Qxt + KT

t Qut+

λt(Kt − µπxt(xt))TA−1t (kt − µπt (xt))

Lx,xt = Qx,xt +Qx,utKt + KTt Qu,xt + KT

t Qu,utKt+

λt(Kt − µπxt(xt))TA−1t (Kt − µπxt(xt)), (5)

where the simplification again happens because xt is zero.

Once we compute kt and Kt at each time step, we per-form a simulator rollout using the deterministic policyut = kt + Ktxt to obtain a new mean nominal trajectory.Since the dynamics may deviate from the previous lin-earization far from the previous trajectory, we perform aline search on kt to ensure that the objective improves as afunction of the new mean, as summarized on lines 7-10.

Once the new mean is found, both the policy πθ(ut|xt)and the dynamics are relinearized around the new nominaltrajectory on line 2, and we perform a second dynamic pro-gramming pass to update the covariance At and feedbackterms Kt. As before, we introduce a variable that incorpo-rates gradient information from future time steps:

Qxu,xut = `xu,xut + 2fTxutLSt+1fxut.

This equation is obtained by using the chain rule to includethe effect of the covariance Σt on future time steps usingthe covariance dynamics St+1 = fxutΣtf

Txut +Ft, so that

1

2tr (Σt`xu,xut) +

∂St+1

∂Σt· LSt+1 =

1

2tr (ΣtQxu,xut) .

This allows us to simply substitute Qxu,xut for `xu,xut inthe objective to include all the effect of the covariance onfuture time steps. To then derive the gradient of the objec-tive, first with respect to At and Kt for optimization, andthen with respect to St to complete the dynamic program-ming step, we first note that

Σt =

[St StK

Tt

KtSt KtStKTt + At

].


This allows us to expand the term tr (ΣtQxu,xut) to get

tr (ΣtQxu,xut) = tr (StQx,xt) + tr(StK

Tt Qu,xt

)+

tr (StQx,utKt) + tr (AtQu,ut) + tr(KtStK

Tt Qu,ut

).

Using this identity, we can obtain the derivatives of the ob-jective with respect to Kt, At, and St:

LAt =1

2Qu,ut +

λt − 1

2A−1t −

λt2A−1t MA−1t

LKt = Qu,utKtSt +Qu,xtSt + λtA−1t KtSt−

λtA−1t µπxt(xt)St

LSt =1

2

[Qx,xt + KT

t Qu,xt +Qx,utKt + KTt Qu,utKt

+λt(Kt−µπxt(xt))TA−1t (Kt−µπxt(xt))], (6)

where M = Σπt + (ut − µπt (xt))(ut − µπt (xt))T + (Kt −

µπxt(xt))St(Kt − µπxt(xt))T. The equation for LSt is sim-ply half of Lx,xt, which indicates that Qxu,xut is the sameas during the first backward pass. Solving for Kt, we alsoobtain the same equation as before:

Kt=−(Qu,ut+λtA

−1t

)−1(Qu,xt−λtA−1t µπxt(xt)

).(7)

To solve for At, we set the derivative to zero and multiplyboth sides on both the left and right by

√2At to get

AtQu,utAt + (λt − 1)At − λtM = 0. (8)

The above equation is a continuous-time algebraic Riccatiequation (CARE),2 and can be solved in comparable timeto an eigenvalue decomposition (Arnold & Laub, 1984).Our implementation uses the MATLAB CARE solver.

Since Kt depends on At, which itself depends on both Kt

and St, we use the old values of each quantity, and repeatthe solver for several iterations. On lines 15-18, we repeat-edly solve for Kt and At at each time step, which usuallyconverges in a few iterations. On lines 12-21, we also re-peat the entire backward pass several times to update Stbased on the new At, which converges even faster. In prac-tice, we found that two backward passes were sufficient,and a simple test on the maximum change in the elementsof Kt and At can be used to determine convergence.

This derivation allows us to optimize q(τ) under a Laplaceapproximation. Although the Laplace approximation pro-vides a reasonable objective, the linearized policy may notreflect the real structure of πθ(ut|xt) in the entire regionwhere q(xt) is high. Since the policy is trained by samplingstates from q(xt), it optimizes a different objective. Thiscan lead to divergence when the policy is highly nonlinear.

2Although such equations often come up in the context of op-timal control, our use of algebraic Riccati equations is unrelatedto the manner in which they are usually employed.

To solve this problem, we can estimate the policy termswith M random samples xti drawn from q(xt), rather thanby linearizing around the mean. This corresponds to MonteCarlo evaluation of the expectation of the KL-divergenceunder q(xt), as opposed to the more crude Laplace approx-imation. The resulting optimization algorithm has a simi-lar structure, with µπt (xt) and µπxt(xt) in the derivatives ofthe mean replaced by their averages over the samples. Thegradients of the covariance terms become more complex,though simply substituting the sample averages of µπt andµπxt into the above algorithm works well in practice, and issomewhat faster. A full derivation of the true gradients isprovided in the supplementary appendix for completeness.

3.2. Policy Optimization

Once q(τ) is fixed, the policy optimization becomes aweighted supervised learning problem. Training points xtiare sampled from q(xt) at each time step, and the policy istrained to be an I-projection onto the corresponding condi-tional Gaussian. For example, if the policy is conditionallyGaussian, with the mean µπ(xt) and covariance Σπ(xt)being any function of xt, the policy objective is given by

L(θ) =

T∑t=1

λt

N∑i=1

DKL(πθ(ut|xti)‖q(ut|xti))

=

T∑t=1

λt

N∑i=1

1

2

{tr(Σπt (xti)A

−1t )− log |Σπ(xti)|+

(Ktxti+kt−µπ(xti))TA−1t (Ktxti+kt−µπ(xti))

}.

This is a least-squares optimization on the policy mean,with targets Ktxti +k and weight matrix A−1t , and can beperformed by standard algorithms such as stochastic gradi-ent descent (SGD) or, as in our implementation, LBFGS.

As the constraints are satisfied, q(xt) approaches πθ(xt),and q(ut|xt) approaches the exponential of the Q-functionof πθ(ut|xt) under the maximum entropy objective. Mini-mizingDKL(πθ(ut|xt)‖q(ut|xt)) therefore resembles pol-icy iteration with an “optimistic” approximate Q-function.This is an advantage over the opposite KL-divergence(where q(τ) is an I-projection of πθ(τ)) suggested in ourprior work (2013b), which causes the policy to be risk-seeking by optimizing the expected exponential reward(negative cost). In the next section, we show that this newformulation outperforms the previous risk-seeking variant,and we discuss the distinction further in Section 5.

4. Experimental EvaluationWe evaluated our method on planar swimming and walk-ing tasks, as well as walking on uneven terrain and recov-ery from strong lateral pushes. Each task was executed ona simulated robot with torque motors at the joints, using


initial trajectory

constrained GPS

variational GPS

ISGPS

adapted ISGPS

cost-weighted

DAGGER

walker, 10 hidden units

iteration

aver

age

cost

20 40 60 80 1000

50

100

150

200walker, 5 hidden units

iteration

aver

age

cost

20 40 60 80 1000

50

100

150

200swimmer, 10 hidden units

iteration

aver

age

cost

20 40 60 80 100140

160

180

200

220

240swimmer, 5 hidden units

iteration

aver

age

cost

20 40 60 80 100140

160

180

200

220

240

Figure 1. Comparison to prior work on swimming and walking tasks. Only constrained and variational GPS succeeded on every task.For the walker, costs significantly above the initial example indicate falling. Policies that fail and become unstable are off the scale, andare clamped to the maximum cost. Frequent vertical oscillations indicate a policy that oscillates between stable and unstable solutions.

the MuJoCo physics simulator (Todorov et al., 2012). Thepolicies were general-purpose neural networks that mappedjoint angles directly to torques at each time step. The costfunction for each task consisted of a sum of three terms:

`(xt,ut) = wu‖ut‖2 + wv(vx − v?x)2 + wh(py − p?y)2,

where vx and v?x are the current and desired horizontal ve-locities, py and p?y are the current and desired heights ofthe root link, and wu, wv , and wh determine the weight oneach objective term. The swimmer and walker are shownin Figure 2, along with a schematic of the policy. The spe-cific weights and a description of each robot are presentedin the supplemental appendix. The initial trajectory for theswimmer was generated with unconstrained trajectory op-timization using DDP, while the initial walking trajectoryused a demonstration from a hand-crafted locomotion sys-tem (Yin et al., 2007), following our prior work (Levine &Koltun, 2013a). Initial push responses were generated bytracking a walking demonstration with DDP.

The policies consisted of neural networks with one hiddenlayer, with a soft rectifier a = log(1 + exp(z)) at the firstlayer and linear connections to the output layer. Gaussiannoise with a learned diagonal covariance was added to theoutput to create a stochastic policy. When evaluating thecost of a policy, the noise was removed, yielding a deter-ministic controller. While this class of policies is very ex-pressive, it poses a considerable challenge for policy searchmethods, due to its nonlinearity and high dimensionality.

As discussed in Section 3, the stochasticity of the policydepends on the cost magnitude. A low cost will producebroad trajectory distributions, which are good for learning,but will also produce a more stochastic policy, which mightperform poorly. To speed up learning and still achieve agood final policy, we found it useful to gradually increasethe cost by a factor of 10 over the first 50 iterations.

4.1. Simple Locomotion

Comparisons to previous methods on the swimming andwalking tasks are presented in Figure 1, using policies witheither 5 or 10 hidden units. Our constrained guided pol-icy search method (constrained GPS) succeeded on all ofthe tasks. The only other method to succeed on all tasks

x1 x2 xn

u1 u2 uk

h1 h2 hp

Figure 2. The swimmer and bipedal walker, next to a diagram ofthe neural network controller. Blue curves indicate root link tra-jectories of the learned locomotion policy.

was variational GPS, which also alternates policy and tra-jectory optimization (Levine & Koltun, 2013b). However,as we will show in the next sections, constrained GPS gen-eralized better to more difficult domains, where the risk-seeking objective in variational GPS caused difficulties.

Importance-sampled guided policy search (ISGPS), whichadds guiding samples from a trajectory distribution into thepolicy search via importance sampling (Levine & Koltun,2013a), solved both swimming tasks but failed to learn awalking gait with 5 hidden units. Guiding samples are onlyuseful if they can be reproduced by the policy. If the policyclass is too restricted, for example by partial observabilityor limited expressiveness, or if the trajectory takes incon-sistent actions in similar states, no policy can reproduce theguiding samples and ISGPS reverts to random exploration.

Adapted ISGPS optimizes the trajectory to match the pol-icy, but still uses importance sampling, which can causeproblems due to weight degeneracy and local optima. Theadapted variant also could not find a 5 hidden unit walk-ing gait. Both variants did converge faster than constrainedGPS on the easier swimming tasks, since constrained GPSrequires a few iterations to find reasonable Lagrange mul-tipliers and bring the policy and trajectory into agreement.

We also compared to cost-weighted regression, which fitsthe policy to previous on-policy samples weighted bythe exponential of their reward (negative cost) (Peters &Schaal, 2007; Kober & Peters, 2009). This approach is rep-resentative of a broad class of reinforcement learning meth-ods, which use model-free random exploration and opti-mize the policy to increase the probability of low cost sam-ples (Peters & Schaal, 2008; Theodorou et al., 2010). Sincea dynamics model was available, we provided this methodwith 1000 samples at each iteration, but it was still only


constrained GPS variational GPS ISGPS adapted ISGPS

3 training terrains

iteration

aver

age

cost

20 40 60 80 1000

50

100

150

2001 training terrain

iteration

aver

age

cost

20 40 60 80 1000

50

100

150

200

Figure 3. Uneven terrain. Solid lines show performance on thetraining terrains, dotted lines show generalization to unseen ter-rains. Constrained GPS was able to generalize from both trainingconditions. One test terrain with a trace of our policy is shown.

able to learn one swimming task, since the search spacefor the walker is far too large and complex to handle withmodel-free methods. While we also tested other RL al-gorithms, such as REINFORCE-style policy gradient, wefound that they performed poorly in all of our experiments.

Lastly, we compared to DAGGER, an imitation learningmethod that mimics an oracle (Ross et al., 2011), whichin our case was the initial trajectory distribution. At eachiteration, DAGGER adds on-policy samples to its datasetand reoptimizes the policy to take the oracle action at eachsample. Like ISGPS, DAGGER relies on being able to re-produce the oracle’s behavior with some policy from thepolicy class, which may be impossible without adapting thetrajectory. Furthermore, poor policies will deviate from thetrajectory, and the oracle’s linear feedback actions in thesefaraway states are highly suboptimal, violating DAGGER’sassumptions. To mitigate this, we weighted the samples bytheir probability under the trajectory distribution, but DAG-GER was still unable to learn a walking gait.

4.2. Uneven Terrain

In this section, we investigate the ability of each method tolearn policies that generalize to new situations, which is akey advantage of parametric policies. Our first experimentis based on the uneven terrain task we proposed in priorwork (Levine & Koltun, 2013a), where the walker traversesrandom uneven terrain with slopes of up to 10◦. The walkerwas trained on one or three terrains, and tested on four otherrandom terrains. The policies used 50 hidden units, fora total of 1206 parameters. We omitted the foot contactand forward kinematics features that we used in prior work(Levine, 2013), and instead trained policies that map thestate vector directly to joint torques. This significantly in-creased the difficulty of the problem, to the point that priormethods could not discover a reliable policy.

The results are shown in Figure 3, with methods that failedon both tasks omitted for clarity. The dotted lines indicate

constrained GPS variational GPS ISGPS adapted ISGPS

4 training pushes

iteration

aver

age

cost

20 40 60 80 1000

50

100

150

2001 training push

iteration

aver

age

cost

20 40 60 80 1000

50

100

150

200

Figure 4. Push response results. Solid lines show training push re-sults, dotted lines show generalization. Constrained GPS learneda successful, generalizable policy from four training pushes.

250N

250N

trai

ning

push

1C

GPS

VG

PS

500N

500N

test

push

1C

GPS

VG

PS500N

500N

test

push

2C

GPS

VG

PS

Figure 5. A few push responses with policies learned by con-strained and variational GPS on four training pushes. The hori-zontal spacing between the figures is expanded for clarity.

the performance of each method on the test terrains. Con-strained GPS was able to learn successful policies on bothtraining sets, and in both cases the policies generalized wellto new terrains. Prior methods struggled with this task, andthough they were able to traverse the single terrain by thelast iteration, none could traverse all three training terrains,and none could generalize successfully.

Like the locomotion tasks in the previous section, the mainchallenge in this task was to learn an effective gait. How-ever, one advantage of general-purpose policies like neuralnetworks is that a single policy can in principle learn mul-tiple strategies, and deploy them as needed in response tothe state. This issue is explored in the next section.

4.3. Push Recovery

To explore the ability of each method to learn policies thatchoose different strategies based on the state, we trainedwalking policies that recover from strong lateral pushes, inthe range of 250 to 500 Newtons, delivered for 100ms. Re-sponding to pushes of such magnitude requires not just dis-turbance rejection, but a well-timed recovery strategy, such


as a protective step or even standing up after a fall. Becausethe push can occur at different points in the gait, multi-ple recovery strategies must be learned simultaneously. Astrategy that succeeds in one pose may fail in another.

We trained policies with 50 hidden units on one or fourpushes, and tested them on four different pushes, with ini-tial states sampled from an example gait cycle. The re-sults are shown in Figure 4, with dotted lines indicatingperformance on the test pushes. Constrained GPS learneda policy from four training pushes that generalized well tothe test set, while no prior method could find a policy thatsucceeded even on the entire training set. Although sev-eral methods learned a successful policy from one trainingpush, none of these policies could generalize, underscoringthe importance of learning multiple recovery strategies.

Figure 5 shows a few of the recoveries learned by con-strained and variational GPS. Note that even the weaker250 Newton pushes are quite violent, and sufficient to liftthe 24kg walker off the ground completely. Nonetheless,constrained GPS was able to recover succesfully. The sup-plementary video3 shows the policy responding to each ofthe training and test pushes, as well as a few additionalpushes delivered in quick succession. The video illustratessome of the strategies learned by the policy, such as kick-ing against the ground to recover from a backward fall. Theability to learn such generalizable strategies is one of thekey benefits of training expressive parametric controllers.

5. DiscussionWe presented a constrained guided policy search algorithmfor learning complex, high-dimensional policies, using tra-jectory optimization with a policy agreement constraint.We showed how the constrained problem can be solved ef-ficiently with dual gradient descent and dynamic program-ming, and presented a comparison of our method to priorpolicy search algorithms. Our approach was able to learn apush recovery behavior for a bipedal walker that capturedgeneralizable recovery strategies, and outperformed priormethods in direct comparisons.

Constrained GPS builds on our recent work on using trajec-tory optimization for policy search with importance sam-pling (Levine & Koltun, 2013a) and variational inference(Levine & Koltun, 2013b). Our evaluation shows that theconstrained approach outperforms the prior methods. Im-portance sampling can be vulnerable to degeneracies in theimportance weights. While variational GPS addresses thisissue, it does so by optimizing a risk-seeking objective: theexpected exponential reward. Such an objective rewardspolicies that succeed only occasionally, which can lead tounreliable results. On the other hand, variational GPS is

3http://graphics.stanford.edu/projects/cgpspaper/index.htm

somewhat easier to implement, and importance sampledGPS is simpler and faster when adaptation of the trajectoryis not required, such as on the simpler locomotion tasks.

Variational GPS makes the policy an M-projection onto thetrajectory distribution, forcing it to visit all regions wherethis distribution is high, even if it means visiting costly re-gions where it’s not. We argue that an I-projection objectiveis more appropriate, as it forces the policy to avoid costlyregions where the trajectory distribution is low, even at thecost of missing some high-density areas. As shown in Sec-tion 2, this corresponds to the expected cost plus an entropyterm. Previous work has also argued for this sort of objec-tive, on the basis of probabilistic robustness (Ziebart, 2010)and computational benefits (Todorov, 2006). The relatedarea of stochastic optimal control has developed model-freereinforcement learning algorithms under a similar objective(Theodorou et al., 2010; Rawlik et al., 2012).

Concurrently with our work, Mordatch and Todorov (2014)proposed another trajectory optimization algorithm forguiding policy search. Further research into trajectory op-timization techniques best suited for policy search is apromising direction for future work.

While we assume a known model of the dynamics, priorwork has proposed learning the dynamics from data(Deisenroth & Rasmussen, 2011; Ross & Bagnell, 2012;Deisenroth et al., 2013), and using our method with learnedmodels could allow for wider applications in the future.

Our method also has several limitations that could be ad-dressed in future work. Our use of trajectory optimizationrequires the policy to be Markovian, so any hidden statecarried across time steps must itself be incorporated into thepolicy search as additional state information. Our methodis also local in nature. While we can use multiple trajecto-ries to optimize a single policy, highly stochastic policiesthat visit large swathes of the state space would not be ap-proximated well by our method.

Another avenue to explore in future work is the applica-tion of our method to learning rich control policies at scale.Since we can learn a single policy that unifies a varietyof trajectories in different environments and with differentinitial conditions, a sufficiently expressive policy could inprinciple learn complex sensorimotor associations from alarge number of trajectories, all optimized in parallel intheir respective environments. This could offer superiorgeneralization and discovery of higher-level motor skills.

ACKNOWLEDGMENTS

We thank Emanuel Todorov, Tom Erez, and Yuval Tassa forthe MuJoCo simulator, and the anonymous reviewers fortheir constructive feedback. Sergey Levine was supportedby an NVIDIA Graduate Fellowship.

http://graphics.stanford.edu/projects/cgpspaper/index.htm


ReferencesArnold, W.F., III and Laub, A.J. Generalized eigenproblem

algorithms and software for algebraic Riccati equations.Proceedings of the IEEE, 72(12), 1984.

Boyd, S. and Vandenberghe, L. Convex Optimization.Cambridge University Press, New York, NY, USA, 2004.

Deisenroth, M. and Rasmussen, C. PILCO: a model-basedand data-efficient approach to policy search. In Interna-tional Conference on Machine Learning (ICML), 2011.

Deisenroth, M., Neumann, G., and Peters, J. A survey onpolicy search for robotics. Foundations and Trends inRobotics, 2(1-2), 2013.

Ijspeert, A., Nakanishi, J., and Schaal, S. Learning attractorlandscapes for learning motor primitives. In Advances inNeural Information Processing Systems (NIPS), 2003.

Kappen, H. J., Gomez, V., and Opper, M. Optimal con-trol as a graphical model inference problem. MachineLearning, 87(2), 2012.

Kober, J. and Peters, J. Learning motor primitives forrobotics. In International Conference on Robotics andAutomation (ICRA), 2009.

Kober, J., Bagnell, J. A., and Peters, J. Reinforcementlearning in robotics: A survey. International Journal ofRobotic Research, 32(11), 2013.

Kolter, J. Z., Jackowski, Z., and Tedrake, R. Design, anal-ysis and learning control of a fully actuated micro windturbine. In American Control Conference (ACC), 2012.

Levine, S. Exploring deep and recurrent architectures foroptimal control. In NIPS 2013 Workshop on Deep Learn-ing, 2013.

Levine, S. and Koltun, V. Guided policy search. In Interna-tional Conference on Machine Learning (ICML), 2013a.

Levine, S. and Koltun, V. Variational policy search via tra-jectory optimization. In Advances in Neural InformationProcessing Systems (NIPS), 2013b.

Mordatch, I. and Todorov, E. Combining the benefits offunction approximation and trajectory optimization. InRobotics: Science and Systems (RSS), 2014.

Paraschos, A., Daniel, C., Peters, J., and Neumann, G.Probabilistic movement primitives. In Advances in Neu-ral Information Processing Systems (NIPS), 2013.

Peters, J. and Schaal, S. Applying the episodic natu-ral actor-critic architecture to motor primitive learning.In European Symposium on Artificial Neural Networks(ESANN), 2007.

Peters, J. and Schaal, S. Reinforcement learning of mo-tor skills with policy gradients. Neural Networks, 21(4),2008.

Rawlik, K., Toussaint, M., and Vijayakumar, S. Onstochastic optimal control and reinforcement learning byapproximate inference. In Robotics: Science and Sys-tems (RSS), 2012.

Ross, S. and Bagnell, A. Agnostic system identification formodel-based reinforcement learning. In InternationalConference on Machine Learning (ICML), 2012.

Ross, S., Gordon, G., and Bagnell, A. A reduction of imi-tation learning and structured prediction to no-regret on-line learning. In International Conference on ArtificialIntelligence and Statistics (AISTATS), 2011.

Ross, S., Melik-Barkhudarov, N., Shankar, K. Shaurya,Wendel, A., Dey, D., Bagnell, J. A., and Hebert, M.Learning monocular reactive UAV control in clutterednatural environments. In International Conference onRobotics and Automation (ICRA), 2013.

Theodorou, E., Buchli, J., and Schaal, S. Reinforcementlearning of motor skills in high dimensions: a path inte-gral approach. In International Conference on Roboticsand Automation (ICRA), 2010.

Todorov, E. Linearly-solvable Markov decision problems.In Advances in Neural Information Processing Systems(NIPS), 2006.

Todorov, E. and Li, W. A generalized iterative LQG methodfor locally-optimal feedback control of constrained non-linear stochastic systems. In American Control Confer-ence (ACC), 2005.

Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physicsengine for model-based control. In International Con-ference on Intelligent Robots and Systems (IROS), 2012.

Yin, K., Loken, K., and van de Panne, M. SIMBICON:simple biped locomotion control. ACM TransactionsGraphics, 26(3), 2007.

Ziebart, B. Modeling purposeful adaptive behavior withthe principle of maximum causal entropy. PhD thesis,Carnegie Mellon University, 2010.


A. Simulation and Cost DetailsThe swimmer consisted of 3 links, with 10 state dimen-sions corresponding to joint angles, joint angular veloci-ties, and the position, angle, velocity, and angular velocityof the head, with two action dimensions corresponding tothe torques between the joints. The simulation applied dragon each link of the swimmer to roughly simulate a fluid, al-lowing it to propel itself. The simulation step was set to0.05s, and the reward weights were wu = 0.0001, wv = 1,and wh = 0, with the desired velocity was v?x = 1m/s.

The bipedal walker consisted of seven links: a torso andthree links for each leg, for a total of 18 dimensions, includ-ing joint angles, the global position and orientation of thetorso, and the corresponding velocities. The action spacehad six dimensions, corresponding to each of the joints.MuJoCo was used to simulate soft, differentiable contactsto allow gradient-based optimization to proceed even in thepresence of contact forces. The simulation step was setto 0.01s, and the reward weights for all walker tasks werewu = 0.0001, wv = 1, and wh = 10, with desired velocityand height v?x = 2.1m/s and p?y = 1.1m.

B. Sample-Based GradientsAs discussed in Section 3.1, the Laplace approximationmay not accurately capture the structure of πθ(ut|xt) in theentire region where q(xt) is large, and since the policy istrained by sampling states from q(xt), policy optimizationoptimizes a different objective. This can lead to noncon-vergence when the policy is highly nonlinear. To reconcilethis problem, we can approximate the policy terms in theobjective with M random samples xti, drawn from q(xt),rather than by using a linearization of the policy:

L(q) ≈T∑t=1

`(xt, ut) +1

2tr (Σt`xu,xut)−

1

2log |At|+

λt2M

M∑i=1

(uti − µπt (xti))TA−1t (uti − µπt (xti))+

λt2

tr(A−1t Σπt

)+λt2

log |At| ,

where the actions are given by uti = Ktxti + ut.Note that the samples xti depend on xt, according toxti = xt + LT

t sti, where sti is a sample from a zero-meanspherical Gaussian, and Lt is the upper triangular Choleskydecomposition of St.4 As before, we differentiate with re-spect to ut, substituting Qu,ut and Qut as needed:

Lut = Qut + λtA−1t µπt

Lu,ut = Qu,ut + λtA−1t ,

4Keeping the same samples sti across iterations reduces vari-ance and can greatly improve convergence.

where µπt = 1M

∑Mi=1(uti − µπt (xti)) is the average dif-

ference between the linear feedback and the policy. Thisyields the following correction and feedback terms:

kt = −(Qu,ut + λtA

−1t

)−1 (Qut + λtA

−1t µπt

)Kt = −

(Qu,ut + λtA

−1t

)−1 (Qu,xt − λtA−1t µπxt

),

where µπxt = 1M

∑Mi=1 µ

πxt(xti) is the average policy gra-

dient. So far, the change to the mean is identical to sim-ply averaging the policy values and policy gradients overall the samples. In fact, a reasonable approximation canbe obtained by doing just that, and substituting the sampleaverages directly into the equations in Section 3.1. Thisis the approximation we use in our experiments, as it isslightly faster and does not appear to significantly degradethe results. However, the true gradients with respect to At

are different. Below, we differentiate the objection withrespect to At and Kt as before, where we use Kt to dis-tinguish the covariance term from the feedback in the firstdynamic programming pass, which is no longer identical.The derivatives with respect to At and Kt are

LAt =1

2Qu,ut +

λt − 1

2A−1t −

λt2A−1t MA−1t

LKt = Qu,utKtSt +Qu,xtSt + λtA−1t (KtSt −Ct),

where M = Σπt +∑Mi=1(uti − µπt (xti))(uti − µπt (xti)),

St = 1M

∑Mi=1 xtix

Tti, and Ct = 1

M

∑Mi=1 µ

πt (xti)x

Tti,

where we simplified using the assumption that xt and utare zero. We again solve for At by solving the CARE inEquation 8. To solve for Kt, we rearrange the terms to get

KtStS−1t +

1

λtAtQu,utKt = CtS

−1t −

1

λtAtQu,xt.

This equation is linear in Kt, but requires solving a sparselinear system with dimensionality equal to the number ofentries in Kt, which increases the computational cost.

Differentiating with respect to St, we get:

LSt =1

2

[Qx,xt + KT

t Qu,xt +Qx,utKt + KTt Qu,utKt

+choldiff(Dt) + choldiff(Dt)T]

where Dt=λtM

∑Mi=1(Kt−µπxt(xti))TA−1t (uti−µπt (xti))s

Tti

and choldiff(. . . ) indicates the differentiation of theCholesky decomposition, for example using the methoddescribed by Giles in “An extended collection of matrixderivative results for forward and reverse mode algorithmicdifferentiation” (2008). While this will provide us with thecorrect gradient, choldiff(Dt) + choldiff(Dt)

T is not guar-anteed to be positive definite. In this case, we found it use-ful to regularize by interpolating the gradient with the oneobtained from the Laplace approximation.

Date post:	19-Apr-2020
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

Learning Complex Neural Network Policies with Trajectory ...svlevine/papers/cgps.pdfLearning Complex...

Documents