NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient...

Post on 22-Aug-2015

106 views 0 download

transcript

A Differential Equation for Modeling Nesterov’sAccelerated Gradient Method: Theory and Insights

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014

speaker: kv

MCLab, CITI Academia Sinicas

May 14, 2015

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 1 / 40

Overview

1 IntroductionSmooth Unconstrained OptimizationAccelerated SchemeOrdinary Differential Equation

2 Derivation of the ODESimple Properties

3 Equivalence between the ODE and Nesterov’s schemeAnalogous Convergence RateQuadratic f and Bessel functionEquivalence between the ODE and Nesterov’s Scheme

4 A family of generalized Nesterov’s schemesContinuous OptimizationComposite Optimization

5 New Restart Scheme6 Accelerating to linear convergence by restarting

Numerical examples7 Discussion

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 2 / 40

Introduction

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 3 / 40

Smooth Unconstrained Optimization

We wish to minimize a smooth convex function

minimize f (x)

where f : Rn → R has a Lipschitz continuous gradient

‖∇f (x)−∇f (y)‖2 < L‖x − y‖2

µ-strong convexity

f (x)− µ‖x‖2/2

In this paper, FL denotes the class of convex function f with L-Lipschitzcontinuous gradients defined on Rn; Sµ denotes the class of µ-stronglyconvex function f on Rn. We set Sµ,L = FL ∩ Sµ

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 4 / 40

Introduction: Accelerated Scheme

Nesterov’s Accelerated Gradient Scheme

xk = yk−1 − s∇f (yk−1) (1)

yk = xk +k − 1

k + 2(xk − xk−1) (2)

For fixed step size s = 1/L, where L is Lipschitz constant of ∇f , thisscheme exhibits the convergence rate

f (xk)− f ∗ ≤ O(L||x0 − x∗||2

k2)

This improvement relies on the introduction to momentum xk − xk−1 andthe particularly tuned coefficient (k − 1)/(k + 2) = 1− 3/(k + 2)

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 5 / 40

Accelerated Scheme: Oscillation Problem

In general, Nesterov’s scheme is not monotone in the objective function.(due to introduction to the momentum)

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 6 / 40

Introduction: Second order ODE

Derive a second order ordinary differential equation(ODE), which is theexact limit of Nesterov’s scheme by taking small step size

X +3

tX +∇f (X ) = 0 (3)

for t > 0, with initial condition X (0) = x0, ˙X (0) = 0; here x0 is thestarting point in Nesterov’s scheme. X denotes to velocity and X isacceleration.

Small t: large 3/t leads the ODE to be an over-damped system

Large t: As t increases, system behaves like under-damped system,oscillating with amplitude decreases to zero

Time parameter in this ODE is related to step size

t ∼ k√s

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 7 / 40

An Example: Trajectories

Minimize f = 0.02x21 + 0.005x22

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 8 / 40

An Example: Zoomed Trajectories

Minimize f = 0.02x21 + 0.005x22

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 9 / 40

An Example: Errors f − f ∗

Minimize f = 0.02x21 + 0.005x22

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 10 / 40

Derivation of the ODE

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 11 / 40

Derivation of the ODE

Assmue f ∈ FL for L > 0, combine two equations of (1) and applyingrescaling give

xk+1 − xk√s

=k − 1

k + 2

xk − xk−1√s

−√s∇f (yk) (4)

Introducfelis e the ansatz xk ∼ X (k√s) for smooth curve X (t) define for

t > 0. With these approximations, we get Taylor expansions:

(xk+1 − xk)/√s = X (t) +

1

2X (t)

√s + o(

√s)

(xk − xk−1)/√s = X (t)− 1

2X (t)

√s + o(

√s)

√s∇f (yk) =

√s∇f (X (t)) + o(

√s)

where the third equality we use yk − X (t) = o(1)

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 12 / 40

Derivation of the ODE

The formula (4) can be rewritten as

X (t) +1

2X (t)

√s + o(

√s)

= (1− 3√s

t){X (t)− 1

2X (t)

√s + o(

√s)} −

√s∇f (X (t)) + o(

√s)

By comparing the coefficients of√s, we obtain

X +3

tX +∇f (X ) = 0

Theorem (Well posed ODE, Existence and Uniqueness)

For any f ∈ F∞ := UL>0FL and any x0 ∈ Rn, the ODE (3) with initialconditions X (0) = x0, X (0) = 0 has an unique global solution X .

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 13 / 40

Simple Properties

Invariance

ODE is invariant under time change and invariant under transformation

Initial asymptotic

Assume sufficient smoothness of X , that limt→0 X exists. The Mean ValueTheorem guarantees the existence ζ ∈ (0, t) that

X (t)/t = (X − X (0))/t = X (ζ)

Hence the ODE deduces to X (t) + 3X (ζ) +∇f (X (t)) = 0 Taking t → 0(for small t), we have

X (t) = −∇f (x0)t2

8+ x0 + o(t2)

Consistent with the empirical observation the Nesterov’s scheme movesslowly in the beginning.

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 14 / 40

Connections and Interpretations

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 15 / 40

Analogous Convergence Rate

Now, we exhibit approximate equivalence between the ODE andNesterov’s scheme in terms of convergence rate.

Theorem (Discrete Nesterov Scheme 3.1)

For any f ∈ FL, the sequence {xk} in 1 with step size s ≤ 1/L obeys

f (xk)− f ∗ ≤ 2||x0 − x∗||2

s(k + 1)2

First result indicates the trajectory of ODE (3) closely resembles thesequence {xk} in terms of the convergence rate

Theorem (Continuous ODE Scheme 3.2)

For any f ∈ F∞, let X (t) be the unique global solution to (3) with initialconditions X (0) = x0, X (0) = 0, for any t > 0

f (X (t))− f ∗ ≤ 2||x0 − x∗||2

t2Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 16 / 40

Proof of Theorem 3.2

Consider energy functional defined as

ε(t) := t2(f (X (t))− f ∗) + 2||X +t

2X − x∗||2

whose time derivative is

ε(t) = 2t(f (X )− f ∗) + t2〈∇f , X 〉+ 4〈X +t

2X − x∗,

3

2X +

t

2X 〉

Substituting 3X/2 + tX/2 with −t∇f (X )/2

ε(t) = 2t(f (X )− f ∗) + 4〈X − x∗,−t∇f (X )/2〉= 2t(f (X )− f ∗)− 2t〈X − x∗,∇f (X )〉 ≤ 0

Hence the monotonicity of ε and non-negativity of 2||X + t2 X − x∗||2, the

gap obeys

f (X (t))− f ∗ ≤ ε(t)

t2≤ ε(0)

t2=

2||x0 − x∗||2

t2

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 17 / 40

Quadratic f and Bessel function

For quadratic f , we have

f (x) =1

2〈x ,Ax〉+ 〈b, x〉

where A ∈ Rn×n. Simple translation can absorb the liner term 〈b, x〉. Weassume A is positive semi-definite, admitting spectral decompositionA = QTΛQ, replace x with Qx , we assume f = 1/2〈x ,Λx〉. The ODEadmits

Xi +3

tXi + λiXi = 0

where i = 1, ..., n

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 18 / 40

Quadratic f and Bessel function

Introduce Yi (u) = uXi (u/√λi ), which satisfies

u2Yi + uYi + (u2 − 1)Yi = 0

This is the Bessel’s Differential Equation with order 1. Apply asymptoticexpansion, we obtain

Xi (t) =2xx0,i

t√λi

J1(t√λi )

For t is large, the Bessel function has asymptotic form

J1(t) =

√2

πt(cos(t − 3π/4) + O(1/t))

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 19 / 40

Quadratic f and Bessel function: Example

Minimizef = 0.02x21 + 0.005x22 , whose eigenvalues are λ1,2 = 0.02, 0.005

f (X )− f ∗ = f (X ) =n∑

i=1

2x20,it2

J1(t√λi )

2

Denote two major period T1,T2. We get T1 = π/√λ1 = 22.214 and

T2 = π/√λ2 = 44.423

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 20 / 40

Equivalence between the ODE and Nesterov’s scheme

We study the stable step size for numerically solving ODE. The finitedifference approximation of (3) by the forward Euler method

X (t + ∆t)− 2X (t) + X (t + ∆t)

∆t2+

3

t

X (t)− X (t −∆t)

∆t+∇f (X (t)) = 0

which is equivalent to

X (t + ∆t) = (2− 3∆t

t)X (t)−∆t2∇f (X (t))− (1− ∆t

t)X (t −∆t)

Assuming that f is sufficiently smooth, for small perturbation. Thecharacteristic equation of this finite difference scheme is approximately(identify k = t/∆t)

det{λ2 − (2−∆t2∇2f − 3∆t

t)λ+ 1− 3∆t

t} = 0 (5)

For numerical stability, all the roots of (5) should lie on unit circle.Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 21 / 40

A family of generalized Nesterov’s schemes

Exploit the power of ODE. We would be interested in studying the ODE(3) with the number of 3 appearing the coefficient of X/t replaced by ageneral constant r as

X +r

tX +∇f (X ) = 0, X (0) = x0, X (0) = 0 (6)

Using the argument similar to theorem 2.1, this ODE is guaranteed toassume a unique global solution for any f ∈ F∞

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 22 / 40

Generalized Nesterov’s Scheme: Continuous Optimization

Theorem (4.1)

Suppose r > 3 and let X be the unique solution to (6) for some f ∈ F∞.Then X (t) obeys

f (X (t))− f ∗ ≤ (r − 1)2||x0 − x∗||2

2t2

and ∫ ∞0

t(f (X (t))− f ∗)dt ≤ (r − 1)2||x0 − x∗||2

2(r − 3)

Theorem (4.2)

For any f ∈ Sµ,L(Rn), the unique solution X to (6) with r ≥ 9/2 obeys

f (X (t))− f ∗ ≤ Cr5/2||x0 − x∗||2

t3õ

For any t > 0 and an universal constant C > 1/2Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 23 / 40

Generalized Nesterov’s Scheme: Continuous Optimization

For example, the solution to (6) with f (x) = ||x ||2/2 is

X (t) =2

r−12 Γ((r + 1)/2)J(r−1)/2(t)

t(r−1)/2

where J(r−1)/2(.) is the first kind of Bessel function of order (r − 1)/2. Forlarge t, this Bessel function obeysJ(r−1)/2(t) =

√2/(πt)(cos(t − rπ/4) + O(1/t). Hence

f (X (t))− f ∗ ≤ ||x0 − x∗||2/tr

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 24 / 40

Generalized Nesterov’s Scheme: Composite Optimization

Skip this part

minx∈Rn

f (x) = g(x) + h(x)

where g ∈ FL for some L > 0 and h is convex on Rn with possibleextended value ∞. Define proximal subgradient

Gs(x) :=x − arg minz{||z − (x − s∇g(x))||2/(2s) + h(z)}

s

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 25 / 40

New Restart Scheme

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 26 / 40

Restart Scheme: Previous Works

Restart: Erase the memory of previous iterations and resets themomentum back to zero

Function Scheme: we restart whenever

f (xk) > f (xk−1)

Gradient Scheme: we restart whenever

∇f (yk−1)T (xk − xk−1) > 0

Refer to: Adaptive Restart for Accelerated Gradient Schemes, BrendanODonoghue Emmanuel Cands, 2012

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 27 / 40

Restart Scheme: Previous Works

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 28 / 40

Restart Scheme: Previous Works

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 29 / 40

New Restart Scheme: Speed Restart

This work provides a new restarting strategy, called speed restartingscheme. The underlying motivation is to maintain relatively high velocityX along the trajectory.

Definition 5.1

For ODE (3) with X (0) = x0, X (0) = 0, let

T = T (f , x0) = sup{t > 0 : ∀u ∈ (0, t),||X (u)||2

du> 0}

be the speed restarting time.

In words, T is the first time the velocity ||X || decreases. Indeed, f (X (t))is the decreasing function before time T , for t < T ,

df (X (t))

dt=< ∇f (X ), X >= −3

t||X ||2 − 1

2

||X ||2

dt< 0

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 30 / 40

Accelerating to linear convergence by restarting

The speed restarting ODE is thus

X (t) +3

tsrX (t) +∇f (X (t)) = 0 (7)

where tsr is set to zero whenever < X , X >= 0. We have followingobservations

X sr (t) is continuous for t ≤ 0, with X sr (0) = x0

X sr (t) satisfied (3) for 0 < t < T1 := T (x0; f )

Recursively define Ti+1 = Ti (Xsr (

∑Tj ; f ) for i ≤ 1 and

X (t) = X sr (∑

Tj + t)

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 31 / 40

Accelerating to linear convergence by restarting

Lemma 5.2

There is a universal constant C > 0 such that

f (X (T ))− f (x∗) ≤ (1− Cµ

L)(f (x0)− f ∗)

Guarantee each restarting reduces the error by a constant factor

Lemma 5.3

There is a universal constant C such that

T ≤4exp( CL

µ )

5√L

An upper bound for T. It conforms that restartings are adequate

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 32 / 40

Accelerating to linear convergence by restarting

Applying Lemma 5.2, 5.3, we have

Theorem (5.1)

There exists positive constants c1 and c2, which only depend on thecondition number L/µ, such that for any f ∈ Sµ,L, we have

f (X sr (t))− f (x∗) ≤ c1L||x0 − x∗||2

2exp−c2t

√L

The theorem guarantees linear convergence of solution to (7). This is newresult in the literature.

where c1 = exp(Cµ/L) and c2 = 5Cµ4L e(−Cµ/L)

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 33 / 40

Numerical examples: algorithm of speed restarting

Below we present a discrete analog to restarted scheme.

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 34 / 40

Quadratic

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 35 / 40

Log-sum-exp

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 36 / 40

Matrix compleltion

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 37 / 40

Lasso in l1-constrainted form with large space design

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 38 / 40

References

W. Su, S. Boyd, E. Candes (2014)

A Differential Equation for Modeling Nesterovs Accelerated Gradient Method:Theory and Insights

NIPS 2014

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 39 / 40

thanks!

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 40 / 40