Improving the Optimized Gradient Method for Large-Scale Convex...

Improving the Optimized Gradient Methodfor Large-Scale Convex Optimization

Donghwan Kim and Jeffrey A. Fessler

EECS Department, University of Michigan

SIAM Conference on Optimization

May 24, 2017

Goal: Develop new accelerated first-order methods

that are faster than Nesterov’s fast gradient method

in the worst-case

for minimizing smooth convex functions

1 / 26

1 Smooth convex problem and fixed-step first-order methods

2 Optimized gradient method (OGM) optimized over cost function

3 OGM with adaptive restart

4 Optimized gradient method optimized over gradient (OGM-OG)

5 Summary

1 Smooth convex problem and fixed-step first-order methodsSmooth convex problemFixed-step first-order methods (FSFOM)Nesterov’s fast gradient method (FGM)Upper and lower achievable bounds of first-order methods




5 Summary

1. Problem and method Smooth convex problem

Smooth convex minimization problem

Solve a smooth convex minimization problem:

minx∈Rd

f(x),

where the following conditions are assumed:

f : Rd → R is a convex function of the type C1,1L (Rd), i.e.,continuously differentiable with Lipschitz continuous gradient:

||∇f(x)−∇f(y)|| ≤ L||x− y||, ∀x,y ∈ Rd,

where L > 0 is the Lipschitz constant. (f ∈ FL(Rd).)The optimal set X∗(f) := argminx∈Rd f(x) is nonempty.

Large-scale (i.e., large d), so consider using a first-order method that hasa computational cost that is mildly dependent on d.

2 / 26

1. Problem and method Fixed-step first-order methods (FSFOM)

Fixed-step first-order methods (FSFOM)


For n = 0, 1, . . .

xn+1 = xn −1

L

n∑

k=0

hn+1,k∇f(xk)

Update step uses a weighted sum of previous and current gradients.

Step coefficients H = {hn,k} are non-adaptive (pre-determined).Excludes conjugate gradient, Barzilai-Borwein, · · ·

Equivalent computationally efficient form exists for some H.

Gradient method (GM)

Heavy-ball method [Polyak, 1964]

Nesterov’s fast gradient method (FGM) [Nesterov, 1983]

Efficient and achieves the optimal rate O(

1

n2

)

of class FSFOM.

3 / 26




For n = 0, 1, . . .

xn+1 = xn −1

L

n∑

k=0

hn+1,k∇f(xk)




Gradient method (GM): rate O(

1n

)




1

n2

)

of class FSFOM.

3 / 26




For n = 0, 1, . . .

xn+1 = xn −1

L

n∑

k=0

hn+1,k∇f(xk)






Nesterov’s fast gradient method (FGM) [Nesterov, 1983]Efficient and achieves the optimal rate O

(

1

n2

)

of class FSFOM.

Optimized gradient method (OGM)

has a worst-case cost function bound that is smaller than that of FGM.

has a computationally efficient form similar to FGM.

3 / 26




For n = 0, 1, . . .

xn+1 = xn −1

L

n∑

k=0

hn+1,k∇f(xk)






Nesterov’s fast gradient method (FGM) [Nesterov, 1983]Efficient and achieves the optimal rate O

(

1

n2

)

of class FSFOM.

Optimized gradient method optimized over gradient (OGM-OG)

has a worst-case gradient bound that is smaller than that of FGM.

has a computationally efficient form similar to FGM.

3 / 26




For n = 0, 1, . . .

xn+1 = xn −1

L

n∑

k=0

hn+1,k∇f(xk)








1

n2

)

of class FSFOM.

3 / 26

1. Problem and method Nesterov’s fast gradient method (FGM)

Fast gradient method (FGM)

FGM [Nesterov, Soviet Math. Dokl., 1983]

Initialize x0 = y0, t0 = 1For n = 0, 1, . . .

yn+1 = xn −1

L∇f(xn) (GM update)

tn+1 =1

2

(

1 +√

1 + 4t2n

)

(momentum factor)

xn+1 = yn+1 +tn − 1tn+1

(yn+1 − yn) (momentum update)

FGM is in class FSFOM with [Drori and Teboulle, Math. Prog., 2014]

HFGM : hn+1,k =

tn−1tn+1

hn,k, k = 0, . . . , n− 2,tn−1tn+1

(hn,n−1 − 1), k = n− 1,1 + tn−1tn+1 , k = n.

4 / 26

1. Problem and method Upper and lower bounds of first-order methods

Upper and lower achievable bounds of first-order methods

Theorem 1.1 [Nesterov, Soviet Math. Dokl., 1983]

For n ≥ 1, the primary sequence {yn} of FGM satisfies

f(yn)− f(x∗) ≤2L||x0 − x∗||2

(n+ 1)2.

Theorem 1.2 [Nesterov, 2004]

When the large-scale condition d ≥ 2n+ 1 holds, for any first-order methods(with fixed or dynamic step sizes) generating xn after n iterations there existsa function f ∈ FL(Rd) that satisfies the following lower bound:

3L||x0 − x∗||232(n+ 1)2

≤ f(xn)− f(x∗).

[Kim and Fessler, Math. Prog., 2016] and [Drori, J. Complexity, 2016] closethe case!

5 / 26

1. Problem and method Upper and lower bounds of first-order methods

Goal: Develop new accelerated first-order methods


For n = 0, 1, . . .

xn+1 = xn −1

L

n∑

k=0

hn+1,k∇f(xk)

Sec. 2: Find best-performing {hn,k} in terms of the cost function: OGMSec. 3: Study adaptive restart of OGM (not in class FSFOM)

Sec. 4: Find best-performing {hn,k} in terms of the gradient: OGM-OG

6 / 26


2 Optimized gradient method (OGM) optimized over cost functionDrori and Teboulle’s worst-case analysis of FSFOMDrori and Teboulle’s numerically optimized FSFOMAnalytically optimized FSFOM: OGMNumerical experiment



5 Summary

2. OGM Drori and Teboulle’s worst-case analysis

Exact worst-case bound analysis of FSFOM

For given [Drori and Teboulle, Math. Prog., 2014]step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

the worst-case convergence bound of f(xN )− f(x∗) is found by solving:

BP(H, N, d, L,R) := maxf∈FL(Rd)

maxx0,··· ,xN∈R

d,x∗∈X∗(f)

f(xN )− f(x∗) (P)

s.t. xn+1 = xn −1

L

n∑

k=0

hn+1,k∇f(xk), n = 0, . . . , N − 1,

||x0 − x∗|| ≤ R.

In other words,

f(xN )− f(x∗) ≤ BP(H, N, d, L,R) = LR2BP(H, N, d, 1, 1).Impractical to solve (P) due to the functional constraint f ∈ FL(Rd). 7 / 26

2. OGM Drori and Teboulle’s worst-case analysis

Relaxed worst-case bound analysis of FSFOM[Drori and Teboulle, Math. Prog., 2014] replaces the constraint f ∈ FL(Rd) by

1

2L||∇f(xi)−∇f(xj)||2 ≤ f(xi)− f(xj)− 〈∇f(xj), xi − xj〉

for i, j = 0, . . . , N, ∗, to relax problem (P) as:

BP1(H,N, d, L,R) := maxx0,··· ,xN ,x∗∈R

d,

g0,··· ,gN∈Rd,

δ0,··· ,δN∈R

L||x0 − x∗||2δN (P1)

s.t. xi+1 = xi −i∑

k=0

hi+1,k||x0 − x∗||gk, i = 0, . . . , N − 1,

1

2||gi − gj ||2 ≤ δi − δj −

〈gj , xi − xj〉||x0 − x∗||

, i, j = 0, . . . , N, ∗,

||x0 − x∗|| ≤ R,

where gi :=1

L||x0−x∗||∇f(xi) and δi := 1L||x0−x∗||2 (f(xi)− f(x∗)).

[Taylor et al., Math. Prog., 2017] showed BP(·) = BP1(·).8 / 26

2. OGM Drori and Teboulle’s numerically optimized FSFOM

Numerical relaxed worst-case bound analysis of FSFOM

[Drori and Teboulle, Math. Prog., 2014] further relaxes the problem as

f(xN )− f(x∗) ≤ BP(H, N, d, L,R)= BP1(H, N, d, L,R)

...

≤ BD(H, N, L,R) = LR2BD(H, N, 1, 1).

BD(H, N, 1, 1) is a solution of a (dual) convex semidefinite optimizationproblem (D) that can be solved numerically.

For any given H, a relaxed upper bound of the worst-case of f(xN )− f(x∗)can be computed numerically using a SDP solver!

9 / 26


Optimizing the step coefficients of FSFOM



...

≤ BD(H, N, L,R) = LR2BD(H, N, 1, 1).


Q. Best-performing H∗?

9 / 26





...

≤ BD(H, N, L,R) = LR2BD(H, N, 1, 1).


Best-performing H∗ can be designed by solving:

H∗ := argminH

BP(H, N, d, 1, 1).

This problem is intractable to solve.

9 / 26





...

≤ BD(H, N, L,R) = LR2BD(H, N, 1, 1).


Best-performing(?) H∗ can be designed by solving:

H∗ := argminH

BD(H, N, 1, 1).

[Drori and Teboulle, Math. Prog., 2014] solves this using a SDP solver(with a tight convex relaxation).

9 / 26

2. OGM Analytically optimized FSFOM: OGM


Optimized step coefficients H∗ is [Kim and Fessler, Math. Prog., 2016]

H∗ : hn+1,k =

θn−1θn+1

hn,k, k = 0, . . . , n− 2,θn−1θn+1

(hn,n−1 − 1), k = n− 1,1 + 2θn−1θn+1 , k = n.

OGM [Kim and Fessler, Math. Prog., 2016]

Initialize x0 = y0, θ0 = 1For n = 0, 1, . . . , N − 1

yn+1 = xn −1


θn+1 =

12

(

1 +√

1 + 4θ2n

)

, n = 0, 1, . . . , N − 212

(

1 +√

1 + 8θ2n

)

, n = N − 1 (new momentum factor)

xn+1 = yn+1 +θn − 1θn+1

(yn+1 − yn) +θn

θn+1(yn+1 − xn)

(new momentum update)

10 / 26



Optimized step coefficients H∗ is [Kim and Fessler, Math. Prog., 2016]

H∗ : hn+1,k =

θn−1θn+1

hn,k, k = 0, . . . , n− 2,θn−1θn+1

(hn,n−1 − 1), k = n− 1,1 + 2θn−1θn+1 , k = n.

OGM [Kim and Fessler, Math. Prog., 2016]

Initialize x0 = y0, θ0 = 1For n = 0, 1, . . . , N − 1

yn+1 = xn −1


θn+1 =

12

(

1 +√

1 + 4θ2n

)

, n = 0, 1, . . . , N − 212

(

1 +√

1 + 8θ2n

)

, n = N − 1 (new momentum factor)

xn+1 =

[

xn −1

L

(

1 +θn

θn+1

)

∇f(xn)]

+θn − 1θn+1

(yn+1 − yn)


10 / 26


Convergence bounds for OGM

Theorem 2.1 [Kim and Fessler, Math. Prog., 2016]

For a given N ≥ 1, the point xN generated by OGM satisfies

f(xN )− f(x∗) ≤L||x0 − x∗||2

2θ2N≤ 1L||x0 − x∗||

2

(N + 1)(N + 1 +√2)

.

Twice smaller bound than FGM. To achieve the same cost function value,OGM requires 1/

√2-times fewer iterations than FGM.

Theorem 2.2 [Drori, J. Complexity, 2017]

When the large-scale condition “d ≥ N + 1” holds, for any first-order methods(with fixed or dynamic step sizes) generating xN after N iterations thereexists a function f ∈ FL(Rd) that satisfies the following lower bound:

L||x0 − x∗||22θ2N

≤ f(xN )− f(x∗).

OGM achieves this lower bound exactly!11 / 26

2. OGM Numerical experiment

Log-Sum-Exp problem

Minimize a Log-Sum-Exp function:

f(x) = η log

(

m∑

i=1

exp

(

1

η(a⊤i x− bi)

)

)

This approaches maxi=1,...,m(a⊤i x− bi) as η → 0.

L = 1ηλmax(A⊤A), where A = [a1 · · · am]⊤ ∈ Rm×d.

m = 100, d = 20 problem dimension, η = 1, N = 50 iterations

Algorithms: GM, FGM, OGM

12 / 26

2. OGM Numerical experiment

Log-Sum-Exp: Cost function vs Iteration

0 10 20 30 40 50

Iteration (n)

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

(f(y

n)−f(x

∗))/f

(x∗)

N = 50 iterations

GM

FGM

OGM

Cost function f(yn)−f(x∗)f(x∗) vs Iteration (n)

Convergence speed: GM



3 OGM with adaptive restartFGM with adaptive restartOGM with adaptive restartNumerical experiment


5 Summary

3. OGM with adaptive restart FGM with adaptive restart

FGM



yn+1 = xn −1


tn+1 =1

2

(

1 +√

1 + 4t2n

)

(momentum factor)

xn+1 = yn+1 +tn − 1tn+1


14 / 26

3. OGM with adaptive restart FGM with adaptive restart

FGM with adaptive restart

FGM with adaptive restart [O’donoghue and Candes, FoCM, 2015]


yn+1 = xn −1


if restart condition satisfies, restart (set tn = 1)

tn+1 =1

2

(

1 +√

1 + 4t2n

)

(momentum factor)

xn+1 = yn+1 +tn − 1tn+1


Restart condition:

f(yn+1) > f(yn) or 〈−∇f(xn), yn+1 − yn〉 < 0

Its practical acceleration is partially explained by a quadratic analysis.

14 / 26

3. OGM with adaptive restart OGM with adaptive restart

OGM′

OGM′ [Kim and Fessler, JOTA, 2017]


yn+1 = xn −1


tn+1 =1

2

(

1 +√

1 + 4t2n

)

(momentum factor)

xn+1 = yn+1 +tn − 1tn+1

(yn+1 − yn) +tn

tn+1(yn+1 − xn)


15 / 26

3. OGM with adaptive restart OGM with adaptive restart

OGM′ with adaptive restart

OGM′ with adaptive restart [Kim and Fessler, arXiv:1703.04641]


yn+1 = xn −1


if restart condition satisfies, restart (set tn = 1)

tn+1 =1

2

(

1 +√

1 + 4t2n

)

(momentum factor)

xn+1 = yn+1 +tn − 1tn+1

(yn+1 − yn) +tn

tn+1(yn+1 − xn)


Restart condition:

f(yn+1) > f(yn) or 〈−∇f(xn), yn+1 − yn〉 < 0We extended a quadratic analysis in [O’donoghue and Candes, FoCM, 2015]for OGM with restart. (details omitted)

15 / 26

3. OGM with adaptive restart Numerical experiment

Log-Sum-Exp problem


f(x) = η log

(

m∑

i=1

exp

(

1

η(a⊤i x− bi)

)

)



m = 100, d = 20 problem dimension, η = 1, N =“1000” iterations

Algorithms: GM, FGM, OGM

Restarting algorithms: “FGM-R”, “OGM-R”

16 / 26



0 10 20 30 40 50

Iteration (n)

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

(f(y

n)−f(x

∗))/f

(x∗)

N = 50 iterations

GM

FGM

OGM





0 200 400 600 800 1000

Iteration (n)

10-15

10-10

10-5

100

(f(y

n)−f(x

∗))/f

(x∗)

N = 1000 iterations

GM

FGM

OGM





0 200 400 600 800 1000

Iteration (n)

10-15

10-10

10-5

100

(f(y

n)−f(x

∗))/f

(x∗)

N = 1000 iterations

GM

FGM

OGM

FGM-R

OGM-R






4 Optimized gradient method optimized over gradient (OGM-OG)Decreasing the gradient is importantWorst-case gradient bound analysis of FSFOMOptimized FSFOM in terms of gradient normNumerical experiment

5 Summary

4. OGM-OG Decreasing the gradient is important

Decreasing the gradient is important

It is known that dual gradient norm is related to primal feasibility.So, decreasing the gradient norm is also important in a dual approach.

GM decreases the (dual) gradient norm ||∇f(xN )|| with rate O(1/N).

What is the rate of the gradient norm decrease of FGM?

O(1/N1.5)!

Faster algorithm?

Develop a new fast first-order method by optimizing FSFOM with

respect to the gradient norm.

18 / 26

4. OGM-OG Worst-case gradient bound analysis of FSFOM

Exact worst-case gradient bound analysis of FSFOMFor given

step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

the worst-case convergence bound of the gradient is found by solving:

BP′(H, N, d, L,R) := maxf∈FL(Rd)


d,x∗∈X∗(f)

||∇f(xN )||2 (P′)

s.t. xn+1 = xn −1

L

n∑

k=0

hn+1,k∇f(xk), n = 0, . . . , N − 1,

||x0 − x∗|| ≤ R.

Similar to the relaxation from (P) to (D), we relaxed the problem as

minn∈{0,...,N}

||∇f(xn)||2 ≤ BP′(H, N, d, L,R) ≤ BD′(H, N, L,R).[Kim and Fessler, arXiv:1607.06764]

19 / 26


Exact worst-case gradient bound analysis of FSFOMFor given

step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

the worst-case convergence bound of the gradient is found by solving:

BP′(H, N, d, L,R) := maxf∈FL(Rd)


d,x∗∈X∗(f)

minn∈{0,...,N}

||∇f(xn)||2 (P′)

s.t. xn+1 = xn −1

L

n∑

k=0

hn+1,k∇f(xk), n = 0, . . . , N − 1,

||x0 − x∗|| ≤ R.

Similar to the relaxation from (P) to (D), we relaxed the problem as

minn∈{0,...,N}

||∇f(xn)||2 ≤ BP′(H, N, d, L,R) ≤ BD′(H, N, L,R).[Kim and Fessler, arXiv:1607.06764]

19 / 26


New gradient bound for FGM



yn+1 = xn −1


tn+1 =1

2

(

1 +√

1 + 4t2n

)

(momentum factor)

xn+1 = yn+1 +tn − 1tn+1


Theorem 4.1 [Kim and Fessler, arXiv:1607.06764]

The sequence {xn} generated by FGM satisfies

minn∈{0,...,N}

||∇f(xn)|| ≤L||x0 − x∗||√

∑Nn=0 t

2n

≤ 2√3L||x0 − x∗||

N1.5.

The final gradient norm ||∇f(xN )|| of FGM decreases with rate O(1/N).20 / 26

4. OGM-OG Optimized FSFOM in terms of gradient norm

Optimized FSFOM in terms of gradient norm

Optimize step coefficients in terms of the gradient norm as:[Kim and Fessler, arXiv:1607.06764]

H∗OG := argminH

BD′(H, N, 1, 1).

FSFOM method with H∗OG has an efficient formulation similar to OGM,named OGM-OG (OG for optimized over a gradient).

21 / 26


OGM-OG method

OGM-OG [Kim and Fessler, arXiv:1607.06764]

Initialize x0 = y0, ω0 = Ω0 = 1For n = 0, 1, . . . , N − 1

yn+1 = xn −1


ωn+1 =

{

12

(

1 +√

1 + 4ω2n

)

, n = 0, . . . ,⌊

N2

⌋

− 2N−n

2 , n =⌊

N2

⌋

− 1, . . . , N − 1

Ωn+1 =n+1∑

l=0

ωl (new momentum factor)

xn+1 = yn+1 +(Ωn − ωn)ωn+1

ωnΩn+1(yn+1 − yn) +

(2ω2n − Ωn)ωn+1ωnΩn+1

(yn+1 − xn)


This reduces to OGM′ when ωn+1 =12

(

1 +√

1 + 4ω2n

)

for all n ≥ 0.

22 / 26


Gradient bound of OGM-OG

Theorem 4.2 [Kim and Fessler, arXiv:1607.06764]

The sequence {xn} generated by OGM-OG satisfies

minn∈{0,...,N}

||∇f(xn)|| ≤L||x0 − x∗||

√

∑Nn=0(Ω

2n − ω2n)

≤√6L||x0 − x∗||

N1.5.

√2-times smaller bound than FGM. To achieve the same gradient norm,

OGM-OG requires 1/ 3√2-times fewer iterations than FGM.

To best of my knowledge, this is the best known worst-case analytical gradientnorm bound for FSFOM. (However, this is not optimal unlike OGM.)

Other choices of ωn such as ωn =n+aa for any a > 2 also have the rate

O(1/N1.5), which does not require selecting N in advance like FGM.

23 / 26

4. OGM-OG Numerical experiment

Log-Sum-Exp problem


f(x) = η log

(

m∑

i=1

exp

(

1

η(a⊤i x− bi)

)

)



m = 100, d = 20 problem dimension, η = 1, N = 50 iterations

Algorithms: GM, FGM, OGM, “OGM-OG”

24 / 26

4. OGM-OG Numerical experiment

Log-Sum-Exp: Gradient norm vs Iteration

0 10 20 30 40 50

Iteration (n)

1.5

2

2.5

3

3.5

4

4.5

5

||∇f(y

n)||

N = 50 iterations

GM

FGM

OGM

OGM-OG

Gradient norm ||∇f(yn)|| vs Iteration (n)






5 Summary

5. Summary

Summary and Future work

Optimized fixed-step first-order methods (FSFOM) over the cost functionand gradient norm for smooth convex minimization, leading to OGM andOGM-OG respectively.

Introduced adaptive restarting (of the momentum) for OGM to improveits rate of convergence, when the current momentum is not helpful.

Future work

Develop FSFOM that is optimal for the gradient norm decrease.

Extend OGM and OGM-OG to other function classes.

(1) Taylor, Hendrickx, and Glineur, “Exact worst-case performance of first-orderalgorithms for composite convex optimization,” arXiv:1512.07516.

(2) Kim and Fessler, “Another look at “Fast Iterative Shrinkage/ThresholdingAlgorithm (FISTA)”,” arXiv:1608.03861.

26 / 26

5. Summary

Summary and Future work

Optimized fixed-step first-order methods (FSFOM) over the cost functionand gradient norm for smooth convex minimization, leading to OGM andOGM-OG respectively.

Introduced adaptive restarting (of the momentum) for OGM to improveits rate of convergence, when the current momentum is not helpful.

Future work

Develop FSFOM that is optimal for the gradient norm decrease.

Extend OGM and OGM-OG to other function classes.

(1) Taylor, Hendrickx, and Glineur, “Exact worst-case performance of first-orderalgorithms for composite convex optimization,” arXiv:1512.07516.

(2) Kim and Fessler, “Another look at “Fast Iterative Shrinkage/ThresholdingAlgorithm (FISTA)”,” arXiv:1608.03861.

Thank you for listening to my talk!

26 / 26

References

1. Y. Drori, “The exact information-based complexity of smooth convex minimization,” J. Complexity,2017.

2. Y. Drori, and M. Teboulle, “Performance of first-order methods for smooth convex minimization: Anovel approach,” Mathematical Programming, 2014.

3. D. Kim, and J. A. Fessler, “Another look at “Fast Iterative Shrinkage/Thresholding Algorithm(FISTA)”,” arXiv:1608.03861.

4. D. Kim, and J. A. Fessler, “Generalizing the optimized gradient methods for smooth convexminimization,” arXiv:1607.06764.

5. D. Kim, and J. A. Fessler, “On the convergence analysis of the optimized gradient method,” J.Optim. Theory Appl., 2017.

6. D. Kim, and J. A. Fessler, “Optimized first-order methods for smooth convex minimization,”Mathematical Programming, 2016.

7. D. Kim, and J. A. Fessler, “Adaptive restart of the optimized gradient method for convexoptimization,” arXiv:1703.04641.

8. Y. Nesterov, “A method for unconstrained convex minimization problem with the rate ofconvergence O(1/k2),” Dokl. Akad. Nauk. USSR, 1983.

9. Y. Nesterov, “Introductory lectures on convex optimization: A basic course,” Kluwer, 2004.

10. B. O’donoghue and E. Candes, “Adaptive restart for accelerated gradient schemes,” Foundations ofComputational Mathematics, 2015.

11. B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSRComputational Mathematics and Mathematical Physics, 1964.

12. A. B. Taylor, J. M. Hendrickx, and F. Glineur, “Exact worst-case performance of first-orderalgorithms for composite convex optimization,” arXiv:1512.07516.

13. A. B. Taylor, J. M. Hendrickx, and F. Glineur, “Smooth strongly convex interpolation and exactworst-case performance of first-order methods,” Mathematical Programming, 2017.

Backup

Two worst-case functions of OGM

Theorem B.1 [Kim and Fessler, JOTA, 2017]

For the following two “worst-case” functions in FL(Rd) with R = ||x0 − x∗||and x∗ = 0:

f1(x;N) =

{

LRθ2N

||x|| − LR22θ4

N

, if ||x|| ≥ Rθ2N

,L2 ||x||2, otherwise,

(Huber)

f2(x) =L

2||x||2, (Quadratic)

OGM exactly achieves its worst-case convergence bound after N iterations,i.e.,

f1(xN ;N)− f1(x∗;N) = f2(xN )− f2(x∗) =L||x0 − x∗||2

2θ2N.

(figure of two different worst-case behaviors will be shown in next page)

Two worst-case functions of OGM (cont’d)

Example: Two worst-case behaviors of OGM for x0 = 1, x∗ = 0, N = 5,and d = R = L = 1.

x

-1 0 1

f(x)

00.02

0.5

f1(x;N): Huber

x

-1 0 1

f(x)

00.02

0.5

f2(x): Quadratic

The worst-case function of the primary sequence {yi} is of the type f1(x; i),whereas that of the secondary sequence {xi} is f2(x). (not shown)Interestingly, the final iterate xN of OGM compromises between the twoextremely different worst-case behaviors, which is related to the choice of θN .

A sketch of quadratic analysis of OGM

Minimize a smooth and strongly convex quadratic function:

f(x) =1

2x⊤Qx− p⊤x,

where Q ∈ Rd×d is a symmetric positive definite matrix, p ∈ Rd is a vector,and x∗ = Q

−1p is the optimum.

OGM with constant step coefficients [Kim and Fessler, arXiv:1703.04641]

Initialize x0 = y0For n = 0, 1, . . .

yn+1 = xn −1


xn+1 = yn+1 + β(yn+1 − yn) + γ(yn+1 − xn) (new momentum update)

Use the following relationship for the analysis (details omitted):

[

yn+1 − x∗yn − x∗

]

=

[

(1 + β)(

I − 1LQ)

− γ 1LQ −β(

I − 1LQ)

I 0

][

yn − x∗yn−1 − x∗

]

Smooth convex problem and fixed-step first-order methodsSmooth convex problemFixed-step first-order methods (FSFOM)Nesterov's fast gradient method (FGM)Upper and lower achievable bounds of first-order methods

Optimized gradient method (OGM) optimized over cost functionDrori and Teboulle's worst-case analysis of FSFOMDrori and Teboulle's numerically optimized FSFOMAnalytically optimized FSFOM: OGMNumerical experiment

OGM with adaptive restartFGM with adaptive restartOGM with adaptive restartNumerical experiment

Optimized gradient method optimized over gradient (OGM-OG)Decreasing the gradient is importantWorst-case gradient bound analysis of FSFOMOptimized FSFOM in terms of gradient normNumerical experiment

Summary

Date post:	26-Jan-2021
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Improving the Optimized Gradient Method for Large-Scale Convex...

Documents