+ All Categories
Home > Documents > Improving the Optimized Gradient Method for Large-Scale Convex...

Improving the Optimized Gradient Method for Large-Scale Convex...

Date post: 26-Jan-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
70
Improving the Optimized Gradient Method for Large-Scale Convex Optimization Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan SIAM Conference on Optimization May 24, 2017
Transcript
  • Improving the Optimized Gradient Methodfor Large-Scale Convex Optimization

    Donghwan Kim and Jeffrey A. Fessler

    EECS Department, University of Michigan

    SIAM Conference on Optimization

    May 24, 2017

  • Goal: Develop new accelerated first-order methods

    that are faster than Nesterov’s fast gradient method

    in the worst-case

    for minimizing smooth convex functions

    1 / 26

  • 1 Smooth convex problem and fixed-step first-order methods

    2 Optimized gradient method (OGM) optimized over cost function

    3 OGM with adaptive restart

    4 Optimized gradient method optimized over gradient (OGM-OG)

    5 Summary

  • 1 Smooth convex problem and fixed-step first-order methodsSmooth convex problemFixed-step first-order methods (FSFOM)Nesterov’s fast gradient method (FGM)Upper and lower achievable bounds of first-order methods

    2 Optimized gradient method (OGM) optimized over cost function

    3 OGM with adaptive restart

    4 Optimized gradient method optimized over gradient (OGM-OG)

    5 Summary

  • 1. Problem and method Smooth convex problem

    Smooth convex minimization problem

    Solve a smooth convex minimization problem:

    minx∈Rd

    f(x),

    where the following conditions are assumed:

    f : Rd → R is a convex function of the type C1,1L (Rd), i.e.,continuously differentiable with Lipschitz continuous gradient:

    ||∇f(x)−∇f(y)|| ≤ L||x− y||, ∀x,y ∈ Rd,

    where L > 0 is the Lipschitz constant. (f ∈ FL(Rd).)The optimal set X∗(f) := argminx∈Rd f(x) is nonempty.

    Large-scale (i.e., large d), so consider using a first-order method that hasa computational cost that is mildly dependent on d.

    2 / 26

  • 1. Problem and method Smooth convex problem

    Smooth convex minimization problem

    Solve a smooth convex minimization problem:

    minx∈Rd

    f(x),

    where the following conditions are assumed:

    f : Rd → R is a convex function of the type C1,1L (Rd), i.e.,continuously differentiable with Lipschitz continuous gradient:

    ||∇f(x)−∇f(y)|| ≤ L||x− y||, ∀x,y ∈ Rd,

    where L > 0 is the Lipschitz constant. (f ∈ FL(Rd).)The optimal set X∗(f) := argminx∈Rd f(x) is nonempty.

    Large-scale (i.e., large d), so consider using a first-order method that hasa computational cost that is mildly dependent on d.

    2 / 26

  • 1. Problem and method Smooth convex problem

    Smooth convex minimization problem

    Solve a smooth convex minimization problem:

    minx∈Rd

    f(x),

    where the following conditions are assumed:

    f : Rd → R is a convex function of the type C1,1L (Rd), i.e.,continuously differentiable with Lipschitz continuous gradient:

    ||∇f(x)−∇f(y)|| ≤ L||x− y||, ∀x,y ∈ Rd,

    where L > 0 is the Lipschitz constant. (f ∈ FL(Rd).)The optimal set X∗(f) := argminx∈Rd f(x) is nonempty.

    Large-scale (i.e., large d), so consider using a first-order method that hasa computational cost that is mildly dependent on d.

    2 / 26

  • 1. Problem and method Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    For n = 0, 1, . . .

    xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk)

    Update step uses a weighted sum of previous and current gradients.

    Step coefficients H = {hn,k} are non-adaptive (pre-determined).Excludes conjugate gradient, Barzilai-Borwein, · · ·

    Equivalent computationally efficient form exists for some H.

    Gradient method (GM)

    Heavy-ball method [Polyak, 1964]

    Nesterov’s fast gradient method (FGM) [Nesterov, 1983]

    Efficient and achieves the optimal rate O(

    1

    n2

    )

    of class FSFOM.

    3 / 26

  • 1. Problem and method Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    For n = 0, 1, . . .

    xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk)

    Update step uses a weighted sum of previous and current gradients.

    Step coefficients H = {hn,k} are non-adaptive (pre-determined).Excludes conjugate gradient, Barzilai-Borwein, · · ·

    Equivalent computationally efficient form exists for some H.

    Gradient method (GM)

    Heavy-ball method [Polyak, 1964]

    Nesterov’s fast gradient method (FGM) [Nesterov, 1983]

    Efficient and achieves the optimal rate O(

    1

    n2

    )

    of class FSFOM.

    3 / 26

  • 1. Problem and method Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    For n = 0, 1, . . .

    xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk)

    Update step uses a weighted sum of previous and current gradients.

    Step coefficients H = {hn,k} are non-adaptive (pre-determined).Excludes conjugate gradient, Barzilai-Borwein, · · ·

    Equivalent computationally efficient form exists for some H.

    Gradient method (GM)

    Heavy-ball method [Polyak, 1964]

    Nesterov’s fast gradient method (FGM) [Nesterov, 1983]

    Efficient and achieves the optimal rate O(

    1

    n2

    )

    of class FSFOM.

    3 / 26

  • 1. Problem and method Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    For n = 0, 1, . . .

    xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk)

    Update step uses a weighted sum of previous and current gradients.

    Step coefficients H = {hn,k} are non-adaptive (pre-determined).Excludes conjugate gradient, Barzilai-Borwein, · · ·

    Equivalent computationally efficient form exists for some H.

    Gradient method (GM)

    Heavy-ball method [Polyak, 1964]

    Nesterov’s fast gradient method (FGM) [Nesterov, 1983]

    Efficient and achieves the optimal rate O(

    1

    n2

    )

    of class FSFOM.

    3 / 26

  • 1. Problem and method Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    For n = 0, 1, . . .

    xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk)

    Update step uses a weighted sum of previous and current gradients.

    Step coefficients H = {hn,k} are non-adaptive (pre-determined).Excludes conjugate gradient, Barzilai-Borwein, · · ·

    Equivalent computationally efficient form exists for some H.

    Gradient method (GM)

    Heavy-ball method [Polyak, 1964]

    Nesterov’s fast gradient method (FGM) [Nesterov, 1983]

    Efficient and achieves the optimal rate O(

    1

    n2

    )

    of class FSFOM.

    3 / 26

  • 1. Problem and method Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    For n = 0, 1, . . .

    xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk)

    Update step uses a weighted sum of previous and current gradients.

    Step coefficients H = {hn,k} are non-adaptive (pre-determined).Excludes conjugate gradient, Barzilai-Borwein, · · ·

    Equivalent computationally efficient form exists for some H.

    Gradient method (GM): rate O(

    1n

    )

    Heavy-ball method [Polyak, 1964]

    Nesterov’s fast gradient method (FGM) [Nesterov, 1983]

    Efficient and achieves the optimal rate O(

    1

    n2

    )

    of class FSFOM.

    3 / 26

  • 1. Problem and method Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    For n = 0, 1, . . .

    xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk)

    Update step uses a weighted sum of previous and current gradients.

    Step coefficients H = {hn,k} are non-adaptive (pre-determined).Excludes conjugate gradient, Barzilai-Borwein, · · ·

    Equivalent computationally efficient form exists for some H.

    Gradient method (GM)

    Heavy-ball method [Polyak, 1964]

    Nesterov’s fast gradient method (FGM) [Nesterov, 1983]Efficient and achieves the optimal rate O

    (

    1

    n2

    )

    of class FSFOM.

    Optimized gradient method (OGM)

    has a worst-case cost function bound that is smaller than that of FGM.

    has a computationally efficient form similar to FGM.

    3 / 26

  • 1. Problem and method Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    For n = 0, 1, . . .

    xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk)

    Update step uses a weighted sum of previous and current gradients.

    Step coefficients H = {hn,k} are non-adaptive (pre-determined).Excludes conjugate gradient, Barzilai-Borwein, · · ·

    Equivalent computationally efficient form exists for some H.

    Gradient method (GM)

    Heavy-ball method [Polyak, 1964]

    Nesterov’s fast gradient method (FGM) [Nesterov, 1983]Efficient and achieves the optimal rate O

    (

    1

    n2

    )

    of class FSFOM.

    Optimized gradient method optimized over gradient (OGM-OG)

    has a worst-case gradient bound that is smaller than that of FGM.

    has a computationally efficient form similar to FGM.

    3 / 26

  • 1. Problem and method Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    Fixed-step first-order methods (FSFOM)

    For n = 0, 1, . . .

    xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk)

    Update step uses a weighted sum of previous and current gradients.

    Step coefficients H = {hn,k} are non-adaptive (pre-determined).Excludes conjugate gradient, Barzilai-Borwein, · · ·

    Equivalent computationally efficient form exists for some H.

    Gradient method (GM)

    Heavy-ball method [Polyak, 1964]

    Nesterov’s fast gradient method (FGM) [Nesterov, 1983]

    Efficient and achieves the optimal rate O(

    1

    n2

    )

    of class FSFOM.

    3 / 26

  • 1. Problem and method Nesterov’s fast gradient method (FGM)

    Fast gradient method (FGM)

    FGM [Nesterov, Soviet Math. Dokl., 1983]

    Initialize x0 = y0, t0 = 1For n = 0, 1, . . .

    yn+1 = xn −1

    L∇f(xn) (GM update)

    tn+1 =1

    2

    (

    1 +√

    1 + 4t2n

    )

    (momentum factor)

    xn+1 = yn+1 +tn − 1tn+1

    (yn+1 − yn) (momentum update)

    FGM is in class FSFOM with [Drori and Teboulle, Math. Prog., 2014]

    HFGM : hn+1,k =

    tn−1tn+1

    hn,k, k = 0, . . . , n− 2,tn−1tn+1

    (hn,n−1 − 1), k = n− 1,1 + tn−1tn+1 , k = n.

    4 / 26

  • 1. Problem and method Upper and lower bounds of first-order methods

    Upper and lower achievable bounds of first-order methods

    Theorem 1.1 [Nesterov, Soviet Math. Dokl., 1983]

    For n ≥ 1, the primary sequence {yn} of FGM satisfies

    f(yn)− f(x∗) ≤2L||x0 − x∗||2

    (n+ 1)2.

    Theorem 1.2 [Nesterov, 2004]

    When the large-scale condition d ≥ 2n+ 1 holds, for any first-order methods(with fixed or dynamic step sizes) generating xn after n iterations there existsa function f ∈ FL(Rd) that satisfies the following lower bound:

    3L||x0 − x∗||232(n+ 1)2

    ≤ f(xn)− f(x∗).

    [Kim and Fessler, Math. Prog., 2016] and [Drori, J. Complexity, 2016] closethe case!

    5 / 26

  • 1. Problem and method Upper and lower bounds of first-order methods

    Upper and lower achievable bounds of first-order methods

    Theorem 1.1 [Nesterov, Soviet Math. Dokl., 1983]

    For n ≥ 1, the primary sequence {yn} of FGM satisfies

    f(yn)− f(x∗) ≤2L||x0 − x∗||2

    (n+ 1)2.

    Theorem 1.2 [Nesterov, 2004]

    When the large-scale condition d ≥ 2n+ 1 holds, for any first-order methods(with fixed or dynamic step sizes) generating xn after n iterations there existsa function f ∈ FL(Rd) that satisfies the following lower bound:

    3L||x0 − x∗||232(n+ 1)2

    ≤ f(xn)− f(x∗).

    [Kim and Fessler, Math. Prog., 2016] and [Drori, J. Complexity, 2016] closethe case!

    5 / 26

  • 1. Problem and method Upper and lower bounds of first-order methods

    Upper and lower achievable bounds of first-order methods

    Theorem 1.1 [Nesterov, Soviet Math. Dokl., 1983]

    For n ≥ 1, the primary sequence {yn} of FGM satisfies

    f(yn)− f(x∗) ≤2L||x0 − x∗||2

    (n+ 1)2.

    Theorem 1.2 [Nesterov, 2004]

    When the large-scale condition d ≥ 2n+ 1 holds, for any first-order methods(with fixed or dynamic step sizes) generating xn after n iterations there existsa function f ∈ FL(Rd) that satisfies the following lower bound:

    3L||x0 − x∗||232(n+ 1)2

    ≤ f(xn)− f(x∗).

    [Kim and Fessler, Math. Prog., 2016] and [Drori, J. Complexity, 2016] closethe case!

    5 / 26

  • 1. Problem and method Upper and lower bounds of first-order methods

    Goal: Develop new accelerated first-order methods

    Fixed-step first-order methods (FSFOM)

    For n = 0, 1, . . .

    xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk)

    Sec. 2: Find best-performing {hn,k} in terms of the cost function: OGMSec. 3: Study adaptive restart of OGM (not in class FSFOM)

    Sec. 4: Find best-performing {hn,k} in terms of the gradient: OGM-OG

    6 / 26

  • 1 Smooth convex problem and fixed-step first-order methods

    2 Optimized gradient method (OGM) optimized over cost functionDrori and Teboulle’s worst-case analysis of FSFOMDrori and Teboulle’s numerically optimized FSFOMAnalytically optimized FSFOM: OGMNumerical experiment

    3 OGM with adaptive restart

    4 Optimized gradient method optimized over gradient (OGM-OG)

    5 Summary

  • 2. OGM Drori and Teboulle’s worst-case analysis

    Exact worst-case bound analysis of FSFOM

    For given [Drori and Teboulle, Math. Prog., 2014]step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

    the worst-case convergence bound of f(xN )− f(x∗) is found by solving:

    BP(H, N, d, L,R) := maxf∈FL(Rd)

    maxx0,··· ,xN∈R

    d,x∗∈X∗(f)

    f(xN )− f(x∗) (P)

    s.t. xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk), n = 0, . . . , N − 1,

    ||x0 − x∗|| ≤ R.

    In other words,

    f(xN )− f(x∗) ≤ BP(H, N, d, L,R) = LR2BP(H, N, d, 1, 1).Impractical to solve (P) due to the functional constraint f ∈ FL(Rd). 7 / 26

  • 2. OGM Drori and Teboulle’s worst-case analysis

    Exact worst-case bound analysis of FSFOM

    For given [Drori and Teboulle, Math. Prog., 2014]step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

    the worst-case convergence bound of f(xN )− f(x∗) is found by solving:

    BP(H, N, d, L,R) := maxf∈FL(Rd)

    maxx0,··· ,xN∈R

    d,x∗∈X∗(f)

    f(xN )− f(x∗) (P)

    s.t. xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk), n = 0, . . . , N − 1,

    ||x0 − x∗|| ≤ R.

    In other words,

    f(xN )− f(x∗) ≤ BP(H, N, d, L,R) = LR2BP(H, N, d, 1, 1).Impractical to solve (P) due to the functional constraint f ∈ FL(Rd). 7 / 26

  • 2. OGM Drori and Teboulle’s worst-case analysis

    Exact worst-case bound analysis of FSFOM

    For given [Drori and Teboulle, Math. Prog., 2014]step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

    the worst-case convergence bound of f(xN )− f(x∗) is found by solving:

    BP(H, N, d, L,R) := maxf∈FL(Rd)

    maxx0,··· ,xN∈R

    d,x∗∈X∗(f)

    f(xN )− f(x∗) (P)

    s.t. xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk), n = 0, . . . , N − 1,

    ||x0 − x∗|| ≤ R.

    In other words,

    f(xN )− f(x∗) ≤ BP(H, N, d, L,R) = LR2BP(H, N, d, 1, 1).Impractical to solve (P) due to the functional constraint f ∈ FL(Rd). 7 / 26

  • 2. OGM Drori and Teboulle’s worst-case analysis

    Exact worst-case bound analysis of FSFOM

    For given [Drori and Teboulle, Math. Prog., 2014]step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

    the worst-case convergence bound of f(xN )− f(x∗) is found by solving:

    BP(H, N, d, L,R) := maxf∈FL(Rd)

    maxx0,··· ,xN∈R

    d,x∗∈X∗(f)

    f(xN )− f(x∗) (P)

    s.t. xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk), n = 0, . . . , N − 1,

    ||x0 − x∗|| ≤ R.

    In other words,

    f(xN )− f(x∗) ≤ BP(H, N, d, L,R) = LR2BP(H, N, d, 1, 1).Impractical to solve (P) due to the functional constraint f ∈ FL(Rd). 7 / 26

  • 2. OGM Drori and Teboulle’s worst-case analysis

    Exact worst-case bound analysis of FSFOM

    For given [Drori and Teboulle, Math. Prog., 2014]step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

    the worst-case convergence bound of f(xN )− f(x∗) is found by solving:

    BP(H, N, d, L,R) := maxf∈FL(Rd)

    maxx0,··· ,xN∈R

    d,x∗∈X∗(f)

    f(xN )− f(x∗) (P)

    s.t. xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk), n = 0, . . . , N − 1,

    ||x0 − x∗|| ≤ R.

    In other words,

    f(xN )− f(x∗) ≤ BP(H, N, d, L,R) = LR2BP(H, N, d, 1, 1).Impractical to solve (P) due to the functional constraint f ∈ FL(Rd). 7 / 26

  • 2. OGM Drori and Teboulle’s worst-case analysis

    Relaxed worst-case bound analysis of FSFOM[Drori and Teboulle, Math. Prog., 2014] replaces the constraint f ∈ FL(Rd) by

    1

    2L||∇f(xi)−∇f(xj)||2 ≤ f(xi)− f(xj)− 〈∇f(xj), xi − xj〉

    for i, j = 0, . . . , N, ∗, to relax problem (P) as:

    BP1(H,N, d, L,R) := maxx0,··· ,xN ,x∗∈R

    d,

    g0,··· ,gN∈Rd,

    δ0,··· ,δN∈R

    L||x0 − x∗||2δN (P1)

    s.t. xi+1 = xi −i∑

    k=0

    hi+1,k||x0 − x∗||gk, i = 0, . . . , N − 1,

    1

    2||gi − gj ||2 ≤ δi − δj −

    〈gj , xi − xj〉||x0 − x∗||

    , i, j = 0, . . . , N, ∗,

    ||x0 − x∗|| ≤ R,

    where gi :=1

    L||x0−x∗||∇f(xi) and δi := 1L||x0−x∗||2 (f(xi)− f(x∗)).

    [Taylor et al., Math. Prog., 2017] showed BP(·) = BP1(·).8 / 26

  • 2. OGM Drori and Teboulle’s worst-case analysis

    Relaxed worst-case bound analysis of FSFOM[Drori and Teboulle, Math. Prog., 2014] replaces the constraint f ∈ FL(Rd) by

    1

    2L||∇f(xi)−∇f(xj)||2 ≤ f(xi)− f(xj)− 〈∇f(xj), xi − xj〉

    for i, j = 0, . . . , N, ∗, to relax problem (P) as:

    BP1(H,N, d, L,R) := maxx0,··· ,xN ,x∗∈R

    d,

    g0,··· ,gN∈Rd,

    δ0,··· ,δN∈R

    L||x0 − x∗||2δN (P1)

    s.t. xi+1 = xi −i∑

    k=0

    hi+1,k||x0 − x∗||gk, i = 0, . . . , N − 1,

    1

    2||gi − gj ||2 ≤ δi − δj −

    〈gj , xi − xj〉||x0 − x∗||

    , i, j = 0, . . . , N, ∗,

    ||x0 − x∗|| ≤ R,

    where gi :=1

    L||x0−x∗||∇f(xi) and δi := 1L||x0−x∗||2 (f(xi)− f(x∗)).

    [Taylor et al., Math. Prog., 2017] showed BP(·) = BP1(·).8 / 26

  • 2. OGM Drori and Teboulle’s worst-case analysis

    Relaxed worst-case bound analysis of FSFOM[Drori and Teboulle, Math. Prog., 2014] replaces the constraint f ∈ FL(Rd) by

    1

    2L||∇f(xi)−∇f(xj)||2 ≤ f(xi)− f(xj)− 〈∇f(xj), xi − xj〉

    for i, j = 0, . . . , N, ∗, to relax problem (P) as:

    BP1(H,N, d, L,R) := maxx0,··· ,xN ,x∗∈R

    d,

    g0,··· ,gN∈Rd,

    δ0,··· ,δN∈R

    L||x0 − x∗||2δN (P1)

    s.t. xi+1 = xi −i∑

    k=0

    hi+1,k||x0 − x∗||gk, i = 0, . . . , N − 1,

    1

    2||gi − gj ||2 ≤ δi − δj −

    〈gj , xi − xj〉||x0 − x∗||

    , i, j = 0, . . . , N, ∗,

    ||x0 − x∗|| ≤ R,

    where gi :=1

    L||x0−x∗||∇f(xi) and δi := 1L||x0−x∗||2 (f(xi)− f(x∗)).

    [Taylor et al., Math. Prog., 2017] showed BP(·) = BP1(·).8 / 26

  • 2. OGM Drori and Teboulle’s numerically optimized FSFOM

    Numerical relaxed worst-case bound analysis of FSFOM

    [Drori and Teboulle, Math. Prog., 2014] further relaxes the problem as

    f(xN )− f(x∗) ≤ BP(H, N, d, L,R)= BP1(H, N, d, L,R)

    ...

    ≤ BD(H, N, L,R) = LR2BD(H, N, 1, 1).

    BD(H, N, 1, 1) is a solution of a (dual) convex semidefinite optimizationproblem (D) that can be solved numerically.

    For any given H, a relaxed upper bound of the worst-case of f(xN )− f(x∗)can be computed numerically using a SDP solver!

    9 / 26

  • 2. OGM Drori and Teboulle’s numerically optimized FSFOM

    Optimizing the step coefficients of FSFOM

    [Drori and Teboulle, Math. Prog., 2014] further relaxes the problem as

    f(xN )− f(x∗) ≤ BP(H, N, d, L,R)= BP1(H, N, d, L,R)

    ...

    ≤ BD(H, N, L,R) = LR2BD(H, N, 1, 1).

    BD(H, N, 1, 1) is a solution of a (dual) convex semidefinite optimizationproblem (D) that can be solved numerically.

    Q. Best-performing H∗?

    9 / 26

  • 2. OGM Drori and Teboulle’s numerically optimized FSFOM

    Optimizing the step coefficients of FSFOM

    [Drori and Teboulle, Math. Prog., 2014] further relaxes the problem as

    f(xN )− f(x∗) ≤ BP(H, N, d, L,R)= BP1(H, N, d, L,R)

    ...

    ≤ BD(H, N, L,R) = LR2BD(H, N, 1, 1).

    BD(H, N, 1, 1) is a solution of a (dual) convex semidefinite optimizationproblem (D) that can be solved numerically.

    Best-performing H∗ can be designed by solving:

    H∗ := argminH

    BP(H, N, d, 1, 1).

    This problem is intractable to solve.

    9 / 26

  • 2. OGM Drori and Teboulle’s numerically optimized FSFOM

    Optimizing the step coefficients of FSFOM

    [Drori and Teboulle, Math. Prog., 2014] further relaxes the problem as

    f(xN )− f(x∗) ≤ BP(H, N, d, L,R)= BP1(H, N, d, L,R)

    ...

    ≤ BD(H, N, L,R) = LR2BD(H, N, 1, 1).

    BD(H, N, 1, 1) is a solution of a (dual) convex semidefinite optimizationproblem (D) that can be solved numerically.

    Best-performing(?) H∗ can be designed by solving:

    H∗ := argminH

    BD(H, N, 1, 1).

    [Drori and Teboulle, Math. Prog., 2014] solves this using a SDP solver(with a tight convex relaxation).

    9 / 26

  • 2. OGM Analytically optimized FSFOM: OGM

    Optimized gradient method (OGM)

    Optimized step coefficients H∗ is [Kim and Fessler, Math. Prog., 2016]

    H∗ : hn+1,k =

    θn−1θn+1

    hn,k, k = 0, . . . , n− 2,θn−1θn+1

    (hn,n−1 − 1), k = n− 1,1 + 2θn−1θn+1 , k = n.

    OGM [Kim and Fessler, Math. Prog., 2016]

    Initialize x0 = y0, θ0 = 1For n = 0, 1, . . . , N − 1

    yn+1 = xn −1

    L∇f(xn) (GM update)

    θn+1 =

    12

    (

    1 +√

    1 + 4θ2n

    )

    , n = 0, 1, . . . , N − 212

    (

    1 +√

    1 + 8θ2n

    )

    , n = N − 1 (new momentum factor)

    xn+1 = yn+1 +θn − 1θn+1

    (yn+1 − yn) +θn

    θn+1(yn+1 − xn)

    (new momentum update)

    10 / 26

  • 2. OGM Analytically optimized FSFOM: OGM

    Optimized gradient method (OGM)

    Optimized step coefficients H∗ is [Kim and Fessler, Math. Prog., 2016]

    H∗ : hn+1,k =

    θn−1θn+1

    hn,k, k = 0, . . . , n− 2,θn−1θn+1

    (hn,n−1 − 1), k = n− 1,1 + 2θn−1θn+1 , k = n.

    OGM [Kim and Fessler, Math. Prog., 2016]

    Initialize x0 = y0, θ0 = 1For n = 0, 1, . . . , N − 1

    yn+1 = xn −1

    L∇f(xn) (GM update)

    θn+1 =

    12

    (

    1 +√

    1 + 4θ2n

    )

    , n = 0, 1, . . . , N − 212

    (

    1 +√

    1 + 8θ2n

    )

    , n = N − 1 (new momentum factor)

    xn+1 = yn+1 +θn − 1θn+1

    (yn+1 − yn) +θn

    θn+1(yn+1 − xn)

    (new momentum update)

    10 / 26

  • 2. OGM Analytically optimized FSFOM: OGM

    Optimized gradient method (OGM)

    Optimized step coefficients H∗ is [Kim and Fessler, Math. Prog., 2016]

    H∗ : hn+1,k =

    θn−1θn+1

    hn,k, k = 0, . . . , n− 2,θn−1θn+1

    (hn,n−1 − 1), k = n− 1,1 + 2θn−1θn+1 , k = n.

    OGM [Kim and Fessler, Math. Prog., 2016]

    Initialize x0 = y0, θ0 = 1For n = 0, 1, . . . , N − 1

    yn+1 = xn −1

    L∇f(xn) (GM update)

    θn+1 =

    12

    (

    1 +√

    1 + 4θ2n

    )

    , n = 0, 1, . . . , N − 212

    (

    1 +√

    1 + 8θ2n

    )

    , n = N − 1 (new momentum factor)

    xn+1 =

    [

    xn −1

    L

    (

    1 +θn

    θn+1

    )

    ∇f(xn)]

    +θn − 1θn+1

    (yn+1 − yn)

    (new momentum update)

    10 / 26

  • 2. OGM Analytically optimized FSFOM: OGM

    Convergence bounds for OGM

    Theorem 2.1 [Kim and Fessler, Math. Prog., 2016]

    For a given N ≥ 1, the point xN generated by OGM satisfies

    f(xN )− f(x∗) ≤L||x0 − x∗||2

    2θ2N≤ 1L||x0 − x∗||

    2

    (N + 1)(N + 1 +√2)

    .

    Twice smaller bound than FGM. To achieve the same cost function value,OGM requires 1/

    √2-times fewer iterations than FGM.

    Theorem 2.2 [Drori, J. Complexity, 2017]

    When the large-scale condition “d ≥ N + 1” holds, for any first-order methods(with fixed or dynamic step sizes) generating xN after N iterations thereexists a function f ∈ FL(Rd) that satisfies the following lower bound:

    L||x0 − x∗||22θ2N

    ≤ f(xN )− f(x∗).

    OGM achieves this lower bound exactly!11 / 26

  • 2. OGM Analytically optimized FSFOM: OGM

    Convergence bounds for OGM

    Theorem 2.1 [Kim and Fessler, Math. Prog., 2016]

    For a given N ≥ 1, the point xN generated by OGM satisfies

    f(xN )− f(x∗) ≤L||x0 − x∗||2

    2θ2N≤ 1L||x0 − x∗||

    2

    (N + 1)(N + 1 +√2)

    .

    Twice smaller bound than FGM. To achieve the same cost function value,OGM requires 1/

    √2-times fewer iterations than FGM.

    Theorem 2.2 [Drori, J. Complexity, 2017]

    When the large-scale condition “d ≥ N + 1” holds, for any first-order methods(with fixed or dynamic step sizes) generating xN after N iterations thereexists a function f ∈ FL(Rd) that satisfies the following lower bound:

    L||x0 − x∗||22θ2N

    ≤ f(xN )− f(x∗).

    OGM achieves this lower bound exactly!11 / 26

  • 2. OGM Numerical experiment

    Log-Sum-Exp problem

    Minimize a Log-Sum-Exp function:

    f(x) = η log

    (

    m∑

    i=1

    exp

    (

    1

    η(a⊤i x− bi)

    )

    )

    This approaches maxi=1,...,m(a⊤i x− bi) as η → 0.

    L = 1ηλmax(A⊤A), where A = [a1 · · · am]⊤ ∈ Rm×d.

    m = 100, d = 20 problem dimension, η = 1, N = 50 iterations

    Algorithms: GM, FGM, OGM

    12 / 26

  • 2. OGM Numerical experiment

    Log-Sum-Exp: Cost function vs Iteration

    0 10 20 30 40 50

    Iteration (n)

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    (f(y

    n)−f(x

    ∗))/f

    (x∗)

    N = 50 iterations

    GM

    FGM

    OGM

    Cost function f(yn)−f(x∗)f(x∗) vs Iteration (n)

    Convergence speed: GM

  • 1 Smooth convex problem and fixed-step first-order methods

    2 Optimized gradient method (OGM) optimized over cost function

    3 OGM with adaptive restartFGM with adaptive restartOGM with adaptive restartNumerical experiment

    4 Optimized gradient method optimized over gradient (OGM-OG)

    5 Summary

  • 3. OGM with adaptive restart FGM with adaptive restart

    FGM

    FGM [Nesterov, Soviet Math. Dokl., 1983]

    Initialize x0 = y0, t0 = 1For n = 0, 1, . . .

    yn+1 = xn −1

    L∇f(xn) (GM update)

    tn+1 =1

    2

    (

    1 +√

    1 + 4t2n

    )

    (momentum factor)

    xn+1 = yn+1 +tn − 1tn+1

    (yn+1 − yn) (momentum update)

    14 / 26

  • 3. OGM with adaptive restart FGM with adaptive restart

    FGM with adaptive restart

    FGM with adaptive restart [O’donoghue and Candes, FoCM, 2015]

    Initialize x0 = y0, t0 = 1For n = 0, 1, . . .

    yn+1 = xn −1

    L∇f(xn) (GM update)

    if restart condition satisfies, restart (set tn = 1)

    tn+1 =1

    2

    (

    1 +√

    1 + 4t2n

    )

    (momentum factor)

    xn+1 = yn+1 +tn − 1tn+1

    (yn+1 − yn) (momentum update)

    Restart condition:

    f(yn+1) > f(yn) or 〈−∇f(xn), yn+1 − yn〉 < 0

    Its practical acceleration is partially explained by a quadratic analysis.

    14 / 26

  • 3. OGM with adaptive restart OGM with adaptive restart

    OGM′

    OGM′ [Kim and Fessler, JOTA, 2017]

    Initialize x0 = y0, t0 = 1For n = 0, 1, . . .

    yn+1 = xn −1

    L∇f(xn) (GM update)

    tn+1 =1

    2

    (

    1 +√

    1 + 4t2n

    )

    (momentum factor)

    xn+1 = yn+1 +tn − 1tn+1

    (yn+1 − yn) +tn

    tn+1(yn+1 − xn)

    (new momentum update)

    15 / 26

  • 3. OGM with adaptive restart OGM with adaptive restart

    OGM′ with adaptive restart

    OGM′ with adaptive restart [Kim and Fessler, arXiv:1703.04641]

    Initialize x0 = y0, t0 = 1For n = 0, 1, . . .

    yn+1 = xn −1

    L∇f(xn) (GM update)

    if restart condition satisfies, restart (set tn = 1)

    tn+1 =1

    2

    (

    1 +√

    1 + 4t2n

    )

    (momentum factor)

    xn+1 = yn+1 +tn − 1tn+1

    (yn+1 − yn) +tn

    tn+1(yn+1 − xn)

    (new momentum update)

    Restart condition:

    f(yn+1) > f(yn) or 〈−∇f(xn), yn+1 − yn〉 < 0We extended a quadratic analysis in [O’donoghue and Candes, FoCM, 2015]for OGM with restart. (details omitted)

    15 / 26

  • 3. OGM with adaptive restart Numerical experiment

    Log-Sum-Exp problem

    Minimize a Log-Sum-Exp function:

    f(x) = η log

    (

    m∑

    i=1

    exp

    (

    1

    η(a⊤i x− bi)

    )

    )

    This approaches maxi=1,...,m(a⊤i x− bi) as η → 0.

    L = 1ηλmax(A⊤A), where A = [a1 · · · am]⊤ ∈ Rm×d.

    m = 100, d = 20 problem dimension, η = 1, N =“1000” iterations

    Algorithms: GM, FGM, OGM

    Restarting algorithms: “FGM-R”, “OGM-R”

    16 / 26

  • 3. OGM with adaptive restart Numerical experiment

    Log-Sum-Exp: Cost function vs Iteration

    0 10 20 30 40 50

    Iteration (n)

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    (f(y

    n)−f(x

    ∗))/f

    (x∗)

    N = 50 iterations

    GM

    FGM

    OGM

    Cost function f(yn)−f(x∗)f(x∗) vs Iteration (n)

    Convergence speed: GM

  • 3. OGM with adaptive restart Numerical experiment

    Log-Sum-Exp: Cost function vs Iteration

    0 200 400 600 800 1000

    Iteration (n)

    10-15

    10-10

    10-5

    100

    (f(y

    n)−f(x

    ∗))/f

    (x∗)

    N = 1000 iterations

    GM

    FGM

    OGM

    Cost function f(yn)−f(x∗)f(x∗) vs Iteration (n)

    Convergence speed: GM

  • 3. OGM with adaptive restart Numerical experiment

    Log-Sum-Exp: Cost function vs Iteration

    0 200 400 600 800 1000

    Iteration (n)

    10-15

    10-10

    10-5

    100

    (f(y

    n)−f(x

    ∗))/f

    (x∗)

    N = 1000 iterations

    GM

    FGM

    OGM

    FGM-R

    OGM-R

    Cost function f(yn)−f(x∗)f(x∗) vs Iteration (n)

    Convergence speed: GM

  • 1 Smooth convex problem and fixed-step first-order methods

    2 Optimized gradient method (OGM) optimized over cost function

    3 OGM with adaptive restart

    4 Optimized gradient method optimized over gradient (OGM-OG)Decreasing the gradient is importantWorst-case gradient bound analysis of FSFOMOptimized FSFOM in terms of gradient normNumerical experiment

    5 Summary

  • 4. OGM-OG Decreasing the gradient is important

    Decreasing the gradient is important

    It is known that dual gradient norm is related to primal feasibility.So, decreasing the gradient norm is also important in a dual approach.

    GM decreases the (dual) gradient norm ||∇f(xN )|| with rate O(1/N).

    What is the rate of the gradient norm decrease of FGM?

    O(1/N1.5)!

    Faster algorithm?

    Develop a new fast first-order method by optimizing FSFOM with

    respect to the gradient norm.

    18 / 26

  • 4. OGM-OG Worst-case gradient bound analysis of FSFOM

    Exact worst-case gradient bound analysis of FSFOMFor given

    step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

    the worst-case convergence bound of the gradient is found by solving:

    BP′(H, N, d, L,R) := maxf∈FL(Rd)

    maxx0,··· ,xN∈R

    d,x∗∈X∗(f)

    ||∇f(xN )||2 (P′)

    s.t. xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk), n = 0, . . . , N − 1,

    ||x0 − x∗|| ≤ R.

    Similar to the relaxation from (P) to (D), we relaxed the problem as

    minn∈{0,...,N}

    ||∇f(xn)||2 ≤ BP′(H, N, d, L,R) ≤ BD′(H, N, L,R).[Kim and Fessler, arXiv:1607.06764]

    19 / 26

  • 4. OGM-OG Worst-case gradient bound analysis of FSFOM

    Exact worst-case gradient bound analysis of FSFOMFor given

    step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

    the worst-case convergence bound of the gradient is found by solving:

    BP′(H, N, d, L,R) := maxf∈FL(Rd)

    maxx0,··· ,xN∈R

    d,x∗∈X∗(f)

    minn∈{0,...,N}

    ||∇f(xn)||2 (P′)

    s.t. xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk), n = 0, . . . , N − 1,

    ||x0 − x∗|| ≤ R.

    Similar to the relaxation from (P) to (D), we relaxed the problem as

    minn∈{0,...,N}

    ||∇f(xn)||2 ≤ BP′(H, N, d, L,R) ≤ BD′(H, N, L,R).[Kim and Fessler, arXiv:1607.06764]

    19 / 26

  • 4. OGM-OG Worst-case gradient bound analysis of FSFOM

    Exact worst-case gradient bound analysis of FSFOMFor given

    step-size coefficients: H = {hn,k},number of iterations: N ,problem dimension: d,Lipschitz constant: L,maximum distance between an initial and a solution: ||x0 − x∗|| ≤ R,

    the worst-case convergence bound of the gradient is found by solving:

    BP′(H, N, d, L,R) := maxf∈FL(Rd)

    maxx0,··· ,xN∈R

    d,x∗∈X∗(f)

    minn∈{0,...,N}

    ||∇f(xn)||2 (P′)

    s.t. xn+1 = xn −1

    L

    n∑

    k=0

    hn+1,k∇f(xk), n = 0, . . . , N − 1,

    ||x0 − x∗|| ≤ R.

    Similar to the relaxation from (P) to (D), we relaxed the problem as

    minn∈{0,...,N}

    ||∇f(xn)||2 ≤ BP′(H, N, d, L,R) ≤ BD′(H, N, L,R).[Kim and Fessler, arXiv:1607.06764]

    19 / 26

  • 4. OGM-OG Worst-case gradient bound analysis of FSFOM

    New gradient bound for FGM

    FGM [Nesterov, Soviet Math. Dokl., 1983]

    Initialize x0 = y0, t0 = 1For n = 0, 1, . . .

    yn+1 = xn −1

    L∇f(xn) (GM update)

    tn+1 =1

    2

    (

    1 +√

    1 + 4t2n

    )

    (momentum factor)

    xn+1 = yn+1 +tn − 1tn+1

    (yn+1 − yn) (momentum update)

    Theorem 4.1 [Kim and Fessler, arXiv:1607.06764]

    The sequence {xn} generated by FGM satisfies

    minn∈{0,...,N}

    ||∇f(xn)|| ≤L||x0 − x∗||√

    ∑Nn=0 t

    2n

    ≤ 2√3L||x0 − x∗||

    N1.5.

    The final gradient norm ||∇f(xN )|| of FGM decreases with rate O(1/N).20 / 26

  • 4. OGM-OG Optimized FSFOM in terms of gradient norm

    Optimized FSFOM in terms of gradient norm

    Optimize step coefficients in terms of the gradient norm as:[Kim and Fessler, arXiv:1607.06764]

    H∗OG := argminH

    BD′(H, N, 1, 1).

    FSFOM method with H∗OG has an efficient formulation similar to OGM,named OGM-OG (OG for optimized over a gradient).

    21 / 26

  • 4. OGM-OG Optimized FSFOM in terms of gradient norm

    OGM-OG method

    OGM-OG [Kim and Fessler, arXiv:1607.06764]

    Initialize x0 = y0, ω0 = Ω0 = 1For n = 0, 1, . . . , N − 1

    yn+1 = xn −1

    L∇f(xn) (GM update)

    ωn+1 =

    {

    12

    (

    1 +√

    1 + 4ω2n

    )

    , n = 0, . . . ,⌊

    N2

    − 2N−n

    2 , n =⌊

    N2

    − 1, . . . , N − 1

    Ωn+1 =n+1∑

    l=0

    ωl (new momentum factor)

    xn+1 = yn+1 +(Ωn − ωn)ωn+1

    ωnΩn+1(yn+1 − yn) +

    (2ω2n − Ωn)ωn+1ωnΩn+1

    (yn+1 − xn)

    (new momentum update)

    This reduces to OGM′ when ωn+1 =12

    (

    1 +√

    1 + 4ω2n

    )

    for all n ≥ 0.

    22 / 26

  • 4. OGM-OG Optimized FSFOM in terms of gradient norm

    Gradient bound of OGM-OG

    Theorem 4.2 [Kim and Fessler, arXiv:1607.06764]

    The sequence {xn} generated by OGM-OG satisfies

    minn∈{0,...,N}

    ||∇f(xn)|| ≤L||x0 − x∗||

    ∑Nn=0(Ω

    2n − ω2n)

    ≤√6L||x0 − x∗||

    N1.5.

    √2-times smaller bound than FGM. To achieve the same gradient norm,

    OGM-OG requires 1/ 3√2-times fewer iterations than FGM.

    To best of my knowledge, this is the best known worst-case analytical gradientnorm bound for FSFOM. (However, this is not optimal unlike OGM.)

    Other choices of ωn such as ωn =n+aa for any a > 2 also have the rate

    O(1/N1.5), which does not require selecting N in advance like FGM.

    23 / 26

  • 4. OGM-OG Optimized FSFOM in terms of gradient norm

    Gradient bound of OGM-OG

    Theorem 4.2 [Kim and Fessler, arXiv:1607.06764]

    The sequence {xn} generated by OGM-OG satisfies

    minn∈{0,...,N}

    ||∇f(xn)|| ≤L||x0 − x∗||

    ∑Nn=0(Ω

    2n − ω2n)

    ≤√6L||x0 − x∗||

    N1.5.

    √2-times smaller bound than FGM. To achieve the same gradient norm,

    OGM-OG requires 1/ 3√2-times fewer iterations than FGM.

    To best of my knowledge, this is the best known worst-case analytical gradientnorm bound for FSFOM. (However, this is not optimal unlike OGM.)

    Other choices of ωn such as ωn =n+aa for any a > 2 also have the rate

    O(1/N1.5), which does not require selecting N in advance like FGM.

    23 / 26

  • 4. OGM-OG Numerical experiment

    Log-Sum-Exp problem

    Minimize a Log-Sum-Exp function:

    f(x) = η log

    (

    m∑

    i=1

    exp

    (

    1

    η(a⊤i x− bi)

    )

    )

    This approaches maxi=1,...,m(a⊤i x− bi) as η → 0.

    L = 1ηλmax(A⊤A), where A = [a1 · · · am]⊤ ∈ Rm×d.

    m = 100, d = 20 problem dimension, η = 1, N = 50 iterations

    Algorithms: GM, FGM, OGM, “OGM-OG”

    24 / 26

  • 4. OGM-OG Numerical experiment

    Log-Sum-Exp: Gradient norm vs Iteration

    0 10 20 30 40 50

    Iteration (n)

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    ||∇f(y

    n)||

    N = 50 iterations

    GM

    FGM

    OGM

    OGM-OG

    Gradient norm ||∇f(yn)|| vs Iteration (n)

    Convergence speed: GM

  • 1 Smooth convex problem and fixed-step first-order methods

    2 Optimized gradient method (OGM) optimized over cost function

    3 OGM with adaptive restart

    4 Optimized gradient method optimized over gradient (OGM-OG)

    5 Summary

  • 5. Summary

    Summary and Future work

    Optimized fixed-step first-order methods (FSFOM) over the cost functionand gradient norm for smooth convex minimization, leading to OGM andOGM-OG respectively.

    Introduced adaptive restarting (of the momentum) for OGM to improveits rate of convergence, when the current momentum is not helpful.

    Future work

    Develop FSFOM that is optimal for the gradient norm decrease.

    Extend OGM and OGM-OG to other function classes.

    (1) Taylor, Hendrickx, and Glineur, “Exact worst-case performance of first-orderalgorithms for composite convex optimization,” arXiv:1512.07516.

    (2) Kim and Fessler, “Another look at “Fast Iterative Shrinkage/ThresholdingAlgorithm (FISTA)”,” arXiv:1608.03861.

    26 / 26

  • 5. Summary

    Summary and Future work

    Optimized fixed-step first-order methods (FSFOM) over the cost functionand gradient norm for smooth convex minimization, leading to OGM andOGM-OG respectively.

    Introduced adaptive restarting (of the momentum) for OGM to improveits rate of convergence, when the current momentum is not helpful.

    Future work

    Develop FSFOM that is optimal for the gradient norm decrease.

    Extend OGM and OGM-OG to other function classes.

    (1) Taylor, Hendrickx, and Glineur, “Exact worst-case performance of first-orderalgorithms for composite convex optimization,” arXiv:1512.07516.

    (2) Kim and Fessler, “Another look at “Fast Iterative Shrinkage/ThresholdingAlgorithm (FISTA)”,” arXiv:1608.03861.

    Thank you for listening to my talk!

    26 / 26

  • References

    1. Y. Drori, “The exact information-based complexity of smooth convex minimization,” J. Complexity,2017.

    2. Y. Drori, and M. Teboulle, “Performance of first-order methods for smooth convex minimization: Anovel approach,” Mathematical Programming, 2014.

    3. D. Kim, and J. A. Fessler, “Another look at “Fast Iterative Shrinkage/Thresholding Algorithm(FISTA)”,” arXiv:1608.03861.

    4. D. Kim, and J. A. Fessler, “Generalizing the optimized gradient methods for smooth convexminimization,” arXiv:1607.06764.

    5. D. Kim, and J. A. Fessler, “On the convergence analysis of the optimized gradient method,” J.Optim. Theory Appl., 2017.

    6. D. Kim, and J. A. Fessler, “Optimized first-order methods for smooth convex minimization,”Mathematical Programming, 2016.

    7. D. Kim, and J. A. Fessler, “Adaptive restart of the optimized gradient method for convexoptimization,” arXiv:1703.04641.

    8. Y. Nesterov, “A method for unconstrained convex minimization problem with the rate ofconvergence O(1/k2),” Dokl. Akad. Nauk. USSR, 1983.

    9. Y. Nesterov, “Introductory lectures on convex optimization: A basic course,” Kluwer, 2004.

    10. B. O’donoghue and E. Candes, “Adaptive restart for accelerated gradient schemes,” Foundations ofComputational Mathematics, 2015.

    11. B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSRComputational Mathematics and Mathematical Physics, 1964.

    12. A. B. Taylor, J. M. Hendrickx, and F. Glineur, “Exact worst-case performance of first-orderalgorithms for composite convex optimization,” arXiv:1512.07516.

    13. A. B. Taylor, J. M. Hendrickx, and F. Glineur, “Smooth strongly convex interpolation and exactworst-case performance of first-order methods,” Mathematical Programming, 2017.

  • Backup

  • Two worst-case functions of OGM

    Theorem B.1 [Kim and Fessler, JOTA, 2017]

    For the following two “worst-case” functions in FL(Rd) with R = ||x0 − x∗||and x∗ = 0:

    f1(x;N) =

    {

    LRθ2N

    ||x|| − LR22θ4

    N

    , if ||x|| ≥ Rθ2N

    ,L2 ||x||2, otherwise,

    (Huber)

    f2(x) =L

    2||x||2, (Quadratic)

    OGM exactly achieves its worst-case convergence bound after N iterations,i.e.,

    f1(xN ;N)− f1(x∗;N) = f2(xN )− f2(x∗) =L||x0 − x∗||2

    2θ2N.

    (figure of two different worst-case behaviors will be shown in next page)

  • Two worst-case functions of OGM (cont’d)

    Example: Two worst-case behaviors of OGM for x0 = 1, x∗ = 0, N = 5,and d = R = L = 1.

    x

    -1 0 1

    f(x)

    00.02

    0.5

    f1(x;N): Huber

    x

    -1 0 1

    f(x)

    00.02

    0.5

    f2(x): Quadratic

    The worst-case function of the primary sequence {yi} is of the type f1(x; i),whereas that of the secondary sequence {xi} is f2(x). (not shown)Interestingly, the final iterate xN of OGM compromises between the twoextremely different worst-case behaviors, which is related to the choice of θN .

  • A sketch of quadratic analysis of OGM

    Minimize a smooth and strongly convex quadratic function:

    f(x) =1

    2x⊤Qx− p⊤x,

    where Q ∈ Rd×d is a symmetric positive definite matrix, p ∈ Rd is a vector,and x∗ = Q

    −1p is the optimum.

    OGM with constant step coefficients [Kim and Fessler, arXiv:1703.04641]

    Initialize x0 = y0For n = 0, 1, . . .

    yn+1 = xn −1

    L∇f(xn) (GM update)

    xn+1 = yn+1 + β(yn+1 − yn) + γ(yn+1 − xn) (new momentum update)

    Use the following relationship for the analysis (details omitted):

    [

    yn+1 − x∗yn − x∗

    ]

    =

    [

    (1 + β)(

    I − 1LQ)

    − γ 1LQ −β(

    I − 1LQ)

    I 0

    ][

    yn − x∗yn−1 − x∗

    ]

    Smooth convex problem and fixed-step first-order methodsSmooth convex problemFixed-step first-order methods (FSFOM)Nesterov's fast gradient method (FGM)Upper and lower achievable bounds of first-order methods

    Optimized gradient method (OGM) optimized over cost functionDrori and Teboulle's worst-case analysis of FSFOMDrori and Teboulle's numerically optimized FSFOMAnalytically optimized FSFOM: OGMNumerical experiment

    OGM with adaptive restartFGM with adaptive restartOGM with adaptive restartNumerical experiment

    Optimized gradient method optimized over gradient (OGM-OG)Decreasing the gradient is importantWorst-case gradient bound analysis of FSFOMOptimized FSFOM in terms of gradient normNumerical experiment

    Summary


Recommended