Generalized Accelerated Composite Gradient Method · GENERALIZED ACCELERATED COMPOSITE GRADIENT...

A GENERALIZED ACCELERATED COMPOSITE GRADIENTMETHOD: UNITING NESTEROV’S FAST GRADIENT METHOD

AND FISTA

MIHAI I. FLOREA∗ AND SERGIY A. VOROBYOV∗

Abstract. Numerous problems in signal processing, statistical inference, computer vision, andmachine learning, can be cast as large-scale convex optimization problems. Due to their size, many ofthese problems can only be addressed by first-order accelerated black-box methods. The most popularamong these are the Fast Gradient Method (FGM) and the Fast Iterative Shrinkage ThresholdingAlgorithm (FISTA). FGM requires that the objective be finite and differentiable with known gradientLipschitz constant. FISTA is applicable to the more broad class of composite objectives and isequipped with a line-search procedure for estimating the Lipschitz constant. Nonetheless, FISTAcannot increase the step size and is unable to take advantage of strong convexity. FGM and FISTAare very similar in form. Despite this, they appear to have vastly differing convergence analyses. Inthis work we generalize the previously introduced augmented estimate sequence framework as wellas the related notion of the gap sequence. We showcase the flexibility of our tools by constructing aGeneralized Accelerated Composite Gradient Method, that unites FGM and FISTA, along with theirmost popular variants. The Lyapunov property of the generalized gap sequence used in deriving ourmethod implies that both FGM and FISTA are amenable to a Lyapunov analysis, common amongoptimization algorithms. We further showcase the flexibility of our tools by endowing our methodwith monotonically decreasing objective function values alongside a versatile line-search procedure.By simultaneously incorporating the strengths of FGM and FISTA, our method is able to surpassboth in terms of robustness and usability. We support our findings with simulation results onan extensive benchmark of composite problems. Our experiments show that monotonicity has astabilizing effect on convergence and challenge the notion present in the literature that for stronglyconvex objectives, accelerated proximal schemes can be reduced to fixed momentum methods.

Key words. estimate sequence, Nesterov method, fast gradient method, FISTA, monotone,line-search, composite objective, large-scale optimization

AMS subject classifications. 90C06, 68Q25, 90C25

1. Introduction. Numerous large-scale convex optimization problems have re-cently emerged in a variety of fields, including signal and image processing, statisticalinference, computer vision, and machine learning. Often, little is known about theactual structure of the objective function. Therefore, optimization algorithms used insolving such problems can only rely (e.g., by means of callback functions) on specificblack-box methods, called oracle functions [1]. The term “large-scale” refers to thetractability of certain computational primitives (see also [2]). In the black-box setting,it means that the oracle functions of large-scale problems usually only include scalarfunctions and operations that resemble first-order derivatives.

Many large-scale applications were rendered practical to address with the adventof Nesterov’s Fast Gradient Method (FGM) [3]. The canonical form of FGM, for-mulated in [4], requires that the objective be differentiable with Lipschitz gradient.Many optimization problems, particularly inverse problems in fields such as sparsesignal processing, linear algebra, matrix and tensor completion, and digital imaging(see [5–9] and references therein), have a composite structure. In these compositeproblems, the objective F is the sum of a function f with Lipschitz gradient (Lips-chitz constant Lf ) and a simple but possibly non-differentiable regularizer Ψ. Theregularizer Ψ embeds constraints by being infinite outside the feasible set. Often, Lfis not known in advance. The composite problem oracle functions are the scalar f(x)

∗Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland([email protected], [email protected]).

1

arX

iv:1

705.

1026

6v2

[m

ath.

OC

] 2

8 O

ct 2

019

mailto:[email protected], [email protected]

2 MIHAI I. FLOREA AND SERGIY A. VOROBYOV

and Ψ(x), as well as the gradient ∇f(x) and the proximal operator proxτΨ(x).To address composite problems, Nesterov has devised an Accelerated Multistep

Gradient Scheme (AMGS) [10]. This method updates a Lipschitz constant estimate(LCE) at every iteration using a subprocess commonly referred to in the literatureas “line-search” [8,11]. However, the generation of a new iterate (advancement phaseof an iteration) and line-search are interdependent and cannot be executed in par-allel. Moreover, AMGS utilizes only the gradient-type oracle functions ∇f(x) andproxτΨ(x). In many applications, including compressed sensing (e.g., LASSO [12])and many classification tasks (e.g., l1-regularized logistic regression), the evaluationof ∇f(x) is more computationally expensive than f(x). An alternative to AMGSthat uses f(x) calls in line-search has been proposed by Beck and Teboulle in theform of the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [11].1 FISTAalso benefits from having line-search decoupled from advancement. However, FISTAis unable to decrease the LCE at run-time and does not take into account strongconvexity.

A strongly convex extension of FISTA, which we designate as FISTA Chambolle-Pock (FISTA-CP), has been recently introduced in [7], but without line-search. Twoworks [13, 14] propose a similar method that accommodates only feasible start, as-sumes µΨ = 0, and does not feature line-search nor monotonically decreasing objectivefunction values (monotonicity). In the context of the composite problem class, thismethod is merely a restricted version of FISTA-CP. Consequently, we also treat itas FISTA-CP. Another study [15] proposes a variant of FISTA-CP equipped witha line-search procedure capable of decreasing the LCE (a fully adaptive line-searchstrategy). However, this strongly convex Accelerated Proximal Gradient (scAPG)method is guaranteed to converge only if f is strongly convex and if the startingpoint is feasible.

Apart from the above methods, which are applicable to a wide range of problemsand feature convergence guarantees, the literature abounds with algorithms that eitheremploy heuristics or are highly specialized. For instance, restart techniques have beendeveloped to address strong convexity. However, the adaptive restart proposed in [16]lacks rigorous convergence guarantees for non-quadratic objectives and the periodicrestart introduced in [3] requires prior knowledge of the Lipschitz constant. A veryinteresting study [17] proposes an over-relaxed variant of FISTA, a first-order methodthat appears to define a distinct category of algorithms. However, the study fails toargue whether the convergence guarantees are superior to those of FISTA. Despitebeing tested only on instances of the LASSO problem [12], the performance benefit isunclear even in this restricted scenario. Numerous other methods successfully exploitadditional problem structure, such as sparsity of optimal points (e.g. [18–21]), howeverthey cannot be applied to the entire composite problem class.

FISTA (when the objective is non-strongly convex), FISTA-CP, and scAPG (whenf is strongly-convex) are identical in form to certain variants of FGM. Therefore,FISTA, FISTA-CP, and scAPG can be viewed as extensions of FGM for compos-ite objectives. However, whereas FGM was derived using the estimate sequence [4],FISTA, FISTA-CP (including the variant found in [13,14]), and scAPG were proposedwithout derivation. No details have been provided on how the update rules of eachmethod were constructed, apart from their obvious validation in the corresponding

1FISTA has very similar functionality and convergence guarantees to those of the first variantof FGM introduced in [3]. For clarity, we denote the latter method also as FISTA, whereas we useFGM to designate the canonical form developed in [4].

GENERALIZED ACCELERATED COMPOSITE GRADIENT METHOD 3

convergence analysis. Consequently, new features of FGM cannot be directly incor-porated into FISTA, FISTA-CP, and scAPG. For instance, Nesterov has proposedin [22] a line-search variant of FGM with an implicit derivation based on the estimatesequence, albeit only for non-strongly convex objectives. Had FISTA been derivedin this way, a fully adaptive line-search variant of FISTA could simply be obtainedfrom [22]. Instead, a sophisticated fully adaptive line-search extension was proposedin [23], with a technical derivation based on the mathematical constructs of [11]. Onthe other hand, through partial adoption of the estimate sequence, i.e., relating FISTAto the “constant step scheme I” variant of FGM in [4] and AMGS, we have arrivedat a similar but simpler fully adaptive line-search scheme for FISTA [24], but againonly in the non-strongly convex scenario.

In [25], we have introduced the augmented estimate sequence framework and usedit to derive the Accelerated Composite Gradient Method (ACGM), which incorporatesby design a fully adaptive line-search procedure. ACGM has the convergence guar-antees of FGM, the best among primal first-order methods, while being as broadlyapplicable as AMGS. In addition, FISTA-CP and FISTA, along with the fully adaptiveline-search extensions in [23] and [24], are particular cases of ACGM [25]. However, toaccommodate infeasible start, we have imposed restrictions on the input parameters.Algorithms such as the “constant step scheme III” variant of FGM [4] and scAPG [15]are guaranteed to converge only when the starting point is feasible, and thus do notcorrespond to any instance of ACGM in [25].

1.1. Contributions.• In this paper, by lifting the restrictions on the input parameters imposed by

infeasible start, we generalize the augmented estimate sequence frameworkand derive a generalization of ACGM that encompasses FGM and FISTA,along with their variants.

• We further showcase the flexibility and power of the augmented estimate se-quence framework by endowing ACGM with monotonicity alongside its adap-tive line-search procedure. Monotonicity is a desirable property, particularlywhen dealing with proximal operators that lack a closed form expression orother kinds of inexact oracles [7, 26]. Even when dealing with exact oracles,monotonicity leads to a more stable and predictable convergence rate.

• Our generalized ACGM, with its additional features, including the fully adap-tive estimation of the Lipschitz constant, is therefore superior to FGM andFISTA in terms of flexibility and usability.

• We further generalize the definition of the wall-clock time unit (previouslyintroduced in [25]), to account for arbitrary oracle function costs. We analyzethe per-iteration complexity of our method in this generalized context, takingadvantage of parallelization when possible.

• We support our theoretical findings with simulation results on an extendedbenchmark of signal processing and machine learning problems.

1.2. Assumptions and notation. We consider the composite optimizationproblems of the form

(1) minx∈Rn

F (x)def= f(x) + Ψ(x),

where x is a vector of n optimization variables (bold type indicates a vector in Rn),and F is the objective function. The constituents of the objective F are the convexdifferentiable function f : Rn → R and the convex lower semicontinuous regularizer


function Ψ : Rn → R ∪ ∞. Function f has Lipschitz gradient (Lipschitz constantLf > 0) and a strong convexity parameter µf ≥ 0. Note that the existence ofa strong convexity parameter does not imply the strong convexity property sinceany convex function has a strong convexity parameter of zero. Regularizer Ψ has astrong convexity parameter µΨ ≥ 0, entailing that objective F has a strong convexityparameter µ = µf + µΨ. Constraints are enforced by making Ψ infinite outside thefeasible set, which is closed and convex.

Apart from the above properties, nothing is assumed known about functions f andΨ, which can only be accessed in a black-box fashion [1] by querying oracle functionsf(x), ∇f(x), Ψ(x), and proxτΨ(x), with arguments x ∈ Rn and τ > 0. The proximaloperator proxτΨ(x) is given by

(2) proxτΨ(x)def= arg min

z∈Rn

(Ψ(z) +

1

2τ‖z − x‖22

),

for all x ∈ Rn and τ > 0, with ‖.‖2 designating the standard Euclidean norm in Rn.Central to our derivation are generalized parabolae, quadratic functions whose

Hessians are multiples of the identify matrix. We refer to the strongly convex onessimply as parabolae. Any parabola ψ : Rn → R can be written in canonical form as

(3) ψ(x)def= ψ∗ +

γ

2‖x− v‖22, x ∈ Rn,

where γ > 0 denotes the curvature, v is the vertex, and ψ∗ is the optimum value.For conciseness, we introduce the generalized parabola expression Qλ,y(x) for all

x,y ∈ Rn and λ ≥ 0 as

(4) Qλ,y(x)def= f(y) + 〈∇f(y),x− y〉+

λ

2‖x− y‖22.

The proximal gradient operator TL(y) [10] can be expressed succinctly using (4) as

(5) TL(y)def= arg min

x∈Rn

(QL,y(x) + Ψ(x)) , y ∈ Rn,

where L > 0 is a parameter corresponding to the inverse of the step size. OperatorTL(y) can be evaluated in terms of oracle functions as

(6) TL(y) = prox 1L Ψ

(y − 1

L∇f(y)

), y ∈ Rn, L > 0.

2. Generalizing ACGM. In our previous formulation of the augmented esti-mate sequence framework and subsequent derivation of the Accelerated CompositeGradient Method (ACGM) in [25], we have chosen to accommodate infeasible startby default. In the sequel, we argue why this assumption was necessary and provide ageneralized framework, wherein this assumption is only a particular case.

2.1. FGM estimate sequence. We begin the generalization of the estimatesequence by noting that a rate of convergence can only be obtained on function valuesand not iterates. It may be argued that, when the objective function is stronglyconvex, many first-order schemes, including the non-accelerated fixed-point methods,guarantee the linear convergence of the iterates to the optimal point. However, whenthe problem is non-strongly convex, the optimization landscape may contain a high-dimensional subspace of very low curvature in the vicinity of the set of optimal points.


In this case, a convergence rate of iterates remains a difficult open problem [27]. Forinstance, Nesterov has provided in [4] an ill-conditioned quadratic problem where aconvergence rate for iterates cannot be formulated for all first-order schemes of acertain structure.

Hence, we choose to measure convergence using the image space distance (ISD),which is the distance between the objective values at iterates and the optimal value.The decrease rate of an upper bound on the ISD gives the convergence guarantee(provable convergence rate). The generalization of the estimate sequence frameworkfollows naturally from the formulation of such guarantees. Specifically, we interpretthe image space distance upper bound (ISDUB), provided by Nesterov for FGM in [4],for all k ≥ 0 as

(7) Ak(F (xk)− F (x∗)) ≤ A0(F (x0)− F (x∗)) +γ0

2‖x0 − x∗‖22.

Point x∗ can be any optimal point. However, we consider it fixed throughout thiswork. The convergence guarantee is given by the sequence Akk≥0 with A0 ≥ 0. Eachiteration may not be able to improve the objective function value, but at least it mustimprove the convergence guarantee, meaning that Akk≥0 must be an increasingsequence. From A0 ≥ 0 it follows that Ak > 0 for all k ≥ 1.

The right-hand side of (7) is a weighted sum between the initial ISD and thecorresponding domain space term (DST), with weights given by A0 and γ0, respec-tively. In the derivation of FGM, the weights are constrained as A0 > 0 and γ0 ≥ A0µwhereas in the original ACGM [25] we have enforced A0 = 0 and γ0 ≤ 1. Given thatour current aim is to provide a generic framework, we impose no restrictions on theweights, apart from A0 ≥ 0 and γ0 > 0. The former restriction follows from theconvexity of F while the latter is required by the estimate sequence, along with itsaugmented variant, as we shall demonstrate in the sequel.

The ISDUB expression can be rearranged to take the form

(8) AkF (xk) ≤ Hk,

where

(9) Hkdef= (Ak −A0)F (x∗) +A0F (x0) +

γ0

2‖x0 − x∗‖22

is the highest upper bound that can be placed on weighted objective values AkF (xk)to satisfy (7).

The value of Hk depends on the optimal value F (x∗), which is an unknown quan-tity. The estimate sequence provides a computable, albeit more stringent, replacementfor Hk. It is obtained as follows. The convexity of the objective implies the existenceof a sequence Wkk≥1 of convex global lower bounds on F , written as

(10) F (x) ≥Wk(x), x ∈ Rn, k ≥ 1.

For instance, when querying ∇f(x) at a test point yk, we obtain a global lowerbound on f as Qµf ,yk

(x) for all x ∈ Rn. Consequently Qµf ,yk+ Ψ is a global lower

bound of F . The lower bounds obtained from test points up to and including yk can becombined to produce Wk. The drawback of incorporating Ψ in the lower bounds liesin the introduction of additional proxτΨ(x) calls, as in [10]. In Subsection 2.3 we willformulate more computationally efficient bounds. Notwithstanding, the frameworkwe introduce here does not assume any structure on Wk apart from convexity.


By substituting the optimal value terms F (x∗) in (8) with Wk(x∗), we obtainHk, a lower bound on Hk, given by

(11) Hkdef= (Ak −A0)Wk(x∗) +A0F (x0) +

γ0

2‖x∗ − x0‖22,

for all k ≥ 0. This still depends on x∗. However, Hk can be viewed as the value ofan estimate function, taken at an optimal point x∗. The estimate functions ψk aredefined as functional extensions of Hk, namely

(12) ψk(x)def= (Ak −A0)Wk(x) +A0F (x0) +

γ0

2‖x− x0‖22,

for all x ∈ Rn and k ≥ 0. Note that the first estimate function ψ0 does not containa lower bound term. Therefore, it is not necessary to define W0. The collection ofestimate functions ψkk≥0, is referred to as the estimate sequence.

The estimate function optimum value, given by

(13) ψ∗kdef= min

x∈Rnψk(x), k ≥ 0,

is guaranteed to be lower than Hk, since

(14) ψ∗k = minx∈Rn

ψk(x) ≤ ψk(x∗) = Hk, k ≥ 0.

As such, ψ∗k provides the sought after computable replacement of Hk. Note that ifthe lower bounds Wk are linear, the estimate functions are generalized parabolae withthe curvature given by γ0. In this case, the existence of ψ∗k is conditioned by γ0 > 0,explaining the assumption made in (7).

Thus, it suffices to maintain the estimate sequence property, given by

(15) AkF (xk) ≤ ψ∗k, k ≥ 0,

to satisfy the ISDUB expression (7). The proof follows from the above definitions as

(16) AkF (xk)(15)

≤ ψ∗k(14)

≤ ψk(x∗) = Hk(10)

≤ Hk, k ≥ 0.

2.2. Generalizing the augmented estimate sequence. The interval be-tween the maintained upper bound ψ∗k and the highest allowable bound Hk containsHk. This allows us to produce a relaxation of the estimate sequence by forcibly closingthe gap between Hk and Hk. Namely, we define the augmented estimate functions as

(17) ψ′k(x)def= ψk(x) +Hk −Hk, k ≥ 0,

with ψ′k(x)k≥0 being the augmented estimate sequence. We expand definition (17)as

(18) ψ′k(x) = ψk(x) + (Ak −A0)(F (x∗)−Wk(x∗)), k ≥ 0.

The augmented estimate sequence property is formulated as

(19) AkF (xk) ≤ ψ′∗k , k ≥ 0,

where the augmented estimate function optimum value ψ′∗k is given by

(20) ψ′∗kdef= min

x∈Rnψ′k(x), k ≥ 0.


2.3. Quadratic lower bounds. As we have seen in Subsection 2.1, the struc-ture of the estimate sequence and, subsequently, the augmented estimate sequence,depends on the global lower bounds Wk. These can be obtained by combining simplelower bounds, each produced at each iteration k using test point yk and denoted aswk. In the following, we propose simple lower bounds wk that can be obtained in acomputationally efficient manner, each requiring only one call to TL(x).

Lemma 1. If at every iteration k ≥ 0, the auxiliary points and LCEs obey thedescent condition, stated as

(21) f(zk+1) ≤ QLk+1,yk+1(zk+1), k ≥ 0,

where

(22) zk+1def= TLk+1

(yk+1),

then the objective is lower bounded for all x ∈ Rn as

(23)F (x) ≥ wk+1(x)

def= F (zk+1) +

Lk+1 + µΨ

2‖zk+1 − yk+1‖22

+(Lk+1 + µΨ)〈yk+1 − zk+1,x− yk+1〉+µ

2‖x− yk+1‖22.

Proof. By expanding (22) we obtain(24)

zk+1 = arg minx∈Rn

(f(yk+1) + 〈∇f(yk+1),x− yk+1〉+

Lk+1

2‖x− yk+1‖22 + Ψ(x)

).

This subproblem is unconstrained and has a strongly convex and subdifferentiableobjective. The first-order optimality condition is therefore equivalent to

(25) ξ = Lk+1(yk+1 − zk+1)−∇f(yk+1),

where ξ is a specific subgradient of function Ψ at point zk+1.Moreover, from the definition of the strong convexity parameters of f and Ψ we

have that

f(x) ≥ f(yk+1) + 〈∇f(yk+1),x− yk+1〉+µf2‖x− yk+1‖22,(26)

Ψ(x) ≥ Ψ(zk+1) + 〈ξ,x− zk+1〉+µΨ

2‖x− zk+1‖22.(27)

By adding together (21), (26), and (27), canceling and rearranging terms we obtain

(28)

F (x) ≥ F (zk+1) + 〈∇f(yk+1) + ξ,x− zk+1〉+µf2‖x− yk+1‖22

+µΨ

2‖x− zk+1‖22 −

Lk+1

2‖yk+1 − zk+1‖22.

Further applying (25) to (28) and rearranging terms gives the desired result.

Once we have obtained the simple lower bounds, we need to determine a meansof combining them. We propose the convex combination given by

(29) Wk+1(x) =

(k+1∑i=1

ai

)−1(k+1∑i=1

aiwi(x)

), k ≥ 0,


where

(30) Ak+1 = Ak + ak+1, k ≥ 0.

We have assumed that Akk≥0 is increasing which implies that the weights are strictlypositive, namely

(31) ak+1 > 0, k ≥ 0.

From definition (12), we have that the initial estimate function is a parabola,given by

(32) ψ0(x) = A0F (x0) +γ0

2‖x− x0‖22, x ∈ Rn.

Applying (29) and (30) to (12) we get simple recursion rule in the form of

(33) ψk+1(x) = ψk(x) + ak+1wk+1(x), x ∈ Rn, k ≥ 0.

Lemma 2. When the simple lower bounds are given by Lemma 1 and are combinedas in (29), then the estimate functions and the augmented estimate functions at everyiteration are parabolic, and thus can be written in canonical form for all x ∈ Rn andk ≥ 0 as

ψk(x) = ψ∗k +γk2‖x− vk‖22,(34)

ψ′k(x) = ψ′∗k +γk2‖x− vk‖22,(35)

with ψ∗k and ψ′∗k given by (13) and (20), respectively, and

(36) vkdef= arg min

x∈Rn

ψk(x).

Moreover, the curvatures and vertices obey the following recursion rules

γk+1 = γk + ak+1µ,(37)

vk+1 =1

γk+1(γkvk + ak+1(Lk+1 + µΨ)zk+1 − ak+1(Lk+1 − µf )yk+1).(38)

Proof. The Hessian of ψ0(x) is given by γ0In (where In is the identity matrix ofsize n) with γ0 > 0 while the Hessian of wk+1(x) is µIn. From the recursion rule in(33), it follows by simple induction that all estimate functions are parabolae, and assuch, can be written as in (34) with (13) and (36). The relation in (18) means that theaugmented estimate functions differ from the estimate function by a constant term.It follows that (35) with (20) holds.

Expanding the recursion rule in (33) using (34) and taking the gradient we obtain

(39)γk+1(x− vk+1) = γk(x− vk) + ak+1(Lk+1 + µΨ)(yk+1 − zk+1)

+ ak+1µ(x− yk+1), x ∈ Rn.

Matching the first-order coefficients and constant terms on both sides of (39) gives(37) and (38), respectively.

By iterating (37), we obtain a closed form expression of the estimate function curva-ture as

(40) γk = γ0 +

(k∑i=1

ai

)µ = γ0 −A0µ+Akµ, k ≥ 0.


2.4. Generalizing the gap sequence. We introduce the augmented estimatesequence gaps Γkk≥0 and the gap sequence ∆kk≥0, respectively, as

Γkdef= AkF (xk)− ψ′∗k , k ≥ 0,(41)

∆kdef= Ak(F (xk)− F (x∗)) +

γk2‖vk − x∗‖22, k ≥ 0.(42)

Proposition 3. When the estimate functions are parabolic, as in Lemma 2, theaugmented estimate sequence gaps can be expressed more succinctly as

(43) Γk = ∆k −∆0, k ≥ 0.

Proof. From (18), using (34) and (35) in Lemma 2, it follows that

(44) ψ′∗k = ψ∗k + (Ak −A0)(F (x∗)−Wk(x∗)).

Therefore, the gap Γk can be expressed as

Γk(44)= Ak(F (xk)− F (x∗)) + (Ak −A0)Wk(x∗) +A0F (x∗)− ψ∗k

(34)= Ak(F (xk)− F (x∗)) +

γk2‖x∗ − vk‖22−

− ψk(x∗) +A0F (x∗) + (Ak −A0)Wk(x∗)

(12)= Ak(F (xk)− F (x∗)) +

γk2‖vk − x∗‖22−

−A0(F (x0)− F (x∗))− γ0

2‖x0 − x∗‖22

(42)= ∆k −∆0, k ≥ 0.

Note that (18), (32), and Lemma 2 imply that ψ′∗0 = ψ′0(x0) = ψ0(x0) = A0F (x0)which, together with (41), guarantees that the initial augmented estimate sequencegap Γ0 is zero, regardless of the structure of the lower bounds. Therefore, a sufficientcondition for the preservation of the augmented estimate sequence property (19) asthe algorithm progresses is that the augmented estimate sequence gaps do not in-crease. Proposition 3 states that when the lower bounds are generalized parabolae,the sufficient condition for (19) becomes

(45) ∆k+1 ≤ ∆k, k ≥ 0.

Therefore, (19) and the subsequent the convergence of the algorithm are implied by theLyapunov (non-increasing) property of the gap sequence. Such Lyapunov functionshave been widely employed in the theoretical analysis of optimization schemes (e.g.,[28–30]). Moreover, the simple structure and generic nature of the gap sequencefurther justifies the augmentation of the estimate sequence.

2.5. A design pattern for first-order accelerated methods. The aug-mented estimate sequence property places an upper bound on F (xk+1), which mustbe satisfied before the generation of xk+1. The black-box nature of the objectiveconstitutes an obstacle in this respect and it is therefore advantageous to employa surrogate upper bound on the objective. Numerous algorithms employ this tech-nique, and are often referred to as majorization minimization (MM) algorithms (see,e.g., [31, 32] and references therein). New iterates are generated as

(46) xk+1 = arg minx∈Rn

uk+1(x), k ≥ 0,


where uk+1 is a local upper bound on the objective at the new iterate, namely

(47) F (xk+1) ≤ uk+1(xk+1), k ≥ 0.

Combining (30), (33), and (46) yields the algorithm design pattern outlined in Pat-tern 1. Note that Nesterov’s FGM and AMGS adhere to Pattern 1. This pattern willform the scaffolding of our generalized ACGM.

Pattern 1 takes as input the starting point x0, the curvature γ0, the initial guar-antee A0 and, if the Lipschitz constant is not known in advance, an initial LCEL0 > 0. In line 7 of Pattern 1, the main iterate is given by the minimum of the localupper bound uk+1. Alongside the main iterate, the algorithm maintains an estimatefunction ψk+1, obtained from the previous one by adding a simple global lower boundwk+1 weighted by ak+1 > 0 (line 8 of Pattern 1). Function wk+1 need not be given byLemma 1, but we assume that it is uniquely determined by an auxiliary point yk+1.The current LCE Lk+1, weight ak+1, and auxiliary point yk+1 are computed usingalgorithm specific methods S, Fa, and Fy, respectively (lines 3, 4, and 5 of Pattern 1).These methods take as parameters the state of the algorithm, given by current valuesof the main iterate, LCE, weight, and estimate function.

Pattern 1 A design pattern for first-order accelerated methods

1: ψ0(x) = A0F (x0) + γ0

2 ‖x− x0‖222: for k = 0, . . . ,K − 1 do3: Lk+1 = S(xk, ψk, Ak, Lk) “line-search”4: ak+1 = Fa(ψk, Ak, Lk+1)5: yk+1 = Fy(xk, ψk, Ak, ak+1)6: Ak+1 = Ak + ak+1

7: xk+1 = arg minx∈Rn

uk+1(x)

8: ψk+1(x) = ψk(x) + ak+1wk+1(x)9: end for

2.6. Upper bounds. Interestingly, all of the above results do not rely on aspecific form of the local upper bound uk+1, as long as assumption (21) holds for allk ≥ 0. The enforced descent condition (21) can be equivalently expressed in terms ofcomposite objective values (by adding Ψ(zk+1) to both sides) as

(48) F (zk+1) ≤ QLk+1,yk+1(zk+1) + Ψ(zk+1), k ≥ 0.

We want our algorithm to converge as fast as possible while maintaining the mono-tonicity property, expressed as

(49) F (xk+1) ≤ F (xk), k ≥ 0.

Then, without further knowledge of the objective function, (48) and (49) suggest asimple expression of the upper bound for all x ∈ Rn and k ≥ 0 in the form of

(50) uk+1(x) = minQLk+1,yk+1(x) + Ψ(x), F (xk) + σxk(x),

where σX is the indicator function [33] of set X, given by

(51) σX(x) =

0, x ∈ X,

+∞, otherwise.


2.7. Towards an algorithm. With the lower and upper bounds fully defined,we select functions S, Fa, and Fy so as to preserve the Lyapunov property of the gapsequence (45).

Theorem 4. If at iteration k ≥ 0 we have

F (xk+1) ≤ F (zk+1) ≤ QLk+1,yk+1(zk+1) + Ψ(zk+1),

then

∆k+1 +Ak+1 + Bk+1 ≤ ∆k,

where subexpressions Ak+1, Bk+1, gk+1, sk+1, and Yk+1 are, respectively, defined as

Ak+1def=

1

2

(Ak+1

Lk+1 + µΨ−a2k+1

γk+1

)‖gk+1‖22,

Bk+1def=

1

γk+1〈gk+1 −

µ

2Yk+1sk+1, sk+1〉,

gk+1def= (Lk+1 + µΨ)(yk+1 − zk+1),

sk+1def= Akγk+1xk + ak+1γkvk − Yk+1yk+1,

Yk+1def= Akγk+1 + ak+1γk.

Proof. We assume k ≥ 0 throughout this proof. The descent condition assumptionimplies that the lower bound property in Lemma 1 holds. We define

(52) Gk+1 = gk+1 − µyk+1 = (Lk+1 − µf )yk+1 − (Lk+1 + µΨ)zk+1,

which simplifies the following subexpression of wk+1 in Lemma 1:

(53)(Lk+1 + µΨ)〈yk+1 − zk+1,x− yk+1〉+

µ

2‖x− yk+1‖22

= 〈Gk+1,x− yk+1〉+µ

2‖x‖22 −

µ

2‖yk+1‖22.

The tightness of lower bound wk+1 in Lemma 1 is given by the residual Rk+1 as

(54) Rk+1(x)def= F (x)− wk+1(x), x ∈ Rn.

From Lemma 1 it follows that

(55) AkR(xk) + ak+1R(x∗) ≥ 0.

By expanding terms in (55) using the identity in (53) we obtain

(56) Ak(F (xk)− F (x∗))−Ak+1(F (zk+1)− F (x∗)) ≥ Ck+1,

where

(57)

Ck+1def=

Ak+1

Lk+1 + µΨ‖gk+1‖22 + 〈Gk+1, Akxk + ak+1x

∗ −Ak+1yk+1〉

+Akµ

2‖xk‖22 +

ak+1µ

2‖x∗‖22 −

Ak+1µ

2‖yk+1‖22.


The vertex update in (38) implies that

ak+1Gk+1 = γkvk − γk+1vk+1,(58)

a2k+1‖gk+1‖22 = ‖γkvk − γk+1vk+1 + ak+1µyk+1‖22.(59)

Using (58) and (59) in (57) and rearranging terms yields

(60) Ck+1 = Ak+1 + Vk+1 +1

γk+1〈Gk+1, sk+1〉+

µ

2γk+1Sk+1,

with Sk+1 and Vk+1 defined as

Sk+1def= Akγk+1‖xk‖22 + ak+1γk‖vk‖22 − Yk+1‖yk+1‖22,(61)

Vk+1def=

γk+1

2‖vk+1‖22 −

γk2‖vk‖22 + 〈Gk+1, ak+1x

∗〉+ak+1µ

2‖x∗‖22.(62)

By further applying (37) and (58), Sk+1 and Vk+1, respectively, become

Sk+1 =

⟨1

Yk+1sk+1 + 2yk+1, sk+1

⟩+

ak+1Akγkγk+1

Akγk+1 + ak+1γk‖xk − vk‖22,(63)

Vk+1 =γk+1

2‖vk+1 − x∗‖22 −

γk2‖vk − x∗‖22, .(64)

Putting together (52), (56), (60), (63), and (64) with F (xk+1) ≤ F (zk+1) and thefact that the second term in the right-hand side of (63) is always non-negative, weobtain the desired result. More details can be found in proofs of Theorems 3, 4, and5 in [34].

Theorem 4 provides a simple sufficient condition for the monotonicity of the gapsequence.

Proposition 5. The monotonic decrease of the gap sequence in (45) is guaran-teed regardless of the algorithmic state if, for all k ≥ 0, the following hold:

yk+1 =1

Yk+1(Akγk+1xk + ak+1γkvk),(65)

ak+1 ≤ E(γk, Ak, Lk+1),(66)

where

(67) E(γk, Ak, Lk+1)def=

γk +Akµ

2(Lk+1 − µf )

(1 +

√1 +

4(Lk+1 − µf )Akγk(γk +Akµ)2

).

Proof. Herein, we consider all k ≥ 0. According to Theorem 4, an obvious suffi-cient condition for (45) is Ak+1 ≥ 0 combined with Bk+1 ≥ 0. Due to the black-boxnature of the oracle functions and the assumption of arbitrary algorithmic state, weallow for gk+1 to be any vector in Rn. Quantity Bk+1 is a scalar product between twovectors, the one containing gk+1 being arbitrary. Therefore, a sufficient condition forBk+1 ≥ 0 is that sk+1 = 0, which we can always guarantee by setting (65). Moreover,the value of ‖gk+1‖22 could be very large, but is always non-negative. Therefore, asufficient condition for Ak+1 ≥ 0 is that

(68) (Lk+1 + µΨ)a2k+1 ≤ Ak+1γk+1.


By expanding (68) using (30) and (37) and rearranging terms, we obtain that ak+1

needs to satisfy the inequality

(69) (Lk+1 − µf )a2k+1 − (γk +Akµ)ak+1 −Akγk ≤ 0.

Because Lk+1 > µf , the solutions of (69) lie between the two roots of the correspond-ing equation, only one of which is positive and given by E(γk, Ak, Lk+1).

Proposition 5 allows us to select Fa and Fy as to preserve the Lyapunov property in(45). For Fy, the obvious choice is (65). Out of the multitude of potential Fa, wechoose the one that yields the largest convergence guarantees Ak+1, given by

(70) Fa(ψk, Ak, Lk+1) = E(γk, Ak, Lk+1), k ≥ 0,

which corresponds to equality in (66), namely

(71) (Lk+1 + µΨ)a2k+1 = Ak+1γk+1, k ≥ 0.

For determining the LCE, we select the backtracking line-search method SA em-ployed by AMGS [10] and the original ACGM [25]. The search parameters comprisethe LCE increase rate ru > 1 and the LCE decrease rate 0 < rd < 1. The searchterminates when the line-search stopping criterion (LSSC) in (21) is satisfied.

2.8. Putting it all together. We have thus determined a search strategy SA,the initial estimate function ψ0 in (32), the upper bounds uk+1 in (50), the lowerbounds wk+1 in Lemma 1, function Fa in (70), and function Fy in (65). Substitutingthese expressions in the design model outlined in Pattern 1, we can write down ageneralization of ACGM in estimate sequence form, as listed in Algorithm 2.

Non-monotone generalized ACGM can be obtained by enforcing xk+1 = zk+1 forall k ≥ 0, accomplished by replacing line 16 of Algorithm 2 with

(72) xk+1 := zk+1.

3. Complexity analysis.

3.1. Worst-case convergence guarantees. Algorithm 2 maintains the con-vergence guarantee in (7) explicitly at run-time as state variable Ak. Moreover, ifsufficient knowledge of the problem is available, it is possible to formulate a worst-case convergence guarantee before running the algorithm.

For our analysis, we will need to define a number of curvature-related quantities,namely the local inverse condition number qk+1 for all k ≥ 0, the worst-case LCE Lu,and the worst-case inverse condition number qu, respectively, given by

qk+1def=

µ

Lk+1 + µΨ,(73)

Ludef= maxruLf , rdL0,(74)

qudef=

µ

Lu + µΨ.(75)

The worst-case convergence rate for generalized ACGM is stated in the followingtheorem.


Algorithm 2 Generalized monotone ACGM in estimate sequence formACGM(x0, L0, µf , µΨ, A0, γ0, ru, rd, K)

1: v0 = x0, µ = µf + µΨ

2: for k = 0, . . . ,K − 1 do3: Lk+1 := rdLk4: loop

5: ak+1 := γk+Akµ

2(Lk+1−µf )

(1 +

√1 +

4(Lk+1−µf )Akγk(γk+Akµ)2

)6: Ak+1 := Ak + ak+1

7: γk+1 := γk + ak+1µ8: yk+1 := 1

Akγk+1+ak+1γk(Akγk+1xk + ak+1γkvk)

9: zk+1 := prox 1Lk+1

Ψ

(yk+1 − 1

Lk+1∇f(yk+1)

)10: if f(zk+1) ≤ QLk+1,yk+1

(zk+1) then

11: Break from loop12: else13: Lk+1 := ruLk+1

14: end if15: end loop16: xk+1 := arg minF (zk+1), F (xk)17: vk+1 := 1

γk+1(γkvk + ak+1(Lk+1 + µΨ)zk+1 − ak+1(Lk+1 − µf )yk+1)

18: Lk+1 := Lk+1, Ak+1 := Ak+1, γk+1 := γk+1

19: end for20: return xK

Theorem 6. If γ0 ≥ A0µ, the generalized ACGM algorithm generates a sequencexkk≥1 that satisfies

F (xk)− F (x∗) ≤ min

4

(k + 1)2, (1−√qu)k−1

(Lu − µf )∆0, k ≥ 1,

where

∆0def=

∆0

γ0=A0

γ0(F (x0)− F (x∗)) +

1

2‖x0 − x∗‖22.

Proof. The non-negativity of the weights (31) implies that γk ≥ γ0 for all k ≥ 0.Combined with (71) and Akµ ≥ 0, we have for all k ≥ 0 that

(76) Ak+1 = Ak + ak+1 ≥ Ak +γ0

2(Lk+1 − µf )+

√γ2

0

4(Lk+1 − µf )2+

Akγ0

(Lk+1 − µf ).

As we can see from Algorithm 2, scaling A0 and γ0 by a fixed factor does not alter thebehavior of generalized ACGM. Additionally, γ0 is guaranteed to be non-zero. To sim-

plify calculations, we introduce the normalized convergence guarantees Akdef= Ak/γ0

for all k ≥ 0.Regardless of the outcome of individual line-search calls, the growth of the nor-


malized accumulated weights obeys

(77) Ak+1 ≥ Ak +1

2(Lu − µf )+

√1

4(Lu − µf )2+

Ak(Lu − µf )

, k ≥ 0.

Taking into account that A0 ≥ 0, we obtain by induction that

(78) Ak ≥(k + 1)2

4(Lu − µf ), k ≥ 1.

From assumption γ0 ≥ A0µ, (40) implies γk ≥ Akµ for all k ≥ 0. Hence

(79)a2k+1

A2k+1

(71)=

γk+1

(Lk+1 + µΨ)Ak+1≥ µ

Lk+1 + µΨ= qk+1 ≥ qu, k ≥ 0.

Since A0 ≥ 0, we have that A1 ≥ 1Lu−µf

. By induction, it follows that

(80) Ak ≥1

Lu − µf(1−√qu)−(k−1), k ≥ 1.

Substituting (78) and (80) in (7) completes the proof.

Theorem 6 shows that generalized ACGM has the best convergence guaranteesamong its class of algorithms (see [25] for an in-depth discussion).

Note that the assumption γ0 ≥ A0µ always holds for non-strongly convex ob-jectives and that ACGM is guaranteed converge for strongly convex objectives alsowhen γ0 < A0µ. However, in the latter case, it is more difficult to obtain simple lowerbounds on the convergence guarantees. We leave such an endeavor to future research.

3.2. Wall-clock time units. So far, we have measured the theoretical perfor-mance of algorithms in terms of convergence guarantees (including the worst-caseones) indexed in iterations. This does not account for the complexity of individual it-erations. In [25], we have introduced a new measure of complexity, the wall-clock timeunit (WTU), to compare optimization algorithms more reliably. We thus distinguishbetween two types of convergence guarantees. One is the previously used iterationconvergence guarantee, indexed in iterations and a new computational convergenceguarantee, indexed in WTU.

The WTU is a measure of running time in a shared memory parallel scenario. Thecomputing environment consists of a small number of parallel processing units (PPU).Each PPU may be a parallel machine itself. The number of parallel units is consideredsufficient to compute any number of independent oracle functions simultaneously. Theshared-memory system does not impose constraints on parallelization, namely, it isuniform memory access (UMA) [35] and it is large enough to store the arguments andresults of oracle calls for as long as they are needed.

In order to compare algorithms based on a unified benchmark, in [25] we haveassumed that f(x) and ∇f(x) require 1 WTU each while all other operations are neg-ligible and amount to 0 WTU. In this work, we generalize the analysis. We attributefinite non-negative costs tf , tg, tΨ, and tp to f(x), ∇f(x), Ψ(x), and proxτΨ(x),respectively. However, since we are dealing with large-scale problems, we maintainthe assumption that element-wise vector operations, including scalar-vector multipli-cations, vector additions, and inner products, have negligible complexity when com-pared to oracle functions and assign a cost of 0 WTU to each. Synchronization ofPPUs also incurs no cost. Consequently, when computed in isolation, an objectivefunction value call costs tF = maxtf , tΨ, ascribable to separability, while a proximalgradient operation costs tT = tg + tp, due to computational dependencies.


3.3. Per-iteration complexity. We measure this complexity in WTU on theshared memory system described in the previous subsection and consider a parallelimplementation involving speculative execution [35].

The advancement phase of a generalized ACGM iteration consists of one proximalgradient step (line 9 of Algorithm 2). Hence, every iteration has a base cost oftT = tg + tp. LSSC and the monotonicity condition (MC) in line 16 of Algorithm 2can be evaluated in parallel with subsequent iterations. Both rely on the computationof f(zk+1), which in the worst case requires dtf/tT e dedicated PPUs. In addition,MC may need up to dtΨ/tT e PPUs.

Backtracks stall the algorithm in a way that cannot be alleviated by parallelizationor intensity reduction. Therefore, it is desirable to make them a rare event. Assumingthat the local curvature of f varies around a fixed value, this would mean that log(ru)should be significantly larger than −log(rd). With such a parameter choice, thealgorithm can proceed from one iteration to another by speculating that backtracksdo not occur at all. Let the current iteration be indexed by k. If the LSSC of iterationk fails, then the algorithm discards all the state information pertaining to all iterationsmade after k, reverts to iteration k, and performs the necessary computation to correctthe error. We consider that a misprediction incurs a detection cost td and a correctioncost tc. LSSC requires the evaluation of f(zk+1) and incurs a detection cost of td = tf .A backtrack entails recomputing yk+1, yielding an LSSC tc = tT correction time.

Overshoots are assumed to occur even less often. Similarly, the algorithm proceedsspeculating that MC always passes and defaults to (72). Hence, MC has td = tF , dueto its dependency on Ψ(zk+1), but once the algorithmic state of iteration k has beenrestored, no additional oracle calls are needed, leading to tc = 0. MC and LSSC canbe fused into a single condition, giving rise to the scenarios outlined in Table 1. Notethat if LSSC fails, MC is not evaluated.

Table 1Algorithm stall time in WTU based on the outcome of LSSC and MC

MC passed MC failed

LSSC passed 0 maxtf , tΨLSSC failed tf + tg + tp N / A

For non-monotone generalized ACGM, each backtrack adds tf + tT WTU to abase iteration cost of tT . A comparison to other methods employing line-search isshown in Table 2.

Table 2Per-iteration cost of FISTA, AMGS, and generalized ACGM in the non-monotone setting

FISTA AMGS ACGM

Base cost tg + tp 2tg + 2tp tg + tpLSSC td tf tg tfLSSC tc tp tg + tp tg + tpBacktrack cost tf + tp 2tg + tp tf + tg + tp


4. Extrapolated form.

4.1. Monotonicity and extrapolation. In the original ACGM [25], the aux-iliary point can be obtained from two successive main iterates through extrapolation.Interestingly, this property is preserved for any step size. We show in the followinghow monotonicity alters this property and bring generalized monotone ACGM to aform in which the auxiliary point is an extrapolation of state variables. We beginwith the following observation.

Lemma 7. In Algorithm 2, the estimate sequence vertices can be obtained fromother state variables through extrapolation as

vk+1 = xk +Ak+1

ak+1(zk+1 − xk), k ≥ 0.

Proof. In this proof we consider all k ≥ 0. The vertex update in (38) can berewritten using (71) as

ak+1γkvk(38)= ak+1γk+1vk+1 + (Lk+1 + µΨ)a2

k+1(yk+1 − zk+1) + a2k+1µyk+1

(71)= ak+1γk+1vk+1 +Ak+1γk+1(yk+1 − zk+1) + a2

k+1µyk+1.(81)

The auxiliary point update in (65) can also be written as

(82) ak+1γkvk = (Akγk+1 + ak+1γk)yk+1 −Akγk+1xk.

From (37) we have that

(83) Ak+1γk+1 − a2k+1µ = Akγk+1 + ak+1γk.

Putting together (81), (82), and eliminating yk+1 using (83) we obtain

(84) −Akγk+1xk = ak+1γk+1vk+1 −Aγk+1zk+1.

Lemma 7 enables us to write down the corresponding expression for the auxiliarypoint.

Proposition 8. In Algorithm 2, the auxiliary points yk+1 obey the followingextrapolation rule:

yk+1 = xk + βk(zk − xk−1), k ≥ 1,

where

βk =

(Akak− 1zk(xk)

)ωk,(85)

ωkdef=

ak+1γkAkγk+1 + ak+1γk

,(86)

and 1X denotes the membership function of set X, namely

1X(x) =

1, x ∈ X0, x /∈ X .


Proof. The results herein apply to all k ≥ 1. Lemma 7 applied for vk, combinedwith the auxiliary point update in (65), leads to

(87) yk+1 = (1− ωk)xk +Akakωkzk +

(1− Ak

ak

)ωkxk−1.

Depending on the outcome of the update in line 16 of Algorithm 2, we distinguishtwo situations.

If MC passes at iteration k − 1 (F (zk) ≤ F (xk−1)), we set xk = zk, hence

(88) yk+1 = xk +

(Akak− 1

)ωk(zk − xk−1).

If the algorithm overshoots (F (zk) > F (xk−1)) then, by monotonicity, we imposexk = xk−1, which leads to

(89) yk+1 = xk +Akakωk(zk − xk−1).

Combining (88) and (89) completes the proof.

Until this point we have assumed that the first iteration k = 0 does not use theauxiliary point extrapolation rule in Proposition 8. To write our generalized ACGMin a form similar to monotone FISTA (MFISTA [26]) and the monotone version ofFISTA-CP [7], we define the vertex extrapolation factor in Lemma 7 as

(90) tkdef=

Akak, k ≥ 1.

Proposition 9. For k ≥ 1, the vertex extrapolation factor can be determinedusing a recursion rule that does not depend on weights ak and Ak, given by

(91) t2k+1 + tk+1(qkt2k − 1)− Lk+1 + µΨ

Lk + µΨt2k = 0.

Subexpression ωk and auxiliary point extrapolation factor βk can also be written forall k ≥ 1 as

ωk =1− qk+1tk+1

(1− qk+1)tk+1,(92)

βk = (tk − 1zk(xk))ωk.(93)

Proof. In this proof we take all k ≥ 1. From (73) it follows that

(94) qk =Lk+1 + µΨ

Lk + µΨqk+1.

Moreover,

γkγk+1

(37)=

Ak+1(γk+1 − ak+1µ)

Ak+1γk+1

(71)= 1− Ak+1ak+1µ

(Lk+1 + µΨ)a2k+1

(90)= 1− qk+1tk+1,(95)

γkt2k

(90)=

(Lk + µΨ)γkA2k

(Lk + µΨ)a2k

(71)= (Lk + µΨ)Ak,(96)

γktk(90)=

γkAkak

(71)= (Lk + µΨ)ak.(97)


Putting together (94), (95), (96), and (97) we obtain

γk+1

(t2k+1 + tk+1(qkt

2k − 1)− Lk+1 + µΨ

Lk + µΨt2k

)(94)= γk+1t

2k+1 +

Lk+1 + µΨ

Lk + µΨγk+1qk+1tk+1t

2k − γk+1tk+1 − γk+1

Lk+1 + µΨ

Lk + µΨt2k

(95)= (Lk+1 + µΨ)Ak+1 −

Lk+1 + µΨ

Lk + µΨγkt

2k − (Lk+1 + µΨ)ak+1

= (Lk+1 + µΨ)Ak+1 − (Lk+1 + µΨ)Ak − (Lk+1 + µΨ)ak+1 = 0.

Since γk+1 > 0, (91) holds. Subexpression ωk is similarly obtained as

ωk(83)=

1− ak+1µγk+1

Ak+1

ak+1− ak+1µ

γk+1

(71)=

1− µAk+1

(Lk+1+µΨ)ak+1(1− µ

Lk+1+µΨ

)Ak+1

ak+1

(90)=

1− qk+1tk+1

(1− qk+1)tk+1.

Substituting (90) in (85) gives (93).

For simplicity, we wish to extend the update rules from Propositions 8 and 9 tothe first iteration k = 0. The missing parameters follow naturally from this extension.First, t0 can be obtained by setting k = 0 in (91) as

(98) t0 =

√t21 − t1

L1+µΨ

L0+µΨ− t1q0

(90)=

√√√√ A1 − a1

(L1+µΨ)a21

(L0+µΨ)A1− a1q0

(71)=

√(L0 + µΨ)A0

γ0.

Next, we introduce a “phantom iteration” k = −1 with the main iterate as the only

state parameter. We set x−1def= x0 so that any value of β0 will satisfy Proposition 8.

For brevity, we choose to obtain β0 from expression (93) with k = 0.Thus, with the initialization in (98) and the recursion in (91), we have completely

defined the vertex extrapolation factor sequence tkk≥0, and derived from it theauxiliary extrapolation factor expression in (93). Now, we no longer need to maintainweight sequences akk≥1 and Akk≥0. We simplify our generalized ACGM furtherby noting that, to produce the auxiliary point, the extrapolation rule in Proposition 8depends on three vector parameters. However, it is not necessary to store both zkand xk−1 across iterations. To address applications where memory is limited, we onlymaintain the difference term dk, given by

(99) dk = (tk − 1zk(xk))(zk − xk−1), k ≥ 0.

The extrapolation rule in Proposition 8 becomes

(100) yk+1 = xk + ωkdk, k ≥ 0.

Note that subexpression ωk contains only recent information whereas dk needs onlyto access the state of the preceding iterations.

The above modifications yield a form of generalized ACGM based on extrapola-tion, which we list in Algorithm 3. To obtain a non-monotone algorithm, it sufficesto replace line 16 of Algorithm 3 with (72).

We stress that while Algorithms 2 and 3 carry out different computations, theyare mathematically equivalent with respect to the main iterate sequence xkk≥0.The oracle calls and their dependencies in Algorithm 3 are also identical to those inAlgorithm 2. Therefore the per-iteration complexity is the same.


Algorithm 3 Generalized monotone ACGM in extrapolated formACGM(x0, L0, µf , µΨ, A0, γ0, ru, rd, K)

1: x−1 = x0, d0 = 0

2: µ = µf + µΨ, t0 =√

(L0+µΨ)A0

γ0, q0 = µ

L0+µΨ

3: for k = 0, ...,K − 1 do4: Lk+1 := rdLk5: loop6: qk+1 := µ

Lk+1+µΨ

7: tk+1 := 12

(1− qkt2k +

√(1− qkt2k)2 + 4 Lk+1+µΨ

Lk+µΨt2k

)8: yk+1 := xk + 1−qk+1 tk+1

(1−qk+1)tk+1dk

9: zk+1 := prox 1Lk+1

Ψ

(yk+1 − 1

Lk+1∇f(yk+1)

)10: if f(zk+1) ≤ QLk+1,yk+1

(zk+1) then

11: Break from loop12: else13: Lk+1 := ruLk+1

14: end if15: end loop16: xk+1 := arg minF (zk+1), F (xk)17: dk+1 := (tk+1 − 1zk+1(xk+1))(zk+1 − xk)

18: Lk+1 := Lk+1, qk+1 := qk+1, tk+1 := tk+1


4.2. Retrieving the convergence guarantee. For Algorithm 2, the conver-gence guarantee in (7) is obtained directly from the state variable Ak. For Algorithm 3,we need the following result.

Lemma 10. The vertex extrapolation factor expression in (98) generalizes to ar-bitrary k ≥ 0 as

tk =

√(Lk + µΨ)Ak

γk.

Proof. For k = 0, (98) implies that Lemma 10 holds. For k ≥ 1 we have that

tk+1 =

√(Lk+1 + µΨ)A2

k+1

(Lk+1 + µΨ)a2k+1

(71)=

√(Lk+1 + µΨ)Ak+1

γk+1.

From Lemma 10, we distinguish two scenarios.If γ0 6= A0µ, from (40), (73), and Lemma 10 it follows that the convergence

guarantee can be derived directly from the state parameters without alterations toAlgorithm 3 as

(101) Ak =(γ0 −A0µ)t2k

(Lk + µΨ)(1− qkt2k), k ≥ 1.


4.3. Border-case. However, if γ0 = A0µ, then (40), (73), and Lemma 10 implythat

(102) tk =1√qk, k ≥ 1.

Therefore, the state parameters of Algorithm 3 no longer contain information on theconvergence guarantee but can be brought to a simpler form. It follows from (92),(93), and (102) that the auxiliary point extrapolation factor is given by

(103) βk =

√Lk + µΨ − 1zk(xk)

√µ√

Lk+1 + µΨ +√µ

, k ≥ 0.

The sequence tkk≥0 does not store any relevant information and can be left out.This means that the convergence guarantee Ak requires a dedicated update. A simplerecursion rule follows from (30), (40), and (71) as

(104) Ak+1 =

√Lk+1 + µΨ√

Lk+1 + µΨ −√µAk k ≥ 0.

Due to scaling invariance, we can select any pair (A0, γ0) that is a positive multipleof (1, µ). For simplicity, we choose A0 = 1 and γ0 = µ.

To reduce computational intensity, we modify subexpressions dk and ωk as

dk =(√

Lk + µΨ − 1zk(xk)√µ)

(zk − xk−1), k ≥ 0,(105)

ωk =1√

Lk+1 + µΨ +√µ, k ≥ 0.(106)

The local inverse condition number sequence qkk≥0 does not appear in updates(104), (105), and (106). Hence, it can also be abstracted away. The form taken bygeneralized ACGM in this border-case, after simplifications, is listed in Algorithm 4.

A non-monotone variant can be obtained by replacing line 13 of Algorithm 4 with(72). The per-iteration complexity of Algorithm 4 matches the one of Algorithms2 and 3.

5. Simulation results.

5.1. Benchmark setup. We have tested the variants of generalized ACGM in-troduced in this work against the methods considered at the time of writing to be thestate-of-the-art on the problem class outlined in Subsection 1.2. The proposed meth-ods included in the benchmark are non-monotone ACGM (denoted as plain ACGM),monotone ACGM (MACGM), and, for strongly-convex problems, border-case non-monotone ACGM (BACGM) as well as border-case monotone ACGM (BMACGM).The state-of-the-art methods are FISTA-CP, monotone FISTA-CP (MFISTA-CP) [7],AMGS [10], and FISTA with backtracking line-search (FISTA-BT) [11].

We have selected as test cases five synthetic instances of composite problems inthe areas of statistics, inverse problems, and machine learning. Three are non-stronglyconvex: least absolute shrinkage and selection operator (LASSO) [12], non-negativeleast squares (NNLS), and l1-regularized logistic regression (L1LR). The other twoare strongly-convex: ridge regression (RR) and elastic net (EN) [36]. Table 3 lists theoracle functions of all above mentioned problems.


Algorithm 4 Border-case ACGM in extrapolated formACGM(x0, L0, µf , µΨ, ru, rd, K)

1: x−1 = x0, d0 = 0, A0 = 1, µ = µf + µΨ

2: for k = 0, ...,K − 1 do3: Lk+1 := rdLk4: loop5: yk+1 := xk + 1√

Lk+1+µΨ+√µdk

6: zk+1 := prox 1Lk+1

Ψ

(yk+1 − 1

Lk+1∇f(yk+1)

)7: if f(zk+1) ≤ QLk+1,yk+1

(zk+1) then

8: Break from loop9: else

10: Lk+1 := ruLk+1

11: end if12: end loop13: xk+1 := arg minF (zk+1), F (xk)

14: dk+1 :=

(√Lk+1 + µΨ − 1zk+1(xk+1)

√µ

)(zk+1 − xk)

15: Lk+1 := Lk+1

16: Ak+1 :=

√Lk+1+µΨ√

Lk+1+µΨ−√µAk


Table 3Oracle functions of the five test problems

f(x) Ψ(x) ∇f(x) proxτΨ(x)

LASSO 12‖Ax− b‖22 λ1‖x‖1 AT (Ax− b) Tτλ1(x)

NNLS 12‖Ax− b‖22 σRn

+(x) AT (Ax− b) (x)+

L1LR I(Ax)− yTAx λ1‖x‖1 AT (L(Ax)− y) Tτλ1(x)

RR 12‖Ax− b‖22 λ2

2‖x‖22 AT (Ax− b) 1

1+τλ2x

EN 12‖Ax− b‖22 λ1‖x‖1 + λ2

2‖x‖22 AT (Ax− b) 1

1+τλ2Tτλ1(x)

Here, the sum softplus function I(x), the element-wise logistic function L(x),and the shrinkage operator Tτ (x) are, respectively, given for all x ∈ Rn and τ > 0 by

I(x) =

m∑i=1

log(1 + exi),(107)

L(x)i =1

1 + e−xi, i ∈ 1, ...,m,(108)

Tτ (x)j = (|xj | − τ)+ sgn(xj), j ∈ 1, ..., n.(109)

To attain the best convergence guarantees for AMGS, Nesterov suggests in [10]that all known global strong convexity be transferred to the simple function Ψ. Whenline-search is enabled, generalized ACGM also benefits slightly from this arrangementwhen ru > 1 (Theorem 6). Without line-search, the convergence guarantees of gener-alized ACGM do not change as a result of strong convexity transfer, in either direction.


Thus, for fair comparison, we have incorporated in Ψ the strongly-convex quadraticregularization term for the RR and EN problems. In the following, we describe indetail each of the five problem instances. All random variables are independent andidentically distributed, unless stated otherwise.

LASSO. Real-valued matrix A is of size m = 500 by n = 500, with entries drawnfrom N (0, 1). Vector b ∈ Rm has entries sampled from N (0, 9). Regularizationparameter λ1 is 4. The starting point x0 ∈ Rn has entries drawn from N (0, 1).

NNLS. Sparse m = 1000 × n = 10000 matrix A has approximately 10% ofentries, at random locations, non-zero. The non-zero entries are drawn from N (0, 1)after which each column j ∈ 1, ..., n is scaled independently to have an l2 norm of 1.Starting point x0 has 10 entries at random locations all equal to 4 and the remainderzero. Vector b is obtained from b = Ax0 + z, where z is standard Gaussian noise.

L1LR. Matrix A has m = 200 × n = 1000 entries sampled from N (0, 1), x0

has exactly 10 non-zero entries at random locations, each entry value drawn fromN (0, 225), and λ1 = 5. Labels yi ∈ 0, 1, i ∈ 1, ...,m are selected with probabilityP(Yi = 1) = L(Ax)i.

RR. Dimensions are m = 500 × n = 500. The entries of matrix A, vector b,and starting point x0 are drawn from N (0, 1), N (0, 25), and N (0, 1), respectively.Regularizer λ2 is given by 10−3(σmax(A))2, where σmax(A) is the largest singularvalue of A.

EN. Matrix A has m = 1000 × n = 500 entries sampled from N (0, 1). Start-ing point x0 has 20 non-zero entries at random locations, each entry value drawnfrom N (0, 1). Regularization parameter λ1 is obtained according to [37] as λ1 =1.5√

2 log(n) and λ2 is the same as in RR, λ2 = 10−3(σmax(A))2.The Lipschitz constant Lf is given by (σmax(A))2 for all problems except for

L1LR where it is 14 (σmax(A))2. Strongly convex problems RR and EN have µ =

µΨ = λ2 and inverse condition number q = µ/(Lf + µΨ) = 1/1001.To be able to benchmark against FISTA-CP and FISTA-BT, which lack fully

adaptive line-search, we have set L0 = Lf for all tested algorithms, thus givingFISTA-CP and FISTA-BT an advantage over the proposed methods. To highlightthe differences between ACGM and BACGM, we ran ACGM and MACGM withparameters A0 = 0 and γ0 = 1.

Despite the problems differing in structure, the oracle functions have the samecomputational costs. We consider one matrix-vector multiplication to cost 1 WTU.Consequently, for all problems, we have tf = 1 WTU, tg = 2 WTU, and tΨ = tp =0 WTU.

The line-search parameters were selected according to the recommendation givenin [5]. For AMGS and FISTA-BT we have rAMGS

u = rFISTAu = 2.0 and rAMGS

d = 0.9.The variants of generalized ACGM and AMGS are the only methods included inthe benchmark that are equipped with fully adaptive line-search. We have decidedto select rACGM

d to ensure that ACGM and AMGS have the same overhead. Weformally define the line-search overhead of methodM, denoted by ΩM, as the averagecomputational cost attributable to backtracks per WTU of advancement. Assumingthat the LCEs hover around a fixed value (Subsection 3.3), we thus have that

ΩAMGS = − (2tg + tp) log(rAMGSd )

2(tg + tp) log(rAMGSu )

,(110)

ΩACGM = − (tf + tg + tp) log(rACGMd )

(tg + tp) log(rACGMu )

.(111)


From (110) and (111) we have that rACGMd = (rAMGS

d )23 , with no difference for border-

case or monotone variants.For measuring ISDs, we have computed beforehand an optimal point estimate x∗

for each problem instance. Each x∗ was obtained as the main iterate after runningMACGM for 5000 iterations with parameters A0 = 0, γ0 = 1, L0 = Lf , and aggressivesearch parameters rd = 0.9 and ru = 2.0.

5.2. Non-strongly convex problems. The convergence results for LASSO,NNLS, and L1LR are shown in Figure 1. The LCE variation during the first 200WTU is shown in Figure 2. For NNLS, floating point precision was exhausted after100 WTU and the LCE variation was only plotted to this point (Figure 2(b)). Inaddition, the average LCEs are listed in Table 4.

Both variants of ACGM outperform in iterations and especially in WTU thecompeting methods in each of these problem instances. Even though for LASSO andNNLS, the iteration convergence rate of AMGS is slightly better in the beginning(Figures 1(a) and 1(c)), AMGS lags behind afterwards and, when measured in termsof computational convergence rate, has the poorest performance among the methodstested (Figures 1(b), 1(d), and 1(f)). FISTA-BT produces the same iterates as FISTA-CP, as theoretically guaranteed in the non-strongly convex case for L0 = Lf .

The overall superiority of ACGM and MACGM can be attributed to the ef-fectiveness of line-search. Interestingly, ACGM manages to surpass FISTA-CP andMFISTA-CP even when the latter are supplied with the exact value of the globalLipschitz constant. This is because ACGM is able to accurately estimate the localcurvature, which is often well below Lf . For the L1LR problem, where the smoothpart f is not the square of a linear function, the local curvature is substantiallylower than the global Lipschitz constant with LCEs hovering around one fifth of Lf(Figure 2(c)). One would expect AMGS to be able to estimate local curvature asaccurately as ACGM. This is does not happen due to AMGS’s reliance on a “dampedrelaxation condition” [10] line-search stopping criterion. For LASSO and NNLS, theaverage LCE of AMGS is actually above Lf . ACGM has an average LCE that isroughly two thirds that of AMGS on these problems whereas for L1LR the average ismore than three times lower than AMGS. The difference between the LCE averagesof ACGM and MACGM is negligible.

Indeed, monotonicity, as predicted, does not alter the overall iteration convergencerate and has a stabilizing effect. MACGM overshoots do have a negative but limitedimpact on the computational convergence rate. We have noticed in our simulationsthat overshoots occur less often for larger problems, such as the tested instance ofNNLS.

5.3. Strongly convex problems. The convergence results for RR and EN areshown in Figures 3(a), 3(b), 3(c), and 3(d). The LCE variation is shown in Figure 4with LCE averages listed in Table 5.

Strongly convex problems have a unique optimum point and accelerated first-orderschemes are guaranteed to find an accurate estimate of it in domain space (see [4] fora detailed analysis). Along with Theorem 6, it follows that

(112) A0(F (x0)− F (x∗)) +γ0

2‖x0 − x∗‖22 ' ∆0.

Thus, we can display accurate estimates of ISDUBs in (7), of the form Ukdef= ∆0/Ak,

k ≥ 1, for methods that maintain convergence guarantees at runtime. These areshown in Figures 3(e) and 3(f) as upper bounds indexed in WTU.


0 500 1000 1500 2000

10−10

10−5

100

105

Iteration

F(x

k)−F(x

∗)

FISTA−BT

AMGS

FISTA−CP

MFISTA−CP

ACGM

MACGM

(a) Iteration convergence rates on LASSO

0 1000 2000 3000 4000

10−10

10−5

100

105

WTU

F(x

k)−F(x

∗)

FISTA−BT

AMGS

FISTA−CP

MFISTA−CP

ACGM

MACGM

(b) Computational convergence rates onLASSO

0 5 10 15 20 25 30 35 40 45 50

10−10

10−5

100

105

Iteration

F(x

k)−F(x

∗)

FISTA−BT

AMGS

FISTA−CP

MFISTA−CP

ACGM

MACGM

(c) Iteration convergence rates on NNLS

0 10 20 30 40 50 60 70 80 90 100

10−10

10−5

100

105

WTU

F(x

k)−F(x

∗)

FISTA−BT

AMGS

FISTA−CP

MFISTA−CP

ACGM

MACGM

(d) Computational convergence rates on NNLS

0 20 40 60 80 100 120 140 160 180 20010

−10

10−5

100

Iteration

F(x

k)−F(x

∗)

FISTA−BT

AMGS

FISTA−CP

MFISTA−CP

ACGM

MACGM

(e) Iteration convergence rates on L1LR

0 50 100 150 200 250 300 350 40010

−10

10−5

100

WTU

F(x

k)−F(x

∗)

FISTA−BT

AMGS

FISTA−CP

MFISTA−CP

ACGM

MACGM

(f) Computational convergence rates on L1LR

Fig. 1. Convergence results of FISTA with backtracking (FISTA-BT), AMGS, FISTA-CP,monotone FISTA-CP (MFISTA-CP), non-monotone ACGM and monotone ACGM (MACGM) onthe LASSO, NNLS, and L1LR non-strongly convex problems. Dots mark iterations preceding over-shoots. At these iterations, the convergence behavior changes.


0 50 100 150 2000

500

1000

1500

2000

2500

3000

3500

WTU

LC

E

FISTA−BT

AMGS

ACGM

MACGM

(a) LASSO

0 20 40 60 80 1000

5

10

15

20

25

30

WTU

LC

E

FISTA−BT

AMGS

ACGM

MACGM

(b) NNLS

0 50 100 150 2000

100

200

300

400

500

600

WTU

LC

E

FISTA−BT

AMGS

ACGM

MACGM

(c) L1LR

Fig. 2. Line-search method LCE variation on LASSO, NNLS, and L1LR

Table 4Average LCEs of line-search methods on LASSO, NNLS, and L1LR

Problem Lf Iterations FISTA-BT AMGS ACGM MACGM

LASSO 1981.98 2000 1981.98 2202.66 1385.85 1303.70NNLS 17.17 50 17.17 19.86 14.35 13.54L1LR 518.79 200 518.79 246.56 80.76 79.12

For the smooth RR problem, the effectiveness of each tested algorithm is roughlygiven by the increase rate of the accumulated weights (Figures 3(a) and 3(c)). Initerations, AMGS converges the fastest. However, in terms of WTU usage, it is theleast effective of the methods designed to deal with strongly convex objectives. Thereasons are the high cost of its iterations, its low asymptotic rate compared to ACGMand FISTA-CP, and the stringency of its damped relaxation criterion that results inhigher LCEs (on average) than ACGM (Figure 4(a) and Table 5). The computationalconvergence rate of BACGM is the best, followed by ACGM and FISTA-CP. This doesnot, however, correspond to the upper bounds (Figure 3(e)). While BACGM producesthe largest accumulated weights Ak, the high value of the ISD term in ∆0 causesBACGM to have poorer upper bounds than ACGM, except for the first iterations. Infact, the effectiveness of BACGM on this problem is exceptional, partly due to theregularity of the composite gradients. This regularity also ensures the monotonicityof BACGM, ACGM, and FISTA-CP. FISTA-BT does not exhibit linear convergenceon this problem and it is even slower than AMGS after 500 WTU despite its lowerline-search overhead and advantageous parameter choice L0 = Lf .


0 50 100 150 200 250 300 350

10−10

10−5

100

105

Iteration

F(x

k)−F(x

∗)

FISTA−BT

AMGS

FISTA−CP

MFISTA−CP

ACGM

MACGM

BACGM

BMACGM

(a) Iteration convergence rates on RR

0 50 100 150

10−10

10−5

100

105

Iteration

F(x

k)−F(x

∗)

FISTA−BT

AMGS

FISTA−CP

MFISTA−CP

ACGM

MACGM

BACGM

BMACGM

(b) Iteration convergence rates on EN

0 100 200 300 400 500 600 700

10−10

10−5

100

105

WTU

F(x

k)−F(x

∗)

FISTA−BT

AMGS

FISTA−CP

MFISTA−CP

ACGM

MACGM

BACGM

BMACGM

(c) Computational convergence rates on RR

0 50 100 150 200 250 300

10−10

10−5

100

105

WTU

F(x

k)−F(x

∗)

FISTA−BT

AMGS

FISTA−CP

MFISTA−CP

ACGM

MACGM

BACGM

BMACGM

(d) Computational convergence rates on EN

0 100 200 300 400 500 600 700

10−5

100

105

WTU

F(x

k)−F(x

∗)

AMGS (U)ACGM (U)MACGM (U)BACGM (U)BMACGM (U)AMGS (R)ACGM (R)MACGM (R)BACGM (R)BMACGM (R)

(e) Computational convergence rates (R) andupper bounds (U) on RR

0 50 100 150 200 250 300

10−10

10−5

100

WTU

F(x

k)−F(x

∗)

AMGS (U)ACGM (U)MACGM (U)BACGM (U)BMACGM (U)AMGS (R)ACGM (R)MACGM (R)BACGM (R)BMACGM (R)

(f) Computational convergence rates (R) andupper bounds (U) on EN

Fig. 3. Convergence results of FISTA with backtracking (FISTA-BT), AMGS, FISTA-CP,monotone FISTA-CP (MFISTA-CP), non-monotone ACGM, monotone ACGM (MACGM), border-case non-monotone ACGM (BACGM), and border-case monotone ACGM (BMACGM) on the RRand EN strongly-convex problems. Dots mark iterations preceding overshoots.


0 50 100 150 2000

500

1000

1500

2000

2500

3000

3500

WTU

LC

E

FISTA−BT

AMGS

ACGM

MACGM

BACGM

BMACGM

(a) RR

0 50 100 150 2000

1000

2000

3000

4000

5000

WTU

LC

E

FISTA−BT

AMGS

ACGM

MACGM

BACGM

BMACGM

(b) EN

Fig. 4. Line-search method LCE variation on RR and EN

Table 5Average LCEs of line-search methods on RR and EN

Problem Lf Iterations FISTA-BT AMGS ACGM MACGM BACGM BMACGM

RR 1963.66 350 1963.66 2022.73 1473.88 1473.88 1471.16 1471.16EE 2846.02 150 2846.02 3023.47 2056.68 2003.09 2093.56 1998.12

On the less regular EN problem, ACGM leads all other methods in terms ofboth iteration and computational convergence rates (Figures 3(b) and 3(d)). Theadvantage of ACGM, especially over BACGM, is accurately reflected in the upperbounds (Figure 3(f)). However, convergence is much faster than the upper boundswould imply. Even FISTA-BT has a competitive rate, due to the small number ofiterations (150) needed for high accuracy results. The ineffectiveness of AMGS onthis problem is mostly due to its high LCEs (Figure 4(b) and Table 5). The proposedACGM and variants show comparable average LCEs. Here as well, monotonicityhas a stabilizing effect and does not have a significant impact on the computationalconvergence rate.

6. Discussion. Our simulation results suggest that enforcing monotonicity inACGM is generally beneficial in large-scale applications. It leads to a more predictableconvergence rate and, provided that the number of overshoots per iteration is small,monotonicity has a negligible impact on the computational convergence rate as well.Our experimental results indicate that the frequency of overshoots generally decreaseswith problem size.

From a theoretical standpoint, the proposed method can be viewed as a unificationof FGM and FISTA, in their most common forms. Specifically, the fixed-step variant(Lk = Lf for all k ≥ 0) of ACGM in extrapolated form (Algorithm 3) is equivalent toboth the monotone and non-monotone variants of FISTA-CP with the theoreticallyoptimal step size τFISTA−CP = 1/Lf . Moreover, when µ = 0, the original non-monotone fixed-step ACGM coincides with the original formulation of FISTA in [11].Adding monotonicity yields MFISTA [26]. Also for µ = 0, but without the fixed-steprestriction, the original non-monotone ACGM in estimate sequence form reduces tothe robust FISTA-like algorithm in [24], whereas in extrapolated form it is a valuablesimplification of the method introduced in [23]. The original backtracking FISTAcan be obtained in the same way as the variants in [23, 24] by further removing theratio Lk+1/Lk from the tk+1 update in line 7 of Algorithm 3 (noting that µΨ = 0 forµ = 0). Since backtracking FISTA relies on the assumption that this ratio is never


less that 1, the convergence analysis remains unaffected.When dealing with differentiable objectives, we can assume without loss of gen-

erality that Ψ(x) = 0 for all x ∈ Rn. In what follows, we consider generalizednon-monotone fixed-step ACGM in estimate sequence form, unless stated otherwise.By substituting the local upper bound functions uk+1 at every iteration k ≥ 0 withany functions that produce iterates satisfying the descent condition, which means inthis context that

f(xk+1) ≤ f(yk+1)− 1

2Lk+1‖∇f(yk+1)‖22,

where xk+1 is given by line 7 of Pattern 1, we obtain the “general scheme of optimalmethod” in [4]. Both the monotone and non-monotone variants adhere to this scheme.The correspondence between Nesterov’s notation in [4] and ours is for all k ≥ 0 givenby:

λFGMk =

AACGM0

AACGMk

, φFGMk (x) =

1

AACGMk

ψACGMk (x), x ∈ Rn,

yFGMk = yACGM

k+1 , αFGMk =

aACGMk+1

AACGMk+1

=1

tACGMk+1

, γFGMk =

γACGMk

AACGMk

.

The remaining state parameters are identical. Note that FGM makes the as-sumption that AACGM

0 > 0, which is incompatible with the original specification ofACGM in [25]. With the above assumption, the non-monotone variant (Algorithm 2)is in fact identical to the “constant step scheme I” in [4]. Similarly, the extrapolatedform of fixed-step non-monotone ACGM (Algorithm 3) corresponds exactly to the“constant step scheme II” in [4] while fixed-step border-case non-monotone ACGM(Algorithm 4) is identical to the “constant step scheme III” in [4]. We note that forboth the RR and EN problems, regardless of the actual performance of BACGM, theconvergence guarantees of BACGM are poorer than those of ACGM with A0 = 0.This discrepancy in guarantees is supported theoretically because, in the most com-mon applications, the ISD term in ∆0 is large compared to the DST. This extends tothe fixed-step setup and challenges the notion found in the literature (e.g., [38]) thatfor strongly-convex functions, FGM and FISTA-CP are momentum methods that takethe form of the “constant step scheme III” in [4]. In fact, the border-case form mayconstitute the poorest choice of parameters A0 and γ0 in many applications. Indeed,the worst-case results in Theorem 6 favor A0 = 0.

The FGM variant in [22] is a particular case of non-monotone ACGM with variablestep size (Algorithm 2) when the objective is non-strongly convex (µ = 0) and thestep size search parameters are set to rACGM

u = 2 and rACGMd = 0.5. The notation

correspondence is as follows:

xFGMk+1,i = xACGM

k+1 , yFGMk,i = yACGM

k+1 , aFGMk,i = aACGM

k+1 , 2iLf = LACGMk+1 .

The remaining parameters are identical.Thus, by relaxing the assumption that A0 = 0, we have devised a generalized

variant of ACGM that effectively encompasses FGM [4], with its recently introducedvariant [22], as well as the original FISTA [11], including its adaptive step-size vari-ants [23,24], the monotone version MFISTA [26], and the strongly convex extensionsFISTA-CP [7] and scAPG [15]. A summary of how the above first-order methodsrelate to generalized ACGM is given in Table 6.


Table 6FGM and FISTA, along with their common variants, can be considered instances of generalized

ACGM with certain restrictions applied.

Algorithm RestrictionSmoothobjective

µ = 0 µ > 0 A0 = 0 A0 > 0Fixedstep size

Non-monotone

FGM [4] yes no no no yes yes yesFGM [22] yes yes no unclear unclear no yesFISTA [11] no yes no yes no partial yesFISTA [23,24] no yes no yes no no yesMFISTA [26] no yes no yes no yes noscAPG [15] no no yes no yes no yesFISTA-CP [7] no no no no no yes no

Due to its adaptivity, generalized ACGM is not limited to the composite problemframework in Subsection 1.2. It is also guaranteed to converge on problems wherethe gradient of f is not globally Lipschitz continuous. Constituent function f needsto have Lipschitz gradient only in the area explored by the algorithm.

In terms of usability, generalized ACGM does not require a priori knowledge ofLipschitz constant Lf , or a lower estimate of it, beforehand. Thus, the proposedmethod can be utilized without any quantitative knowledge of the problem. Lack ofinformation does not hinder generalized ACGM more than any other primal first-order method while additional information, such as the values of strong convexityparameters µf and µΨ or even an accurate estimate L0 of the curvature around x0,leads to a state-of-the-art performance increase unsurpassed for its class.

REFERENCES

[1] A. Nemirovski and D.-B. Yudin, Problem complexity and method efficiency in optimization.John Wiley & Sons, New York, NY, USA, 1983.

[2] Y. Nesterov, “Subgradient methods for huge-scale optimization problems,” Math. Program.,Ser. A, vol. 146, no. 1-2, pp. 275–297, 2014.

[3] ——, “A method of solving a convex programming problem with convergence rate O(1/k2),”Dokl. Math., vol. 27, no. 2, pp. 372–376, 1983.

[4] ——, Introductory Lectures on Convex Optimization. Applied Optimization, vol. 87. KluwerAcademic Publishers, Boston, MA, USA, 2004.

[5] S. R. Becker, E. J. Candes, and M. C. Grant, “Templates for convex cone problems withapplications to sparse signal recovery,” Math. Program. Comput., vol. 3, no. 3, pp. 165–218, 2011.

[6] S. Bubeck et al., “Convex optimization: Algorithms and complexity,” Found. Trends Mach.Learn., vol. 8, no. 3-4, pp. 231–357, 2015.

[7] A. Chambolle and T. Pock, “An introduction to continuous optimization for imaging,” ActaNumer., vol. 25, pp. 161–319, 2016.

[8] N. Parikh, S. P. Boyd et al., “Proximal algorithms,” Found. Trends Optim., vol. 1, no. 3, pp.127–239, 2014.

[9] K. Slavakis, G. B. Giannakis, and G. Mateos, “Modeling and optimization for big data ana-lytics:(statistical) learning tools for our era of data deluge,” IEEE Signal Process. Mag.,vol. 31, no. 5, pp. 18–31, Sep. 2014.

[10] Y. Nesterov, “Gradient methods for minimizing composite functions,” Universite catholique deLouvain, CORE Discussion Papers, 2007/76, Sep. 2007.

[11] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverseproblems,” SIAM J. Imaging Sci., vol. 2, no. 1, pp. 183–202, 2009.

[12] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Stat. Soc. Ser. B.Methodol., vol. 58, no. 1, pp. 267–288, 1996.


[13] J. Mairal, “Optimization with first-order surrogate functions,” in ICML, 2014, Atlanta, Georgia,USA, pp. 73–81.

[14] H. Lin, J. Mairal, and Z. Harchaoui, “A universal catalyst for first-order optimization,” inNIPS, Dec. 2015, Montreal, Canada, pp. 3384–3392.

[15] Q. Lin and L. Xiao, “An adaptive accelerated proximal gradient method and its homotopycontinuation for sparse optimization,” in ICML, 2014, pp. 73–81.

[16] B. O’Donoghue and E. Candes, “Adaptive restart for accelerated gradient schemes,” Found.Comput. Math., vol. 15, no. 3, pp. 715–732, 2015.

[17] M. Yamagishi and I. Yamada, “Over-relaxation of the fast iterative shrinkage-thresholdingalgorithm with variable stepsize,” Inverse Problems, vol. 27, no. 10, p. 105008, Sep. 2011.

[18] W. W. Hager, D. T. Phan, and H. Zhang, “Gradient-based methods for sparse recovery,” SIAMJ. Imaging Sci., vol. 4, no. 1, pp. 146–165, 2011.

[19] Z. Wen, W. Yin, D. Goldfarb, and Y. Zhang, “A fast algorithm for sparse reconstruction basedon shrinkage, subspace optimization, and continuation,” SIAM J. Sci. Comput., vol. 32,no. 4, pp. 1832–1857, 2010.

[20] Z. Wen, W. Yin, H. Zhang, and D. Goldfarb, “On the convergence of an active-set method forl1 minimization,” Optim. Methods Software, vol. 27, no. 6, pp. 1127–1146, 2012.

[21] S. J. Wright, R. D. Nowak, and M. A. Figueiredo, “Sparse reconstruction by separable approx-imation,” IEEE Trans. Signal Process., vol. 57, no. 7, pp. 2479–2493, 2009.

[22] Y. Nesterov and S. Stich, “Efficiency of accelerated coordinate descent method on structuredoptimization problems,” Universite catholique de Louvain, CORE Discussion Papers,2016/03, Feb. 2016.

[23] K. Scheinberg, D. Goldfarb, and X. Bai, “Fast first-order methods for composite convex opti-mization with backtracking,” Found. Comput. Math., vol. 14, no. 3, pp. 389–417, 2014.

[24] M. I. Florea and S. A. Vorobyov, “A robust FISTA-like algorithm,” in Proc. of IEEE Intern.Conf. on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017, New Orleans,USA, pp. 4521–4525.

[25] ——, “An accelerated composite gradient method for large-scale composite objective problems,”IEEE Trans. Signal Process., vol. 67, no. 2, pp. 444–459, Jan. 2019.

[26] A. Beck and M. Teboulle, “Fast gradient-based algorithms for constrained total variation imagedenoising and deblurring problems,” IEEE Trans. Image Process., vol. 18, no. 11, pp. 2419–2434, 2009.

[27] A. Chambolle and C. Dossal, “On the convergence of the iterates of the Fast Iterative Shrink-age/Thresholding Algorithm,” J. Optim. Theory Appl, vol. 166, no. 3, pp. 968–982, 2015.

[28] B. Polyak, Introduction to optimization (translated from Russian). Translations Series in Math-ematics and Engineering, Optimization Software, New York, NY, USA, 1987.

[29] A. Brown and M. C. Bartholomew-Biggs, “Some effective methods for unconstrained optimiza-tion based on the solution of systems of ordinary differential equations,” J. Optim. TheoryAppl, vol. 62, no. 2, pp. 211–224, 1989.

[30] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statis-tical learning via the alternating direction method of multipliers,” Found. Trends Mach.Learn., vol. 3, no. 1, pp. 1–122, 2011.

[31] J. M. Ortega and W. C. Rheinboldt, Iterative solution of nonlinear equations in several vari-ables. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1970.

[32] K. Lange, MM optimization algorithms. Society for Industrial and Applied Mathematics,Philadelphia, PA, USA, 2016.

[33] R. T. Rockafellar, Convex Analysis. Princeton University Press, Princeton, NJ, USA, 1970.[34] M. I. Florea, “Constructing accelerated algorithms for large-scale optimization,” Ph.D. disser-

tation, Aalto University, School of Electrical Engineering, Helsinki, Finland, Oct. 2018.[35] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 5th ed.

Morgan Kaufmann Publishers, San Francisco, CA, USA, 2011.[36] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. R. Stat.

Soc. Ser. B. Methodol., vol. 67, no. 2, pp. 301–320, 2005.[37] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity: The Lasso

and Generalizations. CRC Press, 2015.[38] W. Su, S. Boyd, and E. J. Candes, “A differential equation for modeling Nesterov’s accelerated

gradient method: Theory and insights,” J. Mach. Learn. Res., vol. 17, pp. 1–43, 2016.

Date post:	04-Jun-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Generalized Accelerated Composite Gradient Method · GENERALIZED ACCELERATED COMPOSITE GRADIENT...

Documents