IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 6, NO. 1, …boyd/papers/pdf/mm_dist_lapl.pdf ·...

IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 6, NO. 1, JANUARY 2019 45

Distributed Majorization-Minimization forLaplacian Regularized Problems

Jonathan Tuck, Member, IEEE, David Hallac, Member, IEEE, and Stephen Boyd, Fellow, IEEE

Abstract—We consider the problem of minimizing a blockseparable convex function (possibly nondifferentiable, and in-cluding constraints) plus Laplacian regularization, a problemthat arises in applications including model fitting, regularizingstratified models, and multi-period portfolio optimization. Wedevelop a distributed majorization-minimization method for thisgeneral problem, and derive a complete, self-contained, general,and simple proof of convergence. Our method is able to scaleto very large problems, and we illustrate our approach on twoapplications, demonstrating its scalability and accuracy.

Index Terms—Convex optimization, distributed optimization,graphical networks, Laplacian regularization.

I. INTRODUCTION

MANY applications, ranging from multi-period portfoliooptimization [1] to joint covariance estimation [2],

can be modeled as convex optimization problems with twoobjective terms, one that is block separable and the othera Laplacian regularization term [3]. The block separableterm can be nondifferentiable and may include constraints.The Laplacian regularization term is quadratic, and penalizesdifferences between individual variable components. Thesetypes of problems arise in several domains, including signalprocessing [4], machine learning [5], and statistical estimationor data fitting problems with an underlying graph prior [6], [7].As such, there is a need for scalable algorithms to efficientlysolve these problems.

In this paper we develop a distributed method for min-imizing a block-separable convex objective with Laplacianregularization. Our method is iterative; in each iteration aconvex problem is solved for each block, and the variablesare then shared with each block’s neighbors in the graphassociated with the Laplacian term. Our method is an instanceof a standard and well known general method, majorization-minimization (MM) [8], which recovers a wide variety ofexisting methods depending on the choice of majorization [9].The advantages of MM over such methods as the alternatingdirection method of multipliers [10] and subgradient-basedmethods is that given an appropriate majorizer, which in

Manuscript received September 27, 2018; revised November 16, 2018;accepted November 27, 2018. Recommended by Associate Editor TengfeiLiu. (Corresponding author: Jonathan Tuck.)

Citation: J. Tuck, D. Hallac, and S. Boyd, “Distributed majorization-minimization for Laplacian regularized problems,” IEEE/CAA J. Autom.Sinica, vol. 6, no. 1, pp. 45−52, Jan. 2019.

J. Tuck, D. Hallac, and S. Boyd are with the Department of ElectricalEngineering, Stanford University, Stanford, CA 94305 USA (e-mails: {jtuck1,hallac, boyd}@stanford.edu).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JAS.2019.1911321

general are not hard to find [9], MM can exploit structure(such as separability) in such a way that it can give betterperformance over these other methods [11]. In this paper, wederive a diagonal quadratic majorizer of the given Laplacianobjective term, which has the benefit of separability. Thisseparability allows for the minimization step in our MM algo-rithm to be carried out in parallel on a block-by-block basis.We develop a completely self-contained proof of convergenceof our method, which relies on no further assumption thanthe existence of a solution. Finally, we apply our method totwo separate applications, multi-period portfolio optimizationand joint covariance estimation, demonstrating the scalableperformance of our algorithm.

A. Related Work

There has been extensive research on graph Laplacians[12]−[14] and Laplacian regularization [15]−[17], and ondeveloping solvers specifically for use in optimization overgraphs [18]. In addition, there has been much research doneon the MM algorithm [6], [8], [9], [17], [19], [20], includinginterpreting other well studied algorithms, such as the concave-convex procedure and the expectation-maximization algorithm[21], [22] as special cases of MM. We are not aware of anyprevious work that applies the MM algorithm to Laplacianregularization.

There has also been much work on the two specific appli-cation examples that we consider. Multi-period portfolio opti-mization is studied in, for example, [1], [23], [24], althoughscalability remains an issue in these studies. Our secondapplication example arises in signal processing, specificallythe joint estimation of inverse covariance matrices, which hasbeen studied and applied in many different contexts, such ascell signaling [25], [26], statistical learning [27], radar signalprocessing [28], and medical imaging [29]. Again, scalabilityhere is either not referenced or is still an ongoing issue inthese fields.

B. Outline

In Section II we set up our notation, and describe theproblem of Laplacian regularized minimization. In SectionIII we show how to construct a diagonal quadratic majorizerof a weighted Laplacian quadratic form. In Section IV wedescribe our distributed MM algorithm, and give a completeand self-contained proof of convergence. Finally, in SectionV we present numerical results for two applications whichdemonstrates the effectiveness of our method.

46 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 6, NO. 1, JANUARY 2019

II. LAPLACIAN REGULARIZED MINIMIZATION

We consider the problem of minimizing a convex functionplus Laplacian regularization,

minimize F (x) = f(x) + L(x), (1)

with variable x ∈ Rn. Here f : Rn → R ∪ {∞} is aproper closed convex function [30], [31], and L : Rn → Ris the Laplacian regularizer (or Dirichlet energy [32]) L(x) =(1/2)xT Lx, where L is a weighted Laplacian matrix, i.e.,L = LT , Lij ≤ 0 for i 6= j, and L1 = 0, where 1 is thevector with all entries one [15]. Associating with L the graphwith vertices 1, . . . , n, edges indexed by pairs (i, j) with i < jand Lij 6= 0, and (nonnegative) edge weights wij = −2Lij ,the Laplacian regularizer can be expressed as

L(z) =∑

(i,j)∈Ewij(zi − zj)2.

We refer to the problem (1) as the Laplacian regularizedminimization problem (LRMP). LRMPs are convex optimiza-tion problems, which can be solved by a variety of methods,depending on the specific form of f [33], [34]. We will letF ? denote the optimal value of the LRMP. Convex constraintscan be incorporated into LRMP, by defining f to take value+∞ when the constraints are violated. Note in particular thatwe specifically do not assume that the function f is finite, ordifferentiable (let alone with Lipschitz gradient), or even thatits domain has affine dimension n. In this paper we will makeonly one additional analytical assumption about the LMRP (1):its sublevel sets are bounded. This assumption implies that theLRMP is solvable, i.e., there exists at least one optimal pointx?, and therefore that its optimal value F ? is finite.

A point x is optimal for the LRMP (1) if and only if thereexists g ∈ Rn such that [30], [31]

g ∈ ∂f(x), g +∇L(x) = g + Lx = 0, (2)

where ∂f(x) is the subdifferential of f at x [30], [35]. Forg ∈ ∂f(x), we refer to

r = g + Lx

as the optimality residual for the LRMP (1). Our goal is tocompute an x (and g ∈ ∂f(x)) for which the residual r issmall.

We are interested in the case where f is block separable. Wepartition the variable x as x = (x1, . . . , xp), with xi ∈ Rni ,n1 + · · ·+ np = n, and assume f has the form

f(x) =p∑

i=1

fi(xi),

where fi : Rn → R∪{∞} are closed convex proper functions.The main contribution of this paper is a scalable and

distributed method for solving LRMP in which each of thefunctions fi is handled separately. More specifically, we willsee that each iteration of our algorithm requires the evaluationof a diagonally scaled proximal operator [36] associated witheach block function fi, which can be done in parallel.

III. DIAGONAL QUADRATIC MAJORIZATION OF THELAPLACIAN

Recall that a function L̂ : Rn × Rn → R is a majorizer ofL : Rn → R if for all x and z, L̂(z; z) = L(z), and L̂(x; z) ≥L(x) [8], [9]. In other words, the difference L̂(x, z)−L(z) isnonnegative, and zero when x = z.

We now show how to construct a quadratic majorizer of theLaplacian regularizer L. This construction is known [9], butwe give the proof for completeness. Suppose L̂ = L̂T satisfiesL̂ º L, i.e., L̂− L is positive semidefinite. The function

L̂(x; z) =12zT Lz + zT L(x− z)

+12(x− z)T L̂(x− z), (3)

which is quadratic in x, is a majorizer of L. To see this, wenote that

L̂(x; z)− L(x) =12zT Lz + zT L(x− z)

+12(x− z)T L̂(x− z)− 1

2xT Lx

=12(x− z)T (L̂− L)(x− z),

which is always nonnegative, and zero when x = z.In fact, every quadratic majorizer of L arises from this

construction, for some L̂ º L. To see this we note that thedifference L̂(x; z)− L(x) is a quadratic function of x that isnonnegative and zero when x = z, so it must have the form(1/2)(x−z)T P (x−z) for some P = PT º 0. It follows thatL̂ has the form (3), with L̂ = P + L º L.

We now give a simple scheme for choosing L̂ in thediagonal quadratic majorizer. Suppose L̂ is diagonal,

L̂ = diag(α) = diag(α1, . . . , αn),

where α ∈ Rn. A simple sufficient condition for L̂ º L isαi ≥ 2Lii, i = 1, . . . , n. This follows from standard resultsfor Laplacians [37], but it is simple to show directly. We notethat for any z ∈ Rn, we have

zT (L̂− L)z =n∑

i=1

(αi − Lii)z2i +

n∑

i=1

∑

j 6=i

(−Lij)zizj

≥n∑

i=1

Liiz2i +

n∑

i=1

∑

j 6=i

Lij |zi||zj |

=|z|T L|z| ≥ 0,

where the absolute value is elementwise. On the second linewe use the inequalities αi − Lii ≥ Lii and for j 6= i,−Lijzizj ≥ Lij |zi||zj |, which follows since Lij ≤ 0.

In our algorithm described below, we will require that L̂ ÂL, i.e., L̂ − L is positive definite. This can be accomplishedby choosing

αi > 2Lii, i = 1, . . . , n. (4)

There are many other methods for selecting α, some ofwhich have additional properties. For example, we can chooseα = 2λmax(L)1, where λmax(L) denotes the maximumeigenvalue of L. With this choice we have L̂ = 2λmax(L)I .

TUCK et al.: DISTRIBUTED MAJORIZATION-MINIMIZATION FOR LAPLACIAN REGULARIZED PROBLEMS 47

This diagonal majorization has all diagonal entries equal, i.e.,it is a multiple of the identity.

Another choice (that we will encounter later in Section V-B) takes L̂ to be a block diagonal matrix, conformal withthe partition of x, with each block component a (possiblydifferent) multiple of the identity,

L̂ = diag(α1In1 , . . . , αpInp), (5)

where we can take

αi > maxj∈Ni

2Ljj , i = 1, . . . , p,

where Ni is the index range for block i.

IV. DISTRIBUTED MAJORIZATION-MINIMIZATIONALGORITHM

The majorization-minimization (MM) algorithm is an iter-ative algorithm that at each step minimizes a majorizer ofthe original function at the current iterate [9]. Since L̂, asconstructed in Section III, using (4), majorizes L, it followsthat F̂ = f + L̂ majorizes F = f +L. The MM algorithm forminimizing F is then

xk+1 = arg minx

(f(x) + L̂(x;xk)

), (6)

where the superscripts k and k+1 denote the iteration counter.Note that since L̂ is positive definite, L̂ is strictly convex inx, so the argmin is unique.

a) Stopping criterion: The optimality condition for theupdate (6) is the existence of gk+1 ∈ Rn with

gk+1 ∈ ∂f(xk+1), gk+1 +∇L̂(xk+1;xk) = 0. (7)

From L̂(x; z) − L(x) = (1/2)(x − z)T (L̂ − L)(x − z), wehave

∇L̂(xk+1;xk)−∇L(xk+1) = (L̂− L)(xk+1 − xk).

Substituting this into (7) we get

gk+1 +∇L(xk+1) = (L̂− L)(xk − xk+1). (8)

Thus we see that

rk+1 = (L̂− L)(xk − xk+1)

is the optimality residual for xk+1, i.e., the right-hand sideof (2). We will use ‖rk+1‖2 ≤ ε, where ε > 0 is a tolerance,as our stopping criterion. This guarantees that on exit, xk+1

satisfies the optimality condition (2) within ε.b) Absolute and relative tolerance: When the algorithm

is used to solve problems in which x? or L vary widely insize, the tolerance ε is typically chosen as a combination ofan absolute error εabs and a relative error εrel, for example,

ε = εabs + εrel(‖L̂− L‖F + ‖x‖2),

where ‖ · ‖F denotes the Frobenius norm.

c) Distributed implementation: The update (6) can bebroken down into two steps. The first step requires multiplyingby L, and in the other step, we carry out p independentminimizations in parallel. We partition the Laplacian matrix Linto blocks Lij , i, j = 1, . . . , p, conformal with the partitionx = (x1, . . . , xp). (In a few places above, we used Lij todenote the i, j entry of L, whereas here we use it to denote thei, j submatrix. This slight abuse of notation should not causeany confusion since the index ranges, and the dimensions,make it clear whether the entry, or submatrix, is meant.) Wethen observe that our majorizer (3) has the form

L̂(x; z) =p∑

i=1

L̂i(xi; z) + c,

where c does not depend on x, and

L̂i(xi; z) = (1/2)(xi − zi)T L̂ii(xi − zi) + hTi xi,

where zi refers to the ith subvector of z, and hi is the ithsubvector Lz,

hi = Liizi +∑

j 6=i

Lijzj .

It follows that

F̂ (x;xk) =p∑

i=1

(fi(xi) + L̂(xi;xki )) + c

is block separable.

Algorithm 1 Distributed majorization-minimization.given Laplacian matrix L, and initial starting point x0

in the feasible set of the problem, with f(x0) < ∞.Form majorizer matrix. Form diagonal L̂ with L̂ Â L(using (4)).for k = 1, 2, . . .

1. Compute linear term. Compute hk = Lxk.2. Update in parallel. For i = 1, . . . , p, updateeach xi (in parallel) as

xk+1i =

arg minxi

(fi(xi) + (1/2)(xi − xk

i )T L̂ii(xi − xki ) + (hk

i )T xi

).

Compute residual rk+1 = (L̂− L)(xk − xk+1).3. Test stopping criterion. Quit if k ≥ 2 and‖rk+1‖2 ≤ ε.

Step 1 couples the subvectors xki ; step 2 (the subproblem

updates) is carried out in parallel for each i. We observethat the updates in step 2 are (diagonally scaled) proximaloperator evaluations, i.e., they involve minimizing fi plus anorm squared term, with diagonal quadratic norm; see, e.g.,[36]. Our algorithm can thus be considered as a distributedproximal-based method, a method which has been extensivelyresearched in signal processing [38], [39]. We also mentionthat as the algorithm converges (discussed in detail below),xk+1

i − xki → 0, which implies that the quadratic terms

(1/2)(xi−xki )T L̂ii(xi−xk

i ) and their gradients in the updateasymptotically vanish; roughly speaking, they “go away” asthe algorithm converges. We will see below, however, that


these quadratic terms are critical to convergence of the al-gorithm.

d) Warm start: Our algorithm supports warm startingby choosing the initial point x0 as an estimate of the solution,for example, the solution of a closely related problem (e.g., aproblem with slightly varying hyperparameters.) Warm startingcan decrease the number of iterations required to converge[40], [41]; we will see an example in Section V.

A. Convergence

There are many general convergence results for MM meth-ods, but all of them require varying additional assumptionsabout the objective function [8], [9]. Additionally, the MMalgorithmic framework does not admit a general convergencerate, as convergence rates depend heavily on the choice ofmajorizer [11], [42], [43]. In this section we give a com-plete self-contained proof of convergence for our algorithm,that requires no additional assumptions. We will show thatF (xk) − F ? → 0, as k → ∞, and also that the stoppingcriterion eventually holds, i.e., (L̂− L)(xk − xk+1) → 0.

We first observe a standard result that holds for all MMmethods: the objective function is non-increasing. We have

F (xk+1) ≤ F̂ (xk+1;xk) ≤ F̂ (xk;xk) = F (xk),

where the first inequality holds since F̂ majorizes F , and thesecond since xk+1 minimizes F̂ (x;xk) over x. It follows thatF (xk) converges, and therefore F (xk)−F (xk+1) → 0. It alsofollows that the iterates xk are bounded, since every iteratesatisfies F (xk) ≤ F (x0), and we assume that the sublevelsets of F are bounded.

Since F is convex and gk+1 +∇L(xk+1) ∈ ∂F (xk+1), wehave (from the definition of subgradient)

F (xk) ≥ F (xk+1) + (gk+1 +∇L(xk+1))T (xk − xk+1).

Using this and (8), we have

F (xk)− F (xk+1) ≥ (xk − xk+1)T (L̂− L)(xk − xk+1).

Since F (xk)− F (xk+1) → 0 as k →∞, and L̂− L Â 0, weconclude that xk+1 − xk → 0 as k → ∞. This implies thatour stopping criterion will eventually hold.

Now we show that F (xk) → F ?. Let x? be any optimalpoint. Then,

F ? = F (x?)≥F (xk+1)

+ ((L̂− L)xk − L̂xk+1 + Lxk+1)T (x? − xk+1)

= F (xk+1) + (xk − xk+1)T (L̂− L)(x? − xk+1).

So we have

F (xk+1)− F ? ≤ −(xk − xk+1)T (L̂− L)(x? − xk+1).

Since xk − xk+1 → 0 as k → ∞, and xk+1 is bounded,the right-hand side converges to zero as k → ∞, and so weconclude F (xk+1)− F ? → 0 as k →∞.

B. Variations

a) Arbitrary convex quadratic regularization While ourinterest is in the case when L is Laplacian regularization, thealgorithm (and convergence proof) work when L is any convexquadratic, i.e., L º 0, with the choice

αi >n∑

j=1

|Lij |, i = 1, . . . , n,

replacing the condition (4). In fact, the condition (4) is aspecial case of this condition, for a Laplacian matrix.

b) Nonconvex f If the objective function in the LRMP isnonconvex, i.e., f is nonconvex, then the method proposed inthis paper can be extended as a heuristic for solving (1) fornonconvex f . It is emphasized that the most the algorithm canguarantee is a local optimum, rather than a global optimum[33].

V. EXAMPLES

In this section we describe two applications of our dis-tributed method for solving LRMP, and report numericalresults demonstrating its convergence and performance. Werun all numerical examples on a 32-core AMD machine with64 hyperthreads, using the Pathos multiprocessing package tocarry out computations in parallel [44]. Our code is availableonline at https://github.com/cvxgrp/mm dist lapl.

A. Multi-period Portfolio Optimization

We consider the problem of multi-period trading withquadratic transaction costs; see [1], [45] for more detail. Weare to choose a portfolio of n holdings xt ∈ Rn, for periodst = 1, . . . , T . We assume the nth holding is a riskless holding(i.e., cash). We choose the portfolios by solving the problem

minimizeT∑

t=1

(ft(xt) +12(xt − xt−1)T Dt(xt − xt−1)),

(9)

where ft is the convex objective function (and constraints) forthe portfolio in period t, and the Dt’s are diagonal positivedefinite matrices. The initial portfolio x0 is given and constant;x1, . . . , xT are the variables. The quadratic term (1/2)(xt −xt−1)T Dt(xt−xt−1) is the transaction cost, i.e., the additionalcost of trading to move from the previous portfolio xt−1 tothe current one xt. We will assume that there is no transactioncost associated with cash, i.e., (Dt)nn = 0.

The objective function ft typically includes negative ex-pected return, one or more risk constraints or risk avoidanceterms, shorting or borrow costs, and possibly other terms.It also can include constraints, such as the normalization1T xt = 1 (in which case xt are referred to as the portfolioweights), limits on the holdings or the leverage of the portfolio,or a specified final portfolio; see [1], [45] for more detail.

We can express the transaction cost as Laplacian regular-ization on x = (x1, . . . , xT ) ∈ RTn, plus a quadratic terminvolving x1,


T∑t=1

12(xt − xt−1)T Dt(xt − xt−1)

=12xT Lx + +

12xT

1 D1x1

− (D1x0)T x1 + +12xT

0 D1x0.

(Recall that the initial portfolio x0 is given.) The Laplacianmatrix L has block-tridiagonal form given by

L =

D2 −D2 . . . 0 0−D2 D2 + D3 . . . 0 0

0 −D3 . . . 0 0...

.... . .

......

0 0 . . . DT−2 + DT−1 00 0 . . . −DT−1 + DT −DT

0 0 . . . −DT DT

.

If we assume that the initial portfolio is cash, i.e., x0 is zeroexcept in the last component, then two of the three extraterms, (D1x0)T x1 and (1/2)xT

0 D1x0, both vanish. If we lumpthe extra terms that depend on x1 into f1, the multi-periodportfolio optimization problem (9) has the LRMP form, withp = T and ni = n. The total number of (scalar) variablesis Tn. The graph associated with the Laplacian is the simplechain graph; roughly speaking, each portfolio xt is linked toits predecessor xt−1 and its successor xt+1 by the transactioncost.

We can give a simple interpretation for the subproblemupdate in our method. The quadratic term of the subproblemupdate (which asymptotically goes away as we approachconvergence) adds diagonal risk; the linear term ht contributesan expected return to each asset. These additional risk andreturn terms come from both the preceding and the successorportfolios; they “encourage” the portfolios to move towardseach other from one time period to the next, so as to reducetransaction cost. Each subproblem update minimizes negativerisk-adjusted return, with the given return vector modified toencourage less trading.

1) Problem instance: We consider a problem with n =1000 assets and T = 30 periods, so the total number of(scalar) variables is 30 000. The objective functions ft includea negative expected return, a quadratic risk given by a factor(diagonal plus low rank) model with 50 factors [1], [46], and alinear shorting cost. We additionally impose the normalizationconstraint 1T xt = 1, so the portfolios xt represent weights.The objective functions ft have the form

ft(x) = −µTt x + γxT Σtx + sT

t (x)−, t = 1, . . . , T − 1.(10)

Here, γ > 0 is the risk aversion parameter, µt is the expectedreturn, Σt is the return covariance, st is the (positive) shortingcost coefficient vector, and (x)− = max{−x, 0}, which istaken elementwise. The covariance matrices Σt are diagonalplus a rank 50 (factor) term, with zero entries in the lastrow and column (which correspond to the cash asset). Wechoose all these coefficients and the diagonal transaction costmatrices Dt randomly, but with realistic values. In our problem

instance, we choose all of these parameters independent of t,i.e., constant.

We take fT to be the indicator function for the constraintx = en (i.e., fT (x) = 0 if x = en, and ∞ otherwise), and theinitial portfolio is all cash, x0 = en. So in our multi-periodportfolio optimization problem we are planning a sequence ofportfolios that begin and end in cash.

We can see the interpretation of the subproblem updates inSection V-A by looking at the subproblem objective functions.Assuming we choose the diagonal elements of L̂ to be 3Lii, wecan rewrite the subproblem objective function (at time periodst = 2, . . . , T − 1 and iteration k) as

xTt

(γΣt +

32(Dt + Dt+1)

)xt

− (µt + Dt(2xkt − xk

t−1) + Dt+1(2xkt − xk

t+1))T xt + c,

where c is some constant that does not depend on xt. We seethat a diagonal risk term is added, and the mean return µt isoffset by terms that depend on the past, current, and futureportfolios xk

t−1, xkt , and xk

t+1.2) Numerical results: We first solve the problem instance

using CVXPY [47] and solver OSQP [48], which is single-thread. The solve time for this baseline method was 120minutes.

We then solved the problem instance using our method. Weinitialized all portfolios as en, i.e., all cash, and use stoppingcriterion tolerance ε = 10−6. Our algorithm took 8 iterationsand 19 seconds to converge, and produced a solution thatagreed very closely with the CVXPY/OSQP solution. Fig. 1shows a plot of the residual norm ‖rk‖2 versus iteration k.This plot shows nearly linear convergence, with a reduction inresidual norm by around a factor of 5 each iteration.

Fig. 1. Residual norm versus iteration for multi-period portfolio optimizationproblem.

B. Laplacian Regularized Estimation

We consider estimation of parameters in a statistical model.We have a graph, with some data associated with each node;the goal is to fit a model to the data at each node, withLaplacian regularization used to make neighboring modelssimilar.


The model parameter at node i is θi ∈ Rni . The vectorof all node parameters is θ = (θ1, . . . , θp) ∈ Rn, with n =n1+· · ·+np. We choose θ by minimizing a local loss functionand regularizer at each node, plus Laplacian regularization:

minimizep∑

i=1

fi(θi) + L(θ),

where fi(θi) = ì(θi) + ri(θi), where ì : Rni → R is theloss function (for example, the negative log-likelihood of θi)for the data at node i, and ri : Rni → R is a regularizer onthe parameter θi. Without the Laplacian term, the problem isseparable, and corresponds to fitting each parameter separatelyby minimizing the local loss plus regularizer. The Laplacianterm is an additional regularizer that encourages various entriesin the parameter vectors to be close to each other.

a) Laplacian regularized covariance estimation:We now focus on a more specific case of this general

problem, Laplacian regularized covariance estimation. At eachnode, we have some number of samples from a zero-meanGaussian distribution on Rd, with covariance matrix Σi,assumed positive definite. We will estimate the natural pa-rameters (as an exponential family), the inverse covariancematrices θi = Σ−1

i . So here we take the node parametersθi to be symmetric positive definite d × d matrices, withni = d(d+1)/2. (In the discussion of the general case above,θi is a vector in Rni ; in the rest of this section, θi will denotea symmetric d× d matrix.)

The data samples at node i have empirical covariance Si

(which is not positive definite if there are fewer than dsamples). The negative log-likelihood for node i is (up to aconstant and a positive scale factor)

ì(θi) = Tr(Siθi)− log det θi.

We use trace regularization on the parameter,

ri(θi) = κTr(θi),

where κ > 0 is the local regularization hyperparameter. Wenote that we can minimize fi(θi) = ì(θi)+ri(θi) analytically;the minimizer is

θi = (Si + κI)−1.

(See, e.g., [27].)The Laplacian regularization is used to encourage neighbor-

ing inverse covariance matrices in the given graph to be neareach other. It has the specific form

L(θ1, . . . , θp) = λ∑

(i,j)∈E‖θi − θj‖2F = Tr(θT Lθ),

where the norm is the Frobenius norm, L is the associatedweighted Laplacian matrix for the graph with vertices 1, . . . , pand edges E , and λ ≥ 0 is a hyperparameter that controlsthe amount of Laplacian regularization. When λ = 0, theestimation problem is separable, with analytical solution

θi = (Si + κI)−1, i = 1, . . . , p.

For λ →∞, assuming the graph is connected, the estimationproblem reduces to finding a single covariance matrix for allthe data.

We choose the majorizer to be block diagonal with eachblock a multiple of the identity, as in (5). The update at eachnode in our algorithm can be expressed as minimizing over θi

the function

Tr((Si + Hki )θi)− log det θi + κTr(θi) +

αi

2‖θi − θk

i ‖2F ,

where

Hk = Lθk.

As an aside, we note that the optimal θi can be rewritten as

arg minθi

(− log det θi

+αi

2‖θi − (θk

i − Si −Hki − κI)‖2F ),

i.e., the proximal operator of − log det with parameter 1/αi.This minimization can be carried out analytically. By taking

the gradient of the subproblem objective function with respectto θi and equating to zero, we see that

Si + Hki − θ−1

i + κI + αi(θi − θki ) = 0,

or

θ−1i − αiθi = Si + Hk

i + κI − αiθki .

This implies that θi and Si + Hki + κI −αiθ

ki share the same

eigenvectors [2], [26], [49]. Let QiΛiQTi be the eigenvector

decomposion of Si + Hki + κI − αiθ

ki . We find that the

eigenvalues of θi, vij , j = 1, . . . , n, are

vij =1

2αi

(−(Λi)jj +

√(Λi)2jj + 4αi

).

We have θk+1i = QiViQ

Ti , where Vi = diag(vi1, . . . , vin).

The computational cost per iteration is primarily in computingthe eigenvector decomposition of Si +Hk

i +κI−αiθki , which

has order d3.1) Problem instance: The graph is a 15×15 grid, with 420

edges, so p = 225. The dimension of the data is d = 30, soeach θi is a symmetric 30 × 30 matrix. The total number of(scalar) variables in our problem instance is 225 × 30(30 +1)/2 = 104 625.

We generate the data for each node as follows. First, wechoose the four corner covariance matrices randomly. Theother 221 nodes are given covariance matrices using bilinearinterpolation from the corner covariance matrices. We thengenerate 20 independent samples of data from each of thenode distributions. (The samples are in R30, so the empiricalcovariance matrices are singular.) In our problem instance weused hyperparameter values λ = 0.053 and κ = 0.08, whichwere chosen to give good estimation performance.

2) Numerical results: The problem instance is too largeto reliably solve using CVXPY and the solver SCS [50],which stops after two hours with the status message that thecomputed solution may be inaccurate.

We solved the problem using our distributed method, withabsolute tolerance εabs = 10−5 and relative tolerance εrel =10−3. The method took 54 iterations and 13 seconds toconverge. Fig. 2 is a plot of the residual norm ‖rk‖F versusiteration k.


Fig. 2. Residual norm vs. iteration for Laplacian regularized covarianceestimation problem.

a) Regularization path via warm-start: To illustrate theadvantage of warm-starting our algorithm, we compute theentire regularization path, i.e., the solutions of the problemfor 100 values of λ, spaced logarithmically between 10−5 and104.

Computing these 100 estimates by running the algorithmfor each value of λ sequentially, without warm-start, requires26 000 total iterations (an average of 260 iterations per choiceof λ) and 81 minutes. Computing these 100 estimates by run-ning the algorithm using warm-start, starting from λ = 10−5,requires only 2000 total iterations (an average of 20 iterationsper choice of λ) and 7.1 minutes. For the specific instancesolved above, the algorithm converges in only 2.5 secondsand 10 iterations using warm-start, compared to 13 secondsand 54 iterations using cold-start.

Fig. 3. Root-mean-square error of the optimal estimates vs. λ.

While the point of this example is the algorithm thatcomputes the estimates, we also explore the performance ofthe method. For each of the 100 values of λ we compute theroot-mean-square error between our estimate of the inversecovariance and the true inverse covariance, which we know,since we generated them. Fig. 3 shows a plot of the root-mean-square error of our estimate versus the value of λ. This plotshows that the method works, i.e., produces better estimates of

the inverse covariance matrices than handling them separately(small λ) or fitting one inverse covariance matrix for all nodes(large λ).

ACKNOWLEDGMENTS

The authors would like to thank Peter Stoica for his insightsand comments on early drafts of this paper. We would also liketo thank the Air Force Research Laboratory, and in particularMuralidhar Rangaswamy, for discussions of covariance esti-mation arising in radar signal processing.

REFERENCES

[1] S. Boyd, E. Busseti, S. Diamond, R. N. Kahn, K. Koh, P. Nystrup, andJ. Speth, “Multi-period trading via convex optimization,” Foundationsand Trends in Optimization, vol. 3, no. 1, pp. 1− 76, Apr. 2017.

[2] D. Hallac, Y. Park, S. Boyd, and J. Leskovec, “Network inference via thetime-varying graphical lasso,” in Proceedings of the 23rd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pp.205−213, 2017.

[3] M. Yin, J. Gao, and Z. Lin, “Laplacian regularized low-rank represen-tation and its applications,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 38, no. 3, pp. 504−517, Mar. 2016.

[4] J. Pang and G. Cheung, “Graph laplacian regularization for imagedenoising: Analysis in the continuous domain,” IEEE Transactions onImage Processing, vol. 26, no. 4, pp. 1770− 1785, April 2017. [Online].Available: https://doi.org/10.1109/ TIP.2017.2651400

[5] D. Slepcev and M. Thorpe, “Analysis of p-laplacian regularization insemi-supervised learning,” ArXiV preprint, Oct. 2017.

[6] R. K. Ando and T. Zhang, “Learning on graph with Laplacian regular-ization,” Conference on Neural Information Processing Systems, 2006.

[7] S. Melacci and M. Belkin, “Laplacian support vector machines trainedin the primal,” Journal of Machine Learning Research, vol. 12, pp.1149−1184, Jul. 2011.

[8] K. Lange, MM Optimization Algorithms. Philadelphia, PA: Society forIndustrial and Applied Mathematics, 2016.

[9] Y. Sun, P. Babu, and D. Palomar, “Majorization-minimization algorithmsin signal processing, communications, and machine learning.”IEEETransactions in Signal Processing, vol. 65, no. 3, pp. 794−816, 2017.

[10] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed op-timization and statistical learning via the Alternating Direction Methodof Multipliers,” Foundations and Trends in Machine Learning, vol. 3,no. 1, pp. 1−122, Jan. 2011.

[11] J. de Leeuw and K. Lange, “Sharp quadratic majorization inone dimension,” Computational Statistics & Data Analysis,vol. 53, no. 7, pp. 2471−2484, 2009. [Online]. Available:https://doi.org/10.1016/j.csda.2009.01.002

[12] C. Zhang, D. Florencio, and P. Chou, “Graph signal processing−aprobabilistic framework,” Tech. Rep., April 2015. [Online]. Available:https: //www. microsoft. com/ en-us/ research/ publication/ graph-signal-processing-a-probabilistic-framework/

[13] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Learninglaplacian matrix in smooth graph signal representations,” IEEE Transac-tions on Signal Processing, vol. 64, no. 23, pp. 6160−6173, Dec. 2016.[Online]. Available: https://doi.org/10.1109/TSP.2016.2602809

[14] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from dataunder laplacian and structural constraints,” IEEE Journal of SelectedTopics in Signal Processing, vol. 11, no. 6, pp. 825−841, Sept 2017.

[15] C. Godsil and G. Royle, The Laplacian of a Graph. Springer, 2001.

[16] K. Q. Weinberger, F. Sha, Q. Zhu, and L. K. Saul, “Graph Laplacianregularization for large-scale semidefinite programming,” in Advancesin Neural Information Processing Systems 19. MIT Press, 2007, pp.1489−1496.

[17] M. Razaviyayn, M. Hong, and Z. Luo, “A unified convergence analysisof block successive minimization methods for nonsmooth optimization,”SIAM Journal on Optimization, vol. 23, no. 2, pp. 1126−1153, 2013.


[18] D. Hallac, C. Wong, S. Diamond, A. Sharang, R. Sosic, S. Boyd, andJ. Leskovec, “SnapVX: A network-based convex optimization solver.”Journal of Machine Learning Research, vol. 18, 2017.

[19] D. R. Hunter and K. Lange, “A tutorial on mm algorithms,” TheAmerican Statistician, vol. 58, no. 1, pp. 30−37, 2004. [Online].Available: https://doi.org/10.1198/0003130042836

[20] M. W. Jacobson and J. A. Fessler, “An expanded theoretical treatment ofiteration-dependent majorize-minimize algorithms,” IEEE Transactionson Image Processing, vol. 16, no. 10, pp. 2411−2422, Oct 2007.

[21] A. L. Yuille and A. Rangarajan, “The concave-convex procedure,”Neural Computing, vol. 15, no. 4, pp. 915−936, Apr. 2003.

[22] T. T. Wu and K. Lange, “The MM alternative to EM,” Statistical Science,vol. 25, no. 4, pp. 492−505, 11 2010.

[23] R. Almgren and N. Chriss, “Optimal execution of portfolio transactions,”Journal of Risk, pp. 5−39, 2000.

[24] J. Skaf and S. Boyd, “Multi-period portfolio optimization with con-straints and transaction costs,” 2009.

[25] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covarianceestimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp.432−441, Aug. 2008.

[26] P. Danaher, P. Wang, and D. Witten, “The joint graphical lasso forinverse covariance estimation across multiple classes,” Journal of theRoyal Statistical Society, vol. 76, no. 2, pp. 373−397, 2014.

[27] O. Banerjee, L. El Ghaoui, and A. d’Aspremont, “Model selectionthrough sparse maximum likelihood estimation for multivariate Gaussianor binary data,” Journal of Machine Learning Research, vol. 9, pp.485−516, 2008.

[28] I. Soloveychik and A. Wiesel, “Joint estimation of inverse covariancematrices lying in an unknown subspace,” IEEE Transactions on SignalProcessing, vol. 65, no. 9, pp. 2379−2388, 2017.

[29] E. Levitan and G. T. Herman, “A maximum a posteriori probability ex-pectation maximization algorithm for image reconstruction in emissiontomography,” IEEE Transactions on Medical Imaging, vol. 6, no. 3, pp.185−192, Sept 1987.

[30] R. T. Rockafellar, Convex Analysis, ser. Princeton Mathematical Series.Princeton University Press, 1970.

[31] J. M. Borwein and A. S. Lewis, Convex Analysis and NonlinearOptimization, Theory and Examples. Springer, 2000.

[32] L. C. Evans, Partial differential equations. Providence, R.I.: AmericanMathematical Society, 2010.

[33] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Uni-versity Press, 2004.

[34] J. Nocedal and S. J. Wright, Numerical Optimization. Springer, 2006.

[35] F. H. Clarke, Optimization and Nonsmooth Analysis. Society for Indus-trial and Applied Mathematics, 1990.

[36] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trendsin Optimization, vol. 1, no. 3, pp. 127−239, Jan. 2014.

[37] B. Bollobas, Modern Graph Theory, ser. Graduate Texts in Mathematics.Heidelberg: Springer, 1998.

[38] P. Combettes and J. Pesquet, Proximal Splitting Methods in SignalProcessing. New York, NY: Springer New York, 2011, pp. 185−212.[Online]. Available: https://doi.org/10.1007/978-1-4419-9569-8 10.

[39] M. Burger, A. Sawatzky, and G. Steidl, First Order Algorithms inVariational Image Processing. Cham: Springer International Publishing,2016, pp. 345−407. [Online]. Available: https: //doi.org/10.1007/978-3-319-41589-5 10.

[40] E. Yildirim and S. Wright, “Warm-start strategies in interior-point methods for linear programming,” SIAM Journal on Opti-mization, vol. 12, no. 3, pp. 782−810, 2002. [Online]. Available:https://doi.org/10.1137/S1052623400369235

[41] Y. Wang and S. Boyd, “Fast model predictive control using onlineoptimization,” IEEE Transactions on Control Systems Technology, vol.18, no. 3, pp. 267−278, March 2010.

[42] D. Kim, D. Pal, J.-B. Thibault, and J. A. Fessler, “Accelerating orderedsubsets image reconstruction for X-ray CT using spatially non-uniformoptimization transfer,” IEEE Transactions on Medical Imaging, vol. 32,no. 11, pp. 1965−78, Nov. 2013.

[43] M. J. Muckley, D. C. Noll, and J. A. Fessler, “Fast parallel MR imagereconstruction via B1-based, adaptive restart, iterative soft thresholdingalgorithms (BARISTA),” IEEE Transactions on Medical Imaging, vol.34, no. 2, pp. 578−88, Feb. 2015.

[44] M. McKerns, “Pathos multiprocessing,” July 2017. [Online]. Available:https: //pypi.python.org/pypi/pathos

[45] S. Boyd, M. T. Mueller, B. O’Donoghue, and Y. Wang, “Performancebounds and suboptimal policies for multi-period investment,” Founda-tions and Trends in Optimization, vol. 1, no. 1, pp. 1−72, Jan. 2014.

[46] G. Chamberlain and M. Rothschild, “Arbitrage, factor structure,and mean-variance analysis on large asset markets,” Economet-rica, vol. 51, no. 5, pp. 1281−1304, 1983. [Online]. Available:http://www.jstor.org/stable/1912275

[47] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling lan-guage for convex optimization,” Journal of Machine Learning Research,vol. 17, no. 83, pp. 1−5, 2016.

[48] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd, “OSQP:An operator splitting solver for quadratic programs,” ArXiv e-prints,Nov. 2017.

[49] D. M. Witten and R. Tibshirani, “Covariance-regularized regression andclassification for high dimensional problems,” Journal of the RoyalStatistical Society, vol. 71, no. 3, pp. 615−636, 2009.

[50] B. O’ Donoghue, E. Chu, N. Parikh, and S. Boyd, “Conic optimizationvia operator splitting and homogeneous self-dual embedding,” Journal ofOptimization Theory and Applications, vol. 169, no. 3, pp. 1042−1068,June 2016.

Jonathan Tuck is a Ph.D. candidate at StanfordUniversity in the Electrical Engineering Department.He received his B.S. degree in electrical engineeringfrom the Georgia Institute of Technology in 2016.His research interests include distributed convexoptimization algorithms in machine learning, signalprocessing, and finance.

David Hallac is a Ph.D. candidate at Stanford Uni-versity in the Electrical Engineering Department. Hereceived his M.S. degree from Stanford in 2015 andhis B.S. degree from the University of Pennsylvaniain 2013. His research interests include distributedconvex optimization and machine learning on timeseries data.

Stephen Boyd (F’99) is the Samsung Professor ofEngineering, and Professor of Electrical Engineeringin the Information Systems Laboratory at StanfordUniversity, with courtesy appointments in ComputerScience and Management Science and Engineering.He received the A.B. degree in mathematics fromHarvard University in 1980, and the Ph.D. in elec-trical engineering and computer science from theUniversity of California, Berkeley, in 1985, and thenjoined the faculty at Stanford. His current researchfocus is on convex optimization applications in con-

trol, signal processing, machine learning, finance, and circuit design.

Date post:	21-Apr-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 6, NO. 1, …boyd/papers/pdf/mm_dist_lapl.pdf ·...

Documents