IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015 6013
A Proximal Gradient Algorithm for DecentralizedComposite OptimizationWei Shi, Qing Ling, Gang Wu, and Wotao Yin
Abstract—This paper proposes a decentralized algorithm forsolving a consensus optimization problem defined in a staticnetworked multi-agent system, where the local objective functionshave the smooth+nonsmooth composite form. Examples of suchproblems include decentralized constrained quadratic program-ming and compressed sensing problems, as well as many regular-ization problems arising in inverse problems, signal processing,andmachine learning, which have decentralized applications. Thispaper addresses the need for efficient decentralized algorithmsthat take advantages of proximal operations for the nonsmoothterms. We propose a proximal gradient exact first-order algorithm(PG-EXTRA) that utilizes the composite structure and has thebest known convergence rate. It is a nontrivial extension to therecent algorithm EXTRA. At each iteration, each agent locallycomputes a gradient of the smooth part of its objective and a prox-imal map of the nonsmooth part, as well as exchanges informationwith its neighbors. The algorithm is “exact” in the sense that anexact consensus minimizer can be obtained with a fixed step size,whereas most previous methods must use diminishing step sizes.When the smooth part has Lipschitz gradients, PG-EXTRA hasan ergodic convergence rate of in terms of the first-orderoptimality residual. When the smooth part vanishes, PG-EXTRAreduces to P-EXTRA, an algorithm without the gradients (so no“G” in the name), which has a slightly improved convergence rateat in a standard (non-ergodic) sense. Numerical experi-ments demonstrate effectiveness of PG-EXTRA and validate ourconvergence results.Index Terms—Composite objective, decentralized optimization,
multi-agent network, nonsmooth, proximal, regularization.
I. INTRODUCTION
T HIS paper considers a connected network of agents thatcooperatively solve the consensus optimization problem
in the form
(1)
Manuscript received March 16, 2015; revised July 11, 2015; accepted July13, 2015. Date of publication July 28, 2015; date of current version October 07,2015. The associate editor coordinating the review of this manuscript and ap-proving it for publication was Prof. Sergios Theodoridis. The work of Q. Lingis supported in part by NSFC grants 61004137 and 61573331. The work ofW. Yin is supported in part by NSF grant DMS-1317602. Part of this paper ap-pears in the Fortieth International Conference on Acoustics, Speech, and SignalProcessing, Brisbane, Australia, April 19–25, 2015 [1].W. Shi, Q. Ling, and G. Wu are with the Department of Automation, Univer-
sity of Science and Technology of China, Hefei, Anhui 230026, China (e-mail:[email protected]).W. Yin is with the Department of Mathematics, University of California,
Los Angeles, CA 90095 USA.Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TSP.2015.2461520
and are convex differentiable and possiblynondifferentiable functions, respectively, that are kept privateby agent . We say that the objective has thesmooth+nonsmooth composite structure. We develop an algo-rithm for all the agents in the network to obtain a consensualsolution to problem (1). In the algorithm, each agent locallycomputes the gradient and the so-called proximal map of
(see Section I-C for its definition) and performs one-hopcommunication with its neighbors. The iterations of the agentsare synchronized.The smooth+nonsmooth structure of the local objectives
arises in a large number of signal processing, statistical in-ference, and machine learning problems. Specific examplesinclude: (i) the geometric median problem in which vanishesand is the -norm [2], [3]; (ii) the compressed sensingproblem, where is the data-fidelity term, which is oftendifferentiable, and is a sparsity-promoting regularizer such asthe -norm [4], [5]; (iii) optimization problems with per-agentconstraints, where is a differentiable objective function ofagent and is the indicator function of the constraint set ofagent , that is, if satisfies the constraint andotherwise [6]–[8].
A. Background and Prior ArtPioneered by the seminal work [9], [10] in 1980s, decentral-
ized optimization, control, and decision-making in networkedmulti-agent systems have attracted increasing interest in recentyears due to the rapid development of communication and com-putation technologies [11]–[13]. Different to centralized pro-cessing, which requires a fusion center to collect data, decen-tralized approaches rely on information exchange among neigh-bors in the network and autonomous optimization by all the indi-vidual agents, and are robust to failure of critical relaying agentsand scalable to the network size. These advantages lead to suc-cessful applications of decentralized optimization in robotic net-works [14], [15], wireless sensor networks [4], [16], smart grids[17], [18], and distributed machine learning systems [19], [20],just to name a few. In these applications, problem (1) arises asa generic model.The existing algorithms that solve problem (1) include the
primal-dual domain methods such as the decentralized alter-nating direction method of multipliers (DADMM) [16], [21]and the primal domain methods including the distributed sub-gradient method (DSM) [22]. DADMM reformulates problem(1) in a form to which ADMM becomes a decentralized algo-rithm. In this algorithm, each agent minimizes the sum of itslocal objective and a quadratic function that involves local vari-ables from of its neighbors. DADMM does not take advantages
1053-587X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
6014 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015
of the smooth+nonsmooth structure. In DSM, each agent aver-ages its local variable with those of its neighbors and movesalong a negative subgradient direction of its local objective.DSM is computationally cheap but does not take advantagesof the smooth+nonsmooth structure either. When the local ob-jectives are Lipschitz differentiable, the recent exact first-orderalgorithm EXTRA [23] is much faster, yet it cannot handle non-smooth terms.The algorithms that consider smooth+nonsmooth objec-
tives in the form of (1) include the following primal-domainmethods: the (fast) distributed proximal gradient method(DPGM) [24] and the distributed iterative soft thresholdingalgorithm (DISTA) [25], [26]. Both DPGM and DISTA consistof a gradient step for the smooth part and a proximal step forthe nonsmooth part. DPGM uses two loops where the innerone is dedicated for consensus. The nonsmooth terms of theobjective functions of all agents must be the same. DISTAis exclusively designed for minimization problems andhas a similar restriction on the nonsmooth part. In addition,primal-dual type methods include [7], [27], which are based onDADMM. In this paper, we propose a simpler algorithm thatdoes not explicitly use any dual variable. We establish con-vergence under weaker conditions and show that the residualof the first-order optimality condition reduces at the rate of
, where is the iteration number.When , the proposed algorithm PG-EXTRA reduces to
EXTRA [23]. Clearly, PG-EXTRA extends EXTRA to handlenonsmooth objective terms. This extension is not the same asthe extension from the gradient method to the proximal-gra-dient method. As the reader will see, PG-EXTRA will havetwo interlaced sequences of iterates, whereas the proximal-gra-dient method just inherits the sequence of iterates in the gradientmethod.
B. Paper Organization and Contributions
Section II of this paper develops PG-EXTRA, which takesadvantages of the smooth+nonsmooth structure of the objec-tive functions. The details are given in Section II-A. The specialcases of PG-EXTRA are discussed in Section II-B. In particular,it reduces to an algorithm P-EXTRA when all and thegradient (the “G”) steps are no longer needed.Section III establishes the convergence and derives the rates
for PG-EXTRA and P-EXTRA.Under the Lipschitz assumptionof , the iterates of PG-EXTRA converge to a solution andthe first-order optimality condition asymptotically holds at anergodic rate of . The rate improves to non-ergodicfor P-EXTRA.The performance of PG-EXTRA and P-EXTRA is numer-
ically evaluated in Section IV, on a decentralized geometricmedian problem (Section IV-A), a decentralized compressedsensing problem (Section IV-B), and a decentralized quadraticprogram (Section IV-C). Simulation results confirm our theo-retical findings and validate the competitiveness of he proposedalgorithms.We have not yet found ways to further improve the con-
vergence rates or to relax our algorithms to take stochasticor asynchronous steps with provable performance guarantees,
though some numerical experiments with modified algorithmsappeared to be successful.
C. NotationEach agent holds a local variable , whose value at
iteration is denoted by . We introduce an objective func-tion that sums all the local terms as
where
...
The th row of corresponds to the agent . We say that isconsensual if all of its rows are identical, i.e., .Similar to the definition of , we define
By definition, .The gradient of at is given by
...
where, for each , is the gradient of at . We letdenote a subgradient of at :
...
where is a subgradient of at . The th rows of ,, and belong to the agent .
In the proposed algorithm, each agent needs to compute theproximal map of , which is the subproblem
where is a given point and is a scalar. We assumethat is proximable, that is, the subproblem has a closed-formsolution or can be solved at the complexity or .This is true when is the norm, the composition of thenorm and an orthogonal matrix, the -norm, the indicator func-tion of simple constraints, etc.The Frobenius norm of a matrix is denoted as .
Given a symmetric positive semidefinite matrix , we definethe —norm: . The largest singularvalue of a matrix is denoted as . The largest and
SHI et al.: A PROXIMAL GRADIENT ALGORITHM FOR DECENTRALIZED COMPOSITE OPTIMIZATION 6015
smallest eigenvalues of a symmetric matrix are denoted asand , respectively. The smallest nonzero
eigenvalue of a symmetric positive semidefinite matrixis denoted as . We have . Let
denote the null space of , anddenote the subspace
spanned by the columns of .
II. ALGORITHM DEVELOPMENT
This section derives PG-EXTRA for problem (1) inSection II-A and discusses its special cases in Section II-B.
A. Proposed Algorithm: PG-EXTRA
PG-EXTRA starts from an arbitrary initial point ,that is, each agent holds an arbitrary point . The next point
is generated by a proximal gradient iteration
(2a)
(2b)
where is the step size and is themixing matrix which we will discuss later. All the subsequentpoints are obtained through the following update:
(3a)
(3b)
In (3a), is another mixing matrix, whichwe typically set as though there are more general choices.With that typical choice, can be easily computedfrom . PG-EXTRA is outlined in Algorithm 1, where thecomputation for all individual agents is presented.
Algorithm 1: PG-EXTRA
Set mixing matrices and ;Choose step size ;1. All agents pick arbitrary initial
and do
2. for , all agents do
end for
Algorithm 2: P-EXTRA
Set mixing matrices and ;Choose step size ;1. All agents
pick arbitrary initial and do
2. for , all agents do
end for
We will require and if are not neigh-bors and . Then, the terms like and
only involve , as well as that are fromthe neighbors of agent . All the other terms use only localinformation. Note that the algorithm maintains both of the se-quences and . The latter sequence cannotbe eliminated since is generated from , whichexplicitly depends on .We impose the following assumptions on and .Assumption 1 (Mixing Matrices): Consider a connected
network consisting of a set of agentsand a set of undirected edges . An unordered
pair if agents and have a direct communi-cation link. The mixing matrices and
satisfy1) (Decentralization property) If and , then
.2) (Symmetry property) , .3) (Null space property) ;
.4) (Spectral property) and .The first two conditions together are standard (see [22], for
example). The first condition alone ensures communications tooccur between neighboring agents. All the four conditions to-gether ensure that satisfies and its other eigen-values lie in ( 1,1). Typical choices of can be found in [23],[28]. If a matrix satisfies all the conditions, thenalso satisfies the conditions.
B. Special Cases: EXTRA and P-EXTRA
When the possibly-nondifferentiable term , we havein (2a) and (2b) and thus
(4)
6016 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015
In (3a) and (3b), we have and thus
(5)The updates (4) and (5) are known as EXTRA, a recent algo-rithm for decentralized differentiable optimization [23].When the differentiable term , PG-EXTRA reduces
to P-EXTRA by removing all gradient computation, which isgiven in Algorithm 2.
III. CONVERGENCE ANALYSIS
A. PreliminariesUnless otherwise stated, the convergence results in this sec-
tion are given under Assumptions 1–3.Assumption 2 (Convex Objective Functions and the Smooth
Parts Having Lipschitz Gradients): For all , func-tions and are proper closed convex and satisfy
where are constant.To prove global convergence of PG-EXTRA, we require that
the objective functions are convex and the smooth parts haveLipschitz gradients. The assumption of Lipschitz gradientsholds in many decentralized detection and estimation appli-cations; for example, when the smooth parts are least-squaresterms. Even for some cases where the gradients are not Lips-chitz over the whole space, this assumption remains to hold ina bounded region. Note that our sequence is provably boundedsince the distance to the solution set is monotonic.Following Assumption 2, is proper
closed convex and satisfies
with constant .Assumption 3 (Solution Existence): The set of solution(s)
to problem (1) is nonempty.We first give a lemma on the first-order optimality condition
of problem (1).Lemma 1 (First-Order Optimality Condition): Given mixing
matrices and and the economical-form singular value de-composition , define
. Then, under Assumptions 1–3, the fol-lowing two statements are equivalent• is consensual, that is,
, and every is optimal to problem (1);• There exists for some and subgradient
such that
(6a)(6b)
Proof: According to Assumption 1 and the definition of ,we have
Hence is consensual if and only if (6b) holds.
Next, any row of the consensual is optimal if and onlyif . Since is symmetric and
, (6a) gives . Conversely,if , then
follows from and thusfor some . Let .
Then, and (6a) holds.Let and satisfy the optimality conditions (6a) and (6b).
Introduce an auxiliary sequence
The next lemma restates the updates of PG-EXTRA in terms of, , , and for convergence analysis.Lemma 2 (Recursive Relations of PG-EXTRA): In
PG-EXTRA, the quadruple sequence obeys
(7)
and
(8)
for any .Proof: By giving the first-order optimality conditions of
the subproblems (2b) and (3b) and eliminating the auxiliary se-quence for , we have the following equivalentsubgradient recursions with respect to :
Summing these subgradient recursions over times 1 through, we get
(9)
Using and the decomposition, (7) follows from (9) immediately.Since , and
, we have
(10)
Subtracting (10) from (7) and addingto (7), we obtain (8).
The recursive relations of P-EXTRA are shown in the fol-lowing corollary of Lemma 2.Corollary 1 (Recursive Relations of P-EXTRA): In
P-EXTRA, the quadruple sequence obeys
(11)
SHI et al.: A PROXIMAL GRADIENT ALGORITHM FOR DECENTRALIZED COMPOSITE OPTIMIZATION 6017
and
(12)
for any .The convergence analysis of PG-EXTRA is based on the re-
cursions (7) and (8) and that of P-EXTRA is based on (11) and(12). Define
For PG-EXTRA, we show that converges to a solution andthe successive iterative difference converges to 0at an ergodic rate (see Theorem 2); the same ergodic rateshold for the first-order optimality residuals, which are defined inTheorem 2. For the special case, P-EXTRA, also convergesto an optimal solution ; the progress and thefirst-order optimality residuals converge to 0 at improved non-ergodic rates (see Theorem 2).
B. Convergence and Convergence Rates of PG-EXTRA1) Convergence of PG-EXTRA: We first give a theorem that
shows the contractive property of PG-EXTRA. This theoremprovides a sufficient condition for PG-EXTRA to converge toa solution. In addition, it prepares for analyzing convergencerates of PG-EXTRA in Section III-B-2 and its limit case givesthe contractive property of P-EXTRA (see Section III-C).Theorem 1: Under Assumptions 1–3, if we set the step size
, then the sequence generated byPG-EXTRA satisfies
(13)where . Furthermore, converges to anoptimal .
Proof: By Assumption 2, and are convex, and isLipschitz continuous with constant , we have
(14)
and
(15)
Substituting (8) from Lemma 2 for, it follows from (14) and (15) that
(16)
For the terms on the right-hand side of (16), we have
(17)
(18)
and
(19)
Plugging (17)–(19) into (16), we have
(20)
Using the definitions of , and , (20) is equivalent to
(21)
Applying the basic equality
(22)
to (21), we have
(23)
By Assumption 1, in particular, , we haveand thus
(24)
where . The last inequality holds since
.It shows from (24) that for any solution , is
bounded and contractive. Therefore, is convergentas long as . The convergence of to a
6018 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015
solution follows from the standard analysis for contractionmethods; see, for example, Theorem 3 in [29].2) Ergodic Rates of PG-EXTRA: To establish rate of
convergence, we need the following proposition. Parts of it ap-peared in recent works [30]–[32].Proposition 1: If a sequence obeys: (1)
and (2) , then we have: (i) ; (ii); (iii) . If the sequence
further obeys: (3) , then in addition, we have:(iv) .
Proof: Part (i) is obvious. Let . By theassumptions, is uniformly bounded and obeys
from which part (ii) follows. Since is monotoni-cally non-increasing, we have
This, with , gives us orpart (iii).If is further monotonically non-increasing, we have
Since , we get part (iv).This proposition serves for the proof of Theorem 2, as well
as that of Theorem 2 appearing in Section III-C. We give theergodic convergence rates of PG-EXTRA below.Theorem 2: In the same setting of Theorem 1, the following
rates hold for PG-EXTRA:(i) Running-average successive difference:
(ii) Running-best successive difference:
(iii) Running-average optimality residuals:
(iv) Running-best optimality residuals:
Before proving the theorem, let us explain the rates. The firsttwo rates on the squared successive difference are used to de-duce the last two rates on the optimality residuals. Since ouralgorithm does not guarantee to reduce objective functions in
a monotonic manner, we choose to establish our convergencerates in terms of optimality residuals, which show how quicklythe residuals to the KKT system (6) reduce. Note that the ratesare given on the standard squared quantities since they are sum-mable and naturally appear in the convergence analysis. Withparticular note, these rates match those on the squared suc-cessive difference and the optimality residual in the classical(centralized) gradient-descent method.
Proof: Parts (i) and (ii): Since converges to 0when goes to , we are able to sum (13) in Theorem 1 over
through and apply the telescopic cancellation, whichyields
(25)
Then, the results follow from Proposition 1 immediately.Parts (iii) and (iv): Using the basic inequality
which holds for any and any matricesand of the same size, it follows that
(26)
Since and are symmetric and
there exists a bounded such that
It follows from (26) that
(27)
SHI et al.: A PROXIMAL GRADIENT ALGORITHM FOR DECENTRALIZED COMPOSITE OPTIMIZATION 6019
As part (i) shows that , wehaveand .From (27) and (25), we see that both
and aresummable. Again, by Proposition 1, we have part (iv), the
rates of the running best first-order optimality residuals.The monotonicity of is an open question. If it
holds, then convergence rates will apply to the sequenceitself.Remark 1: The convergence rate implies that
PG-EXTRA needs at most iterations in order to reach anaccuracy of (in terms of the first-order optimality residual).Assuming that the per-iteration computational cost of calcu-lating the gradients and proximal maps is , we obtain theoverall computational complexity .
C. Convergence Rates of P-EXTRA
Convergence of P-EXTRA follows from that of PG-EXTRAdirectly. Since P-EXTRA is a special case of PG-EXTRA andits updates are free of gradient steps, it enjoys slightly betterconvergence rates: non-ergodic . Let us brief on our steps.First, as a special case of Theorem 1 by letting , thesequence is summable. Second, the sequence
of P-EXTRA is shown to be monotonic inLemma 3. Based on these results, the non-ergodic con-vergence rates are then established for successive difference andfirst-order optimality residuals.Lemma 3: Under the same assumptions of Theorem 1 except
, for any step size , the sequence generatedby P-EXTRA satisfies
(28)
for all .Proof: To simplify the description of the proof, de-
fine , ,, and .
By convexity of in Assumption 2, we have
(29)
Taking difference of (7) at the -th and -th iterationsyields
(30)
Combine (29) and (30) it follows that
(31)
Using the definition of , . Thus, we have
(32)
Substituting (32) into (31) yields
(33)
or equivalently
(34)
By applying the basic equalityto (34), we finally
have
(35)
which implies (28) and completes the proof.The next theorem gives the convergence rates of
P-EXTRA. Its proof is omitted as it is similar to that of Theorem2. The only difference is that has been shownmonotonic in P-EXTRA. Invoking fact (iv) in Proposition 1,the rates are improved from ergodic to non-ergodic
.Theorem 3: In the same setting of Lemma 3, the following
rates hold for P-EXTRA:(i) Successive difference:
(ii) First-order optimality residuals:
Remark (Less Restriction on Step Size): We can see fromSection III-C that P-EXTRA accepts a larger range of step sizethan PG-EXTRA.
IV. NUMERICAL EXPERIMENTS
In this section, we provide three numerical experiments,decentralized geometric median, decentralized compressedsensing, and decentralized quadratic programming, to demon-strate the effectiveness of the proposed algorithms. All theexperiments are conducted over a randomly generated con-nected network showing in Fig. 1, which has agentsand edges.In the numerical experiments, we use the relative error
and the successive difference asperformance metrics; the former is a standard metric to assessthe solution optimality and the later evaluates the bounds ofthe rates proved in this paper. It is worth noting that all thedecentralized algorithms numerically evaluated in this sectionhave the same communication cost per iteration. Consequently,
6020 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015
Fig. 1. The underlying network for the experiments.
the amounts of information exchange over the network are pro-portional to their numbers of iterations.
A. Decentralized Geometric Median
Consider a decentralized geometric median problem. Eachagent holds a vector , and all theagents collaboratively calculate the geometric medianof all . This task can be formulated as solving the followingminimization problem:
Computing decentralized geometric medians have interestingapplications: (i) in [2], the multi-agent system locates a facilityto minimize the cost of transportation in a decentralizedmanner;(ii) in cognitive robotics [33], a group of collaborative robotssets up a rally point such that the overall moving cost is minimal;(iii) in distributed robust Bayesian learning [34], decentralizedgeometric median is also an important subproblem.The above problem can further be generalized as the group
least absolute deviations problem
( is themeasurement matrix on agent ), which can be consid-ered as a variant of cooperative least squares estimation whilebeing capable of detecting anomalous agents and maintainingthe system out of harmful effects caused by agents of collapse.The geometric median problem is solved by P-EXTRA.
The minimization subproblem in P-EXTRAhas an explicit solution
where we have that and, .
Fig. 2. Relative error and successive differenceof P-EXTRA, DSM, and DADMM in the decentralized geometric medianproblem.
We set , that is, each point . Data aregenerated following the uniform distribution in
. The algorithm starts from.We use the Metropolis constant edge weight for and
, as well as a constant step size .We compare P-EXTRA to DSM [22] and DADMM [21]. In
DSM, at each iteration, each agent combines the local variablesfrom its neighbors with a Metropolis constant edge weight andperforms a subgradient step along the negative subgradient ofits own objective with a diminishing step size .In DADMM, at each iteration, each agent updates its primallocal copy by solving an optimization problem and then updatesits local dual variable with simple vector operations. DADMMhas a penalty coefficient as its parameter. We have hand-op-timized this parameter and the step sizes for DADMM, DSM,and P-EXTRA.The numerical results are illustrated in Fig. 2. It shows that the
relative errors of DADMM and P-EXTRA both drop to in100 iterations while DSM has a relative error of larger thanbefore 400 iterations. P-EXTRA is better than DSM becauseit utilizes the problem structure, which is ignored by DSM. Inthis case, both P-EXTRA and DADMM can be considered asproximal point algorithms and thus have similar convergenceperformance.
SHI et al.: A PROXIMAL GRADIENT ALGORITHM FOR DECENTRALIZED COMPOSITE OPTIMIZATION 6021
B. Decentralized Compressed Sensing
Consider a decentralized compressed sensing problem. Eachagent holds its own measurement equations,
, where is a measurementvector, is a sensing matrix, is an un-known sparse signal, and is an i.i.d. Gaussian noisevector. The goal is to estimate the sparse vector . The numberof total measurements is often less than the number ofunknowns , which fails the ordinary least squares. We insteadsolve an -regularized least squares problem
where
and is the regularization parameter on agent .The decentralized compressed sensing is a special case of
the general distributed compressed sensing [35] where its intra-signal correlation is consensus. This case appears specificallyin cooperative spectrum sensing for cognitive radio networks.More of its applications can be found in [5] and the referencestherein.Considering the smooth+nonsmooth structure, PG-EXTRA
is applied here. The minimization subproblem in PG-EXTRAhas an explicit
solution
which utilizes the soft-thresholding operatorin an element-wise manner.
In this experiment, each agent holds measure-ments and the dimension of is . Measurement ma-trices and noises vectors are randomly generated withtheir elements following an i.i.d. Gaussian distribution andhave been normalized to have . The signal israndomly generated and has a sparsity of 0.8 (containing 10nonzero elements). The algorithm starts from . InPG-EXTRA, we use theMetropolis constant edge weight forand , and a constant step size .We compare PG-EXTRA with the recent work DISTA [25],
[26], which has two free parameters: temperature parameterand . We have hand optimized
and show the effect of in our experiment.The numerical results are illustrated in Fig. 3. It shows that
the relative errors of PG-EXTRA drops to in 1000 iter-ations while DISTA still has a relative error larger thanwhen it is terminated at 4000 iterations. Both PG-EXTRA andDISTA utilize the smooth+nonsmooth structure of the problembut PG-EXTRA achieves faster convergence.
C. Decentralized Quadratic Programming
We use decentralized quadratic programming as an exampleto show that how PG-EXTRA solves a constrained optimization
Fig. 3. Relative error and successive differenceof PG-EXTRA and DISTA in the decentralized compressed problem. Constant
is the step size given in Theorem 1 for PG-EXTRA. Constantis a parameter of DISTA.
problem. Each agent has a local quadratic ob-jective and a local linear constraint ,where the symmetric positive semidefinite matrix ,the vectors and , and the scalar arestored at agent . The agents collaboratively minimize the av-erage of the local objectives subject to all local constraints. Thequadratic program is:
We recast it as
(36)where
if ,otherwise,
is an indicator function. Setting and, it has the form of (1) and can be solved
by PG-EXTRA. The minimization subproblem in PG-EXTRA
6022 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 22, NOVEMBER 15, 2015
Fig. 4. Relative error and successive difference ofPG-EXTRA, DSPM1, and DSPM2 in the decentralized quadratic programmingproblem. Constant is the critical step size given in Theorem 1for PG-EXTRA.
has an explicitsolution. Indeed, for agent , the solution is
if ,
otherwise.
In this experiment, we set . Each is generatedby a -by- matrix, whose elements follow and i.i.d. Gaussiandistribution, multiplying its transpose. Each ’s elements aregenerated following and i.i.d. Gaussian distribution. We alsorandomly generate the constraints data and but guaranteethat the feasible set is nonempty and the optimal solution to theproblem (36) is different to the optimization problem with thesame objective of (36) but without constraints . Inthis way we can make sure at least one of the constraints
is activated. In PG-EXTRA, we use the Metropolis con-stant edge weight for and , and constant step size.The numerical experiment result is illustrated in Fig. 4.
We compare PG-EXTRA with two distributed subgradient
projection methods (DSPMs) [36] (denoted as DSPM1and DSPM2 in Fig. 4). DSPM1 assumes that each agentknows all the constraints so that DSPMcan be applied to solve (36). The iteration of DSPM1 ateach agent iswhere the set and standsfor projection onto . The projection step employs the al-ternating projection method [8] and its computation cost ishigh. To address this issue, we modify DSPM1 to DSPM2with where
. DSPM2 is likely to be convergent buthas no theoretical guarantee. Both DSPM1 and DSPM2 usediminishing step size and we hand-optimizethe initial step sizes.It is shown in Fig. 4 that the relative errors of PG-EXTRA
drops to in 4000 iterations while DSPM1 and DSPM2have their relative errors larger than when they are ter-minated at 20000 iterations. PG-EXTRA is better than DSPM1and DSPM2 because it utilizes the specific problem structure.
V. CONCLUSIONThis paper attempts to solve a broad class of decentralized
optimization problems with local objectives in the smooth+non-smooth form by extending the recent method EXTRA, whichintegrates gradient descent with consensus averaging. We pro-posed PG-EXTRA, which inherits most properties of EXTRAand can take advantages of easy proximal operations on manynonsmooth functions. We proved its convergence and estab-lished its convergence rate. The preliminary numericalresults demonstrate its competitiveness, especially over the sub-gradient and double-loop algorithms on the tested smooth+non-smooth problems. It remains open to improve the rate towith certain acceleration techniques, and to extend our methodto asynchronous and stochastic settings.
REFERENCES[1] W. Shi, Q. Ling, G. Wu, and W. Yin, “A Proximal Gradient Algo-
rithm for Decentralized Nondifferentiable Optimization,” in Proc. 40thIEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2015, pp.2964–2968.
[2] Foundations of Location Analysis, H. Eiselt and V. Marianov, Eds.New York, NY, USA: Springer, 2011.
[3] H. Lopuhaa and P. Rousseeuw, “Breakdown points of affine equi-variant estimators of multivariate location and covariance matrices,”Ann. Statist., vol. 19, no. 1, pp. 229–248, 1991.
[4] Q. Ling and Z. Tian, “Decentralized sparse signal recovery forcompressive sleeping wireless sensor networks,” IEEE Trans. SignalProcess., vol. 58, no. 7, pp. 3816–3827, 2010.
[5] G. Mateos, J. Bazerque, and G. Giannakis, “Distributed sparselinear regression,” IEEE Trans. Signal Process., vol. 58, no. 10, pp.5262–5276, 2010.
[6] S. Lee and A. Nedic, “Distributed random projection algorithm forconvex optimization,” IEEE J. Sel. Topics Signal Process., vol. 7, no.2, pp. 221–229, 2013.
[7] T. Chang, M. Hong, and X. Wang, “Multi-agent distributed optimiza-tion via inexact consensus ADMM,” IEEE Trans. Signal Process., vol.63, no. 2, pp. 482–497, 2015.
[8] C. Pang, “Set intersection problems: Supporting hyperplanes andquadratic programming,” Math. Programm., vol. 149, no. 1–2, pp.329–359, 2015.
[9] J. Tsitsiklis, “Problems in decentralized decision making and computa-tion,” Ph.D. dissertation, Electr. Eng. Comput. Sci. Dept., Massachu-setts Inst. of Technol., Cambridge, MA, USA, 1984.
[10] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronousdeterministic and stochastic gradient optimization algorithms,” IEEETrans. Autom. Control, vol. 31, no. 9, pp. 803–812, 1986.
SHI et al.: A PROXIMAL GRADIENT ALGORITHM FOR DECENTRALIZED COMPOSITE OPTIMIZATION 6023
[11] B. Johansson, “On distributed optimization in networked systems,”Ph.D. dissertation, School of Electrical Engineering, Royal Inst. ofTechnol., Stockholm, Sweden, 2008.
[12] Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progressin the study of distributed multi-agent coordination,” IEEE Trans. Ind.Informat., vol. 9, no. 1, pp. 427–438, 2013.
[13] A. Sayed, “Adaptation, learning, and optimization over networks,”Found. Trends Mach. Learn., vol. 7, no. 4–5, pp. 311–801, 2014.
[14] F. Bullo, J. Cortes, and S. Martinez, Distributed Control of RoboticNetworks. Princeton, NJ, USA: Princeton Univ. Press, 2009.
[15] K. Zhou and S. Roumeliotis, “Multirobot active target tracking withcombinations of relative observations,” IEEE Trans. Robot., vol. 27,no. 4, pp. 678–695, 2010.
[16] I. Schizas, A. Ribeiro, and G. Giannakis, “Consensus in ad hocWSNswith noisy links-part I: Distributed estimation of deterministic signals,”IEEE Trans. Signal Process., vol. 56, no. 1, pp. 350–364, 2008.
[17] V. Kekatos and G. Giannakis, “Distributed robust power system stateestimation,” IEEE Trans. Power Syst., vol. 28, no. 2, pp. 1617–1626,2013.
[18] G. Giannakis, V. Kekatos, N. Gatsis, S. Kim, H. Zhu, and B. Wol-lenberg, “Monitoring and optimization for power grids: A signal pro-cessing perspective,” IEEE Signal Process. Mag., vol. 30, no. 5, pp.107–128, 2013.
[19] P. Forero, A. Cano, and G. Giannakis, “Consensus-based dis-tributed support vector machines,” J. Mach. Learn. Res., vol. 11, pp.1663–1707, 2010.
[20] F. Yan, S. Sundaram, S. Vishwanathan, and Y. Qi, “Distributed Au-tonomous Online Learning: Regrets and Intrinsic Privacy-PreservingProperties,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 11, pp.2483–2493, 2013.
[21] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear conver-gence of the ADMM in decentralized consensus optimization,” IEEETrans. Signal Process., vol. 62, no. 7, pp. 1750–1761, 2014.
[22] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp.48–61, 2009.
[23] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-orderalgorithm for decentralized consensus optimization,” SIAM J. Optim.,vol. 25, no. 2, pp. 944–966, 2015.
[24] A. Chen, “Fast Distributed First-Order Methods,” M.S. thesis, Electr.Eng. Comput. Sci. Dept., Massachusetts Inst. of Technol., Cambridge,MA, USA, 2012.
[25] C. Ravazzi, S. Fosson, and E. Magli, “Distributed iterative thresh-olding for -regularized linear inverse problems,” IEEE Trans.Inf. Theory, vol. 61, no. 4, pp. 2081–2100, 2015.
[26] C. Ravazzi, S. M. Fosson, and E. Magli, “Distributed soft thresholdingfor sparse signal recovery,” in Proc. IEEE Global Commun. Conf.(GLOBECOM), 2013, pp. 3429–3434.
[27] P. Bianchi,W.Hachem, and F. Iutzeler, “A stochastic primal-dual algo-rithm for distributed asynchronous composite optimization,” in Proc.2nd Global Conf. Signal Inf. Process. (GlobalSIP), 2014, pp. 732–736.
[28] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing Markov chain on agraph,” SIAM Rev., vol. 46, no. 4, pp. 667–689, 2004.
[29] B. He, “A new method for a class of linear variational inequalities,”Math. Programm., vol. 66, no. 1–3, pp. 137–144, 1994.
[30] W. Deng, M. Lai, Z. Peng, and W. Yin, “Parallel multi-block ADMMwith convergence,” UCLA CAM Rep. 13-64, 2013.
[31] D. Davis and W. Yin, “Convergence rate analysis of several splittingschemes,” 2014, arXiv preprint arXiv:1406.4834.
[32] D. Davis and W. Yin, “Convergence rates of relaxed Peaceman-Rach-ford and ADMM under regularity assumptions,” 2014, arXiv preprintarXiv:1407.5210.
[33] R. Ravichandran, G. Gordon, and S. Goldstein, “A scalable distributedalgorithm for shape transformation in multi-robot systems,” in Proc.IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2007, pp. 4188–4193.
[34] S. Minsker, S. Srivastava, L. Lin, and D. Dunson, “Scalable and robustBayesian inference via the median posterior,” in Proc. 31st Int. Conf.Mach. Learn. (ICML), 2014, pp. 1656–1664.
[35] D. Baron, M. Duarte, M. Wakin, S. Sarvotham, and R. Baraniuk, “Dis-tributed compressive sensing,” 2009, arXiv preprint arXiv:0901.3403.
[36] S. Ram, A. Nedic, and V. Veeravalli, “Distributed stochastic subgra-dient projection algorithms for convex optimization,” J. Optim. TheoryAppl., vol. 147, no. 3, pp. 516–545, 2010.
Wei Shi received the B.E. degree in automation fromUniversity of Science and Technology of China in2010. He is now pursuing his Ph.D. degree in controltheory and control engineering in the Department ofAutomation, University of Science and Technologyof China. His current research focuses on decentral-ized optimization of networked multi-agent systems.
Qing Ling received the B.E. degree in automationand the Ph.D. degree in control theory and controlengineering from University of Science and Tech-nology of China in 2001 and 2006, respectively.From 2006 to 2009, he was a Post-Doctoral Re-search Fellow in the Department of Electrical andComputer Engineering, Michigan TechnologicalUniversity. Since 2009, he has been an AssociateProfessor in the Department of Automation,University of Science and Technology of China.His current research focuses on decentralized
optimization of networked multi-agent systems.
GangWu received the B.E. degree in automation andthe M.S. degree in control theory and control engi-neering from University of Science and Technologyof China in 1986 and 1989, respectively. Since 1991,he has been in the Department of Automation, Uni-versity of Science and Technology of China, wherehe is now a Professor. His current research interestsare advanced control and optimization of industrialprocesses.
Wotao Yin received the B.S. degree in mathematicsand applied mathematics from Nanjing Universityin 2001, the M.S. and Ph.D. degrees in operationsresearch from Columbia University in 2003 and2006, respectively. From 2006 to 2013, he wasan Assistant Professor and then an AssociateProfessor in the Department of Computationaland Applied Mathematics, Rice University. Since2013, he has been a Professor in the Departmentof Mathematics, University of California, LosAngeles. His current research interest is large-scale
decentralized/distributed optimization.