+ All Categories
Home > Documents > 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62,...

1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62,...

Date post: 14-Oct-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
12
1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 7, APRIL 1, 2014 On the Linear Convergence of the ADMM in Decentralized Consensus Optimization Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin Abstract—In decentralized consensus optimization, a connected network of agents collaboratively minimize the sum of their local objective functions over a common decision variable, where their information exchange is restricted between the neighbors. To this end, one can rst obtain a problem reformulation and then apply the alternating direction method of multipliers (ADMM). The method applies iterative computation at the individual agents and information exchange between the neighbors. This approach has been observed to converge quickly and deemed powerful. This paper establishes its linear convergence rate for the decentralized consensus optimization problem with strongly convex local objec- tive functions. The theoretical convergence rate is explicitly given in terms of the network topology, the properties of local objective functions, and the algorithm parameter. This result is not only a performance guarantee but also a guideline toward accelerating the ADMM convergence. Index Terms—Decentralized consensus optimization, alter- nating direction method of multipliers (ADMM), linear convergence. I. INTRODUCTION R ECENT advances in signal processing and control of net- worked multi-agent systems have led to much research interests in decentralized optimization [2]–[14]. Decentral- ized optimization problems arising in networked multi-agent systems include coordination of aircraft or vehicle networks [2]–[4], data processing of wireless sensor networks [5]–[10], spectrum sensing of cognitive radio networks [11], [12], state estimation and operation optimization of smart grids [13], [14], etc. In these scenarios, the data is collected and/or stored in a distributed manner; a fusion center is either disallowed or not economical. Consequently, any computing tasks must be Manuscript received July 20, 2013; revised November 30, 2013; accepted January 20, 2014. Date of publication February 04, 2014; date of current version March 10, 2014. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Shuguang Cui. The work of W. Shi is supported by Chinese Scholarship Council grant 201306340046. The work of Q. Ling is supported by NSFC grant 61004137 and Chinese Scholarship Council grant 2011634506. The work of G. Wu is supported by MOF/MIIT/MOST grant BB2100100015. The work of W. Yin is supported by ARL and ARO grant W911NF-09-1-0383 and NSF grants DMS-0748839 and DMS-1317602. Part of this paper appeared in the Thiry-Eighth International Conference on Acoustics, Speech, and Signal Processing, Vancouver, Canada, May26–31, 2013. (Corre- sponding author: Q. Ling). W. Shi, Q. Ling, K. Yuan, and G. Wu are with the Department of Automation, University of Science and Technology of China, Hefei, Anhui 230026, China (e-mail: [email protected]). W. Yin is with the Department of Mathematics, University of California, Los Angeles, CA 90095 USA. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TSP.2014.2304432 accomplished in a decentralized and collaborative manner by the agents. This approach can be powerful and efcient, as the computing tasks are distributed over all the agents and infor- mation exchange occurs only between the agents with direct communication links. There is no risk of central computation overload or network congestion. In this paper, we focus on decentralized consensus optimiza- tion, an important class of decentralized optimization in which a network of agents cooperatively solve (1) over a common optimization variable , where is the local objective function known by agent . This formu- lation arises in averaging [4]–[6], learning [7], [8], and estima- tion [9]–[13] problems. Examples of include least squares [4]–[6], regularized least squares [8], [10]–[12], as well as more general ones [7]. The values of can stand for average tempera- ture of a room [5], [6], frequency-domain occupancy of spectra [11], [12], states of a smart grid system [13], [14], and so on. There exist several methods for decentralized consensus optimization, including distributed subgradient descent algo- rithms [15]–[17], dual averaging methods [18], [19], and the alternating direction method of multipliers (ADMM) [8]–[10], [20], [21]. Among these algorithms, the ADMM demonstrates fast convergence in many applications, e.g., [8]–[10]. However, how fast it converges and what factors affect the rate are both unknown. This paper addresses these issues. A. Our Contributions Firstly, we establish the linear convergence rate of the ADMM that is applied to decentralized consensus optimization with strongly convex local objective functions. This theoretical result gives a performance guarantee for the ADMM and validates the observation in prior literature. Secondly, we study how the network topology, the properties of local objective functions, and the algorithm parameter affect the convergence rate. The analysis provide guidelines for net- working strategies, objective-function splitting strategies, and algorithm parameter settings to achieve faster convergence. B. Related Work Besides the ADMM, existing decentralized approaches for solving (1) include belief propagation [7], incremental opti- mization [22], subgradient descent [15]–[17], dual averaging [18], [19], etc. Belief propagation and incremental optimization require one to predene a tree or loop structure in the network, 1053-587X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Transcript
Page 1: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 7, APRIL 1, 2014

On the Linear Convergence of the ADMM inDecentralized Consensus Optimization

Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin

Abstract—In decentralized consensus optimization, a connectednetwork of agents collaboratively minimize the sum of their localobjective functions over a common decision variable, where theirinformation exchange is restricted between the neighbors. To thisend, one can first obtain a problem reformulation and then applythe alternating direction method of multipliers (ADMM). Themethod applies iterative computation at the individual agents andinformation exchange between the neighbors. This approach hasbeen observed to converge quickly and deemed powerful. Thispaper establishes its linear convergence rate for the decentralizedconsensus optimization problem with strongly convex local objec-tive functions. The theoretical convergence rate is explicitly givenin terms of the network topology, the properties of local objectivefunctions, and the algorithm parameter. This result is not only aperformance guarantee but also a guideline toward acceleratingthe ADMM convergence.

Index Terms—Decentralized consensus optimization, alter-nating direction method of multipliers (ADMM), linearconvergence.

I. INTRODUCTION

R ECENT advances in signal processing and control of net-worked multi-agent systems have led to much research

interests in decentralized optimization [2]–[14]. Decentral-ized optimization problems arising in networked multi-agentsystems include coordination of aircraft or vehicle networks[2]–[4], data processing of wireless sensor networks [5]–[10],spectrum sensing of cognitive radio networks [11], [12], stateestimation and operation optimization of smart grids [13], [14],etc. In these scenarios, the data is collected and/or stored ina distributed manner; a fusion center is either disallowed ornot economical. Consequently, any computing tasks must be

Manuscript received July 20, 2013; revised November 30, 2013; acceptedJanuary 20, 2014. Date of publication February 04, 2014; date of current versionMarch 10, 2014. The associate editor coordinating the review of this manuscriptand approving it for publication was Prof. Shuguang Cui. The work of W. Shi issupported by Chinese Scholarship Council grant 201306340046. The work ofQ. Ling is supported by NSFC grant 61004137 and Chinese Scholarship Councilgrant 2011634506. The work of G.Wu is supported byMOF/MIIT/MOST grantBB2100100015. The work of W. Yin is supported by ARL and ARO grantW911NF-09-1-0383 and NSF grants DMS-0748839 andDMS-1317602. Part ofthis paper appeared in the Thiry-Eighth International Conference on Acoustics,Speech, and Signal Processing, Vancouver, Canada, May26–31, 2013. (Corre-sponding author: Q. Ling).W. Shi, Q. Ling, K. Yuan, and G.Wu are with the Department of Automation,

University of Science and Technology of China, Hefei, Anhui 230026, China(e-mail: [email protected]).W. Yin is with the Department of Mathematics, University of California,

Los Angeles, CA 90095 USA.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TSP.2014.2304432

accomplished in a decentralized and collaborative manner bythe agents. This approach can be powerful and efficient, as thecomputing tasks are distributed over all the agents and infor-mation exchange occurs only between the agents with directcommunication links. There is no risk of central computationoverload or network congestion.In this paper, we focus on decentralized consensus optimiza-

tion, an important class of decentralized optimization in whicha network of agents cooperatively solve

(1)

over a common optimization variable , whereis the local objective function known by agent . This formu-lation arises in averaging [4]–[6], learning [7], [8], and estima-tion [9]–[13] problems. Examples of include least squares[4]–[6], regularized least squares [8], [10]–[12], as well as moregeneral ones [7]. The values of can stand for average tempera-ture of a room [5], [6], frequency-domain occupancy of spectra[11], [12], states of a smart grid system [13], [14], and so on.There exist several methods for decentralized consensus

optimization, including distributed subgradient descent algo-rithms [15]–[17], dual averaging methods [18], [19], and thealternating direction method of multipliers (ADMM) [8]–[10],[20], [21]. Among these algorithms, the ADMM demonstratesfast convergence in many applications, e.g., [8]–[10]. However,how fast it converges and what factors affect the rate are bothunknown. This paper addresses these issues.

A. Our Contributions

Firstly, we establish the linear convergence rate of theADMM that is applied to decentralized consensus optimizationwith strongly convex local objective functions. This theoreticalresult gives a performance guarantee for the ADMM andvalidates the observation in prior literature.Secondly, we study how the network topology, the properties

of local objective functions, and the algorithm parameter affectthe convergence rate. The analysis provide guidelines for net-working strategies, objective-function splitting strategies, andalgorithm parameter settings to achieve faster convergence.

B. Related Work

Besides the ADMM, existing decentralized approaches forsolving (1) include belief propagation [7], incremental opti-mization [22], subgradient descent [15]–[17], dual averaging[18], [19], etc. Belief propagation and incremental optimizationrequire one to predefine a tree or loop structure in the network,

1053-587X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

SHI et al.: ON THE LINEAR CONVERGENCE OF THE ADMM IN DECENTRALIZED CONSENSUS OPTIMIZATION 1751

whereas the advantage of the ADMM, subgradient descent,and dual averaging is that they do not rely on any predefinedstructures. Subgradient descent and dual averaging work wellfor asynchronous networks but suffer from slow convergence.Indeed, for subgradient descent algorithms [15] and [16] estab-lish the convergence rate of , where is the number ofiterations, to a neighborhood of the optimal solution when thelocal subgradients are bounded and the stepsize is fixed. Furtherassuming that the local objective functions are strongly convex,choosing a dynamic stepsize leads to a rate of[17]. Dual averaging methods using dynamic stepsizes alsohave sublinear rates, e.g., as proved in [18] and[19].The decentralized ADMM approaches use synchronous steps

by all the agents but have much faster empirical convergence,as demonstrated in many applications [8]–[10]. However, ex-isting convergence rate analysis of the ADMM is restricted tothe classic, centralized computation. The centralized ADMMhas a sublinear convergence rate for general convexoptimization problems [23]. In [24] an ADMM with restrictedstepsizes is proposed and proved to be linearly convergent forcertain types of non-strongly convex objective functions. A re-cent paper [25] shows a linear convergence rate forsome under a strongly convex assumption, and our paperextends the analysis tools therein to the decentralized regime.A notable work about convergence rate analysis is [20],

which proves the linear convergence rate of the ADMM ap-plied to the average consensus problem, a special case of (1) inwhich with being a local measurementvector of agent . Its analysis takes a state-transition equationapproach, which is not applicable to the more general localobjective functions considered in this paper.

C. Paper Organization and Notation

This paper is organized as follows. Section II reformulatesthe decentralized consensus optimization problem and developsan algorithm based on the ADMM. Section III analyzes thelinear convergence rate of the ADMM and shows how to accel-erate the convergence through tuning the algorithm parameter.Section IV provides extensive numerical experiments to vali-date the theoretical analysis in Section III. Section V concludesthe paper.In this paper we denote as the Euclidean norm of a

vector and as the inner product of two vectors and. Given a semidefinite matrix with proper dimensions, the-norm of is . We let be the operator that

returns the largest singular value of and be the onethat returns the smallest nonzero singular value of .We use two kinds of definitions of convergence, Q-linear con-

vergence and R-linear convergence. We say that a sequence ,where the superscript stands for time index, Q-linearly con-verges to a point if there exists a number such that

with being a vector norm. We say that

a sequence R-linearly converges to a point if for allwhere Q-linearly converges to .

II. THE ADMM FOR DECENTRALIZED CONSENSUSOPTIMIZATION

In this section, we first reformulate the decentralized con-sensus optimization problem (1) such that it can be solved bythe ADMM (see Section II-A). Then we develop the decentral-ized ADMM approach and provide a simplified decentralizedalgorithm (see Section II-B).

A. Problem Formulation

Throughout the paper, we consider a network consisting ofagents bidirectionally connected by edges (and thus arcs).We can describe the network as a symmetric directed graph

or an undirected graph , whereis the set of vertexes with cardinality is the set ofarcs with , and is the set of edges with .Algorithms that solve the decentralized consensus optimizationproblem (1) are developed based on this graph.Generally speaking, the ADMM applies to the convex opti-

mization problem in the form of

(2)

where and are optimization variables, and are convexfunctions, and is a linear constraint of and. The ADMM solves a sequence of subproblems involving

and one at a time and iterates to converge as long as a saddlepoint exists.To solve (1) with the ADMM in a decentralized manner, we

reformulate it as

(3)

Here is the local copy of the common optimization variableat agent and is an auxiliary variable imposing the con-

sensus constraint on neighboring agents and . In the con-straints, are separable when are fixed, and vice versa.Therefore, (3) lends itself to decentralized computation in theADMM framework. Apparently, (3) is equivalent to (1) whenthe network is connected.Defining as a vector concatenating all

as a vector concatenating all , and, (3) can be written in a matrix form

as

(4)

where , which fits the form of (2), and is amenable tothe ADMM. Here are bothcomposed of blocks of matrices. Ifand is the th block of , then the th block of and the

th block of are identity matrices ; otherwisethe corresponding blocks are zero matrices . Also, wehave with being aidentity matrix.

Page 3: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

1752 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 7, APRIL 1, 2014

B. Algorithm Development

Now we apply the ADMM to solve (4). The augmented La-grangian of (4) is

where is the Lagrange multiplier and is a posi-tive algorithm parameter. At iteration , the ADMM firstlyminimizes to obtain , secondly minimizes

to obtain , and finally updates fromand . The updates are

-

--

(5)

where is the gradient of at point ifis differentiable, or is a subgradient if is non-differentiable.Next we show that if the initial values of and are properly

chosen the ADMM updates in (5) can be simplified (see alsothe derivation in [8]). Multiplying the two sides of the -updateby and adding it to the -update, we have

. Further, multiplying the twosides of the -update by and adding it to the -update wehave . Therefore (5) can be equivalently expressedas

(6)

Letting with and recalling, we know from the second

equation of (6). Therefore, the first equation in (6) reduces towhere

and . The third equation in (6)splits to two equationsand . If we choose the ini-tial value of as such that holds for

, summing and subtracting these two equations re-sult in and ,respectively. If we further choose the initial value of as

holds for .To summarize, with initialization and

, (6) reduces to

(7)

In Section III we will analyze the convergence rate of theADMM updates (7). The analysis requires an extra initializa-tion condition that lies in the column space of (e.g.,

) such that also lies in the column space of ;the reason will be given in Section III.

Indeed, (7) also leads to a simple decentralized algorithm thatinvolves only an -update and a new multiplier update. To seethis, substituting into the first two equationsof (7) we have

(8)

which is irrelevant with . Note that in the first equation of(8) the -update relies on other than . There-fore, multiplying the second equation with we have

. Substituting it tothe first equation of (8) we obtain the -update whereis decided by and , i.e.,

. Let-ting be a block diagonal matrix with its

th block being the degree of agent multiplying andother blocks being ,we know . Defining a new multiplier

, we obtain a simplified decentralizedalgorithm

-

-(9)

The introduced matrices , and are re-lated to the underlying network topology. With regard to theundirected graph and are the extended unorientedand oriented incidence matrices, respectively; and arethe extended signless and signed Laplacian matrices, respec-tively; and is the extended degree matrix. By “extended”,we mean replacing every 1 by by , and 0 by inthe original definitions of these matrices [26]–[29].The updates in (9) are distributed to agents. Note that

where is the local solution of agent andwhere is the local Lagrange multiplier of

agent . Recalling the definitions of and , (9) trans-lates the update of agent by

(10)

where denotes the set of neighbors of agent . The algorithmis fully decentralized since the updates of and only relyon local and neighboring information. The decentralized con-sensus optimization algorithm based on the ADMM is outlinedin Table I.

III. CONVERGENCE RATE ANALYSIS

This section first establishes the linear convergence rate of theADMM in decentralized consensus optimization with stronglyconvex local objective functions (see Section III-A); the de-tailed proof of the main theoretical result is placed in Appendix.We then discuss how to tune the parameter and accelerate theconvergence (see Section III-B).

Page 4: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

SHI et al.: ON THE LINEAR CONVERGENCE OF THE ADMM IN DECENTRALIZED CONSENSUS OPTIMIZATION 1753

TABLE IALGORITHM 1: DECENTRALIZED CONSENSUS OPTIMIZATION BASED ON THE ADMM

A. Main Theoretical Result

Throughout this paper, we make the following assumptionthat the local objective functions are strongly convex and haveLipschitz continuous gradients; note that the latter impliesdifferentiability.Assumption 1: The local objective functions are strongly

convex. For each agent and given anywith

. The gradients of the local objective functions are Lip-schitz continuous. For each agent and given any

with .Recall the definition . Assumption 1 di-

rectly indicates that is strongly convex (i.e.,given any

with ) and the gradient of is Lipschitz con-tinuous (i.e., for any

with ).Although the convergence of Algorithm 1 to the optimal so-

lution of (4) can be shown based on the convergence propertyof the ADMM (see e.g., [21]), establishing its linear conver-gence is nontrivial. In [25] the linear convergence of the central-ized ADMM is proved given that either is strongly convexor is full row-rank in (8). However, the decentralized con-sensus optimization problem does not satisfy these conditions.The function is not strongly convex, and the matrix

is row-rank deficient.Next we will analyze the convergence rate of the ADMM iter-

ation (7). The analysis requires an extra initialization conditionthat lies in the column space of such that also liesin the column space of , which is necessary in the analysis.Note that there is a unique optimal multiplier lying in thecolumn space of . To see so, taking in (7) yieldsthe KKT conditions of (4)

(11)

where is the unique primal optimal solution and theuniqueness follows from the strong convexity of as wellas the consensus constraint . Since the consensusconstraints are feasible, there is at least oneoptimal multiplier exists such that . Weshow that its projection onto the column space of , denotedby , is also an optimal multiplier. According to the propertyof projection, and hence .Therefore, the projection that lies in the column space

of also satisfies . Next we showthe uniqueness of such a by contradiction. Consider twodifferent vectors that both lie in thecolumn space of and satisfy the equation. Therefore, wehave and .Subtracting them yields . Since

whereis the smallest nonzero singular value of ,

we conclude that and consequentlywhich contradicts with the assumption of

and being different. Hence, is the unique dualoptimal solution that lies in the column space of .Our main theoretical result considers the convergence of a

vector that concatenating the primal variable and the dual vari-able , which is common in the convergence rate analysis of theADMM [23]–[25]. Let us introduce

(12)

We will show that is Q-linearly convergent to itsoptimal with respect to the -norm. Further, theQ-linear convergence of to impliesthat is R-linearly convergent to its optimal .Theorem 1: Consider the ADMM iteration (7) that solves (4).

The primal variables and have their unique optimal valuesand , respectively; the dual variable has its unique op-

timal value that lies in the column space of . Recall thedefinition of and defined in (12). If the local objective func-tions satisfy Assumption 1 and the dual variable is initial-ized such that lies in the column space of , then for any

is Q-linearly convergent to its optimalwith respect to the -norm

(13)

where

(14)

Further, is R-linearly convergent to following from

(15)

Proof: See Appendix.

Page 5: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

1754 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 7, APRIL 1, 2014

In Theorem 1, (14) shows that is no greaterthan and hence converges to Q-linearlyat a rate

A larger guarantees faster convergence. On the other hand,is a theoretical upper bound of the convergence rate, prob-

ably not tight. The Q-linear convergence of to translatesto the R-linear convergence of to as shown in (15).

B. Accelerating the Convergence

From (14) we can find that the theoretical convergence rate(more precisely, its upper bound) is given in terms of the net-work topology, the properties of local objective functions, andthe algorithm parameter. The value of is related with the freeparameter , the strongly con-vexity constant of , the Lipschitz constant of , andthe algorithm parameter .Now we consider tuning the free parameter and the algo-

rithm parameter to maximize and thus accelerate the con-vergence (i.e., through minimizing that is actually an upperbound). From the analysis we will see more clearly how theconvergence rate is influenced by the network topology and thelocal objective functions. For convenience, we define the con-dition number of as

Recall that and . Therefore,is an upper bound of the condition numbers of the local ob-

jective functions. We also define the condition number of theunderlying graph or as

With regard to the underlying graph, the minimum nonzerosingular value of the extended signed Laplacian matrix ,denoted as , is known as its algebraic connectivity[26], [27]. The maximum singular value of the extendedsignless Laplacian matrix , denoted as , has alsodrawn research interests recently [28], [29]. Bothand are measures of network connectedness but theformer is weaker. Roughly speaking, larger and

mean stronger connectedness, and a larger meansweaker connectedness.Keeping the definitions of and in mind, the fol-

lowing theorem shows how to choose the free parameter andthe algorithm parameter to maximize and accelerate theconvergence.Theorem 2: If the algorithm parameter in (14) is chosen as

(16)

where

(17)

then

(18)

maximizes the value of in (14) and ensures that (15) holds.Proof: Observing the two values inside the minimization

operator in (14), we find that only the second term is relevantwith . It is easy to check that the value of in (16), no matterhow is chosen, maximizes as

(19)

Inside the minimization operator in (19), the first and secondterms are monotonically increasing and decreasing with regardto , respectively. To maximize , we choose a value ofsuch that the two terms are equal. Simple calculations show

that the value of in (17), which is larger than 1, satisfies thiscondition. The resulting maximum value of is the one in (18).

The value of in (18) is monotonically decreasing with re-gard to and . This conclusion suggests thata smaller condition number of and a smaller conditionnumber of the graph lead to faster convergence. On the otherhand, if these condition numbers keep increasing, the conver-gence can go arbitrarily slow. In fact, the limit of in (18) is 0as or . Given , the upper bound of , wedefine the upper bound of the convergence rate as

IV. NUMERICAL EXPERIMENTS

In this section, we provide extensive numerical experimentsand supplement to validate our theoretical analysis. We in-troduce experimental settings in Section IV-A and then studythe influence of different factors on the convergence rate inSections IV-B through IV-E.

A. Experimental Settings

We generate a network consisting of agents and possessingat most edges. If the network is randomly generated,we define , the connectivity ratio of the network, as its actualnumber of edges divided by . Such a random networkis generated with edges that are uniformly randomlychosen, while ensuring the network connected.

Page 6: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

SHI et al.: ON THE LINEAR CONVERGENCE OF THE ADMM IN DECENTRALIZED CONSENSUS OPTIMIZATION 1755

TABLE IISUMMARY OF THE NUMERICAL EXPERIMENTS

We apply the ADMM to a decentralized consensus leastsquares problem

(20)

Here is the unknown signal to estimate and its truevalues follow the normal distribution isthe linear measurement matrix of agent whose elements follow

by default, and is the measurement vector ofagent whose elements are polluted by random noise following

. In Section IV-D the elements of the matrices needto be further manipulated to produce different condition num-bers of the objective functions. We reformulate (20) into theform of (3) as

(21)

The solution to (20) is denoted by in which the part of agentis denoted by . The algorithm is stopped oncereaches or the number of iterations reaches 4000,whichever is earlier.In the numerical experiments, we choose to record the primal

error instead of as the latter incurssignificant extra computation when the number of agents islarge. But note that is not necessarily monotonic in

. Let the transient convergence rate be . Asfluctuates, we report the running geometric-average rate of

convergence given by

(22)

which follows from (13) and (15). While , andinfluence , observing

we see that their influence diminishes and the steady state

is upper bounded by as is. Throughout the numericalexperiments, we report and .In the following subsections, we demonstrate how different

factors influence the convergence rate. We firstly show theevidence of linear convergence, and along the way, the influ-ence of the connectivity ratio on the convergence rate (seeSection IV-B). Secondly, we compare the practical convergencerate using the best theoretical algorithm parameter in(16) and that using the best hand-tuned parameter (seeSection IV-C). Thirdly, we check the effect of , the conditionnumber of the objective function (see Section IV-D). Finally,we show how , the condition number of the network, as wellas other network parameters, influence the convergence rate(see Section IV-E). The numerical experiments are summarizedin Table II.

B. Linear Convergence

To illustrate linear convergence of the ADMM for decen-tralized consensus optimization, we generate random networksconsisting of agents. The connectivity ratio of the net-works, , is set to different values. The ADMM parameter is setas (16).Fig. 1 depicts how the relative error, , varies in .

Obviously the convergence rates are linear for all ; a higherconnectivity ratio leads to faster convergence. Fig. 2 plots ,which stabilizes within 10 iterations. From Figs. 1 and 2, one canobserve that for such randomly generated networks, varying theconnectivity ratio within the range [0.08, 1] does not signifi-cantly change the convergence rate. The reason is that when islarger than a certain threshold, its value makes little influence on(see Table III in Section IV-C). We will discuss more about

the influence of in Section IV-D.As a comparison, we also demonstrate the convergence of

the distributed gradient descent (DGD) method in Figs. 1 and2. Using a diminishing stepsize [30], the DGD showssublinear convergence that is slow even for a complete graph(i.e., ).

C. Algorithm Parameter

Here we discuss the influence of the ADMM parameter onthe convergence rate. The best theoretical value in (16),though optimizing the upper bound of the convergence rate,does not give best practical performance.We vary , and plot thesteady-state running geometric-average rates of convergencein Fig. 3. For each curve that corresponds to a unique , wemark

Page 7: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

1756 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 7, APRIL 1, 2014

TABLE IIISETTINGS AND CONVERGENCE RATES CORRESPONDING TO FIGS. 1, 2, 4, AND 5

Fig. 1. Relative error versus iteration .

Fig. 2. Running geometric-average rate of convergence versus iteration .

the best theoretical value and the best practical value . Con-sistently, are larger than .Now we set , the hand-tuned optimal value, and plot

in Fig. 4 as per Fig. 1 and in Fig. 5 as per Fig. 2.Comparing to those using , the best theoretical value,in Figs. 1 and 2, the convergence improves significantly. Thenumerical quantities of Figs. 1, 2, 4, and 5 are given in Table III.It appears that is a stable overestimate of . Therefore, we

recommend for nearly optimal convergence using some. Fig. 6 illustrates the convergence corresponding to

different values of . We randomly generate 4000 connectednetworks with agents whose connectivity ratios are

Fig. 3. Steady-state running geometric-average rate of convergence versusalgorithm parameter .

Fig. 4. Relative error versus iteration .

uniformly distributed on . The random networks are di-vided into 20 groups according to their condition numbers .For each group of the random networks, the values of areplotted with error bars, and compared with the theoretical upperbound . For this dataset, appear to be a good overallchoice. A smaller imposes a risk of slower convergence when

is small.

D. Condition Number of the Objective Function

Now we study how , the condition number of the objec-tive function, affects the convergence rate. We generate random

Page 8: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

SHI et al.: ON THE LINEAR CONVERGENCE OF THE ADMM IN DECENTRALIZED CONSENSUS OPTIMIZATION 1757

Fig. 5. Running geometric-average rate of convergence versus iteration .

Fig. 6. Convergence performance obtained with for varying , whereis analytically given in (16).

networks consisting of agents with different connec-tivity ratios . We set . To produce different , we firstgenerate a linear measurement matrix with its elements fol-lowing . Second, we apply singular value decomposi-

tions to , scale the singular values to the range , and

rebuild .Fig. 7 shows that the theoretical convergence rates are

monotonically increasing as increases, which is consistentwith Theorem 2. When the connectivity ratios are small, thetrend of disobeys the theoretical analysis. It is because that ourupper bound of the convergence rate, becomes loose when thenetwork connectedness is poor. When the network is well-con-nected (say ), we can observe a positive correlation be-tween and , which coincides with the theoretical analysis.

E. Network Topology

Last we study how the network topology affects the conver-gence rate. Besides the condition number of the networkthat is relevant, we also consider other network parameters in-cluding the network diameter, geometric average degree, as well

Fig. 7. Convergence performance versus the condition number of the ob-jective function at different connectivity ratios .

Fig. 8. Convergence performance versus the condition number of the networkobtained with networks of different topologies (random, line, cycle, star,

and complete) and of different sizes .

as imbalance of bipartite networks. In the numerical experi-ments, the local objective functions are generated as describedin Section IV-A. The algorithm parameter is set as .1) Condition Number of the Network: As it is difficult to

precisely design , the condition number of the network,we run a large number of trials to sample . We randomlygenerate 4000 connected networks withagents, 12000 networks in total. Their connectivity ratios areuniformly distributed on . In addition, we generate specialnetworks with topologies of the line, cycle, star, complete,and grid types. The grid networks are generated in a 3D space(2 5 5, 5 5 8, and 5 10 10).Fig. 8 depicts the effect of on the convergence rate.

In Fig. 8, the dashed curve with error bars correspond to therandom networks, and the individual points correspond to thespecial networks. There is only one dashed curve in the plotsince do not make significant differences.The networks of the line, cycle, complete, and grid topologies

Page 9: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

1758 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 7, APRIL 1, 2014

Fig. 9. Convergence performance versus the condition number of the networkobtained with networks of different topologies (random, line, cycle, star,

complete, and grid) and of different sizes .

generate points in the plot that are nearly on the dashed curve,which indicates that is a good indicator for convergencerate. In addition, the trends of , the steady-state runninggeometric-average rate of convergence, and , the theoreticalrate of convergence, are consistent. The points correspondingto the three networks of the star topology are away from thedashed side.We observe that the convergence rate is closely related to ,

less to . To reach a target convergence rate, one therefore shallhave a sufficiently small , which in turn depends on and, as well as other factors. To obtain a sufficiently small ,typically, needs to be large if is small, but not as large ifis large. In other words, if one has a network with a large

number of agents (say ), a small connectivity ratio (say) will lead to a small and thus fast convergence.

With the same , the networks with the star topology havemuch faster convergence than random networks. We shall dis-cuss this special topology at the end of this subsection.2) Network Diameter: The network diameter is defined as

the longest distance between any pair of agents in the network.In decentralized consensus optimization, is related to howmany iterations the information from one agent will reach allthe other agents.To discuss the effect of the network diameter on the conver-

gence rate, we randomly generated 4000 connected networkswith agents and connectivity ratios uniformly dis-tributed on . We also generate the networks of the line,cycle, star, complete, and grid topologies. Most randomly gen-erated networks possess small diameters. In this experiment, thenumbers of those with andare 3141, 717 and 142, respectively. From Fig. 9, we concludethat in general a larger diameter tends to cause a worse conditionnumber of the network and thus slower convergence, though thisrelationship is interfered by network properties.3) Geometric Average Degree: Define and as the

largest and smallest degrees of the agents in the network, respec-tively. The geometric average degree reflectsthe agents’ number of neighbors in a geometric average sense.

Fig. 10. Convergence performance versus the condition number of the net-work and the network diameter obtained with networks of different topolo-gies (random, line, cycle, star, complete, and grid) and of size .

Fig. 11. Convergence performance versus the condition number of the net-work and the imbalance of bipartite networks obtained with networks ofrandom and star topologies and of size .

Its value reaches maximum at if the topology is complete;and reaches minimum when the topology is a line.To discuss the effect of the network diameter on the conver-

gence rate, we randomly generated 4000 connected networkswith agents and connectivity ratios uniformly dis-tributed on . We also generate the networks of the line,cycle, star, complete, and grid topologies. Most randomly gen-erated networks possess small diameters. In this experiment, thenumbers of those with andare 3141, 717 and 142, respectively. From Fig. 9, we concludethat in general a larger diameter tends to cause a worse conditionnumber of the network and thus slower convergence, though thisrelationship is interfered by network properties.4) Imbalance of Bipartite Networks: Let denote

the class of bipartite networks with agents in one group andagents in another group. Agents within either group cannot

directly communicate with each other. For a bipartite networkconsisting of agents, its imbalance is definedas , which can vary between 0 and .

Page 10: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

SHI et al.: ON THE LINEAR CONVERGENCE OF THE ADMM IN DECENTRALIZED CONSENSUS OPTIMIZATION 1759

We randomly generate 1000 bipartite graphs of size, whose connectivity ratios are uniformly distributed on

, for each of the cases .The star topology corresponds to a special bipartite networkwith . From Fig. 11, we find that for thesame , the networks with larger have faster convergence.An extreme example is the network of the star topology. Thisobservation suggests us to assign few “hot spots” to relay in-formation for fast convergence, if is fixed in advance. How-ever, this approach may cause robustness or scalability issuesbecause the relaying agents are subject to extensive communi-cation burden. Hence there is a tradeoff between fast conver-gence and robustness or scalability in network design.

V. CONCLUSION

We apply the ADMM to a reformulation of a general decen-tralized consensus optimization problem. We show that if theobjective function is strongly convex, the decentralized ADMMconverges at a globally linear rate, which can be given explic-itly. It is revealed that several factors affect the convergencerate that include the topology-related properties of the network,the condition number of the objective function, and the algo-rithm parameter. Numerical experiments corroborate and sup-plement our theoretical findings. Our analysis sheds light onhow to construct a network and tune the algorithm parameterfor fast convergence.

APPENDIX

Proof: Consider the ADMM updates (7) and the KKT con-ditions (11). Subtracting the three equations in (11) from the cor-responding equations in (7) yields

(23)

(24)

(25)

respectively.To prove the Q-linear convergence of we use

as an intermediate. Based on Assumption 1,is strongly convex with a constant such that

(26)

Using (23), we can split the right-hand side of (26) to two terms

(27)

Substituting (24) and (25) to (27) we can eliminate the termand obtain

(28)

Recall the definition of and defined in (12). It is obviousthat the right-hand side of (28) can be written as a compactform . Using the equality

, (28) is equivalent to

(29)

and consequently using (26)

(30)

Having (30) at hand, to prove (13) we only need to show

(31)

which is equivalent to

(32)

The idea of proof is to show that andare upper bounded by two non-overlapping parts of the

left-hand side of (32), respectively.The upper bound of follows from (25) that

shows . Hence we have

(33)

where is the largest singular value of . Tofind the upper bound of , we use two inequal-ities and

; the latter holdssince has Lipschitz continuous gradients with a constant. Therefore, given the positive algorithm parameter and

any it holds

(34)

Recall that from (23) is the summation ofand . Hence we can apply

Page 11: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

1760 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 7, APRIL 1, 2014

the basic inequality ,which holds for any , to (34) and obtain

(35)

Since by assumption is initialized such that it lies inthe column space of , we know that lies inthe column space of too; see the ADMM updates(7). Because also lies in the column space of

whereis the smallest nonzero singular value of .

Therefore from (35) we can upper bound by

(36)

Combining (33) and (36), we prove (32). From (33) we have

(37)

From (36) we have

(38)

Summing up (37) and (38) yields

(39)

Apparently, in (14) satisfies

(40)

and consequently (32), which proves (13).To prove the R-linear convergence of to , we observe

that (30) implies , which proves(15).

REFERENCES[1] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “Linearly conver-

gent decentralized consensus optimization with the alternating direc-tion method of multipliers,” in Proc. 38th Int. Conf. Acoust., Speech,Signal Process., 2013, pp. 4613–4617.

[2] G. Inalhany, D. Stipanovic, and C. Tomlin, “Decentralized optimiza-tion, with application to multiple aircraft coordination,” in Proc. 41stIEEE Conf. Dec. Control, 2002, pp. 1147–1155.

[3] W. Ren, R. Beard, and E. Atkins, “Information consensus in multi-vehicle cooperative control: Collective group behavior through localinteraction,” IEEE Control Syst. Mag., vol. 27, pp. 71–82, Apr. 2007.

[4] B. Johansson, “On Distributed Optimization in Networked Systems,”Ph.D. dissertation, Electr. Eng. Dept., KTH, Stockholm, Sweden, 2008.

[5] L. Xiao, S. Boyd, and S. Kim, “Distributed average consensus withleast-mean-square deviation,” J. Parallel Distrib. Comput., vol. 67, pp.33–46, 2007.

[6] A. Dimakis, S. Kar, M. R. J. Moura, and A. Scaglione, “Gossip algo-rithms for distributed signal processing,” Proc. IEEE, vol. 98, no. 11,pp. 1847–1864, 2010.

[7] J. Predd, S. Kulkarni, and H. Poor, “A collaborative training algorithmfor distributed learning,” IEEE Trans. Inf. Theory, vol. 55, no. 4, pp.1856–1871, 2009.

[8] G. Mateos, J. Bazerque, and G. Giannakis, “Distributed sparselinear regression,” IEEE Trans. Signal Process., vol. 58, no. 10, pp.5262–5276, 2010.

[9] I. Schizas, A. Ribeiro, and G. Giannakis, “Consensus in Ad hoc WSNswith noisy links—Part I: Distributed estimation of deterministic sig-nals,” IEEE Trans. Signal Process., vol. 56, no. 1, pp. 350–364, 2008.

[10] Q. Ling and Z. Tian, “Decentralized sparse signal recovery forcompressive sleeping wireless sensor networks,” IEEE Trans. SignalProcess., vol. 58, no. 7, pp. 3816–3827, 2010.

[11] J. Bazerque and G. Giannakis, “Distributed spectrum sensing forcognitive radio networks by exploiting sparsity,” IEEE Trans. SignalProcess., vol. 58, no. 3, pp. 1847–1862, 2010.

[12] J. Bazerque, G.Mateos, and G. Giannakis, “Group-Lasso on splines forspectrum cartograph,” IEEE Trans. Signal Process., vol. 59, no. 10, pp.4648–4663, 2011.

[13] V. Kekatos and G. Giannakis, “Distributed robust power system stateestimation,” IEEE Trans. Power Syst., vol. 28, no. 2, pp. 1617–1626,2013.

[14] L. Gan, U. Topcu, and S. Low, “Optimal decentralized protocol forelectric vehicle charging,” IEEE Trans. Power Syst., vol. 28, no. 2, pp.940–951, 2013.

[15] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp.48–61, 2009.

[16] S. Ram, A. Nedic, and V. Veeravalli, “Distributed stochastic subgra-dient projection algorithms for convex optimization,” J. Optim. TheoryAppl., vol. 147, pp. 516–545, 2010.

[17] K. Tsianos andM. Rabbat, “Distributed strongly convex optimization,”in Proc. 50th Ann. Allerton Conf. Commun., Control Comput., 2012,pp. 593–600.

[18] J. Duchi, A. Agarwal, and M. Wainwright, “Dual averaging fordistributed optimization: Convergence analysis and network scaling,”IEEE Trans. Autom. Control, vol. 57, no. 3, pp. 592–606, 2012.

[19] K. Tsianos, S. Lawlor, and M. Rabbat, “Push-sum distributed dual av-eraging for convex optimization,” in Proc. 51st IEEE Ann. Conf. Dec.Control, 2012, pp. 5453–5458.

[20] T. Erseghe, D. Zennaro, E. Dall’Anese, and L. Vangelista, “Fast con-sensus by the alternating direction multipliers method,” IEEE Trans.Signal Process., vol. 59, no. 11, pp. 5523–5537, 2011.

[21] D. Bertsekas and J. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods, 2nd ed. Nashua, NH, USA: Athena Scientific,1997.

[22] M. Rabbat and R. Nowak, “Quantized incremental algorithms for dis-tributed optimization,” IEEE J. Sel. Areas Commun., vol. 23, no. 4, pp.798–808, 2006.

[23] B. He and X. Yuan, “On the convergence rate of the Douglas-Rachford alternating direction method,” SIAM J. Numer. Anal., vol. 50,no. 2, pp. 700–709, 2012.

[24] M. Hong and Z. Luo, “On the linear convergence of the alternating di-rection method of multipliers,” 2013 [Online]. Available: http://arxiv.org/ pdf/1208.3922v3.pdf

[25] W. Deng and W. Yin, “On the Global and Linear Convergence of theGeneralized Alternating Direction Method of Multipliers,” Rice Univ.,Houston, TX, USA, Tech. Rep. TR12-14, 2012.

[26] F. Chung, Spectral Graph Theory, ser. CBMS Regional Conf. SeriesMathematics, No. 92. Providence, RI, USA: Amer.Math. Soc., 1996.

[27] M. Fiedler, “Algebra connectivity of graphs,” Czechoslovake Math. J.,vol. 23, no. 98, pp. 298–305, 1973.

[28] D. Cvetkovic, P. Rowlinson, and S. Simic, “Signless Laplacians of fi-nite graphs,” Linear Algebra Appl., vol. 423, pp. 155–171, 2007.

Page 12: 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, …home.ustc.edu.cn/~qingling/papers/J_TSP2014_ADMM.pdf · 2014. 5. 13. · 1750 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

SHI et al.: ON THE LINEAR CONVERGENCE OF THE ADMM IN DECENTRALIZED CONSENSUS OPTIMIZATION 1761

[29] Y. Chen and L. Wang, “Sharp bounds for the largest eigenvalue ofthe signless Laplacian of a graph,” Linear Algebra Appl., vol. 433, pp.908–913, 2010.

[30] D. Jakovetic, J. Xavier, and J. M. Moura, “Convergence rate analysisof distributed gradient methods for smooth optimization,” in Proc. 20thIEEE Telecommun. Forum (TELFOR), 2012, pp. 867–870.

Wei Shi received the B.E. degree in automation fromthe University of Science and Technology of Chinain 2010. He is currently working toward the Ph.D.degree in control theory and control engineering inthe Department of Automation, University of Sci-ence and Technology of China. His current researchfocuses on decentralized optimization of networkedmulti-agent systems.

Qing Ling received the B.E. degree in automationand the Ph.D. degree in control theory and con-trol engineering from the University of Science andTechnology of China in 2001 and 2006, respec-tively. From 2006 to 2009, he was a Post-DoctoralResearch Fellow in the Department of Electricaland Computer Engineering, Michigan Technolog-ical University. Since 2009, he has been an Asso-ciate Professor in the Department of Automation,University of Science and Technology of China. Hiscurrent research focuses on decentralized optimiza-

tion of networked multi-agent systems.

Kun Yuan received the B.E. degree in telecom-munication engineering from Xidian University in2011. He currently working toward the M.S. degreein control theory and control engineering in theDepartment of Automation, University of Scienceand Technology of China. His current researchfocuses on decentralized optimization of networkedmulti-agent systems.

GangWu received the B.E. degree in automation andthe M.S. degree in control theory and control engi-neering from University of Science and Technologyof China in 1986 and 1989, respectively. Since 1991,he has been in the Department of Automation, Uni-versity of Science and Technology of China, wherehe is now a Professor. His current research interestsare advanced control and optimization of industrialprocesses.

Wotao Yin received the B.S. degree in mathematicsand applied mathematics from Nanjing Universityin 2001 and the M.S. and Ph.D. degrees in oper-ations research from Columbia University in 2003and 2006, respectively. From 2006 to 2013, he wasan Assistant Professor and then an Associate Pro-fessor in the Department of Computational and Ap-plied Mathematics, Rice University. Since 2013, hehas been a Professor in the Department of Mathe-matics, University of California, Los Angeles, CA,USA. His current research interest is large-scale de-

centralized/distributed optimization.


Recommended