+ All Categories
Home > Documents > ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b ›...

ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b ›...

Date post: 04-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
21
SIAM J. CONTROL OPTIM. c 2016 Society for Industrial and Applied Mathematics Vol. 54, No. 5, pp. 2872–2892 ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM FOR CONVEX OPTIMIZATION WITH INEQUALITY CONSTRAINTS * DEMING YUAN , DANIEL W. C. HO , AND YIGUANG HONG § Abstract. In this paper, we consider an optimization problem, where multiple agents cooperate to minimize the sum of their local individual objective functions subject to a global inequality constraint. We propose a class of distributed stochastic gradient algorithms that solve the problem using only local computation and communication. The implementation of the algorithms removes the need for performing the intermediate projections. For strongly convex optimization, we employ a smoothed constraint incorporation technique to show that the algorithm converges at an expected rate of O(ln T/T ) (where T is the number of iterations) with bounded gradients. For non-strongly convex optimization, we use a reduction technique to establish an O(1/ T ) convergence rate in expectation. Finally, a numerical example is provided to show the convergence of the proposed algorithms. Key words. distributed convex optimization, constrained optimization algorithm, stochastic gradient, convergence rate AMS subject classifications. 93A14, 90C25 DOI. 10.1137/15M1048896 1. Introduction. In recent years, the problem of minimizing a sum of local pri- vate objective functions that are distributed among a network of multiple interacting nodes has received much attention (see [3, 4, 6, 16, 18, 19, 20, 23] and references therein). Such a problem arises in a variety of real applications such as distributed estimation [2], source localization in sensor networks [24], and smart grid [15, 26]. Most of the existing works build on the average consensus algorithm to design fully distributed algorithms that find the solution of the problem. As uncertainties always exist in communication and environment, stochastic or randomize algorithms also have been discussed for distributed optimization [13, 10]. Following the distributed optimization results without constraints, a constrained multiagent optimization problem is drawing more and more attention. In particular, the objective function is usually a sum of all the nodes’ local objective functions and the constraint is a convex and compact set available to all the nodes. The existing algorithms for solving this problem were involved with a Euclidean projection onto the constraint set at every iteration in order to ensure that the estimates are within the feasible domain; they usually fell into two main categories. The first built on the projected gradient algorithms, mainly based on the assumption that the constraint * Received by the editors November 19, 2015; accepted for publication (in revised form) August 23, 2016; published electronically October 19, 2016. http://www.siam.org/journals/sicon/54-5/M104889.html Funding: This work was supported in part by the National Natural Science Foundation of China under grants 61304042 and 61573344, in part by GRF grants from HKSAR under grants CityU 11204514 and CityU 11300415, and in part by the Natural Science Foundation of Jiangsu Province under grant BK20130856. College of Automation, Nanjing University of Posts and Telecommunications, Nanjing 210023, China ([email protected]). Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong ([email protected]). § Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China ([email protected]). 2872
Transcript
Page 1: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

SIAM J. CONTROL OPTIM. c© 2016 Society for Industrial and Applied MathematicsVol. 54, No. 5, pp. 2872–2892

ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTICGRADIENT ALGORITHM FOR CONVEX OPTIMIZATION WITH

INEQUALITY CONSTRAINTS∗

DEMING YUAN† , DANIEL W. C. HO‡ , AND YIGUANG HONG§

Abstract. In this paper, we consider an optimization problem, where multiple agents cooperateto minimize the sum of their local individual objective functions subject to a global inequalityconstraint. We propose a class of distributed stochastic gradient algorithms that solve the problemusing only local computation and communication. The implementation of the algorithms removesthe need for performing the intermediate projections. For strongly convex optimization, we employa smoothed constraint incorporation technique to show that the algorithm converges at an expectedrate of O(lnT/T ) (where T is the number of iterations) with bounded gradients. For non-stronglyconvex optimization, we use a reduction technique to establish an O(1/

√T ) convergence rate in

expectation. Finally, a numerical example is provided to show the convergence of the proposedalgorithms.

Key words. distributed convex optimization, constrained optimization algorithm, stochasticgradient, convergence rate

AMS subject classifications. 93A14, 90C25

DOI. 10.1137/15M1048896

1. Introduction. In recent years, the problem of minimizing a sum of local pri-vate objective functions that are distributed among a network of multiple interactingnodes has received much attention (see [3, 4, 6, 16, 18, 19, 20, 23] and referencestherein). Such a problem arises in a variety of real applications such as distributedestimation [2], source localization in sensor networks [24], and smart grid [15, 26].Most of the existing works build on the average consensus algorithm to design fullydistributed algorithms that find the solution of the problem. As uncertainties alwaysexist in communication and environment, stochastic or randomize algorithms alsohave been discussed for distributed optimization [13, 10].

Following the distributed optimization results without constraints, a constrainedmultiagent optimization problem is drawing more and more attention. In particular,the objective function is usually a sum of all the nodes’ local objective functions andthe constraint is a convex and compact set available to all the nodes. The existingalgorithms for solving this problem were involved with a Euclidean projection ontothe constraint set at every iteration in order to ensure that the estimates are withinthe feasible domain; they usually fell into two main categories. The first built on theprojected gradient algorithms, mainly based on the assumption that the constraint

∗Received by the editors November 19, 2015; accepted for publication (in revised form) August23, 2016; published electronically October 19, 2016.

http://www.siam.org/journals/sicon/54-5/M104889.htmlFunding: This work was supported in part by the National Natural Science Foundation of

China under grants 61304042 and 61573344, in part by GRF grants from HKSAR under grantsCityU 11204514 and CityU 11300415, and in part by the Natural Science Foundation of JiangsuProvince under grant BK20130856.†College of Automation, Nanjing University of Posts and Telecommunications, Nanjing 210023,

China ([email protected]).‡Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong

([email protected]).§Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Chinese

Academy of Sciences, Beijing 100190, China ([email protected]).

2872

Page 2: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

ON CONVERGENCE RATE OF DSG ALGORITHM 2873

set is simple, in the sense that the Euclidean projection step could be easily solved(see, e.g., [1, 5, 12, 13, 17]). To extend the discussion of simple constraint, theworks in [22, 25] proposed approximate projected gradient algorithms by allowing theprojection step to be solved only approximately at each iteration. The second categoryof algorithms exploited the particular structure of the constraint set (say, characterizedby an inequality constraint) (see, e.g., [7, 14, 15, 21]). In general, these algorithmswere primal-dual, and the design of such algorithms usually involved constructinga dual optimal set containing the dual optimal variable; nevertheless this leads tosolving a general convex optimization problem. In addition, many problems aboutconstrained multiagent optimization remain to be addressed, including the effect ofthe stochastic errors on the distributed gradient algorithm.

Convergence rate analysis is also an important issue in the distributed design.Even for consensus problems, how to estimate or improve convergence rate was dis-cussed in many publications (see [9, 11]). Certainly, it is also important to provideconvergence rate in the study of distributed optimization. For example, a convergencerate of O(lnT/T ) was obtained for unconstrained optimization, where the individ-ual objective functions are strongly convex and have Lipschitz gradients in [2], andmoveover, for constrained optimization, an O(lnT/

√T ) rate was established for gen-

eral nonsmooth and convex functions by using a distributed dual averaging algorithmin [5]. In our previous work [21], a distributed algorithm was proposed to drive all thenodes to an optimal point at rate O(1/T 1/4) for constrained optimization. The al-gorithm exploited the particular structure of the constraint set, which was describedby an inequality and contained in a ball with finite radius. The basic idea of thealgorithm was to appropriately penalize the intermediate estimates when they wereoutside the domain, with the help of a regularized Lagrangian function.

The main objective of this paper is to study a constrained multiagent optimiza-tion problem by addressing the following questions: (i) Is there any algorithm thatdoes not require a projection step at every iteration and only needs the noisy sam-ples of the gradients? and (ii) Is it possible to derive the convergence rate undersuch circumstances? Although the convergence was verified for a class of distributedgradient algorithms in [13], the convergence rate remains unknown for constrained op-timization. To this end, we develop a class of distributed gradient algorithms withoutrequiring intermediate projections. Different from the convergence rate results givenin [21], we construct a penalty function that incorporates the inequality constraint intothe strongly convex objective function without requiring intermediate projections anddesign an efficient algorithm by adopting the smoothing technique in [8]. Specifically,we can achieve an improvement to an O(lnT/T ) rate. Moreover, we shall also studythe convergence rate for the non-strongly convex case with a reduction technique usedin online optimization [27].

The technical contribution of the paper can be summarized as follows:• We propose a class of new stochastic gradient algorithms to study distributed

optimization with a global constraint. In contrast to the existing algorithmsthat require projections at every iteration, the implementation of the pro-posed algorithm removes the need for intermediate projections; instead, onlyone projection at the last iteration is needed to get a feasible solution for theentire algorithm.

• We employ the smoothing technique to deal with the case when the objec-tive functions are strongly convex. This is different from the existing worksthat handle the inequality constraint (see, e.g., [7, 14, 15, 21]), where thealgorithms are in general primal-dual. In this paper, we incorporate the in-

Page 3: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

2874 DEMING YUAN, DANIEL W. C. HO, AND YIGUANG HONG

equality constraint into the objective function and design a non-primal-dualalgorithm. Therefore, the proposed algorithms are relatively simple to imple-ment. In addition, we use the smoothed constraint incorporation techniqueto provide the explicit convergence rate of the proposed algorithms, and driveall the nodes to the optimal point at an expected rate of O(lnT/T ).

• We study the effect of stochastic errors on the distributed gradient algorithm,where the errors are due to the fact that the nodes only have access to noisysamples of the gradients for the case in which the objective functions arenon-strongly convex. To solve the problem, we adopt the idea of a reductiontechnique in the online convex optimization community (see, e.g., [27]) toderive an expected rate of O(1/

√T ) for the proposed distributed gradient

algorithm.The reminder of the paper is organized as follows. In section 2, we give a de-

tailed description of the constrained multiagent optimization problem and the relatedassumptions. In section 3, we propose a class of distributed gradient algorithms andprovide its convergence analysis results. In section 4, we illustrate the proposed algo-rithm with a distributed estimation problem. Finally, we conclude with section 5.

Notation and terminology. Let Rd be the d-dimensional vector space. Weuse 〈x,y〉 to denote the standard inner product on Rd for any x,y ∈ Rd. Wewrite ΠX [x] to denote the Euclidean projection of a vector x onto the set X , i.e.,ΠX [x] = arg miny∈X ‖x − y‖2. We write max(a, b) to denote the maximum of tworeal numbers a and b and denote [λ]+ = max(λ, 0). We denote by [N ] the set ofintegers {1, . . . , N}. We denote the (i, j)th element of a matrix W by [W]ij . For aconvex (possibly nonsmooth) function f , its gradient (or subgradient) at a point y isdenoted by ∇f(y), and the following inequality holds for every x in the domain of f :

f(x) ≥ f(y) + 〈∇f(y),x− y〉 .

In addition, we say f is σ-strongly convex over the convex set X if

f(x) ≥ f(y) + 〈∇f(y),x− y〉+σ

2‖x− y‖22 ∀x,y ∈ X .

2. Problem setting and assumptions. Consider a time-varying network withN computational nodes (or agents) that is labeled by V = {1, 2, . . . , N}. The nodes’connectivity at time t (t = 1, 2, . . . ) is described by a directed graph G(t) = (V,E(t)),where E(t) is the set of activated links at time t; the communication pattern amongthe nodes is captured by W(t) ∈ RN×N , whose elements are defined as follows:

(i) [W(t)]ij > 0 for any active links (j, i) ∈ E(t), including node i itself, that is,[W(t)]ii > 0;

(ii) [W(t)]ij = 0 for any inactive links at time t, as well as those that are notneighbors of node i.

Each node is endowed with a local private convex cost function; all the nodes inthe network collectively minimize the sum of the cost functions, subject to a globalconvex constraint. The problem is formally given by

(2.1)minimize f(x) =

N∑i=1

f i(x)

subject to x ∈ X ,

where f i : Rd → R is the convex objective function of node i known only by itself;X is a compact convex domain that is known to all the nodes, and it is characterized

Page 4: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

ON CONVERGENCE RATE OF DSG ALGORITHM 2875

by an inequality constraint:

X = {x ∈ Rd : g(x) ≤ 0} ⊆ B = {x ∈ Rd : ‖x‖2 ≤ R},

where g : Rd → R is convex, and X is compact—specifically, is contained in aEuclidean ball of radius R.

Suppose that each node i only has access to the noisy samples of its gradient.To be specific, for any point x ∈ X , node i can only obtain the stochastic gradient∇f i(x) that is generated by

∇f i(x) = ∇f i(x) + ni(x),(2.2)

where ni(x) ∈ Rd is an independent random vector with zero mean and has boundedvariance, that is,

E[ni(x)

]= 0 and E

[‖ni(x)‖22

]≤ L2

n.(2.3)

The noise ni(x) can represent the error due to inexact computation of the gradient,or roundoff error.

In addition, we make the following assumptions on problem (2.1) and the under-lying network model.

Assumption 1. The graph G(t) = (V,E(t)) and the weight matrix W(t) satisfy(t = 1, 2, . . .):

(a) W(t) is doubly stochastic.(b) For all i ∈ [N ], [W(t)]ii ≥ ν and [W(t)]ij ≥ ν if (j, i) ∈ E(t), where ν is a

positive scalar.(c) The graph (V,E(sB + 1)

⋃· · ·⋃E((s+ 1)B)) is strongly connected for all

s ≥ 0 and some positive integer B.

Assumption 2. There exists an x ∈ X such that g(x) < 0.

Assumption 3. For each function f i(x) (i ∈ [N ]) and function g(x), we assumethat for all x ∈ B,

‖∇f i(x)‖2 ≤ Lf and ‖∇g(x)‖2 ≤ Lg.

Assumption 4. Each function f i(x) (i ∈ [N ]) is σ-strongly convex over the set B.

Assumption 5. For function g(x), we assume that there exists a ρ > 0 such thatming(x)=0 ‖∇g(x)‖2 ≥ ρ.

Remark 1. Assumptions 1–3 are standard and widely present in the convergenceanalysis of consensus-based gradient algorithms that deal with inequality constrainedmultiagent optimization (see, e.g., [7, 15]). Assumption 4 imposes some strong con-ditions on the functions f i, which is standard in the literature on strongly convexoptimization (see, e.g., [2, 17]). Assumption 5 is introduced to ensure that the dualoptimal variable for the problem in (2.1) is well bounded from above (see section 3.1).In fact, Assumption 5 is satisfied for many constraints, such as a polytope and apositive semidefinite cone.

Remark 2. Consider an example with constant edge weights to show how to en-sure in a distributed manner that the weight matrix W(t) satisfies Assumptions 1(a)and 1(b). Specifically, define

[W(t)]ij =

ν if (j, i) ∈ E(t),

1− di(t)ν if i = j,

0 otherwise,

(2.4)

Page 5: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

2876 DEMING YUAN, DANIEL W. C. HO, AND YIGUANG HONG

where di(t) is the number of neighbors communicating with node i at time t. It iseasy to show that Assumptions 1(a) and 1(b) are satisfied for 0 < ν ≤ 1

1+di(t).

3. The algorithms and convergence results. In this section, we provide aclass of efficient distributed stochastic gradient (DSG) algorithm for solving problem(2.1) and characterize its explicit convergence rate. Building on the developed algo-rithm, we use a reduction technique to derive the convergence rate for non-stronglyconvex optimization.

3.1. Distributed stochastic gradient algorithm. To deal with the inequalityconstraint, we can write problem (2.1) into a min-max optimization problem:

minx∈X

maxλ≥0

f(x) + λNg(x).

This is guaranteed by the Slater’s condition (cf. Assumption 2). Denote the optimalsolution of the above problem by (x?, λ?), and it is easy to show that λ? ∈ [0, Lf/ρ].Hence, problem (2.1) can be further written as

minx∈X

max0≤λ≤θ

f(x) + λNg(x) = minx∈X

N∑i=1

(f i(x) + θ[g(x)]+

),

where θ > Lf/ρ. Note that it is possible, but complicated, to calculate the subgradientof [g(x)]+. We adopt the technique, called the smoothed constraint incorporationtechnique, given in [8], by adding a smoothing term to the objective function, that is,S(λ/θ), where S(λ) = −λ lnλ−(1−λ) ln(1−λ) (noting that this idea has been widelyused in the conventional convex optimization community [28]). Then the associatedproblem becomes

minx∈X

max0≤λ≤θ

f(x) + λNg(x) + γNS(λ/θ)

= minx∈X

N∑i=1

(f i(x) + γ ln (1 + exp (θg(x)/γ))

).

The smoothing term S(λ) is introduced for the following purposes: (i) it works asa regularization term that prevents the dual variable λ from being too large; (ii) itincorporates the inequality constraint into the objective function, and we can solvethe transformed problem instead; and (iii) it avoids the unnecessary complication dueto the subgradient of [g(x)]+. We now have a smoothed version of problem (2.1):

(3.1)minx∈X

F(x) =N∑i=1

Fi(x),

where Fi(x) = f i(x) + γ ln (1 + exp (θg(x)/γ)) .

Note that each Fi is available to node i only. We will propose a distributed algorithm(Algorithm 1) to solve the above problem, instead of problem (2.1). This intuitionleads to the construction of the algorithm and also the analysis of its convergenceperformance.

Now we are in a position to propose the algorithm for solving the distributedstrongly convex optimization problem, which is shown as follows.

Page 6: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

ON CONVERGENCE RATE OF DSG ALGORITHM 2877

Algorithm 1. Distributed stochastic gradient algorithm.

Input: xi1 = 0; a step-size sequence {βt}Tt=1, γ > 0, and θ > Lf/ρ.Iteration: (t = 1, . . . , T )

1. zit = xit−βt(∇Fi(xit)+ni(xit)

)= xit−βt

(∇f i(xit)+

exp(θg(xit)/γ)

1+exp(θg(xit)/γ)

θ∇g(xit))

2. xit+1 =∑Nj=1[W(t)]ijz

jt

3. xit+1 = ΠB[xit+1] = Rmax(‖xi

t+1‖2,R)xit+1

Output: xiT = ΠX[xiT], where xiT = 1

T

∑Tt=1 x

it.

Remark 3. The essential idea of Algorithm 1 is to replace the projection stepwith the gradient computation of the inequality function that defines the domain X .Instead of projecting the intermediate estimates onto the complex convex domain X ,Algorithm 1 projects them onto the Euclidean ball that contains the same domain.At the last iteration, only one projection onto the domain X is needed to get afeasible solution. As a result, the implementation of Algorithm 1 removes the needfor intermediate projections.

We will seek to establish the convergence of the function value evaluated at xiTto the optimal value, for every i ∈ [N ], and also characterize the convergence rate ofthe DSG algorithm.

Denote the average estimate at iteration t by

yt :=1

N

N∑i=1

xit.(3.2)

The next lemma shows that the average disagreement among all the nodes vanishesat an expected rate O(lnT/T ).

Lemma 3.1. Let Assumptions 1 and 3 hold and the step-size sequence be βt = 1σt ,

t = 1, . . . , T . For all i ∈ [N ] and any number of iterations T ≥ 3, we have

1

T

T∑t=1

N∑i=1

E[‖xit − yt‖2

]≤ C1

lnT

T,

where C1 = 2Nσ ( 3N

η2(1−ς) + 4)(Lf + Ln + θLg) with η = 1− ν4N2 and ς = η1/B.

Proof. See Appendix A for the proof.

It is time to present one of the main results of this section, which shows that theDSG algorithm converges also at an expected rate O(lnT/T ).

Theorem 3.2. Under Assumptions 1, 2, 3, 4, and 5, let the step-size sequence beβt = 1

σt , t = 1, . . . , T , and denote x? = arg minx∈X f(x). If we set γ = lnTT , then, for

all j ∈ [N ] and any number of iterations T ≥ 3,

E[f(xjT )]− f(x?) ≤(

1 +Lf

θρ− Lf

)C2

lnT

T,

where C2 = 2Nσ ( 12N

η2(1−ς) + 17)((Lf + θLg)

2 + L2n

)+N ln 2.

The proof of Theorem 3.2 is based on the following three lemmas. Lemma 3.3provides a basic convergence of the DSG algorithm, while Lemma 3.4 provides the

Page 7: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

2878 DEMING YUAN, DANIEL W. C. HO, AND YIGUANG HONG

optimal condition related to the last projection. Lemma 3.5 shows the relation be-tween the function difference in the smoothed version and its original one.

Lemma 3.3. Let Assumptions 1, 3, and 4 hold. For all j ∈ [N ] and t ≥ 1, wehave

E[F(xjt )− F(x)] ≤ 1

2βt

((1− σβt)

N∑i=1

E[rit(x)]−N∑i=1

E[rit+1(x)]

)

+N((Lf + θLg)

2 + L2n

)βt + (Lf + θLg)

N∑i=1

E[‖xit − xjt‖2],(3.3)

where rit(x) = ‖xit − x‖22 for all x ∈ B and t ≥ 1.

Proof. See Appendix B for the proof.

Lemma 3.4 (see [8]). Let x = arg ming(x)≤0 ‖x − x‖22 with g(x) > 0; then thereexists a positive scalar µ such that

g(x) = 0 and x− x = µ∇g(x).

Lemma 3.5. For all x ∈ X , we have

f(x)− f(x?) ≤ F(x)− F(x?) +Nγ ln 2−N [θg(x)]+.

Proof. See Appendix C for the proof.

Proof of Theorem 3.2. We divide the proof into two parts. The first part providesa bound on the function difference E[F(xjT ) − F(x)]. The second part derives theconclusion by using the relation between the function difference in the smoothedversion and its original one, by using Lemma 3.5.

(i) Adding the inequality (3.3) over t = 1, . . . , T ,

T∑t=1

E[F(xjt )− F(x)] ≤T∑t=1

1

2βt

((1− σβt)

N∑i=1

E[rit(x)]−N∑i=1

E[rit+1(x)]

)︸ ︷︷ ︸

:=AT

(3.4)

+N((Lf + θLg)

2 + L2n

) T∑t=1

βt︸ ︷︷ ︸:=BT

+ (Lf + θLg)

T∑t=1

N∑i=1

E[‖xit − xjt‖2]︸ ︷︷ ︸:=CT

.

We bound the terms on the right-hand side of the preceding inequality one by one:due to the specific choice of the step size βt = 1

σt , the terms AT and BT can be

Page 8: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

ON CONVERGENCE RATE OF DSG ALGORITHM 2879

bounded as follows:

AT =1

2β1

N∑i=1

E[ri1(x)]− σ

2

N∑i=1

E[ri1(x)]− 1

2βT

N∑i=1

E[riT+1(x)]

+

T∑t=2

(1

2βt− 1

2βt−1− σ

2

) N∑i=1

E[rit(x)]

=1

2β1

N∑i=1

E[ri1(x)]− σ

2

N∑i=1

E[ri1(x)]− 1

2βT

N∑i=1

E[riT+1(x)]

= − 1

2βT

N∑i=1

E[riT+1(x)]

≤ 0

and

BT = N((Lf + θLg)

2 + L2n

) T∑t=1

1

σt

≤ 2N

σ

((Lf + θLg)

2 + L2n

)lnT,

where we have used the inequality (A.8) (see Appendix A). For the term CT , it followsfrom Lemma 3.1 that

CT ≤ 2(Lf + θLg)C1 lnT

≤ 4N

σ

(3N

η2(1− ς)+ 4

)(Lf + Ln + θLg)

2lnT

≤ 8N

σ

(3N

η2(1− ς)+ 4

)((Lf + θLg)

2 + L2n

)lnT.

Combining the last three inequalities with (3.4) yields

E[F(xjT )− F(x?)] ≤ 2N

σ

(12N

η2(1− ς)+ 17

)((Lf + θLg)

2 + L2n

) lnT

T,(3.5)

where we have used the convexity of function F, that is, F(xjT ) ≤ 1T

∑Tt=1 F(xjt ).

(ii) We now derive the conclusion by resorting to Lemma 3.5 and the bound (3.5).We distinguish two cases: (a) When g(xjT ) ≤ 0, we have xjT = ΠX [xjT ] = xjT , hence,

using Lemma 3.5 and setting the first x to be xjT and the second and the third x’s

to be xjT , respectively, combining with the inequality (3.5), and noting that γ = lnTT ,

we arrive at

E[f(xjT )]− f(x?) ≤(

2N

σ

(12N

η2(1− ς)+ 17

)((Lf + θLg)

2 + L2n

)+N ln 2

)lnT

T

= C2lnT

T.(3.6)

(b) When g(xjT ) > 0, we have to resort to Lemma 3.4 and Assumption 5:

g(xjT ) = g(xjT )− g(xjT ) ≥⟨∇g(xjT ), xjT − xjT

⟩= ‖∇g(xjT )‖2‖xjT − xjT ‖2≥ ρ‖xjT − xjT ‖2,

Page 9: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

2880 DEMING YUAN, DANIEL W. C. HO, AND YIGUANG HONG

where the equality follows from the fact that xjT − xjT is in the same direction to

∇g(xjT ) (cf. Lemma 3.4). Moreover, from Assumption 3, we have

f(xjT ) ≥ f(xjT ) +⟨∇f(xjT ), xjT − xjT

⟩≥ f(x?)−NLf‖xjT − xjT ‖2.

Combining the preceding two inequalities with (3.5), using Lemma 3.5, and takingexpectation, we obtain

(Nθρ−NLf )E[‖xjT − xjT ‖2] ≤ C2lnT

T,

which yields (noting that θ > Lf/ρ)

E[‖xjT − xjT ‖2] ≤ C2

N(θρ− Lf )

lnT

T.(3.7)

This implies that

E[f(xjT )]− f(x?) ≤ E[f(xjT )]− f(x?) +NLfE[‖xjT − xjT ‖2]

≤ C2lnT

T+NLf

C2

N(θρ− Lf )

lnT

T

=

(1 +

Lfθρ− Lf

)C2

lnT

T,

where we have used Lemma 3.5 and the inequalities (3.5) and (3.7). The conclusionfollows by combining the above inequality with (3.6). The proof is complete.

Remark 4. It is worth pointing out that the choice of the step size β = 1σt is not

unique; in fact, the inequality AT ≤ 0 holds when β = 1σt with σ ≤ σ.

Remark 5. We would like to make some comparisons between the proposed algo-rithm and the existing ones. Based on the work [8], we adapt the algorithm in [8] tothe distributed setting. We have designed a consensus mechanism that drives all thenodes’ states to agree on the optimal value and established the convergence results aswell. In addition, we have used the idea of reductions to develop a distributed algo-rithm for non-strongly convex optimization (see Algorithm 2). The work [2] presents adistributed subgradient-push algorithm (with the same convergence rate) for stronglyconvex optimization, but the problem considered there is unconstrained and the ob-jective functions should have Lipschitz gradients. The authors in [17] present a studyof the distributed strongly convex optimization problem in an online setting, wherethe proposed algorithm involves solving a projection problem at each iteration. Incontrast, our proposed algorithm removes the need for intermediate projections andachieves the same convergence rate as that in [17]. Different from our previous work[21], where the convergence rate is O(1/T 1/4), here we show that if the objectivefunctions are further assumed to be strongly convex, then the convergence rate canbe improved to O(lnT/T ) even when the noisy gradients of objective functions areavailable.

Based on Theorem 3.2, we immediately have the following corollary that showsthe convergence rate result for the noise-free version of Algorithm 1, where ∇f i(x)becomes ∇f i(x).

Page 10: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

ON CONVERGENCE RATE OF DSG ALGORITHM 2881

Corollary 3.6. Under the conditions of Theorem 3.2 and the assumption thatall the nodes have access to the noise-free gradients, we have

f(xjT )− f(x?) ≤(

1 +Lf

θρ− Lf

)C2

lnT

T,

where C2 = 2Nσ ( 12N

η2(1−ς) + 17)(Lf + θLg)2 +N ln 2.

3.2. Reduction to non-strongly convex optimization. The previous sub-section dealt with strongly convex optimization. In this subsection, we use a reductiontechnique from the developed algorithm (i.e., Algorithm 1) to derive the convergencerate for non-strongly convex optimization, that is, Assumption 4 does not need tohold.

The reduction technique has been widely used in the study of online optimization(referring to [27]). Its basic idea is to add a controlled amount of strong convexity toeach function f i at each iteration t, and then apply Algorithm 1 to optimize the newobjective function. With the reduction idea, we propose an algorithm as follows.

Algorithm 2. Reduction to non-strongly convex functions.

Input: f i, T , parameter sequence {σt}Tt=1.Set f it (x) = f i(x) + σt

2 ‖x‖22, for t = 1, . . . , T .

Apply Algorithm 1 with parameters f it , xi1 = 0, {βt}Tt=1, γ > 0, θ > Lf/ρ, return

xiT .

The following theorem characterizes the convergence rate for Algorithm 2.

Theorem 3.7. Let Assumptions 1, 2, 3, and 5 hold. Let the step-size sequenceand the parameter sequence be βt = 1√

tand σt = 1√

t, t = 1, . . . , T , and denote

x? = arg minx∈X f(x). If we set γ = 1√T

, we have that for all j ∈ [N ] and any

number of iterations T ≥ 1,

E[f(xjT )]− f(x?) ≤(

1 +Lf

θρ− Lf

)C3

1√T,

where C3 = 2N( 12Nη2(1−ς) + 17)

((Lf +R+ θLg)

2 + L2n

)+R2 +N ln 2.

Proof. First, we derive the disagreement among the nodes for Algorithm 2. Witha slight abuse of notation, we write

∇it = ∇Fit(xit) + ni(xit)(3.8)

= ∇f it (xit) +exp(θg(xit)/γ)

1 + exp(θg(xit)/γ)θ∇g(xit) + ni(xit)

= ∇f i(xit) + σtxit +

exp(θg(xit)/γ)

1 + exp(θg(xit)/γ)θ∇g(xit) + ni(xit),

where Fit(x) = f it (x) + γ ln (1 + exp (θg(x)/γ)). Hence we have the bound

E[‖∇it‖2

]≤ Lf + σt‖xit‖2 + θLg + Ln(3.9)

≤ Lf +R+ θLg + Ln

Page 11: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

2882 DEMING YUAN, DANIEL W. C. HO, AND YIGUANG HONG

because of the inequality (A.6) and σt ≤ 1. On the other hand, we have the followingbound for the step-size sequence {βt}Tt=1:

T∑t=1

βt = 1 +

T∑t=2

1√t≤ 1 +

∫ T

1

1√u

du = 2√T − 1 ≤ 2

√T .(3.10)

Building on the proof of Lemma 3.1 and using the preceding inequalities, we can easilyget that

1

T

T∑t=1

N∑i=1

E[‖xit − yt‖2

]≤ C4√

T,(3.11)

where C4 = 2N( 3Nη2(1−ς) + 4)(Lf + R + θLg + Ln). Second, based on the proof of

Lemma 3.3, it follows that

T∑t=1

E[Ft(xjt )− Ft(x)] ≤

T∑t=1

1

2βt

((1− σtβt)

N∑i=1

E[rit(x)]−N∑i=1

E[rit+1(x)]

)︸ ︷︷ ︸

:=A′T

(3.12)

+N((Lf +R+ θLg)

2 + L2n

) T∑t=1

βt

+(Lf +R+ θLg)

T∑t=1

N∑i=1

E[‖xit − xjt‖2],

where the last two terms can be easily bounded by using the estimates (3.10) and(3.11), respectively, and we hence turn our attention to bound the first term:

A′T =1

2β1

N∑i=1

E[ri1(x)]− σ1

2

N∑i=1

E[ri1(x)]− 1

2βT

N∑i=1

E[riT+1(x)]

+

T∑t=2

(1

βt− 1

βt−1− σt

)1

2

N∑i=1

E[rit(x)]

≤T∑t=2

(√t−√t− 1− 1√

t

)1

2

N∑i=1

E[rit(x)],

≤ 0

where in the first inequality we have used the fact that β1 = σ1 = 1, and in the lastinequality we have used the inequality that

√t−√t− 1− 1√

t< 0 for all t ≥ 2. This

yields

1

T

T∑t=1

E[Ft(xjt )− Ft(x)] ≤ 2N

(12N

η2(1− ς)+ 17

)((Lf +R+ θLg)

2 + L2n

) 1√T.(3.13)

Hence, we have

Ft(xjt ) = F(xjt ) +

σt2‖xjt‖22 and Ft(x) = F(x) +

σt2‖x‖22,

Page 12: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

ON CONVERGENCE RATE OF DSG ALGORITHM 2883

and the bound of the left-hand side of (3.13) can be further lowered by

1

T

T∑t=1

E[Ft(xjt )− Ft(x)] ≥ 1

T

T∑t=1

E[F(xjt )− F(x)]− 1

T

T∑t=1

σt2R2

≥ E[F(xjt )− F(x)]− R2

√T.(3.14)

Thus, the conclusion can be obtained by following an argument similar to that of theproof of Theorem 3.2.

Remark 6. Under appropriate conditions, our convergence rate is a factor of1/T 1/4 better than that attained in [21] (i.e., O(1/T 1/4)). It is also worth notingthat for general nonsmooth and convex functions, the best achievable rate for cen-tralized subgradient methods is O(1/

√T ).

Similarly, we have the following corollary that shows the convergence rate resultfor the noise-free version of Algorithm 2.

Corollary 3.8. Under the conditions of Theorem 3.7 and the assumption thatall the nodes have access to the noise-free gradients, we have

E[f(xjT )]− f(x?) ≤(

1 +Lf

θρ− Lf

)C3

1√T,

where C3 = 2N( 12Nη2(1−ς) + 17)(Lf +R+ θLg)

2 +R2 +N ln 2.

4. Simulation results. In this section, we demonstrate some simulations of theDSG algorithm. To be specific, we consider a distributed estimation problem of thefollowing form:

(4.1)minimize

∑Ni=1 w

i‖x− νi‖22subject to x ∈ X ,

where wi ∈ R and νi ∈ Rd are deterministic parameters known only to node i. Notethat this is a well-known problem in distributed estimation (see [2] and referencestherein).

We illustrate the proposed algorithm with a specific problem instance with g(x) =‖x‖1− 1 (i.e., X = {x ∈ Rd : ‖x‖1 ≤ 1}), with wi and νi generated from the interval(1, 2) and a unit norm distribution, respectively. It is easy to see that X ⊆ B ={x ∈ Rd : ‖x‖2 ≤ 1}. We give some specific choices of the parameters that areused in simulations. First, note that each f i is 2wi-strongly convex, hence we setthe step size as βt = 1

σt , where σ = mini∈[N ] 2wi. Second, we set ρ = 1, based onthe following argument: from Assumption 5 we can choose ρ = ming(x)=0 ‖∇g(x)‖2,where g(x) = ‖x‖1− 1; the minimum can be achieved by setting x = ei and choosingthe associated subgradient as that of ei, where ei is the standard basis vector in Rd.A reasonable estimate for Lf is that Lf = maxi∈[N ] 2wi(1 + ‖νi‖2). Hence, we setθ = Lf/ρ = Lf .

The noises ni(xit) are random variables generated independent and identically dis-tributed from the normal distribution N (0, σId×d), where σ is some positive constantthat will be specified in what follows. In all cases, we use the setting of the networksize N = 60, the dimension of the estimate d = 2, and the initial estimates xi1 = 0 forall i. Note that the simulation results are based on the average of 100 realizations.

Page 13: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

2884 DEMING YUAN, DANIEL W. C. HO, AND YIGUANG HONG

Fig. 1. The random graph.

0 20 40 60 80 10010

−4

10−3

10−2

10−1

100

T

Rel

ativ

e fu

nctio

n er

ror

Node 1Node 1 (noise−free)Node 2Node 2 (noise−free)

Fig. 2. Plot of the relative function error versus the number of iterations T for a random graph.

Next, let us investigate the convergence performance of the DSG algorithm. Wefirst implement the proposed algorithm over a random graph, which is generated asin [9] (see Figure 1). Figure 2 provides a plot of the relative function error (on the

log-scale),|f(xi

T )−f(x?)||f(x?)| , versus the number of iterations T for two randomly selected

nodes, together with the corresponding noise-free case for comparison. Moreover,

we have shown the sample mean and standard deviation of|f(xi

T )−f(x?)||f(x?)| for T in

multiplies of 20 in Figure 2, where the error bars show the mean plus and minus onestandard deviation, respectively. Note that the parameter σ is chosen as σ = 0.15.We have illustrated the same quantities for a ring graph in Figure 3.

Page 14: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

ON CONVERGENCE RATE OF DSG ALGORITHM 2885

0 20 40 60 80 10010

−2

10−1

100

T

Rel

ativ

e fu

nctio

n er

ror

Node 1Node 1 (noise−free)Node 2Node 2 (noise−free)

Fig. 3. Plot of the relative function error versus the number of iterations T for a ring graph.

0 200 400 600 800 100010

−1

100

101

102

t

The

dis

agre

emen

t

Ring graphRandom graph

Fig. 4. Plot of the disagreement versus the iteration t.

Figures 2 and 3 show that the estimates of the nodes converge in both cases, andthe error decays faster for the random graph than that of the ring graph. This isconsistent with common sense, because the random graph (shown in Figure 1) clearlyhas a better connectivity than that of the ring graph. A similar phenomenon can alsobe seen in Figure 4, which shows that the disagreement,

∑Ni=1 ‖xit−yt‖2, evolves over

the iteration t. We can see from Figure 4 that the disagreement diminishes at a fasterrate for the random graph than that of the ring graph. Note that we have shown thesample mean and standard deviation of the disagreement for T in multiplies of 100 inFigure 4.

Page 15: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

2886 DEMING YUAN, DANIEL W. C. HO, AND YIGUANG HONG

5. Conclusion. In this paper, we studied the constrained multiagent optimiza-tion problem. We proposed a class of DSG algorithms without intermediate projec-tions. By virtue of the smoothed constraint incorporation technique, we establishedan O(lnT/T ) convergence rate for strongly convex optimization. For the case whenthe individual objective functions are non-strongly convex, we established an optimalO(1/

√T ) convergence rate based on the idea of the reduction method. There are

several interesting questions that remain to be explored. For instance, one possiblefuture research direction is to apply the proposed algorithm to solve the distributedsupervised learning problem. Also, it would be of interest to relax the assumptionrequiring all the objective functions to be strongly convex.

Appendix A. Proof of Lemma 3.1.

Lemma A.1 (see [16]). Under Assumption 1, for all i, j and all t ≥ s, we have∣∣[W(t : s)]ij −N−1∣∣ ≤ ηd t−s+1

B e−2 ≤ η−2ςt−s+1.

where W(t : s) is a transition matrix defined by W(t : s) = W(t)W(t − 1) · · ·W(s +1)W(s) with W(t : t) = W(t).

Proof of Lemma 3.1. To simplify the notation, we denote for all i and t,

∇it := ∇Fi(xit) + ni(xit).

The average estimate yt evolves as follows:

yt =1

N

N∑i=1

ΠB[xit](A.1)

=1

N

N∑i=1

(xit + pit

)=

1

N

N∑i=1

N∑j=1

[W(t− 1)]ijzjt−1 +

1

N

N∑i=1

pit

=1

N

N∑i=1

zit−1 +1

N

N∑i=1

pit,

where pit = ΠB[xit] − xit, and the first equality follows from the fact that xit =R

max(‖xit‖2,R)

xit = ΠB[xit], and in the last equality we have used the relation

N∑i=1

N∑j=1

[W(t− 1)]ijϕj =

N∑j=1

(N∑i=1

[W(t− 1)]ij

)ϕj =

N∑i=1

ϕi(A.2)

because W(t− 1) is doubly stochastic. Substituting the update formula for zit−1

into (A.1), we have

yt =1

N

N∑i=1

(xit−1 − βt−1∇it−1

)+

1

N

N∑i=1

pit

= yt−1 − βt−11

N

N∑i=1

∇it−1 +1

N

N∑i=1

pit;

Page 16: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

ON CONVERGENCE RATE OF DSG ALGORITHM 2887

applying the above equation recursively, we obtain

yt = y1 −t−1∑s=1

βs1

N

N∑i=1

∇is +

t−1∑s=1

1

N

N∑i=1

pis+1.(A.3)

Similarly, the estimate of node i evolves as follows:

xit =

N∑j=1

[W(t− 1 : 1)]ijxj1 −

t−1∑s=1

βs

N∑j=1

[W(t− 1 : s)]ij∇js(A.4)

+

t−2∑s=1

N∑j=1

[W(t− 1 : s+ 1)]ijpjs+1 + pit.

Combining the inequalities (A.3) and (A.4), we arrive at

‖xit − yt‖2 ≤

∥∥∥∥∥∥N∑j=1

[W(t− 1 : 1)]ijxj1 − y1

∥∥∥∥∥∥2

(A.5)

+

t−1∑s=1

βs

N∑j=1

∣∣[W(t− 1 : s)]ij −N−1∣∣ ‖∇js‖2

+

t−2∑s=1

N∑j=1

∣∣[W(t− 1 : s+ 1)]ij −N−1∣∣ ‖pjs+1‖2

+‖pit‖2 +

∥∥∥∥∥ 1

N

N∑i=1

pit

∥∥∥∥∥2

,

where the first term on the right-hand side is zero, because xj1 = 0 for all j ∈ [N ] andy1 = 0. We are left to bound the norm of the error due to the projection

‖pit‖2 =

∥∥∥∥∥∥ΠB[xit]−

N∑j=1

[W(t− 1)]ij

(xjt−1 − βt−1∇jt−1)

)∥∥∥∥∥∥2

∥∥∥∥∥∥ΠB[xit]−

N∑j=1

[W(t− 1)]ijxjt−1

∥∥∥∥∥∥2

+βt−1

N∑j=1

[W(t− 1)]ij‖∇jt−1‖2.

Note that xit ∈ B holds for all i and t ≥ 1. Hence, we can bound the first term byusing the nonexpansiveness of the Euclidean projection:∥∥∥∥∥∥ΠB

[xit]−

N∑j=1

[W(t− 1)]ijxjt−1

∥∥∥∥∥∥2

∥∥∥∥∥∥xit −N∑j=1

[W(t− 1)]ijxjt−1

∥∥∥∥∥∥2

≤ βt−1

N∑j=1

[W(t− 1)]ij‖∇jt−1‖2.

Page 17: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

2888 DEMING YUAN, DANIEL W. C. HO, AND YIGUANG HONG

This yields the final bound on ‖pit‖2 in expectation, that is,

E[∥∥pit∥∥2

]≤ 2βt−1

N∑j=1

[W(t− 1)]ijE[‖∇jt−1‖2

]≤ 2 (Lf + Ln + θLg)βt−1,

due to the double stochasticity of the weight matrix and the following bound on thestochastic gradient ‖∇it‖2 by Assumption 3 and the assumption on the stochasticnoise (2.3):

E[‖∇it‖2

]≤ ‖∇f i(xit)‖2 + E

[‖ni(xit)‖2

]+ θ‖∇g(xit)‖2(A.6)

≤ Lf +√E[‖ni(xit)‖22

]+ θLg

≤ Lf + Ln + θLg,

where in the first inequality we have used Jensen’s inequality. Combining the aboverelation with (A.5) and using Lemma A.1, we have

E[‖xit − yt‖2

]≤ Nη−2(Lf + Ln + θLg)

t−1∑s=1

βsςt−s

+2Nη−2(Lf + Ln + θLg)

t−2∑s=1

βsςt−s−1

+4 (Lf + Ln + θLg)βt−1,

where we have used the convexity of the norm function, that is, ‖ 1N

∑Ni=1 p

it‖2 ≤

1N

∑Ni=1

∥∥pit∥∥2. By noting that ς < 1, we can get a more compact bound:

E[‖xit − yt‖2

]≤ 3Nη−2(Lf + Ln + θLg)

t−1∑s=1

βsςt−s−1(A.7)

+ 4 (Lf + Ln + θLg)βt−1.

Summing (A.7) over t = 1, . . . , T yields

T∑t=1

E[‖xit − yt‖2

]=

T∑t=2

E[‖xit − yt‖2

]≤ 3Nη−2(Lf + Ln + θLg)

T∑t=2

t−1∑s=1

βsςt−s−1

︸ ︷︷ ︸∆T

+4 (Lf + Ln + θLg)

T∑t=2

βt−1.

The term ∆T can be bounded as follows:

∆T ≤T−1∑t=1

βt

(T−2∑s=0

ςs

)≤T−1∑t=1

βt

( ∞∑s=0

ςs

)≤ 1

1− ς

T−1∑t=1

βt.

Page 18: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

ON CONVERGENCE RATE OF DSG ALGORITHM 2889

A direct consequence of the step-size sequence is that for all T ≥ 3,

T∑t=1

βt =1

σ

(1 +

T∑t=2

1

t

)≤ 1

σ

(1 +

∫ T

1

1

udu

)≤ 2

σlnT.(A.8)

This, combined with the last two inequalities, gives our final estimate.

Appendix B. Proof of Lemma 3.3. We follow the standard analysis toderive the general evolution of rit(x):

rit+1(x) = ‖xit+1 − x‖22= ‖ΠB[xit+1]− x‖22≤ ‖xit+1 − x‖22

≤N∑j=1

[W(t)]ij‖zjt − x‖22,(B.1)

where the first inequality follows from the nonexpansiveness of the Euclidean projec-tion and the last inequality from the convexity of the norm square function and thedouble stochasticity of W(t). For the term ‖zjt −x‖22, we have the following estimate,according to the update formula for zjt :

‖zjt − x‖22 = ‖xjt − βt∇jt − x‖22

= rjt (x) + β2t ‖∇

jt‖22 + 2βt

⟨x− xjt , ∇

jt

⟩.

Substituting the preceding inequality into (B.1) and then summing over all i yields

N∑i=1

rit+1(x) ≤N∑i=1

N∑j=1

[W(t)]ij

(rjt (x) + β2

t ‖∇jt‖22 + 2βt

⟨x− xjt , ∇

jt

⟩)(B.2)

=

N∑j=1

(N∑i=1

[W(t)]ij

)(rjt (x) + β2

t ‖∇jt‖22 + 2βt

⟨x− xjt , ∇

jt

⟩)

=

N∑j=1

(rjt (x) + β2

t ‖∇jt‖22 + 2βt

⟨x− xjt , ∇

jt

⟩)

with the same manipulation as in (A.2). The term ‖∇it‖22 can be bounded as follows:

‖∇it‖22 = ‖∇Fi(xit) + ni(xit)‖22 ≤ 2‖∇Fi(xit)‖22 + 2‖ni(xit)‖22.

This leads to

E[‖∇it‖22

]≤ 2(Lf + θLg)

2 + 2L2n.(B.3)

Page 19: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

2890 DEMING YUAN, DANIEL W. C. HO, AND YIGUANG HONG

Combining the inequalities (B.2) and (B.3) gives

N∑i=1

E[⟨xit − x,∇Fi(xit)

⟩]=

N∑i=1

E[⟨xit − x,∇Fi(xit) + ni(xit)

⟩](B.4)

=

N∑i=1

E[⟨

xit − x, ∇it⟩]

≤ 1

2βt

(N∑i=1

E[rit(x)

]−

N∑i=1

E[rit+1(x)

])+N

((Lf + θLg)

2 + L2n

)βt,

where the second equality follows from E[⟨xit − x,ni(xit)

⟩]= 0, by using the inde-

pendence of xit and ni(xit) and the relation E[ni(xit)

]= 0. It remains to provide

a lower bound on the left-hand side of the preceding inequality, which is given asfollows:

N∑i=1

(Fi(xit)− Fi(x) +

σ

2‖xit − x‖22

)(B.5)

=

N∑i=1

(Fi(xjt )− Fi(x) + Fi(xit)− Fi(xjt ) +

σ

2‖xit − x‖22

)≥ F(xjt )− F(x) +

σ

2

N∑i=1

rit(x) +

N∑i=1

⟨∇Fi(xjt ),xit − xjt

⟩≥ F(xjt )− F(x) +

σ

2

N∑i=1

rit(x)− (Lf + θLg)

N∑i=1

‖xit − xjt‖2

because each Fi is σ-strongly convex, that is, Fi(x) ≥ Fi(xit) +⟨x− xit,∇Fi(xit)

⟩+

σ2 ‖x

it − x‖22 (since each f i is σ-strongly convex, by Assumption 4). This, combined

with the estimate (B.4), gives the desired result.

Appendix C. Proof of Lemma 3.5. To simplify the notation, we denotegγ(x) = γ ln (1 + exp (θg(x)/γ)). It is easy to show that gγ(x?) ≤ Nγ ln 2, due to thefact that g(x?) ≤ 0. For any x ∈ X , we can show that

gγ(x) ≥ [θg(x)]+

based on the following argument: if g(x) ≤ 0, we have gγ(x) > γ ln 1 = 0 = [θg(x)]+;otherwise we have gγ(x) ≥ γ ln

(1 + exp

(θg(x)/γ

))> θg(x) = [θg(x)]+, according

to the inequality that ln(1 + exp(a)) > a for all a ≥ 0. Hence, using the fact thatF(x) = f(x) +Ngγ(x), it is easy to show that

f(x)− f(x?) = F(x)− F(x?) +Ngγ(x?)−Ngγ(x)

≤ F(x)− F(x?) +Nγ ln 2−N [θg(x)]+.

This gives the desired result.

Acknowledgment. The authors would like to thank the anonymous reviewersfor their careful reading and constructive comments that improved this paper.

Page 20: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

ON CONVERGENCE RATE OF DSG ALGORITHM 2891

REFERENCES

[1] A. Nedic, A. Ozdaglar, and P. A. Parrilo, Constrained consensus and optimization in multi-agent networks, IEEE Trans. Automat. Control, 55 (2010), pp. 922–938.

[2] A. Nedic and A. Olshevsky, Stochastic Gradient-Push for Strongly Convex Functions onTime-Varying Directed Graphs, arXiv:1406.2075, 2014.

[3] B. Gharesifard and J. Cortes, Distributed continuous-time convex optimization on weight-balanced digraphs, IEEE Trans. Automat. Control, 59 (2014), pp. 781–786.

[4] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization,IEEE Trans. Automat. Control, 54 (2009), pp. 48–61.

[5] J. Duchi, A. Agarwal, and M. Wainwright, Dual averaging for distributed optimization:convergence analysis and network scaling, IEEE Trans. Automat. Control, 57 (2012),pp. 592–606.

[6] M. Zhong and C. G. Cassandras, Asynchronous distributed optimization with event-drivencommunication, IEEE Trans. Automat. Control, 55 (2010), pp. 2735–2750.

[7] M. Zhu and S. Martınez, On distributed convex optimization under inequality and equalityconstraints, IEEE Trans. Automat. Control, 57 (2012), pp. 151–164.

[8] M. Mahdavi, T. Yang, R. Jin, S. Zhu, and J. Yi, Stochastic gradient descent with only oneprojection, in Advances in Neural Information Processing Systems 25, 2012, pp. 503–511.

[9] L. Xiao and S. Boyd, Fast linear iterations for distributed averaging, Systems Control Lett.,53 (2004), pp. 65–78.

[10] I. Lobel and A. Ozdaglar, Distributed subgradient methods for convex optimization overrandom networks, IEEE Trans. Automat. Control, 56 (2011), pp. 1291–1306.

[11] A. Olshevsky and J. N. Tsitsiklis, Convergence speed in distributed consensus and averaging,SIAM J. Control Optim., 48 (2009), pp. 33–55.

[12] B. Johansson, T. Keviczky, M. Johansson, and K. H. Johansson, Subgradient methods andconsensus algorithms for solving convex optimization problems, in Proceedings of the 47thIEEE Conference on Decision and Control, Cancun, Mexico, 2008, pp. 4185–4190.

[13] S. S. Ram, A. Nedic, and V. V. Veeravalli, Distributed stochastic subgradient projectionalgorithms for convex optimization, J. Optim. Theory Appl., 147 (2010), pp. 516–545.

[14] D. Yuan, S. Xu, and H. Zhao, Distributed primal-dual subgradient method for multiagentoptimization via consensus algorithms, IEEE Trans. Systems Man Cybernet. Part B, 41(2011), pp. 1715–1724.

[15] T. H. Chang, A. Nedic, and A. Scaglione, Distributed constrained optimization byconsensus-based primal-dual perturbation method, IEEE Trans. Automat. Control, 59 (2014),pp. 1524–1538.

[16] A. Nedic, A. Olshevsky, A. Ozdaglar, and J. N. Tsitsiklis, Distributed subgradient meth-ods and quantization effects, in Proceedings of the 47th IEEE Conference on Decision andControl, Cancun, Mexico, 2008, pp. 4177–4184.

[17] K. Tsianos and M. Rabbat, Distributed strongly convex optimization, in Proceedings of the50th Annual Allerton Conference on Communication, Control, and Computing, Monticello,IL, 2012, pp. 593–600.

[18] P. Lin and W. Ren, Constrained consensus in unbalanced networks with communication delays,IEEE Trans. Automat. Control, 59 (2014), pp. 775–781.

[19] J. Chen and A. H. Sayed, Diffusion adaptation strategies for distributed optimization andlearning over networks, IEEE Trans. Signal Process., 60 (2012), pp. 4289–4305.

[20] D. Jakovetic, J. Xavier, and J. M. F. Moura, Fast distributed gradient methods, IEEETrans. Automat. Control, 59 (2014), pp. 1131–1146.

[21] D. Yuan, D. W. C. Ho, and S. Xu, Regularized primal-dual subgradient method for distributedconstrained optimization, IEEE Trans. Cybernet., 46 (2016), pp. 2109–2118.

[22] Y. Lou, G. Shi, K. H. Johansson, and Y. Hong, Approximate projected consensus for con-vex intersection computation: Convergence analysis and critical error angle, IEEE Trans.Automat. Control, 59 (2014), pp. 1722–1736.

[23] W. Shi, Q. Ling, G. Wu, and W. Yin, EXTRA: An exact first-order algorithm for decentralizedconsensus optimization, SIAM J. Optim., 25 (2015), pp. 944–966.

[24] Y. Zhang, Y. Lou, Y. Hong, and L. Xie, Distributed projection-based algorithms for sourcelocalization in wireless sensor networks, IEEE Wireless Commun., 14 (2015), pp. 3131–3142.

[25] D. Yuan, D. W. C. Ho, and S. Xu, Inexact dual averaging method for distributed multi-agentoptimization, Systems Control Lett., 71 (2014), pp. 23–30.

[26] P. Yi, Y. Hong, and F. Liu, Distributed gradient algorithm for constrained optimization withapplication to load sharing in power systems, Systems Control Lett., 83 (2015), pp. 45–52.

Page 21: ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

2892 DEMING YUAN, DANIEL W. C. HO, AND YIGUANG HONG

[27] E. Hazan, Introduction to Online Convex Optimization, http://ocobook.cs.princeton.edu/OCObook.pdf, 2015.

[28] Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program., 103 (2005),pp. 127–152.


Recommended