Asynchronous Approximate Data-parallel Computationasim.ai/papers/asap.pdf · Asynchronous...

Asynchronous Approximate Data-parallel Computation

Asim Kadav and Erik KruusNEC Labs America

Abstract

Emerging workloads, such as graph processing and ma-chine learning are approximate because of the scale of datainvolved and the stochastic nature of the underlying algo-rithms. These algorithms are often distributed over multiplemachines using bulk-synchronous processing (BSP) or othersynchronous processing paradigms such as map-reduce. How-ever, data-parallel processing primitives such as repeatedbarrier and reduce operations introduce high synchroniza-tion overheads. Hence, many existing data-processing plat-forms use asynchrony and staleness to improve data-paralleljob performance. Often, practitioners simply change the syn-chronous communication to asynchronous between the workernodes in the cluster. This improves the throughput of dataprocessing but results in a poor accuracy of the final outputsince different workers may progress at different speeds andprocess inconsistent intermediate outputs.

In this paper, we present ASAP, a model that provides asyn-chronous and approximate processing semantics for data-parallel computation. ASAP provides fine-grained workersynchronization using NOTIFY-ACK semantics that allows in-dependent workers to run asynchronously. ASAP also providesstochastic reduce that provides approximate but guaranteedconvergence to the same result as an aggregated all-reduce.In our results, we show that ASAP can reduce synchronizationcosts and provides 2-10X speedups in convergence and up to10X savings in network costs for distributed machine learningapplications and provides strong convergence guarantees.

1. Introduction

Large-scale distributed data-parallel computation often pro-vides two fundamental constructs to scale-out local data pro-cessing. First, a merge or a reduce operation to allow the work-ers to merge updates from all other workers and second, a bar-rier or an implicit wait operation to ensure that all workers cansynchronize and operate at similar speeds. For example, thebulk-synchronous model (BSP) is a general paradigm to modeldata-intensive distributed processing [62]. Here, each node af-ter processing a specific amount of data synchronizes with theother nodes using barrier and reduce operations. The BSPmodel is widely used to implement many big data applications,frameworks and libraries such as in the areas of graph process-ing and machine learning [1, 28, 33, 36, 45, 55]. Other syn-chronous paradigms such as the map-reduce [26, 68], the pa-rameter server [25, 43] and dataflow based systems [6, 39, 48]use similar constructs to synchronize outputs across multipleworkers.

There is an emerging class of big data applications suchas graph-processing and machine learning (ML) that are ap-proximate because of the stochastic nature of the underlyingalgorithms that converge to the final solution in an iterativefashion. These iterative-convergent algorithms operate onlarge amounts of data and unlike traditional TPC style work-loads that are CPU bound [53], these iterative algorithms incursignificant network and synchronization costs by communicat-ing large vectors between their workers. These applicationscan gain an increase in performance by reducing the synchro-nization costs in two ways. First, the workers can operate overstale intermediate outputs. The stochastic algorithms operateover input data in an iterative fashion to produce and commu-nicate intermediate outputs with other workers. However, it isnot imperative that the workers perform a reduce on all theintermediate outputs at every iteration. Second, the synchro-nization requirements between the workers may be relaxed,allowing partial, stale or overwritten outputs. This is possiblebecause in some cases the iterative nature of data processingand the stochastic nature of the algorithms may provide anopportunity to correct any errors introduced from staleness orincorrect synchronization.

There has been recent research that uses this property byprocessing stale outputs, or by removing all synchroniza-tion [56, 64]. However, naïvely converting the algorithmsfrom synchronous to asynchronous can increase the through-put but may not improve the convergence speeds. This isbecause an increase in data processing speeds may not pro-duce the final output with same accuracy and in some casesmay even converge the underlying algorithm to an incorrectvalue [47].

Hence, to provide asynchronous and approximate semanticswith reasonable correctness guarantees for iterative conver-gent algorithms, we present Asynchronous and Approximateabstractions for data-parallel computation. To facilitate ap-proximate processing, we describe stochastic reduce, a sparsereduce primitive, that mitigate the communication and syn-chronization costs by performing the reduce operation withfewer workers. We construct a reduce operator by choos-ing workers based on sparse, directed expander graphs onunderlying communication nodes that mitigates CPU and net-work costs during reduce for iterative convergent algorithms.Furthermore, stochastic reduce convergence is directly pro-portional to the spectral gap of the underlying machine graphthat allows practitioners to introduce adjust network structurebased on available network bandwidth.

To reduce synchronization costs, we propose a fine-grained NOTIFY-ACK based synchronization that provides

performance improvement over barrier based methods.NOTIFY-ACK allows independent worker threads (such asthose in stochastic reduce) to run asynchronously instead ofblocking on a coarse-grained global barrier at every iter-ation. Additionally, NOTIFY-ACK provides stronger consis-tency than simply using a barrier to implement synchronousdata-parallel processing.

ASAP is not a programming model (like map-reduce [26])or is limited to a set of useful mechanisms. It introducessemantics for approximate and asynchronous execution whichare often amiss in the current flurry of distributed machinelearning systems which often use asynchrony and staleness totrade-off the input processing throughput with output accuracy.The contributions of this paper are as follows:• We present ASAP, an asynchronous approximate compu-

tation model for large scale data parallel applications. Wepresent an inter-discplinary approach where we combine al-gorithm design with the hardware properties of low latencynetworks to design a distributed data-parallel system. Weuse stochastic reduce for approximate semantics with fine-grained synchronization based on NOTIFY-ACK to allowindependent threads run asynchronously.

• With stochastic reduce, we present a spectral gap measurethat allows developers to reason why some node commu-nication graphs may converge faster than others and canbe a better choice for connecting machines with stochasticreduce. This allows developers to compare the flurry ofrecent papers that propose different network topologies forgradient propagation such as ring, tree etc.

• We apply ASAP to build a distributed learning frameworkover RDMA. In our results, we show that our system canachieve strong consistency, provable convergence, and pro-vides 2-10X in convergence and up to 10X savings in net-work costs.

2. BackgroundLarge-scale problems such as training image classificationmodels, page-rank computation and matrix factorization op-erate on large amounts of data. As a result, many stochas-tic algorithms have been proposed that make these problemtractable for large data by iteratively approximating the solu-tion over small batches of data. For example, to scale better,matrix factorization methods have moved from direct andexact factorization methods such as singular value decomposi-tion to iterative and approximate factorization using gradientdescent style algorithms [29]. Hence, many algorithms thatdiscover relationships within data have been re-written in theform of distributed optimization problems that iterate overthe input data and approximate the solution. In this paper,we focus on how to provide asynchronous and approximatesemantics to distributed machine learning applications. Theseoptimizations can be beneficial to end users when they runtheir distributed machine learning jobs across multiple cloudinstances [38] or by the cloud provider themselves when train-

LearningFunction

Cost Function

Parameters w1, w2,.., wn

Input x1,x2,..xn

Expected O/P y1,y2,..yn

Output e1, e2, , en

Goal: Minimize nP

e=1Cost(ei)

nPe=1

Cost(ei)

Figure 1: The machine learning training process.ing a model for vision or speech recognition [19, 34, 35].

2.1. Distributed Machine Learning

Machine learning algorithms process data to build a trainingmodel that can generalize over new data. The training outputmodel, or the parameter vector (represented by w) is computedto perform future predictions over new data. To train overlarge data, ML methods often use the Stochastic GradientDescent (SGD) algorithm that can train over a single (or abatch) of examples over time. The SGD algorithm processesdata examples to compute the gradient of a loss function. Theparameter vector is then updated based on this gradient valueafter processing each training data example. After a numberof iterations, the parameter vector or the model convergestowards acceptable error values over test data.

Figure 1 shows the ML training process. To scale out thecomputation over multiple machines, the SGD algorithm canbe distributed over a cluster by using data parallelism, bysplitting the input (x1,x2,..,xn) or by model parallelism, bysplitting the model (w1, w2,..,wn). The goal of parallelizationis not just to improve throughput but also to maintain low errorrates (e1, e2,..,en). In data-parallel learning using BSP, theparallel model replicas train over different machines. Aftera fixed number of iterations, these machines synchronize theparameter models that have been trained over the partitioneddata with one-another using a reduce operation. For example,each machine may perform an average of all incoming modelswith its own model, and proceed to train over more data. In theBSP model, there is a global barrier that ensures that modelstrain and synchronize intermediate inputs at the same speeds.Hence, distributed data-parallel ML suffers from additionalsynchronization and communication costs over a single thread.

Figure 2 shows the interconnect speeds for a single com-puter buses, ethernet and infiniBand. Modern buses can senddata at higher throughputs with increasingly lower latencies.As a result, software costs for synchronization, such as in areduce operation have begun to contribute significantly to-wards overall job times in distributed machine learning andother iterative data-parallel tasks. The reduce operation re-quires communicating updates to all other machines whentraining using the BSP or the map-reduce model. However,since these algorithms are iterative-convergent, and can tol-erate errors in the synchronization step, there has been re-

2

1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017

0

5

10

15

20

25

30

35

Year

bandw

idth

spee

ds

inG

B/s

PC Bus infiniBand Ethernet

1

MCA 16-32bit/ 10 MHz

VESA Local Bus 32-bit/33 MHz

PCI 64-bit/33 MHz

AGP 4×

PCI Express 1.0 (×16 link)



NVLink10 µs latency

100 G Ethernet (Mellanox)

40G Ethernet10G Ethernet (802.3AE)Gigabit

Ethernet

IB EDR 4x (0.5µs latency)

IB FDR 4x (0.7µs latency)

IB QDR 4x (1.3µs latency)

IB DDR 4x (2.5 µs latency)

Figure 2: Decreasing latency and increasing local and networkbandwidths over the years. The latency measurements are for64B packets.cent work on communicating stale intermediate parameterupdates and exchanging parameters with little or no synchro-nization [11, 18, 56].

Past research has found that simply removing the barriermay speed up the throughput of the system [15, 64]. However,this may not always improve the convergence speed and mayeven converge the system to an incorrect final value [47]. Sincethe workers do not synchronize, and communicate model pa-rameters at different speeds, the workers process the dataexamples at a higher throughput. However, since the differentworkers train at different speeds, the global model skews to-wards the workers that are able to process and communicatetheir models. Similarly, using bounds for asynchrony mayappear to work for some workloads. But determining thesebounds can be empirically difficult and for some datasets thesebounds may be no better than synchronous. Furthermore, if aglobal single model is maintained and updated without locks(as in [56]), a global convergence may only be possible ifthe parameter vector is sparse. Finally, maintaining a globalsingle model in a distributed setting results in lots of wastedcommunication since a lot of the useful parameter updates areoverwritten [15, 51].

The distributed parameter-server architecture limits net-work traffic by maintaining a central master [6, 25, 43]. Here,the server coordinates the parameter consistency amongst allother worker machines by resetting the workers’ model af-ter every iteration and ensures global consensus on the finalmodel. Hence, a single server communicates with a largenumber of workers that may result in network congestion atthe edges which can be mitigated using a distributed param-eter server [43]. However, the parameter server suffers fromsimilar synchronization issues as BSP style systems – a syn-chronous server may spend a significant amount of time atthe barrier while an asynchronous server may reduce withfew workers’ models and produce inconsistent intermediateoutputs and this can slow down convergence. Hence, the pa-rameter server architecture can benefit from a fine-grainedsynchronization mechanisms that have low overheads.

To provide asynchronous and approximate processing se-mantics with consistency and convergence guarantees, weintroduce ASAP that provides approximate processing by syn-chronizing each worker with a subset of workers at each itera-tion. Additionally, ASAP provides fine-grained synchroniza-tion that improves convergence behavior and reduces synchro-nization overheads over a barrier. We describes both thesetechniques next.

3. Stochastic reduce for approximate processingIn this section, we describe how data-parallel applications canuse stochastic reduce to mitigate network and processing timesfor iterative ML algorithms. More importantly, we prove thatconvergence speeds of algorithms depend on the spectral-gapvalues of underlying node communication graph of the cluster.

With distributed ML parallel workers train on input dataover model replicas. They synchronize with one-another afterfew iterations and perform a reduce over intermediate modelupdates and continue to train. This synchronization step is re-ferred to as the all-reduce step. In the parameter server model,this synchronization is performed at a single or distributedmaster [25, 44]. To mitigate the reduce overheads, efficientall-reduce has been explored in the map-reduce context wherenodes perform partial aggregation in a tree-style to reducenetwork costs [17, 12, 66]. However, these methods decreasethe network and processing costs at the cost of increasing thelatency of the reduce operation proportional to the height ofthe tree.

The network of worker nodes, i.e. the node communicationgraph of the cluster determines how rapidly the intermediatemodel parameters are propagated to all other machines andalso determines the associated network and processing costs.This network information diffusion or connectedness shouldbe high for faster convergence but imposes higher networkcosts. For example, all-reduce and the parameter server rep-resent different types of communication graphs that describehow the workers communicate the intermediate results asshown in Figure 3. In the all-reduce architecture, all machinesexchange parameters while in a parameter server architecturea central node coordinates the parameter update. Intuitively,when the workers communicate with all machines at everyreduce step, this network is densely connected and conver-gence is rapid since all the machines get the latest intermediateupdates. However, if the network of nodes is sparsely con-nected, the convergence may be slow due to stale, indirectupdates being exchanged between machines. However, withsparse connectivity, there are savings in network and CPUcosts (fewer updates to process at each node), that can resultin an overall speedup in job completion times. Furthermore,if there is a heterogeneity in communication bandwidths be-tween the workers (or between the workers and the master, ifa master is used), many workers may end up waiting. As anexample, if one is training using GPUs over a cluster, GPUswithin one machine can synchronize at far lower costs over

3

the PCI bus than over the network. Hence, frequent reduceacross interconnects with varying latency may introduce alarge synchronization cost for all workers. Hence, we proposeusing sparse or stochastic reduce based on sparse node graphswith strong connectivity properties.

The goal of stochastic reduce is to improve performanceby reducing network and processing costs by using sparsereduce graphs. Recent work has shown that every dense graphcan be reduced to a sparse graph with fewer edges [9] withsimilar network information diffusion properties. This is asignificant result since it implies that stochastic reduce can beapplied to save network costs for almost any network topol-ogy. Expander graphs, which are sparse graphs with strongconnectivity properties have been explored in the context ofdata centers and distributed communities to communicate datawith low overheads [60, 61]. An expander graph has fixedout-degrees as the number of vertices increase while maintain-ing approximately the same connectivity between the vertices.Hence, using directed expander graphs for stochastic reduceprovides approximately the same convergence as all-reducewhile keeping network costs low as the number of nodes in-crease. Directed graphs can use the one-sided properties ofRDMA networks that allow for fast, asynchronous communi-cation. Furthermore, we use low-degree graphs for less overallcommunication costs. Both these properties ensure better scal-ability for peer-to-peer communication fabrics like RDMAover Infiniband.

To measure the convergence of algorithms that use stochas-tic reduce, i.e. to compare the sparsity of the adjacency graphof communication, we calculate the spectral gap of the ad-jacency matrix of the network of workers. The spectral gapis the difference between the two largest singular values ofthe adjacency matrix normalized by the in-degree of everynode. The spectral gap of a communication graph determineshow rapidly a distributed, iterative sparse reduce convergeswhen performing distributed optimization over a network ofnodes represented by this graph. For faster convergence, thisvalue should be as high as possible. Hence, densely-connectedgraphs have a high spectral gap value and converge faster butcan have high communication costs. Conversely, if the graphis disconnected, the spectral gap value is zero, and with parti-tioned data, the model may never converge to a correct value.Hence, there is a research opportuniy in designing networksthat have high-spectral gap but have low communication costsby using low-degree nodes.

Past work using partial-reduce for optimization problemshas explored fixed communication graphs such as butterflyor logarithmic sequences [11, 42]. These communication se-quences provide fixed network costs but may not generalize tonetworks with complex topologies or networks with dissimilarbandwidth edges such as distributed GPU environments. Moreimportantly, ASAP introduces the ability to reason conver-gence using the spectral gap of a network graph and developerscan reason why some node graphs have stronger convergence

properties. Finally, existing approaches use a global barrierafter each iteration, incurring extra synchronization overheadswhich can be reduced using ASAP’s fine-grained synchroniza-tion described in the next section. In the next section, we provethat this convergence can be compared using the spectral gapof the node graphs.

3.1. Stochastic reduce convergence analysis

In this section, we analyze the conditions for convergence forstochastic reduce for any distributed optimization problem thatachieves consensus by communicating information betweennodes. We show that the rate of convergence for a set of nodesis dependent on the spectral gap of the transition matrix ofthe nodes. The transition matrix represents the probability oftransition of gradient updates from one node to another andis the ratio of the adjacency matrix/in-degree of each node.Mathematically, the optimization problem is defined on aconnected directed network and solved by n nodes collectively,

minxxx∈X ⊆Rd

f̄ (xxx) :=n

∑i=1

fi(xxx). (3.1)

The feasible set X is a closed and convex set in Rd and isknown by all nodes, whereas fi : X ∈ R is a convex func-tion known by the node i. We also assume that fi is L-Lipschitz continuous over X with respect to the Euclideannorm ‖·‖. The network G = (N ,E ), with the node setN = [n] := {1,2, · · · ,n} and the edge set E ⊆N ×N , spec-ifies the topological structure on how the information is spreadamongst the nodes through local node interactions over time.Each node i can only send and retrieve information as definedby the node communication graph N (i) := { j | ( j, i) ∈ E }and itself.

Our goal is to measure the overall convergence time (alsoknown as mixing time in Markov chain literature) based ona specific network topology. We specifically use directedgraphs as using undirected graph such as in the case of pa-rameter servers [25, 43] and exchange based peer-to-peer pro-tocols [65] require synchronization. Furthermore, directedgraph style communication improves performance in-case ofRDMA networks using one-sided communication.

In this algorithm, each node i keeps a local estimate xxxi anda model variable wwwi to maintain an accumulated sub-gradient.At iteration t, to update wwwi, each node needs to collect themodel update values of its neighbors which is the gradientvalue denoted by ∇ fi(wwwi

t), and forms a convex combinationwith an equal weight of the received information. The learningrate is denoted by ηt . Hence, the updates received by eachmachine can be expressed as:

wwwit+1/2← wwwi

t −ηt∇ fi(wwwit) (3.2)

wwwit+1←

1|N i

in|∑

j∈N iin

www jt+1/2

4

In order to make the above algorithm converge correctly, weneed a network over which each node has the same influence.To understand and quantify this requirement, we denote the ad-jacency matrix as AAA, i.e. Ai j = 1 if ( j, i) ∈A and 0 otherwise,and denote PPP as the transition matrix after scaling each i-throw of AAA by the in-degree of node i, i.e. PPP = diag(dddin)

−1 AAA,where dddin ∈ Rk and din(i) equals the in-degree of node i. Forease of illustration, we assume that degree d = 1 and initialwi

0 = 0. We denote sub-gradient and model updates over timeas gggt = (g1

t ,g2t , . . . ,g

kt ), and wwwt = (w1

t ,w2t , . . . ,w

kt )>. Then the

updates can be expressed as :

www1 =−η0PPPggg0

www2 =−η0PPP2ggg0−η1PPPggg1

...

wwwt =−η0PPPtggg0−η1PPPt−1ggg1−·· ·−ηt−1PPPgggt−1

wwwt =−t−1

∑k=0

ηkPPPt−kgggk (3.3)

It can be easily verified that PPP∞ := limt→∞ PPPt = 111πππ>, where πππ

is a probability distribution (known as stationary distribution).Thus, πi here represents the influence that node i plays in thenetwork. Therefore, to take a fair treatment of each node,πi =

1k is desired, which is equivalent to when the row sums

and column sums of the transition matrix PPP are all equal toone, i.e. PPP is a doubly stochastic matrix. Hence, in the contextof our network setting, we need a network whose nodes havethe same in-degree. Intuitively, this means that as long as allnodes are connected and each node has the same influence i.e.sends its gradients to equal number of neighbors, the overallproblem will converge correctly even if sparsely connectedgraphs may require a large number of iterations. Indeed, whenfi is convex, the convergence results have been establishedunder this assumption [27, 49].

Besides being regular, the network N (i) should also beconstructed in a way such that the information can be effec-tively spread out over the entire network, which is closelyrelated with the concept of spectral gap. We calculate thespectral gap of the network as 1−σ2(PPP), where σ2(PPP) is thesecond largest singular value of P. The transition matrix P isdefined as A/d, where A is the adjacency matrix (includingself-loop) and d is the in-degree (including self-loop). Thespectral gap here is defined as σ1(PPP)−σ2(PPP). But σ1(PPP) thelargest singular value should be 1. So the gap equals 1−σ2(PPP),where σ2(PPP) is the second largest singular value of P. We de-note σ1(PPP) ≥ σ2(PPP) ≥ ·· · ≥ σk(PPP) ≥ 0, where σi(PPP) is theith largest singular value of P. Clearly, σ1(PPP) = 1. Fromthe expression (3.3), we see that the speed of convergencedepends on how fast PPPt converges to 1

k 111111>, and based on thePerron-Frobenius theory [57], we have,

∥∥∥∥PPPtxxx− 1n

111∥∥∥∥

2≤ σ2(PPP)t ,

W1

W6 W2

W5 W3

W4

M1

W5 W1

W4 W2

W3

W1

W2 W3

W4 W5

W6

(a) all-reduce (b) parameter server (c) expander graph

W4

W5 W3

W6 W2

W1

(d) chain graph

Figure 3: Figure (a) shows all-reduce, Spectral Gap (SG) for6/25 nodes, SG-6:1.00, SG-25:1.00. Figure (b) shows parame-ter server, SG-6:0.75, SG-25:0.68. Figure (c) shows an expandergraph with SG-6:0.38, SG-25:0.2 Figure (d) shows a chain graphwith SG-6: 0.1, SG-25: 0.002. To the best of our knowledge, ourwork is the first to quantify convergence rates of commonly useddistributed learning node communication graphs.for any xxx in the k-dimensional probability simplex. Therefore,the network with large spectral gap, 1− σ2(PPP), is greatlydesired. Hence, the spectral gap represents how rapidly themixing time or the variance between the model at differentnodes decreases. This should be as high as possible for agiven network budget. A spectral gap of 1 indicates all nodesare inter-connected, providing the fastest convergence butpossibly high network costs. A spectral gap of 0 indicates adisconnected or partitioned network graph. We now comparethis spectral gap values for commonly used distributed learningarchitectures and construct a low-degree, high-spectral gapnode communication graph.

3.2. Stochastic reduce using expander graphs

Figure 3 shows six nodes connected using four distributedML training architectures, the all-reduce, the parameter server(non-distributed), an expander graph with a fixed out-degree oftwo, and a chain like architecture and their respective spectral-gap values for 6 and 25 nodes.

As expected, architectures that contain nodes with moreedges (higher-indegree) have a higher spectral gap. Figure 3(a)shows the all-reduce, where all machines communicate withone-another and may incur significant network costs. Fig-ure 3(b) shows the parameter server has a reasonably highspectral gap but using a single master with a high fanoutrequires considerable network bandwidth and Paxos-style re-liability for the master. Figure 3(c) shows a root expandergraph has a fixed out-degree of two, and in a network of Ntotal nodes, each node i sends the intermediate output to itsneighbor (i+1) (to ensure connectivity) and to i+

√Nth node.

Such root expander graphs ensure that the updates are spreadacross the network as N scales since the root increases withN. Finally, figure 3(d), shows a chain like graph, where thenodes are connected in a chain-like fashion, the intermediateparameter updates from node i may spread to i+1 in a singletime step, but will require N time steps to reach to the lastnode in the cluster and has low spectral gap values. In therest of the paper, we use the root sparse communication graph,as shown in Figure 3(c), with a fixed out-degree of two, toevaluate stochastic reduce.

5

Workers only send updates from the next iteration to

machines that have sent an ACK

Workers wait for NOTIFY from all its

senders

W1 W2 W3 W4 W5

Barrier

W1 W2 W3 W4 W5

W1 W2 W3 W4 W5

W1 W2 W3 W4 W5

Iteration i

Iteration i+1

Iteration i

Iteration i+1

(a) Barrier based synchronization

(b) NOTIFY-ACK based synchronization

Workers send()

updates to all other machines

1

All workers wait on barrier() for all machines to finish send

Workers perform a

reduce() over the received

updates

2

3

Workers send() updates + NOTIFY to

all or subset of the machines

1

Workers perform a reduce() when NOTIFY == # of

receivers

2

3

send a

nd NOT

IFY

ACK after reduce

5 Workers send an ACK after performing a reduce()

4

Figure 4: This figure shows sparse synchronization semanticswhere workers W1, W2 and W3 synchronize with one anotherand W3, W4 and W5 synchronize with one another. With abarrier, all workers wait for every other worker and then pro-ceed to the next iteration.

We find that by using sparse communication reduce graphs,with just two out-degrees often provides good convergenceand speedup over all-reduce i.e. high enough spectral gap val-ues with reasonably low communication costs. Using sparsereduce graphs with stochastic reduce results in faster modeltraining times because: First, the amount of network time isreduced. Second, the synchronization time is reduced sinceeach machine communicates with fewer nodes. Finally, theCPU costs at each node are reduced since they process fewerincoming model updates.

For stochastic reduce to be effective, the following proper-ties are desirable. First, the generated node communicationgraph should have a high-spectral gap. This ensures that themodel updates from each machine are diffused across the net-work rapidly. Second, the node-communication graphs shouldhave low communication costs. For example, the out-degreesof each node in the graph should have small out-degrees. Fi-nally, the graph should be easy to generate such as using asequence to accommodate variable number of machines or apossible re-configuration in case of a failure. These propertiescan be used to guide existing data-parallel optimizers [37] toreduce the data shuffling costs by constructing sparse reducegraphs to accommodate available network costs and other con-straints such as avoiding cross-rack reduce. We now discusshow to reduce the synchronization costs with fine-grainedcommunication.

4. Fine-grained synchronization

Barrier based synchronization is an important and widely usedoperation for synchronizing parallel ML programs across mul-tiple machines. After executing a barrier operation, a par-allel worker waits until all the processes in the system havereached a barrier. Parallel computing libraries like MPI,as well as data parallel frameworks such as BSP systems andsome parameter servers expose this primitive to the develop-ers [14, 33, 45, 43]. Furthermore, ML systems based on map-reduce use the stage barrier between the map and reduce tasksto synchronize intermediate outputs across machines [16, 30].

Workers only send updates from the next iteration to

machines that have sent an ACK

Workers wait for NOTIFY from all its

senders

W1 W2 W3 W4 W5

Barrier

W1 W2 W3 W4 W5

W1 W2 W3 W4 W5

W1 W2 W3 W4 W5

Iteration i

Iteration i+1

Iteration i

Iteration i+1

(a) Barrier based synchronization

(b) NOTIFY-ACK based synchronization

Workers send()

updates to all other machines

1

All workers wait on barrier() for all machines to finish send

Workers perform a

reduce() over the received

updates

2

3

Workers send() updates + NOTIFY to

all or subset of the machines

1

Workers perform a reduce() when NOTIFY == # of

receivers

2

3

send a

nd NOT

IFY

ACK after reduce

5 Workers send an ACK after performing a reduce()

4

Figure 5: Fine-grained synchronization in ASAP. The solid linesshow NOTIFY operation and the dotted lines show the corre-sponding ACK . Workers only wait for intermediate outputs fromdependent workers to perform a reduce. After reduce, work-ers push more data out when they receive an ACK from receiverssignaling that the sent parameter update has been consumed.

Figure 4 shows a parallel training system on n processes.Each process trains on a subset of data and compute the inter-mediate model update and issues a send to other machinesand the wait on the barrier primitive. When all processesarrive at the barrier, the workers perform a reduce operationover the incoming output and continue processing more input.

However, using the barrier as a synchronization point inthe code suffers from several problems: First, the BSP proto-col described above, suffers from mixed-version issues i.e. inthe absence of additional synchronization or serialization atthe receive side, a receiver may perform a reduce with partialor torn model updates (or skip them if a consistency check isenforced). This is because just using a barrier gives no infor-mation if the recipient has finished receiving and consumingthe model update. Second, most barrier implementationssynchronize with all other processes in the computation. Incontrast, with stochastic reduce, finer grained synchronizationprimitives are required that will block on only the requiredsubset of workers to avoid unnecessary synchronization costs.A global barrier operation is slow and removing this opera-tion can reduce synchronization costs, but makes the workersprocess inconsistent data that may slow down the overall timeto achieve the final accuracy. Finally, using a barrier cancause network resource spikes if all the processes send theirparameters at the same time.

Adding extra barriers before/after push and reduce, doesnot produce a strongly consistent BSP that can incorporatemodel updates from all replicas since the actual send operationmay be asynchronous and there is no guarantee the receiversreceive these messages when the perform a reduce. Unlessa blocking receive is added after every send, the consistencyis not guaranteed. However, this introduces a significant syn-chronization overhead.

Hence, to provide efficient coordination among parallelmodel replicas, we require the following three properties

6

from any synchronization protocol. First, the synchronizationshould be fine-grained. Coarse-grained synchronization suchas barrier impose high overheads as discussed above. Sec-ond, the synchronization mechanism should provide consis-tent intermediate outputs. Strong consistency methods avoidtorn-reads and mixed version parameter vectors, and improveperformance [13, 67]. Finally, the synchronization shouldbe efficient. Excessive receive-side synchronization for ev-ery reduce and send operation can significantly increaseblocking times.

In data-parallel systems with barrier based synchroniza-tion, there is often no additional explicit synchronization be-tween the sender and receiver when an update arrives. Fur-thermore, any additional synchronization may reduce the per-formance especially when using low latency communicationhardware such as RDMA that allow one-sided writes withoutinterrupting the receive-side CPU [40]. In the absence of syn-chronization, a fast sender can overwrite the receive buffers orthe receiver may perform a reduce with a fewer senders in-stead of consuming each worker’s output hurting convergence.This is especially problematic with RDMA memory that canonly pin a finite amount of memory on the device.

Naiad, a dataflow based data-parallel system, provides anotify mechanism to inform the receivers about the incom-ing model updates [48]. This ensures that when a node per-forms a local reduce, it consumes the intermediate outputsfrom all machines. Hence, a per-receiver notification allowsfor finer-grained synchronization. However, simply using anotify is not enough since a fast sender can overwrite thereceive queue of the receiver and a barrier or any otherstyle of additional synchronization is required to ensure thatthe parallel workers process incoming model parameters at thesame speeds.

To eliminate the barrier overheads for stochastic re-duce and to provide strong consistency, we propose usinga NOTIFY-ACK based synchronization mechanism that givesstricter guarantees than using a coarse grained barrier. Thiscan also improve convergence times in some cases since it fa-cilitates using consistent data from dependent workers duringthe reduce step.

In ASAP, with NOTIFY-ACK, the parallel workers computeand send their model parameters with notifications to otherworkers. They then proceed to wait to receive notificationsfrom all its senders as defined by their node communicationgraphs as shown in figure 5. The wait operation counts theNOTIFY events and invokes the reduce when a worker hasreceived notifications from all its senders as described by thenode communication graph. Once all notifications have beenreceived, it can perform a consistent reduce.

After performing a reduce, the workers sends ACKs, indi-cating that the intermediate output in previous iteration hasbeen consumed. Only when a sender receives this ACK for aprevious send, it may proceed to send the data for the nextiteration. Unlike a barrier based synchronization, where

there is no guarantee that a receiver has consumed the interme-diate outputs from all senders, waiting on ACKs from receiversensures that a sender never floods the receive side queue andavoids any mixed version issues from overlapping intermedi-ate outputs. Furthermore, fine-grained synchronization allowsefficient implementation of stochastic reduce since each senderis only blocked by dependent workers and other workers mayrun asynchronously.NOTIFY-ACK requires no additional receive-side synchro-

nization making it ideal for direct-memory access styleprotocols such as RDMA or GPU Direct [3]. However,NOTIFY-ACK requires ordering guarantees of the underlyingimplementation to guarantee that a NOTIFY arrives after theactual data. Furthermore, in a NOTIFY-ACK based implemen-tation, the framework should ensure that the workers send theirintermediate updates and then wait on their reduce inputs toavoid any deadlock from a cyclic node communication graphs.

5. ImplementationWe develop our second generation distributed ML frameworkusing the ASAP model and incorporate stochastic reduceand fine-grained synchronization. We implement distributeddata-parallel model averaging over stochastic gradient descent(SGD). We implement our reference framework with stochas-tic reduce and NOTIFY-ACK support in C++ and provide Luabindings to run our Lua based deep learning networks [20]. Ascompared to our first-generation system [8], we use zero copyfor Lua tensors, use NOTIFY-ACK instead of BSP for sparsegraphs. Furthermore, we use low-degree sparse graph to con-nect machine nodes rather than fixed-degree fully-connectedgraphs.

For distributed communication, we use MPI and createmodel parameters in distributed shared memory. In our imple-mentation, the parallel model replicas create a model vector inthe shared memory and train on a portion of the dataset usingthe SGD algorithm. The model replicas process the partitioneddataset to compute and communicate the gradient and performa reduce periodically.

We use the infiniBand transport and each worker directlywrites the intermediate model to its senders without inter-rupting the receive side CPU, using one-sided RDMA oper-ations. To reduce synchronization overheads, each machinemaintains a fixed size per-sender receive queue to receivethe model updates from other machines [55]. Hence, aftercomputing the gradient, parallel replicas write their updatesto this queue using one-sided RDMA writes. Each replicasperforms a reduce of its model and all the models in thequeues following NOTIFY-ACK semantics. Hence, our systemonly performs RDMA writes which have half the latency ofRDMA reads. The queues and the shared memory communi-cation between the model replicas are created based on a nodecommunication graph provided as an input when launching ajob. After the reduce operation, each machine sends out themodel updates to other machines’ queues as defined by the

7

Figure 6: Figure (a) compares the speedup of a root expander graph and an all-reduce graph for 4 workers a single machine SGD.Each machine in root expander graph transmits 56 MB data while all-reduce transmits 84 MB of data to converge. Figure (b) showsthe convergence of a root expander graph with all-reduce graph for the splice-site dataset over 8 workers. Each machine in root graphtransmits 219 GB of data while all-reduce transmits 2.08 TB/machine to reach the desired accuracy. Figure (c) shows the convergenceof a chain graph, a root expander graph and an all-reduce implementation for the webspam dataset over 25 workers.communication graph. Our system can perform reduce overany user-provided node communication graph allowing us toevaluate stochastic reduce for different node communicationgraphs.

Furthermore, we also implement the synchronous, asyn-chronous and NOTIFY-ACK based synchronization. We imple-ment synchronous (BSP) training by using the barrier prim-itive. We use low-level distributed wait and notify primi-tives to implement NOTIFY-ACK. We maintain ACK countsat each node and send all outputs before waiting for ACKsacross iterations to avoid deadlocks. We use separate infini-Band queues for transmitting short messages (ACKs and othercontrol messages) and large model updates (usually a fixedsize for a specific dataset). For Ethernet based implementa-tion, separate TCP flows can be used to reduce the latency ofcontrol messages [50].

Fault Tolerance: We provide a straight-forward model forfault tolerance by simply check-pointing the trained model pe-riodically to the disk. Additionally, for synchronous methods,we implement a dataset specific dynamic timeout which is afunction of the time taken for the reduce operation.

6. Evaluation

We evaluate (i) What is the benefit of stochastic reduce? Donetworks with a higher spectral gap exhibit better conver-gence? and (ii) What is the speedup of using fine-grainedsynchronization over a barrier? Is it consistent?

We evaluate the ASAP model for applications that arecommonly used today including text classification, spamclassification, image classification and Genome detection.Our baseline for evaluation is our BSP/synchronous andasynchronous based all-reduce implementations which isprovided by many other distributed big data or ML sys-tems [1, 6, 24, 28, 33, 36, 45, 55]. We use an efficient infini-Band implementation stack that improves performance for allmethods. However, stochastic reduce and NOTIFY-ACK canbe implemented and evaluated over any existing distributedlearning platform such as GraphLab or TensorFlow.

We run our experiments on eight Intel Xeon 8-core, 2.2

GHz Ivy-Bridge processors and 64 GB DDR3 DRAM. Allconnected via a Mellanox Connect-V3 56 Gbps infiniBandcards. Our network achieves a peak throughput of 40 Gbpsafter accounting for the bit-encoding overhead for reliabletransmission. All machines load the input data from a sharedNFS partition. We sometimes run multiple processes on eachmachine, especially for models with less than 1M parameters,where a single model replica is unable to saturate the networkand CPU. All reported times do not account for the initialone-time cost for the loading the data-sets in memory. Alltimes are reported in seconds.

We evaluate two ML methods – (a) SVM: We test ASAP ondistributed SVM based on Bottou’s SVM-SGD [10]. Each ma-chine computes the model parameters and communicates themto other machines as described in the machine communicationgraph. We train SVM over the RCV1 dataset (document clas-sification), the webspam dataset (webspam detection) and thesplice-site dataset (Genome classification) [4]; and(b) Convo-lutional Neural Networks (CNNs): We train CNNs for imageclassification over the CIFAR-10 dataset [5]. The datasetconsists of 50K train and 10K test images and the goal is toclassify an input image within 10 classes. We use the VGGnetwork to train 32x32 CIFAR-10 images with 11 layers thathas 7.5M parameters [59]. We use OMP_PARALLEL_THREADSto parallelize the convolutional operations within a single ma-chine.

6.1. Approximate processing benefits

Speedup with stochastic reduce We measure the speedupsof all applications under test as the time to reach a specificaccuracy. We first evaluate a small dataset (RCV1, 700MB,document classification) for the SVM application. The goalhere is to demonstrate that for problems that fit in one ma-chine, our data-parallel system outperforms a single threadfor each workload [46]. Figure 6(a) shows the convergencespeedup for 4 machines for the RCV1 dataset. We compare theperformance for all-reduce against a root graph with a fixedout degree of two, where each node sends the model updatesto two nodes – its neighbor and rootNth node. We find that

8

the root expander graph converges marginally faster than theall-reduce primitive, owing to marginally lower network andCPU costs since the number of machines is small.

Figure 6(b) shows the convergence for the SVM applicationusing the splice-site dataset on 8 machines with 8 processes.The splice site training dataset is about 250GB, and does notfit in-memory in any one of our machines. This is one ofthe largest public dataset that cannot be sub-sampled and re-quires training over the entire dataset to converge correctly [7].Figure 6(b) compares the two communication graphs – an all-reduce graph and a root expander graph with an out-degree of2. We see that the expander graph can converge faster, and re-quires about 10X lower network bandwidth. Figure 6(c) showsthe convergence for the SVM application on 25 processes us-ing the webspam dataset that consists of 250K examples. Wecompare three node communication graphs – a root expandergraph (with a spectral gap of 0.2 for 25 nodes) and a chain-likearchitecture where each machine maintains a model and com-municates its updates in the form a chain to one other machine,with the all-reduce implementation. The chain node grapharchitecture has lower network costs, but very low spectralvalues (≈ 0.008). Hence, it converges slower than the rootexpander graph. Both the sparse graphs provide a speedupover all-reduce since the webspam dataset has a large denseparameter vector of about 11M float values. However, thechain graph requires more epochs to train over the datasetand sends 50433 MB/node to converge while the root node re-quires 44351 MB/node even though the chain graph transmitsless data per epoch. Hence, one should avoid sparse reducegraphs with very low spectral gap values and use expandergraphs that provide good convergence for a given networkbandwidth.

To summarize, we find that stochastic reduce provides sig-nificant speedup in convergence (by 2-10X) and reduces thenetwork and CPU costs. However, if the node communica-tion graph is sparse and has low spectral gap values (usuallyless than 0.01), the convergence can be slow. Hence, stochas-tic reduce provides a quantifiable measure using the spectralgap values to sparsify the node communication graphs. Formodels where the network capacity and CPU costs can matchthe model costs, using a network that can support the largestspectral gap is recommended.

6.2. Fine-grained synchronization benefits

We compare the performance of NOTIFY-ACK, synchronousand asynchronous implementation. We implement the BSPalgorithm using a barrier in the training loop, and all data-parallel models perform each iteration concurrently. However,simply using a barrier may not ensure consistency at the re-ceive queue. For example, the models may invoke a barrierafter sending the models and then perform a reduce. Thisdoes not guarantee that each machine receives all the interme-diate outputs during reduce. Hence, we perform a consistencycheck on the received intermediate outputs and adjust the num-

0 1 2 3 4 5 6 7

0

10

20

30

40

50

60

70

80

90

100

Number of reducers

% of

red

uce

oper

atio

ns

SYNC SYNC SYNC SYNC

SYNC

SYNC

SYNC

SYNC

NACK

ASYNC ASYNCASYNC

ASYNCASYNC

ASYNCASYNC

ASYNC

Figure 7: This bar graph shows the percentage of consistentreduce operations with NOTIFY-ACK vs BSP vs ASYNC for8 machines for the RCV1 dataset. NOTIFY-ACK provides thestrongest consistency allowing 100% of reduce operations tobe performed with other 7 nodes.ber of intermediate models to compute the model averagecorrectly. Each parameter update carries a unique versionnumber in the header and footer we verify the version valuesin the header and footer are identical before and after readingthe data from the input buffers.

For asynchronous processing, we perform no synchroniza-tion between the workers once they finish loading data andstart training. We check the incoming intermediate model up-dates for consistency as described above. For the NOTIFY-ACKimplementation, we send intermediate models with notifica-tions and wait for the ACK before performing a reduce.Often communication libraries over network interconnectsmay not guarantee that the notifications arrive with the data,and we additionally check the incoming model updates andadjust the number of reducers to compute the model averagecorrectly. We first perform a micro-benchmark to measure howconsistency of reduce operations in synchronous (SYNC),asynchronous (ASYNC) and NOTIFY-ACK.

Figure 7 shows, for a graph of 8 nodes with each machinehaving an in-degree 7 , the distribution of correct buffersreduced. NOTIFY-ACK, reduces with all 7 inputs and is valid100% of the time. BSP has substantial torn reads, and 77% ofthe time performs a reduce with 5 or more workers. ASYNCcan only perform 39% of the reduce operations correctlywith 5 or more workers. Hence, we find that NOTIFY-ACKprovides the most consistent data during reduce and fewesttorn buffers followed by BSP using a barrier, followed byASYNC.

Figure 8 shows the convergence for the CIFAR-10 datasetfor eight machines in an all-reduce setup. We calculate thetime to reach 99% training accuracy on the VGG networkwhich corresponds to an approximately 84% test accuracy.We train our network with a mini-batch size of 1, with nodata augmentation and a network wide learning rate schedulethat decays every epoch. We find that NOTIFY-ACK providessuperior convergence over BSP and ASYNC. Even with adense communication graph, we find that NOTIFY-ACK re-duces barrier times and provides stronger consistency andfaster convergence. Furthermore, we find that ASYNC ini-

9

Figure 8: This figure shows the convergence of NOTIFY-ACK,BSP, ASYNC with eight machines using all-reduce for train-ing CNNs over the CIFAR-10 dataset with a VGG network.Speedups are measured over a single machine implementation.tializes slowly and converges slower than synchronous andNOTIFY-ACK methods.

We also measure the throughput (examples/second) forsynchronous (BSP), asynchronous and the NOTIFY-ACK syn-chronous methods. NOTIFY-ACK, avoids coarse-grained syn-chronization and achieves an average throughput of 229.3images per second for eight machines. This includes the timefor forward and backward propagation, adjusting the weightsand communicating and averaging the intermediate updates.With BSP, we achieve 217.8 fps. Finally, we find that eventhough ASYNC achieves the highest throughput of 246 fps,Figure 8 shows that the actual convergence is poor. Hence, tounderstand the benefits of approaches like relaxed consistency,one must consider speedup towards a (good) final accuracy.However, ASYNC may converge fastest when the datasetsare highly redundant. ASYNC can also provide a speedup ifthe model updates are sparse, the reduce operation becomescommutative and reduces conflicting updates.

Figure 9 shows the convergence for the SVM applicationusing the webspam dataset on 25 processes. We use the root-expander graph as described earlier.We find that using fine-grained NOTIFY-ACK improves the convergence performanceand is about 3X faster than BSP for the webspam dataset. Fur-thermore, the asynchronous algorithm does not converge to thecorrect value even though it operates at a much higher through-put. NOTIFY-ACK provides good performance for three rea-sons. First, NOTIFY-ACK provides stronger consistency thanBSP implemented using a barrier. In the absence of addi-tional expensive synchronization with each sender, the modelreplicas may reduce with fewer incoming model updates.NOTIFY-ACK provides stronger guarantees since each workerwaits for the NOTIFYs before performing the reduce and sendsout additional data with after receiving the ACK messages. Sec-ond, for approximate processing i.e. when the communicationgraph is sparse, a barrier blocks all parallel workers whilewhen using fine-grained communication with NOTIFY-ACK

independent workers run asynchronously with respect to oneanother. To summarize, we find that NOTIFY-ACK eliminates

Figure 9: This figure shows the convergence of NOTIFY-ACK vsBSP and ASYNC for the root expander graph over the webspamdataset. The asynchronous implementation does not converge(DNC) to the final value.

4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Number of Nodes

Spec

tral

gap

ASAP(root) Chain MALT

1

Figure 10: This figure compares the spectral gap values for ex-pander graph (degree 2), chain graph (degree 1) and MALT (de-gree 2). Higher spectral gap values indicate faster convergence.

torn buffers and provides faster convergence over existingconsistency mechanisms for dense as well as sparse nodegraphs. We describe our results for the BSP/all-reduce model.However, these results can also be extended to bi-directionalcommunication architectures such as the parameter server orthe butterfly architecture.

Comparision of expander graph with other architectures:Finally, we compare the convergence of expander graphs witha chain node communication and MALT Halton graphs [42]in Figure 10. We compute the spectral graphs for degree 2 forMALT and ASAP as well as for a chain (or ring graph). Thespectral gap represents how fast these node communicationsgraphs converge as the number of nodes (on x-axis) increases.We found that ASAP converges faster than MALT for most ofnetwork graphs and chain for every network graph. The ASAPexpander graph of N nodes sends its update to next neighborand the root(N) node. The MALT graph follows the Haltonseries to send its updates to another two nodes as computedby this series [2]. In some cases, this may create nodes thatare partitioned into two groups, and spectral gap falls to zero,resulting in no convergence.

10

7. Related Work

Batch systems: The original map-reduce uses a stage bar-rier i.e. all mapper tasks synchronize intermediate outputsover a distributed file-system [26]. This provides synchro-nization and fault tolerance for the intermediate job state butcan hurt performance due to frequent disk I/O. Spark [68]implements the map-reduce model in-memory using copy-on-write data-structures (RDDs). RDDs provide fault toleranceand enforce determinism preventing applications from run-ning asynchronously. ASIP [32] adds asynchrony supportin Spark by decoupling fault tolerance and coarse-grainedsynchronization. However, ASIP finds that asynchronous ex-ecution may lead to incorrect convergence, and presents thecase for running asynchronous machine learning jobs usingsecond order methods that provides stronger guarantees butcan be extremely CPU intensive [54]. Finally, there are manyexisting general purpose dataflow based systems that use barri-ers or block on all inputs to arrive [21, 39, 48, 52]. ASAP usesfine-grained synchronization with partial reduce to mitigatecommunication and synchronization overheads. We use ap-plication level checkpoints for fault tolerance of intermediatestates.

Approximate Processing: Recent work on partial aggre-gation uses efficient tree-style all-reduce primitive to mitigatecommunication costs in batch systems by combining resultsat various levels such as machine, rack etc. [12, 66]. However,the reduce operation suffers from additional latency propor-tional to the height of the tree. Furthermore, when partialaggregation is used with iterative-convergent algorithms, theworkers wait for a significant aggregated latency time whichcan be undesirable. Other work on partial-aggregation pro-duces variable accuracy intermediate outputs over differentcomputational budgets [41, 58]. Other methods to reducenetwork costs include using lossy compression [6] or KKTfilters [43] which computes the benefit of updates before send-ing them over the network. These methods can be appliedwith stochastic reduce even though they may incur additionalCPU costs unlike stochastic reduce.

Asynchronous Processing: Past work has explored remov-ing barriers in Hadoop to start reduce operations as soon assome of the mappers finish execution [31, 63]. HogWild [56]uses a single shared parameter vector and allows parallelthreads to update model parameters without locks, thrashingone another’s updates. However, HogWild may not convergeto a correct final value if the parameter vector is dense andthe updates from different threads overlap frequently. ProjectAdam [15], DogWild [51] and other systems that use HogWildin a distributed setting often cause wasteful communication es-pecially when used to communicate dense parameter updates.Several systems propose removing the barriers which mayprovide faster throughput but may lead to slower or incorrectconvergence towards the final optimization goal. To overcomethis problem, bounded-staleness [22] provides asynchrony

within bounds for the parameter server i.e. the fast runningthreads wait for the stragglers to catch up. However, deter-mining these bounds empirically can be difficult, and in somecases they may not be more relaxed than synchronous. ASAPinstead proposes using fine-grained synchrony that reducessynchronization overhead with strong consistency.

ML frameworks: Parameter Servers [25, 43] provide amaster-worker style communication framework. Here, work-ers compute the parameter updates and send it to one or morecentral servers. The parameter servers compute and updatethe global model and sends it to the workers and they continueto train new data over this updated model. On the contrary,all-reduce based systems may operate fully asynchronouslysince unlike parameter server there is no consensus operationto exchange the gradients. ASAP reduces communicationoverheads in the all-reduce model and by proposing partial-reduce based on information dispersal properties of underlyingnodes. TensorFlow runs a dataflow graph across a cluster anduses the asynchronous parameter server to train large mod-els [6, 13]. For larger models, the large fanout of the mastercan be a bottleneck and the model parameters are aggregatedat bandwidth hierarchies [13]. Using ASAP’s stochastic re-duce to improve the convergence behavior of such networkarchitectures can reduce the wait times. The parameter serverarchitecture has also been proposed over GPUs [23, 69] andthe communication and synchronization costs can be reducedin these systems by using the ASAP model.

8. Conclusion

Practitioners often use approximation and asynchrony to ac-celerate job processing throughput (i.e. examples/second) indata parallel frameworks. However, these optimizations maynot achieve a speedup over BSP to reach high accuracy output.

In this paper, we present stochastic reduce that providestunable approximation based on available network bandwidth.We also introduce NOTIFY-ACK that provides fine-grained yetstronger consistency than BSP. In our results, we demonstratethat our model can achieve 2-10X speedups in convergenceand up to 10X savings in network costs for distributed MLapplications. Other optimization problems such as graph-processing face similar trade-offs, and can benefit from usingthe ASAP model. A github link of our data-parallel systemincluding the code to compute spectral gap for different nodetopologies will be provided in the final version of our paper.

Finally, there are other sources of synchrony in the systemthat can be relaxed. For example, we find that loading datainto memory consumes a significant portion of job times. Fortraining tasks with significant CPU times, such as processingimages through deep networks, having a separate worker thatloads and sends data to various workers over low latencynetworks allows overlapping data loading with model training,and can remove the initial data load wait times.

11

References[1] Apache Hama: Framework for Big Data Analytics.

http://hama.apache.org/.[2] Halton sequence. en.wikipedia.org/wiki/Halton_

sequence.[3] NVIDIA GPUDirect. https://developer.nvidia.

com/gpudirect.[4] PASCAL Large Scale Learning Challenge.

http://largescale.ml.tu-berlin.de, 2009.[5] The CIFAR-10 dataset. https://www.cs.

toronto.edu/ kriz/cifar.html, 2009.[6] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,

Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga,Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vi-jay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and XiaoqiangZheng. Tensorflow: A system for large-scale machine learning. In 12thUSENIX Symposium on Operating Systems Design and Implementation(OSDI 16), pages 265–283, 2016.

[7] Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford.A reliable effective terascale linear learning system. JMLR, 2014.

[8] Anonymous. Anonymized for review. In Unknown. XXX, XXX.[9] Joshua Batson, Daniel A Spielman, Nikhil Srivastava, and Shang-

Hua Teng. Spectral sparsification of graphs: theory and algorithms.Communications of the ACM, 56(8):87–94, 2013.

[10] Léon Bottou. Large-scale machine learning with stochastic gradientdescent. In Springer COMPSTAT, pages 177–187, Paris, France, 2010.

[11] John Canny and Huasha Zhao. Butterfly mixing: Acceleratingincremental-update algorithms on clusters. In SDM, pages 785–793,2013.

[12] Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, DarrenShakib, Simon Weaver, and Jingren Zhou. Scope: easy and efficientparallel processing of massive data sets. Proceedings of the VLDBEndowment, 1(2):1265–1276, 2008.

[13] Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz.Revisiting distributed synchronous sgd. ICLR Workshop, 2016.

[14] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang,Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet:A flexible and efficient machine learning library for heterogeneousdistributed systems. arXiv preprint arXiv:1512.01274, 2015.

[15] Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalya-naraman. Project Adam: Building an Efficient and Scalable DeepLearning Training System. In USENIX OSDI, 2014.

[16] Cheng Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski,Andrew Y Ng, and Kunle Olukotun. Map-reduce for machine learningon multicore. NIPS, 19:281, 2007.

[17] Lingkun Chu, Hong Tang, Tao Yang, and Kai Shen. Optimizing dataaggregation for cluster-based internet services. In ACM SIGPLAN Sym-posium on Principles and practice of parallel programming (PPoPP) ,2003. ACM.

[18] James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory RGanger, Garth Gibson, Kimberly Keeton, and Eric Xing. Solving thestraggler problem with bounded staleness. In USENIX HotOS, 2013.

[19] Clarifai. Clarifai Visual Search.https://www.clarifai.com/visual-search.

[20] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7:A matlab-like environment for machine learning. In BigLearn, NIPSWorkshop, 2011.

[21] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M Hellerstein,Khaled Elmeleegy, and Russell Sears. Mapreduce online. In NSDI,volume 10, page 20, 2010.

[22] Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee,Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R Ganger, Phillip BGibbons, et al. Exploiting bounded staleness to speed up big dataanalytics. In USENIX ATC, 2014.

[23] Henggang Cui, Hao Zhang, Gregory Ganger, Phillip Gibbons, andEric Xing. Geeps: Scalable deep learning on distributed gpus witha gpu-specialized parameter server. In Proceedings of the EleventhEuropean Conference on Computer Systems. ACM, 2016.

[24] Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee,Junming Yin, Qirong Ho, and Eric P Xing. Petuum: A framework foriterative-convergent distributed ml. arXiv preprint arXiv:1312.7651,2013.

[25] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, MatthieuDevin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Se-nior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributeddeep networks. In NIPS, 2012.

[26] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified dataprocessing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

[27] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual aver-aging for distributed optimization: convergence analysis and networkscaling. Automatic control, IEEE Transactions on, 57(3):592–606,2012.

[28] Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack JDongarra, Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur,Brian Barrett, Andrew Lumsdaine, et al. Open MPI: Goals, concept,and design of a next generation MPI implementation. In Recent Ad-vances in Parallel Virtual Machine and Message Passing Interface,pages 97–104. Springer, 2004.

[29] Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis.Large-scale matrix factorization with distributed stochastic gradientdescent. In ACM KDD, pages 69–77, 2011.

[30] Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, BertholdReinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, andShivakumar Vaithyanathan. Systemml: Declarative machine learn-ing on mapreduce. In Data Engineering (ICDE), 2011 IEEE 27thInternational Conference on, pages 231–242. IEEE, 2011.

[31] Ínigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu DNguyen. Approxhadoop: Bringing approximations to mapreduceframeworks. In Proceedings of the Twentieth International Conferenceon Architectural Support for Programming Languages and OperatingSystems, pages 383–397. ACM, 2015.

[32] Joseph E Gonzalez, Peter Bailis, Michael I Jordan, Michael J Franklin,Joseph M Hellerstein, Ali Ghodsi, and Ion Stoica. Asynchronouscomplex analytics in a distributed dataflow architecture. arXiv preprintarXiv:1510.07092, 2015.

[33] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, andCarlos Guestrin. Powergraph: Distributed graph-parallel computationon natural graphs. In USENIX OSDI, 2012.

[34] Google Cloud Platform. CLOUD SPEECH API.https://cloud.google.com/speech/.

[35] Google Cloud Platform. CLOUD VISION API.https://cloud.google.com/vision/.

[36] Douglas Gregor and Andrew Lumsdaine. The parallel BGL: A genericlibrary for distributed graph computations. Parallel Object-OrientedScientific Computing (POOSC), 2:1–18, 2005.

[37] Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, HuchengZhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and LidongZhou. Spotting code optimizations in data-parallel pipelines throughperiscope. In USENIX OSDI, 2012.

[38] Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger,and Phillip B Gibbons. Proteus: agile ml elasticity through tieredreliability in dynamic resource markets. In Proceedings of the TwelfthEuropean Conference on Computer Systems, pages 589–604. ACM,2017.

[39] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and DennisFetterly. Dryad: distributed data-parallel programs from sequentialbuilding blocks. In ACM EuroSys, 2007.

[40] Anuj Kalia, Michael Kaminsky, and David G Andersen. Using RDMAefficiently for key-value services. In SIGCOMM, pages 295–306. ACM,2014.

[41] Gautam Kumar, Ganesh Ananthanarayanan, Sylvia Ratnasamy, and IonStoica. Hold’em or fold’em? aggregation queries under performancevariations. 2016.

[42] Hao Li, Asim Kadav, Erik Kruus, and Cristian Ungureanu. Malt:distributed data-parallelism for existing ml applications. In Eurosys.ACM, 2015.

[43] Mu Li, David Andersen, Alex Smola, Junwoo Park, Amr Ahmed, VanjaJosifovski, James Long, Eugene Shekita, and Bor-Yiing Su. Scalingdistributed machine learning with the parameter server. In USENIXOSDI, 2014.

[44] Mu Li, David G Andersen, and Alexander Smola. Distributed delayedproximal gradient methods. In NIPS Workshop on Optimization forMachine Learning, 2013.

[45] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehn-ert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a systemfor large-scale graph processing. In Proceedings of the 2010 ACMSIGMOD International Conference on Management of data, pages135–146. ACM, 2010.

12

[46] Frank McSherry, Michael Isard, and Derek G Murray. Scalability! butat what cost? In 15th Workshop on Hot Topics in Operating Systems(HotOS XV), 2015.

[47] McSherry, Frank. Progress in graph process-ing: Synchronous vs asynchronous graph processing.https://github.com/frankmcsherry/blog/blob/master/posts/2015-12-24.md.

[48] Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, PaulBarham, and Martin Abadi. Naiad: A timely dataflow system. In ACMSOSP, 2013.

[49] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methodsfor multi-agent optimization. IEEE Transactions on Automatic Control,pages 48–61, 2009.

[50] Edmund B Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann,Jon Howell, and Yutaka Suzue. Flat datacenter storage. In Presentedas part of the 10th USENIX Symposium on Operating Systems Designand Implementation (OSDI 12), pages 1–15, 2012.

[51] Cyprien Noel and Simon Osindero. Dogwild!—distributed hogwildfor cpu & gpu. In NIPS workshop on Distributed Machine Learningand Matrix Computations, 2014.

[52] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,and Andrew Tomkins. Pig latin: a not-so-foreign language for dataprocessing. In Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data, pages 1099–1110. ACM, 2008.

[53] Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, and VMware ICSI. Making sense of performance in dataanalytics frameworks. In NSDI, 2015.

[54] Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochasticalternating direction method of multipliers. In Proceedings of the 30thInternational Conference on Machine Learning, pages 80–88, 2013.

[55] Russell Power and Jinyang Li. Piccolo: Building fast, distributedprograms with partitioned tables. In USENIX OSDI, pages 293–306,2010.

[56] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hog-wild: A lock-free approach to parallelizing stochastic gradient descent.In NIPS, 2011.

[57] Eugene Seneta. Non-negative matrices and Markov chains. SpringerScience & Business Media, 2006.

[58] Ambuj Shatdal and Jeffrey F Naughton. Adaptive parallel aggregationalgorithms. In ACM SIGMOD Record, volume 24, pages 104–114.ACM, 1995.

[59] Karen Simonyan and Andrew Zisserman. Very deep convolu-tional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[60] Dinh Nguyen Tran, Bonan Min, Jinyang Li, and LakshminarayananSubramanian. Sybil-resilient online content voting. In NSDI, volume 9,pages 15–28, 2009.

[61] Asaf Valadarsky, Michael Dinitz, and Michael Schapira. Xpander:Unveiling the secrets of high-performance datacenters. In Proceedingsof the 14th ACM Workshop on Hot Topics in Networks, page 16. ACM,2015.

[62] Leslie G Valiant. A bridging model for parallel computation. Commu-nications of the ACM, 33(8):103–111, 1990.

[63] Abhishek Verma, Brian Cho, Nicolas Zea, Indranil Gupta, and Roy HCampbell. Breaking the mapreduce stage barrier. Cluster computing,16(1):191–206, 2013.

[64] Guozhang Wang, Wenlei Xie, Alan J Demers, and Johannes Gehrke.Asynchronous large-scale graph processing made easy. In CIDR, 2013.

[65] Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez,and Peter Pietzuch. Ako: Decentralised deep learning with partialgradient exchange. In Proceedings of the Seventh ACM Symposium onCloud Computing, pages 84–97. ACM, 2016.

[66] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlings-son, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system forgeneral-purpose distributed data-parallel computing using a high-levellanguage. In USENIX OSDI, 2008.

[67] Hyokun Yun, Hsiang-Fu Yu, Cho-Jui Hsieh, SVN Vishwanathan, andInderjit Dhillon. NOMAD: Non-locking, stOchastic Multi-machinealgorithm for Asynchronous and Decentralized matrix completion. InACM VLDB, 2014.

[68] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, andIon Stoica. Resilient distributed datasets: A fault-tolerant abstractionfor in-memory cluster computing. In USENIX NSDI, 2012.

[69] Hao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim,Qirong Ho, and Eric Xing. Poseidon: A system architecture for effi-cient gpu-based deep learning on multiple machines. arXiv preprint

arXiv:1512.06216, 2015.

13

Date post:	05-Oct-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Asynchronous Approximate Data-parallel Computationasim.ai/papers/asap.pdf · Asynchronous...

Documents