Asynchronous Parallel Computing in Signal Processing and …€¦ · Asynchronous Parallel...

Post on 05-Oct-2020

5 views 0 download

transcript

Asynchronous Parallel Computing in Signal Processing andMachine Learning

Wotao Yin (UCLA Math)

joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU)

Optimization and Parsimonious Modeling – IMA, Jan 25, 2016

1 / 49

Do we need parallel computing?

2 / 49

Back in 1993

3 / 49

2006

4 / 49

2015

5 / 49

35 Years of CPU Trend

1995 2000 2005 2010 2015

Number

of CPUs

Performance

per core

Cores

per CPU

D. Henty. Emerging Architectures and Programming Models for Parallel Computing, 2012.

• In May 2014, Intel cancelled its Tejas project (single-core) and announced anew multi-core project.

6 / 49

Today: 4x ADM 16-core 3.5GHz CPUs (64 cores total)

7 / 49

Today: Tesla K80 GPU (2496 cores)

8 / 49

Today: Octa-Core Headsets

9 / 49

Free lunch was over

• before 2005: a single-threaded algorithm automatically gets faster

• now, new algorithms must be developed for faster speeds by

• exploiting problem structures

• taking advantages of dataset properties

• using all the cores available

10 / 49

How to use all the cores available?

11 / 49

Parallel computing

Agent

Agent

Agent

t1t2tN · · ·

Problem

12 / 49

Parallel speedup

• definition:speedup = serial time

parallel timetime is in the wall-clock sense

• Amdahl’s Law: N agents, no overhead, ρ = percentage of parallelcomputing

ideal speedup = 1(ρ/N) + (1− ρ)

100

102

104

106

108

0

5

10

15

20

Number of processors

Speedup

25%50%90%95%

13 / 49

Parallel speedup

• ε := parallel overhead (startup, synchronization, collection)• in the real world

actual speedup = 1(ρ/N) + (1− ρ) + ε

100

102

104

106

108

0

2

4

6

8

10

Number of processors

Speedup

25%50%90%95%

100

102

104

106

108

0

5

10

15

20

Number of processors

Speedup

25%50%90%95%

when ε = N when ε = log(N)

14 / 49

Sync-parallel versus async-parallel

Agent 1

Agent 2

Agent 3

idle idle

idle

idle

Synchronous(wait for the slowest)

Agent 1

Agent 2

Agent 3

Asynchronous(non-stop, no wait)

15 / 49

Async-parallel coordinate updates

16 / 49

Fixed point iteration and its parallel version

• H = H1 × · · · × Hm

• original iteration: xk+1 = Txk =: (I − ηS)xk

• all agents do in parallel:

agent 1: xk+11 ← T1(xk) = xk1 − ηS1(xk)

agent 2: xk+12 ← T2(xk) = xk2 − ηS2(xk)

...agent m: xk+1

m ← Tm(xk) = xkm − ηSm(xk)

• assumption:1. coordinate friendliness: cost of Six ∼ 1

mcost of Sx

2. synchronization after each iteration

17 / 49

Comparison

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

18 / 49

ARock1: Async-parallel coordinate update

• H = H1 × · · · × Hm• p agents, possibly p 6= m

• each agent randomly picks i ∈ {1, . . . ,m} and updates just xi:

xk+11 ← xk1

...xk+1i ← xki − ηkSixk−dk

...xk+1m ← xkm

• 0 ≤ dk ≤ τ , maximum delay

1Peng-Xu-Yan-Y.’1519 / 49

Two ways to model xk−dk

definitions: let x0, ..., xk, ... be the states of x in the memory

1. xk−dk is consistent if dk is a scalar2. xk−dk is possibly inconsistent if dk is a vector, different components are

delayed by different amounts

ARock allows both consistent and inconsistent read.

20 / 49

Memory lock illustration

Agent 1 read [0, 0, 0, 0]T = x0

consistent read

Agent 1 read [0, 0, 0, 2]T 6∈ {x0, x1, x2}

inconsistent read

21 / 49

History and recent literature

22 / 49

Brief history of async-parallel algorithms(mostly worst case analysis)

• 1969 – a linear equation solver by Chazan and Miranker;

• 1978 – extended to the fixed-point problem by Baudet under theabsolute-contraction2 type of assumption.

• For 20–30 years, mainly solve linear, nonlinear and differential equations bymany people

• 1989 – Parallel and Distributed Computation: Numerical Methods byBertsekas and Tsitsiklis. 2000 – Review by Frommer and Szyld.

• 1991 – gradient-projection itr assuming a local linear-error bound by Tseng

• 2001 – domain decomposition assuming strong convexity by Tai & Tseng

2An operator T : Rn → Rn is absolute-contractive if |T (x)− T (y)| ≤ P |x− y|, component-wise, where|x| denotes the vector with components |xi|, i = 1, ..., n, and P ∈ Rn×n

+ and ρ(P ) < 1.23 / 49

Absolute-contraction

• Absolute-contractive operator T : Rn → Rn:if |T (x)− T (y)| ≤ P |x− y|, component-wise, where |x| denotes thevector with components |xi|, i = 1, ..., n, and P ∈ Rn×n+ and ρ(P ) < 1.

• Interpretation: a series of nested rectangular boxes for xk+1 = Txk

• Applications:• diagonally dominated A for Ax = b

• diagonally dominated ∇2f for minx f(x) (just strong convexity is notenough)

• some network flow problems

24 / 49

Recent work(stochastic analysis)

• AsySCD for convex smooth and composite minimization by Liu et al’14and Liu-Wright’14. Async dual CD (regression problems) by Hsieh et al.’15

• Async randomized (splitting/distributed/incremental) methods:Wei-Ozdaglar’13, Iutzeler et al’13, Zhang-Kwok’14, Hong’14, Chang etal’15

• Async SGD: Hogwild!, Lian’15, etc.

• Async operator sample and CD: SMART Davis’15

25 / 49

Random coordinate selection

• select xi to update with probability pi, where mini pi > 0

• drawback:• agents cannot cache data• either global memory or communication is required• pseudo-random number generation takes time

• benefits:• often faster than the fixed cyclic order• automatic load balance• simplifies certain analysis

26 / 49

Convergence summary

27 / 49

Convergence guarantees

m is # coordinates, τ is the maximum delay, uniform selection pi ≡ 1m

Theorem (almost sure convergence)Assume that T is nonexpansive and has a fixed point. Use step sizesηk ∈ [ε, 1

2m−1/2τ+1 ), ∀k. Then, with probability one, xk ⇀ x∗ ∈ FixT .

In addition, rates can be derived.Consequence: step size is O(1) if τ ∼ √m. Under equal agents and updates,attaining linear speedup if using p = O(

√m) agents. p can be bigger if T is

sparse.

28 / 49

Sketch of proof

• typical inequality:

‖xk+1 − x∗‖2 ≤‖xk − x∗‖2 − c‖Txk − xk‖2

+ harmful terms(xk−1, . . . , xk−τ )

• Descent inequality under a new metric:

E(‖xk+1 − x∗‖2

M

∣∣X k) ≤ ‖xk − x∗‖2M − c ‖Txk − xk‖2

where• the history up to iteration k• xk = (xk, xk−1, . . . , xk−τ ) ∈ Hτ+1, k ≥ 0• any x∗ = (x∗, x∗, . . . , x∗) ∈ X∗ ⊆ Hτ+1

• M is a positive definite matrix.• c = c(ηk,m, τ)

29 / 49

• apply the Robbins-Siegmund theorem:

E(αk+1|Fk) + vk ≤ (1 + ξk)αk + ηk

where all are nonnegative, αk is random, and ξk, ηk are summable. Thenαk converges a.s.

• prove weakly convergent clustering points are fixed-points;

• assume H is separable and apply results [Combettes, Pesquet 2014].

30 / 49

Applications and numerical results

31 / 49

Linear equations (asynchronous Jacobi)

• require: invertible square matrix A with nonzero diagonal entries

• let D be the diagonal part of A; then

Ax = b ⇐⇒ (I −D−1A)x+D−1b︸ ︷︷ ︸Tx

= x.

• T is nonexpansive if ‖I −D−1A‖2 ≤ 1, i.e., A is diagonal dominating

• xk+1 = Txk recovers the Jacobi algorithm

32 / 49

Algorithm 1: ARock for linear equations

Input : shared variables x ∈ Rn, K > 0;set global iteration counter k = 0;while k < K, every agent asynchronously and continuously do

sample i ∈ {1, . . . ,m} uniformly at random;add − ηk

aii(∑

jaij x

kj − bi) to shared variable xi;

update the global counter k ← k + 1;

33 / 49

Sample code

loadData (A, data_file_name );loadData (b, label_file_name );# pragma omp parallel num_threads (p) shared (A,b,x,para){ // A, b, x, and para are passed by reference

call Jacobi (A,b,x,para) or ARock(A,b,x,para );}

• p: the number of threads• A,b,x: shared variable• para: other parameters

34 / 49

Jacobi worker function

for(int itr =0; itr < max_itr ; itr ++){

// compute the update for the assigned x[i]// ...

# pragma omp barrier{

// write x[i] in global memory}

# pragma omp barrier}

Jacobi needs the barrier directive for synchronization

35 / 49

ARock worker function

for(int itr =0; itr < max_itr ; itr ++){

// pick i at random// compute the update for x[i]// ...// write x(i) in global memory

}

ARock has no synchronization barrier directive

36 / 49

Minimizing smooth functions

• require: convex and Lipschitz differentiable function f

• if ∇f is L-Lipschitz, then

minimizex

f(x) ⇐⇒ x =(I − 2

L∇f)︸ ︷︷ ︸

T

x.

where T is nonexpansive

• ARock will be efficient when ∇xif(x) is easy to compute

37 / 49

Minimizing composite functions

• require: convex smooth g(·) and convex (possibly nonsmooth) f(·)

• proximal map: proxγf (y) = arg min f(x) + 12γ ‖x− y‖

2.

minimizex

f(x) + g(x) ⇐⇒ x = proxγf ◦ (I − γ∇g)︸ ︷︷ ︸T

x.

• ARock will be fast if• ∇xig(x) is easy to compute• f(·) is separable (e.g., `1 and `1,2)

38 / 49

Example: sparse logistic regression

• n features, N labeled samples

• each sample ai ∈ Rn has its label bi ∈ {1,−1}

• `1 regularized logistic regression:

minimizex∈Rn

λ‖x‖1 + 1N

N∑i=1

log(1 + exp(−bi · aTi x)

), (1)

• compare sync-parallel and ARock (async-parallel) on two datasets:

Name N (#samples) n (# features) # nonzeros in {a1, . . . , aN}rcv1 20,242 47,236 1,498,952

news20 19,996 1,355,191 9,097,916

39 / 49

Speedup tests

• implemented in C++ and OpenMP.• 32 cores shared memory machine.

#coresrcv1 news20

Time (s) Speedup Time (s) Speedupasync sync async sync async sync async sync

1 122.0 122.0 1.0 1.0 591.1 591.3 1.0 1.02 63.4 104.1 1.9 1.2 304.2 590.1 1.9 1.04 32.7 83.7 3.7 1.5 150.4 557.0 3.9 1.18 16.8 63.4 7.3 1.9 78.3 525.1 7.5 1.1

16 9.1 45.4 13.5 2.7 41.6 493.2 14.2 1.232 4.9 30.3 24.6 4.0 22.6 455.2 26.1 1.3

40 / 49

More applications

41 / 49

Minimizing composite functions

• require: both f and g are convex (possibly nonsmooth) functions

minimize f(x) + g(x) ⇐⇒ z = reflγf ◦ reflγg︸ ︷︷ ︸TPRS

(z).

recover x = proxγg(z)

• TPRS is known as the Peaceman-Rachford splitting operator3

• also, the Douglas-Rachford splitting operator: 12I + 1

2TPRS

• ARock runs fast when• reflγf is separable• (reflγg)i is easy-to-compute/maintain

3reflective proximal map: reflγf := 2proxγf − I. The maps reflγf , reflγg and thusreflγf ◦ reflγg are nonexpansive

42 / 49

Parallel/distributed ADMM

• require: m convex functions fi

• consensus problem:

minimizex

∑m

i=1 fi(x) + g(x)

⇐⇒ minimizexi,y

∑m

i=1 fi(xi) + g(y)

subject to

I

I

. . .I

x1

x2...xm

−I

I...I

y = 0

• Douglas-Rachford-ARock to the dual problem ⇒ async-parallel ADMM:

• m subproblems are solved in the async-parallel fashion• y and zi (dual variables) are updated in global memory (no lock)

43 / 49

Decentralized computing

• n agents in a connected network G = (V,E) with bi-directional links E

• each agent i has a private function fi

• problem: find a consensus solution x∗ to

minimizex∈Rp

f(x) :=n∑i=1

fi(Aixi) subject to xi = xj , ∀i, j.

• challenges: no center, only between-neighbor communication

• benefits: fault tolerance, no long-dist communication, privacy

44 / 49

Async-parallel decentralized ADMM

• a graph of connected agents: G = (V,E).

• decentralized consensus optimization problem:

minimizexi∈Rd,i∈V

f(x) :=∑

i∈V fi(xi)

subject to xi = xj , ∀(i, j) ∈ E

• ADMM reformulation: constraints xi = yij , xj = yij , ∀(i, j) ∈ E

• apply• ARock version 1: nodes asynchronously activate• ARock version 2: edges asynchronously activate

no global clock, no central controller, each agent keeps fi private andtalks to its neighbors

45 / 49

notation:

• N(i) all edges of agent i, N(i) = L(i) ∪R(i)

• L(i) neighbors j of agent i, j < i

• R(i) neighbors j of agent i, j > i

Algorithm 2: ARock for the decentralized consensus problem

Input : each agent i sets x0i ∈ Rd, dual variables z0

e,i for e ∈ E(i), K > 0.while k < K, any activated agent i do

receive zkli,l from neighbors l ∈ L(i) and zkir,r from neighbors r ∈ R(i);update local xki , zk+1

li,i and zk+1ir,i according to (2a)–(2c), respectively;

send zk+1li,i to neighbors l ∈ L(i) and zk+1

ir,i to neighbors r ∈ R(i).

xki ∈ arg minxi

fi(xi) +(∑

l∈L(i) zkli,l +

∑r∈R(i) z

kir,r

)xi + γ

2 |E(i)|‖xi‖2,

(2a)

zk+1ir,i = zkir,i − ηk((zkir,i + zir,r)/2 + γxki ), ∀r ∈ R(i), (2b)zk+1li,i = zkli,i − ηk((zkli,i + zli,l)/2 + γxki ), ∀l ∈ L(i). (2c)

46 / 49

Summary

47 / 49

Summary of async-parallel coordinate descent

benefits:

• eliminate idle time

• reduce communication / memory-access congestion

• random job selection: load balance

mathematics: analysis of disorderly (partial) updates

• asynchronous delay

• inconsistent read and write

48 / 49

Thank you!

Acknowledgements: NSF DMSReference: Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin. UCLA CAM15-37.Website: http://www.math.ucla.edu/˜wotaoyin/arock

49 / 49