Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

transcript

CoCoAA General Framework

for Communication-Efficient Distributed Optimization

Virginia Smith Simone Forte ⋅ Chenxin Ma ⋅ Martin Takac ⋅ Martin Jaggi ⋅ Michael I. Jordan

Machine Learning with Large Datasets

image/music/video tagging document categorization

item recommendation click-through rate prediction

sequence tagging protein structure prediction

sensor data prediction spam classification

fraud detection

Machine Learning Workflow

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Open Problem: efficiently solving objective

when data is distributed

Distributed Optimization

reduce: w = w � ↵P

k �w

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees

Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication

Pk �w

average: w := 1K

Pk �w

average: w := 1K

“always communicate”

“never communicate”

k �w

✔ low communication

average: w := 1K

k �w

✔ low communication✗ convergence not guaranteed

average: w := 1K

k �w

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch

Pi2b �w

Mini-batch✔ convergence guarantees

Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication

Pi2b �w

a natural middle-ground

Pi2b �w

a natural middle-ground

Pi2b �w

Mini-batch Limitations

1. ONE-OFF METHODS

2. STALE UPDATES

1. ONE-OFF METHODS

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

LARGE-SCALE OPTIMIZATION

ProxCoCoA+

1. ONE-OFF METHODS

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

1. ONE-OFF METHODS Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

1. ONE-OFF METHODS Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

CoCoA-v1

1. Primal-Dual Framework

PRIMAL DUAL≥

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

PRIMAL DUAL

Stopping criteria given by duality gap

≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinear

≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinearDual separates across machines (one dual variable per datapoint)

≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

1. Primal-Dual FrameworkGlobal objective:

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

Local objective:

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

Local objective: max

�↵[k]2Rn� 1

`⇤i (�↵i � (�↵[k])i)

nwTA�↵[k] �

��1

�nA�↵[k]

��2

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

Local objective: max

�↵[k]2Rn� 1

`⇤i (�↵i � (�↵[k])i)

nwTA�↵[k] �

��1

�nA�↵[k]

��2

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

Local objective:

Can solve the local objective using any internal optimization method

�↵[k]2Rn� 1

`⇤i (�↵i � (�↵[k])i)

nwTA�↵[k] �

��1

�nA�↵[k]

��2

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

w w +�w

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

w w +�w

FRESHfor i 2 b

�w �↵riP (w)w w +�w

w w +�w

3. Average over K

reduce: w = w + 1K

Pk �wk

COCOA-v1 Limitations

Can we avoid having to average the partial solutions?

Can we avoid having to average the partial solutions? ✔

[CoCoA+, Ma & Smith, et. al., ICML ’15]

L1-regularized objectives not covered in initial framework

✔[CoCoA+, Ma & Smith, et. al., ICML ’15]

L1-regularized objectives not covered in initial framework

✔[CoCoA+, Ma & Smith, et. al., ICML ’15]

ProxCoCoA+

L1 Regularization

L1 RegularizationEncourages sparse solutions

L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems

Beneficial to distribute by feature

L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems

Beneficial to distribute by feature

Can we map this to the CoCoA setup?

Solution: Solve Primal Directly

minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

PRIMAL DUAL≥

↵2Rn

"D(↵) := ��

kA↵k2 � 1

`⇤i (�↵i)

PRIMAL DUAL≥min↵2Rn

f(A↵) +nX

gi(↵i) minw2Rd

f⇤(w) +nX

`⇤i (�x

PRIMAL DUAL

Stopping criteria given by duality gap

≥min↵2Rn

f(A↵) +nX

gi(↵i) minw2Rd

f⇤(w) +nX

`⇤i (�x

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥min↵2Rn

f(A↵) +nX

gi(↵i) minw2Rd

f⇤(w) +nX

`⇤i (�x

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. glmnet

≥min↵2Rn

f(A↵) +nX

gi(↵i) minw2Rd

f⇤(w) +nX

`⇤i (�x

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. glmnetPrimal separates across machines (one primal variable per feature)

≥min↵2Rn

f(A↵) +nX

gi(↵i) minw2Rd

f⇤(w) +nX

`⇤i (�x

50x speedup

CoCoA: A Framework for Distributed Optimization

flexible

flexibleallows for arbitrary internal methods

flexible efficientallows for arbitrary internal methods

fast! (strong convergence & low

communication)

flexible efficient generalallows for arbitrary internal methods

communication)

works for a variety of ML models

communication)

1. CoCoA-v1 [NIPS ’14]

communication)

2. CoCoA+ [ICML ’15]

communication)

3.ProxCoCoA+ [current work]

communication)

3.ProxCoCoA+ [current work]

CoCoA Framework

Impact & Adoption

Impact & Adoptiongingsmith.github.io/

cocoa/

Numerous talks & demos

Impact & Adoption

ICMLgingsmith.github.io/

cocoa/

Numerous talks & demos

Impact & Adoption

cocoa/

Numerous talks & demosOpen-source code & documentation

Impact & Adoption

cocoa/

Numerous talks & demosOpen-source code & documentationIndustry & academic adoption

Thanks!

cs.berkeley.edu/~vsmith

github.com/gingsmith/proxcocoa

github.com/gingsmith/cocoa

Empirical Results in Dataset

Training (n)

Features (d)

Sparsity Workers (K)

url 2M 3M 4e-3% 4

kddb 19M 29M 1e-4% 4

epsilon 400K 2K 100% 8

webspam 350K 16M 0.02% 16

A First Approach:

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizers

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

k↵k1 + �k↵k22

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

k↵k1 + �k↵k22

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

k↵k1 + �k↵k22

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

k↵k1 + �k↵k22

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

CoCoA+ with smoothing doesn’t work

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

CoCoA+ with smoothing doesn’t work

Additionally, CoCoA+ distributes by datapoint, not by feature

Better Solution: ProxCoCoA+

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

ProxCoCoA+ 0.6030

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

ProxCoCoA+ 0.6030

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

ProxCoCoA+ 0.6030

Convergence

ConvergenceAssumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

Convergence

Theorem 1. Let have -bounded supportL

T � O⇣

11�⇥

⇣8L2n2

⌧✏ + c⌘⌘

Assumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

Convergence

Theorem 1. Let have -bounded supportL

T � O⇣

11�⇥

⇣8L2n2

⌧✏ + c⌘⌘

Assumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

Theorem 2. Let be -strongly convexµ

T � 1(1�⇥)

µ⌧+nµ⌧ log

Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Technology