Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Post on 09-Jan-2017

119 views 1 download

transcript

CoCoAA General Framework

for Communication-Efficient Distributed Optimization

Virginia Smith Simone Forte ⋅ Chenxin Ma ⋅ Martin Takac ⋅ Martin Jaggi ⋅ Michael I. Jordan

Machine Learning with Large Datasets

Machine Learning with Large Datasets

Machine Learning with Large Datasets

image/music/video tagging document categorization

item recommendation click-through rate prediction

sequence tagging protein structure prediction

sensor data prediction spam classification

fraud detection

Machine Learning Workflow

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Open Problem: efficiently solving objective

when data is distributed

Distributed Optimization

Distributed Optimization

Distributed Optimization

reduce: w = w � ↵P

k �w

Distributed Optimization

reduce: w = w � ↵P

k �w

Distributed Optimization

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication

average: w := 1K

Pk wk

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication

average: w := 1K

Pk wk

“always communicate”

“never communicate”

reduce: w = w � ↵P

k �w

Distributed Optimization✔ convergence guarantees✗ high communication

✔ low communication

average: w := 1K

Pk wk

“always communicate”

“never communicate”

reduce: w = w � ↵P

k �w

Distributed Optimization✔ convergence guarantees✗ high communication

✔ low communication✗ convergence not guaranteed

average: w := 1K

Pk wk

“always communicate”

“never communicate”

reduce: w = w � ↵P

k �w

Mini-batch

Mini-batch

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch✔ convergence guarantees

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication

a natural middle-ground

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication

a natural middle-ground

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch Limitations

Mini-batch Limitations

1. ONE-OFF METHODS

Mini-batch Limitations

1. ONE-OFF METHODS

2. STALE UPDATES

Mini-batch Limitations

1. ONE-OFF METHODS

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

LARGE-SCALE OPTIMIZATION

CoCoA

ProxCoCoA+

LARGE-SCALE OPTIMIZATION

CoCoA

ProxCoCoA+

Mini-batch Limitations

1. ONE-OFF METHODS

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

Mini-batch Limitations

1. ONE-OFF METHODS Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

1. ONE-OFF METHODS Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

CoCoA-v1

1. Primal-Dual Framework

PRIMAL DUAL≥

1. Primal-Dual Framework

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

1. Primal-Dual Framework

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gap

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinear

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinearDual separates across machines (one dual variable per datapoint)

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual Framework

1. Primal-Dual FrameworkGlobal objective:

1. Primal-Dual FrameworkGlobal objective:

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual FrameworkGlobal objective:

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual FrameworkGlobal objective:

Local objective:

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual FrameworkGlobal objective:

Local objective: max

�↵[k]2Rn� 1

n

X

i2Pk

`⇤i (�↵i � (�↵[k])i)

� 1

nwTA�↵[k] �

2

���1

�nA�↵[k]

���2

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual FrameworkGlobal objective:

Local objective: max

�↵[k]2Rn� 1

n

X

i2Pk

`⇤i (�↵i � (�↵[k])i)

� 1

nwTA�↵[k] �

2

���1

�nA�↵[k]

���2

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual FrameworkGlobal objective:

Local objective:

Can solve the local objective using any internal optimization method

max

�↵[k]2Rn� 1

n

X

i2Pk

`⇤i (�↵i � (�↵[k])i)

� 1

nwTA�↵[k] �

2

���1

�nA�↵[k]

���2

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

2. Immediately Apply Updates

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE

FRESHfor i 2 b

�w �↵riP (w)w w +�w

end

w w +�w

3. Average over K

3. Average over K

reduce: w = w + 1K

Pk �wk

COCOA-v1 Limitations

Can we avoid having to average the partial solutions?

COCOA-v1 Limitations

Can we avoid having to average the partial solutions? ✔

COCOA-v1 Limitations

Can we avoid having to average the partial solutions? ✔

[CoCoA+, Ma & Smith, et. al., ICML ’15]

COCOA-v1 Limitations

Can we avoid having to average the partial solutions?

L1-regularized objectives not covered in initial framework

✔[CoCoA+, Ma & Smith, et. al., ICML ’15]

COCOA-v1 Limitations

Can we avoid having to average the partial solutions?

L1-regularized objectives not covered in initial framework

✔[CoCoA+, Ma & Smith, et. al., ICML ’15]

COCOA-v1 Limitations

LARGE-SCALE OPTIMIZATION

CoCoA

ProxCoCoA+

LARGE-SCALE OPTIMIZATION

CoCoA

ProxCoCoA+

L1 Regularization

L1 RegularizationEncourages sparse solutions

L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems

L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems

Beneficial to distribute by feature

L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems

Beneficial to distribute by feature

Can we map this to the CoCoA setup?

Solution: Solve Primal Directly

minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Solution: Solve Primal Directly

PRIMAL DUAL≥

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Solution: Solve Primal Directly

PRIMAL DUAL≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

Solution: Solve Primal Directly

PRIMAL DUAL

Stopping criteria given by duality gap

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

Solution: Solve Primal Directly

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

Solution: Solve Primal Directly

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. glmnet

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

Solution: Solve Primal Directly

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. glmnetPrimal separates across machines (one primal variable per feature)

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

50x speedup

CoCoA: A Framework for Distributed Optimization

CoCoA: A Framework for Distributed Optimization

flexible

CoCoA: A Framework for Distributed Optimization

flexibleallows for arbitrary internal methods

CoCoA: A Framework for Distributed Optimization

flexible efficientallows for arbitrary internal methods

CoCoA: A Framework for Distributed Optimization

flexible efficientallows for arbitrary internal methods

fast! (strong convergence & low

communication)

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

works for a variety of ML models

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

works for a variety of ML models

1. CoCoA-v1 [NIPS ’14]

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

works for a variety of ML models

1. CoCoA-v1 [NIPS ’14]

2. CoCoA+ [ICML ’15]

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

works for a variety of ML models

1. CoCoA-v1 [NIPS ’14]

2. CoCoA+ [ICML ’15]

3.ProxCoCoA+ [current work]

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

works for a variety of ML models

1. CoCoA-v1 [NIPS ’14]

2. CoCoA+ [ICML ’15]

3.ProxCoCoA+ [current work]

CoCoA Framework

Impact & Adoption

Impact & Adoptiongingsmith.github.io/

cocoa/

Impact & Adoptiongingsmith.github.io/

cocoa/

Impact & Adoptiongingsmith.github.io/

cocoa/

Impact & Adoptiongingsmith.github.io/

cocoa/

Impact & Adoptiongingsmith.github.io/

cocoa/

Numerous talks & demos

Impact & Adoption

ICMLgingsmith.github.io/

cocoa/

Numerous talks & demos

Impact & Adoption

ICMLgingsmith.github.io/

cocoa/

Numerous talks & demosOpen-source code & documentation

Impact & Adoption

ICMLgingsmith.github.io/

cocoa/

Numerous talks & demosOpen-source code & documentationIndustry & academic adoption

Thanks!

cs.berkeley.edu/~vsmith

github.com/gingsmith/proxcocoa

github.com/gingsmith/cocoa

Empirical Results in Dataset

Training (n)

Features (d)

Sparsity Workers (K)

url 2M 3M 4e-3% 4

kddb 19M 29M 1e-4% 4

epsilon 400K 2K 100% 8

webspam 350K 16M 0.02% 16

A First Approach:

A First Approach: CoCoA+ with Smoothing

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizers

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

k↵k1 + �k↵k22

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

k↵k1 + �k↵k22

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

k↵k1 + �k↵k22

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

k↵k1 + �k↵k22

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

k↵k1 + �k↵k22

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

X

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

X

CoCoA+ with smoothing doesn’t work

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

X

CoCoA+ with smoothing doesn’t work

Additionally, CoCoA+ distributes by datapoint, not by feature

Better Solution: ProxCoCoA+

Better Solution: ProxCoCoA+

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

Better Solution: ProxCoCoA+

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

ProxCoCoA+ 0.6030

Better Solution: ProxCoCoA+

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

ProxCoCoA+ 0.6030

Better Solution: ProxCoCoA+

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

ProxCoCoA+ 0.6030

Convergence

ConvergenceAssumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

Convergence

Theorem 1. Let have -bounded supportL

T � O⇣

11�⇥

⇣8L2n2

⌧✏ + c⌘⌘

gi

Assumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

Convergence

Theorem 1. Let have -bounded supportL

T � O⇣

11�⇥

⇣8L2n2

⌧✏ + c⌘⌘

gi

Assumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

Theorem 2. Let be -strongly convexµ

T � 1(1�⇥)

µ⌧+nµ⌧ log

n✏

gi