+ All Categories
Home > Documents > Scaling up Machine Learning

Scaling up Machine Learning

Date post: 12-Sep-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
106
Scaling up Machine Learning Alex Smola Yahoo! Research Santa Clara alex.smola.org Monday, September 19, 11
Transcript
Page 1: Scaling up Machine Learning

Scaling up Machine LearningAlex Smola

Yahoo! ResearchSanta Clara

alex.smola.org

Monday, September 19, 11

Page 2: Scaling up Machine Learning

Thanks

AmrAhmed

JoeyGonzalez

YuchengLow

QirongHo

Ziadal Bawab

SergiyMatyusevich

ShravanNarayanamurthy

KilianWeinberger

JohnLangford

VanjaJosifovski

QuocLe

Choon HuiTeo

EricXing

JamesPetterson

JakeEisenstein

Shuang HongYang

VishyVishwanathan

ZhaohuiZheng

MarkusWeimer

AlexandrosKaratzoglou

MartinZinkevich

Monday, September 19, 11

Page 3: Scaling up Machine Learning

Why

Monday, September 19, 11

Page 4: Scaling up Machine Learning

Data• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme) >10B useful webpages

Monday, September 19, 11

Page 5: Scaling up Machine Learning

Data - Identity & Graph• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme) 100M-1B vertices

Monday, September 19, 11

Page 6: Scaling up Machine Learning

Data - User generated content• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme) >1B images, 40h video/minute

Monday, September 19, 11

Page 7: Scaling up Machine Learning

Data - Messages• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme) >1B texts

Monday, September 19, 11

Page 8: Scaling up Machine Learning

Data - User Tracking• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)

alex.smola.org

>1B ‘identities’Monday, September 19, 11

Page 9: Scaling up Machine Learning

Personalization• 100-1000M users

• Spam filtering• Personalized targeting

& collaborative filtering• News recommendation• Advertising

• Large parameter space(25 parameters = 100GB)

• Distributed storage(need it on every server)

• Distributed optimization• Model synchronization

Monday, September 19, 11

Page 10: Scaling up Machine Learning

• Ads

• Click feedback

• Emails

• Tags

• Editorial data is very expensive! Do not use!

• Graphs

• Document collections

• Email/IM/Discussions

• Query stream

(implicit) Labels no Labels

Monday, September 19, 11

Page 11: Scaling up Machine Learning

Hardware• Mostly commodity hardware• Server

• Multicore• Soft NUMA (e.g. 2-4 socket Xeons)• Plenty of disks

• Racks• Common switch per rack• 40 odd servers

• Server Center• Many racks• Big fat master switch(es)

• Faulty (1-100 years MTBF per machine)

Monday, September 19, 11

Page 12: Scaling up Machine Learning

What

modular strategysimple components

Monday, September 19, 11

Page 13: Scaling up Machine Learning

1. Distributed Convex Optimization

• Supervised learning• Classification, regression• CRFs, Max-Margin-Markov networks• Fully observed graphical models• Small modifications for aggregate labels, etc

• Works with MapReduce/Hadoop• Small number of iterations• Distributed file system• Simple & theoretical guarantees• Plenty of data

• Parallel batch subgradient solver (cluster)• Parallel online solver (multicore & cluster)TLSV’07, ZSL’09, TVSL’10, ZWSL’10

Monday, September 19, 11

Page 14: Scaling up Machine Learning

2. Parameter Compression• Personalization

• Spam filtering• News recommendation• Collaborative filtering

• String kernels• Dictionary free• Arbitrary substrings

• Sparse high-dimensional data• Structured data without pointers• Fixed memory footprint • Simple & theoretical guarantees

SPDLSSV’09, WDALS’09, KSW’10, PSCBN’10, YLSZZ’11, ASTV’12

Hey,

please mention subtly during your talk that people should use Yahoo products more often.

Thanks,

1

3

2

-1

h()

matrix factorcompression

Monday, September 19, 11

Page 15: Scaling up Machine Learning

3. Distributed Storage, Sampling and Synchronization

• Latent variable models with large state• Joint statistics (e.g. clustering, topic models)• Local state (attached to evidence)• Too big to store on a single machine

• Distributed Storage• Asynchronous computation & communication• Maps to network topology• Consistent hashing for scalability• Out of core storage of local state

• Distributed Gibbs sampler(10B latent variables, 1000 machines)

SN’10, AAJS’11, LAS’11, AAGS’12Monday, September 19, 11

Page 16: Scaling up Machine Learning

Design Principles• Must scale (essentially linearly) with

• Amount of data• Number of machines• Problem complexity (parameter space)

• Composable techniques• Accommodate more complex model with more data

• No 100 cluster model on 1B objects• Bayesian Nonparametrics• No 1000 parameter classifier on 1M data• Increased bit resolution for hashing• Throughput on simple models and 1CPU meaningless

Monday, September 19, 11

Page 17: Scaling up Machine Learning

How•Distributed Batch Convex Optimization•Distributed Online Convex Optimization•Parameter Compression•Distributed Sampling and Synchronization

Monday, September 19, 11

Page 18: Scaling up Machine Learning

Large Margin Classification

SpamHam

Monday, September 19, 11

Page 19: Scaling up Machine Learning

Large Margin Classification

SpamHam

Monday, September 19, 11

Page 20: Scaling up Machine Learning

Large Margin Classification

SpamHam

minimize

w,b,⇠

1

m

mX

i=1

⇠i +�

2

kwk2

subject to yi [hw, xii+ b] � 1� ⇠i and ⇠i � 0

Monday, September 19, 11

Page 21: Scaling up Machine Learning

Large Margin Classification

SpamHam

minimize

w,b

1

m

mX

i=1

max [0, 1� yi [hw, xii+ b]] +

2

kwk2

Monday, September 19, 11

Page 22: Scaling up Machine Learning

Large Margin Classification

SpamHam

minimize

w,b

1

m

mX

i=1

max [0, 1� yi [hw, xii+ b]] +

2

kwk2l(xi, yi, w) ⌦[w]

Monday, September 19, 11

Page 23: Scaling up Machine Learning

Regularized Risk Functional

SVM, regression, sequence annotation, ranking and recommendation, image annotation, gene finding, face detection, density estimation, novelty detection

minimizew

1

m

mX

i=1

l(xi, yi, w) + �⌦[w]

decomposable relatively simple

quadratic penalty (l2)sparsity penalty (l1)hyperkernelsgroup lasso

Monday, September 19, 11

Page 24: Scaling up Machine Learning

Regularized Risk Functionalminimize

w

1

m

mX

i=1

l(xi, yi, w) + �⌦[w]

data

aggregate loss& subgradients"

X

i2S

l(xi, yi, w)

#,

"X

i2S

@wl(xi, yi, w)

#

Monday, September 19, 11

Page 25: Scaling up Machine Learning

Regularized Risk Functionalminimize

w

1

m

mX

i=1

l(xi, yi, w) + �⌦[w]

data

solve master problem

Monday, September 19, 11

Page 26: Scaling up Machine Learning

Regularized Risk Functionalminimize

w

1

m

mX

i=1

l(xi, yi, w) + �⌦[w]

data

update parameter

w

Monday, September 19, 11

Page 27: Scaling up Machine Learning

Bundle Method Solverempirical

risk ⌦[w]

+

minimize

w

hmax

ihgi, wi+ bi

i+

2

⌦[w]

Monday, September 19, 11

Page 28: Scaling up Machine Learning

Bundle Method Solverempirical

risk ⌦[w]

+

minimize

w

hmax

ihgi, wi+ bi

i+

2

⌦[w]

Monday, September 19, 11

Page 29: Scaling up Machine Learning

Bundle Method Solverempirical

risk ⌦[w]

+

minimize

w

hmax

ihgi, wi+ bi

i+

2

⌦[w]

Monday, September 19, 11

Page 30: Scaling up Machine Learning

Bundle Method Solverempirical

risk ⌦[w]

+

minimize

w

hmax

ihgi, wi+ bi

i+

2

⌦[w]

Monday, September 19, 11

Page 31: Scaling up Machine Learning

Bundle Method Solver

• starting point w0

• compute first order Taylor approximation (gi, bi)• solve optimization problem• repeat

empiricalrisk ⌦[w]

+

minimize

w

hmax

ihgi, wi+ bi

i+

2

⌦[w]

Monday, September 19, 11

Page 32: Scaling up Machine Learning

Bundle Method Solverminimize

w

hmax

ihgi, wi+ bi

i+

2

⌦[w]

• Empirical risk certificates (at each iteration)• Upper bound on risk via first order Taylor approximation.• Lower bound on risk after solving optimization problem

• Convergence guarantees (worst case)(loss bound L, gradient bound G, Hessian bound H)

• Generic iteration bound

• For bounded Hessian

log

�L

G2+

8G2

�✏

log

�L

G2+

4

�[1 +H log 2✏]

Monday, September 19, 11

Page 33: Scaling up Machine Learning

Bundle Method Solver

Monday, September 19, 11

Page 34: Scaling up Machine Learning

Bundle Method Solver• Alternatives

• Use BFGS in outer loop• Gradient with line search• Dual Subgradient (Boyd et al.)

• Theoretically elegant• Slow convergence due to dual gradient descent

• FISTA (better for l1 sparsity penalty)• Problems with batch solvers

• requires 50 passes through dataset• requires smooth regularizer for fast convergence

Monday, September 19, 11

Page 35: Scaling up Machine Learning

How•Distributed Batch Convex Optimization•Distributed Online Convex Optimization•Parameter Compression•Distributed Sampling and Synchronization

Monday, September 19, 11

Page 36: Scaling up Machine Learning

Multicore

Monday, September 19, 11

Page 37: Scaling up Machine Learning

Online Learning• General Template

• Get instance• Compute instantaneous gradient• Update parameter vector

• Problems• Sequential execution (single core)• CPU core speed is no longer increasing• Disk/network bandwidth: 300GB/h• Does not scale to TBs of data

Monday, September 19, 11

Page 38: Scaling up Machine Learning

Parallel Online Templates

• Data parallel

• Parameter parallel

lossgradient

datasource

x

data

sourcedata

part n

x

part n

updater

Monday, September 19, 11

Page 39: Scaling up Machine Learning

Delayed Updates

• Data parallel• n processors compute gradients• delay is n-1 between gradient computation

and application • Parameter parallel

• delay between partial computation and feedback from joint loss

• delay logarithmic in processors

Monday, September 19, 11

Page 40: Scaling up Machine Learning

• Optimization Problem

• Algorithm

Delayed Updates

minimizew

i

fi(w)

Input: scalar ⇥ > 0 and delay ⇤for t = ⇤ + 1 to T + ⇤ do

Obtain ft and incur loss ft(wt)Compute gt := ⇥ft(wt) and set �t = 1

�(t�⇥)

Update wt+1 = wt � �tgt�⇥

end for

Monday, September 19, 11

Page 41: Scaling up Machine Learning

• Linear function classes

Algorithm converges no worse than with serial execution. Up to a factor of 4 as tight.

• Strong convexity

Each loss function is strongly convex with modulus λ. Constant offset depends on the degree of parallelism.

• Bounds are tightAdversary sends same instance τ times

Adversarial Guarantees

E[fi(w)] 4RLp

⌧T

R[X] �⌧R +

⇥12 + ⌧

⇤ L2

�(1 + ⌧ + log T )

Monday, September 19, 11

Page 42: Scaling up Machine Learning

• Lipschitz continuous loss gradients

Rate no longer depends on amount of parallelism• Strong convexity and Lipschitz gradients

This only works when the objective function is very close to a parabola (upper and lower bound)

Nonadversarial Guarantees

E[R[X]] 28.3R2H +

2

3

RL +

4

3

R2H log T

�⌧2

+

8

3

RLp

T .

E[R[X]] O(⌧2+ log T )

Monday, September 19, 11

Page 43: Scaling up Machine Learning

Convergence on TREC

-12

-10

-8

-6

-4

-2

0

2

0 10 20 30 40 50 60 70 80 90 100

Lo

g_

2 E

rro

r

Thousands of Iterations

Performance on TREC Data

no delaydelay of 10

delay of 100delay of 1000

Monday, September 19, 11

Page 44: Scaling up Machine Learning

Convergence on Y!Data

-6

-5

-4

-3

-2

-1

0

1

2

0 10 20 30 40 50 60 70 80 90 100

Log_2 E

rror

Thousands of Iterations

Performance on Real Data

no delaydelay of 10

delay of 100delay of 1000

Monday, September 19, 11

Page 45: Scaling up Machine Learning

Speedup on TREC

0

50

100

150

200

250

300

350

400

450

1 2 3 4 5 6 7

Perc

ent S

peedup

Threads

Performance on TREC Data

Monday, September 19, 11

Page 46: Scaling up Machine Learning

Cluster

Monday, September 19, 11

Page 47: Scaling up Machine Learning

MapReduce variant• Idiot proof simple algorithm

• Perform stochastic gradient on each computer for a random subset of the data (drawn with replacement)

• Average parameters• Benefits

• No communication during optimization• Single pass MapReduce• Latency is not a problem• Fault tolerant (we oversample anyway)

Monday, September 19, 11

Page 48: Scaling up Machine Learning

Guarantees• Requirements

• Strongly convex loss• Lipschitz continuous gradient

• Theorem

• Not sample size dependent• Regularization limits parallelization• For runtime

Ew2DT,k⌘

[c(w)]�minw

c(w) 8⌘G2

pk�

qk@ckL +

8⌘G2 k@ckL

k�+ (2⌘G2)

T = ln k�(ln ⌘+ln �)2⌘�

Monday, September 19, 11

Page 49: Scaling up Machine Learning

How•Distributed Batch Convex Optimization•Distributed Online Convex Optimization•Parameter Compression•Distributed Sampling and Synchronization

Monday, September 19, 11

Page 50: Scaling up Machine Learning

Classifier ClassifierClassifier Classifier

Spam Classification

Monday, September 19, 11

Page 51: Scaling up Machine Learning

1: donut?0: not-spam!1: spam! ?

maliciouseducated misinformed confused silent

0: quality

Classifier ClassifierClassifier Classifier

Spam Classification

Monday, September 19, 11

Page 52: Scaling up Machine Learning

Classifier

maliciouseducated misinformed confused silent

Classifier ClassifierClassifier Classifier

Spam Classification

Monday, September 19, 11

Page 53: Scaling up Machine Learning

Classifier Classifier Classifier Classifier Classifier

maliciouseducated misinformed confused silent

GlobalClassifier

Multitask Learning

Monday, September 19, 11

Page 54: Scaling up Machine Learning

Collaborative Classification

• Primal representation

Kernel representation

Multitask kernel (e.g. Pontil & Michelli, Daume). Usually does not scale well ...• Problem - dimensionality is 1013. That is 40TB of space

f(x, u) = ⇤�(x), w⌅+ ⇤�(x), wu⌅ = ⇤�(x)⇥ (1� eu), w⌅

k((x, u), (x�, u�)) = k(x, x�)[1 + �u,u� ]

Monday, September 19, 11

Page 55: Scaling up Machine Learning

Collaborative Classification

email

wwuser

• Primal representation

Kernel representation

Multitask kernel (e.g. Pontil & Michelli, Daume). Usually does not scale well ...• Problem - dimensionality is 1013. That is 40TB of space

f(x, u) = ⇤�(x), w⌅+ ⇤�(x), wu⌅ = ⇤�(x)⇥ (1� eu), w⌅

k((x, u), (x�, u�)) = k(x, x�)[1 + �u,u� ]

Monday, September 19, 11

Page 56: Scaling up Machine Learning

Collaborative Classification

email

wwuser

email (1 + euser)

w + wuser

• Primal representation

Kernel representation

Multitask kernel (e.g. Pontil & Michelli, Daume). Usually does not scale well ...• Problem - dimensionality is 1013. That is 40TB of space

f(x, u) = ⇤�(x), w⌅+ ⇤�(x), wu⌅ = ⇤�(x)⇥ (1� eu), w⌅

k((x, u), (x�, u�)) = k(x, x�)[1 + �u,u� ]

Monday, September 19, 11

Page 57: Scaling up Machine Learning

Hash Kernels

Monday, September 19, 11

Page 58: Scaling up Machine Learning

Hash Kernels

Hey,

please mention subtly during your talk that people should use Yahoo products more often. Thanks,

Someone important

instance: dictionary:

1

2

1

1

task/user(=barney):

sparse

Monday, September 19, 11

Page 59: Scaling up Machine Learning

Hash Kernels

Hey,

please mention subtly during your talk that people should use Yahoo products more often. Thanks,

Someone important

instance: dictionary:

1

2

1

1

task/user(=barney):

sparse

1

3

21

Rm

hashfunction:

h()

sparse

Monday, September 19, 11

Page 60: Scaling up Machine Learning

Hash Kernelsinstance:

task/user(=barney):

Hey,

please mention subtly during your talk that people should use Yahoo search more often. Thanks,

⇥xi � RN�(U+1)

1

3

2-1

h()

h(‘mention’)

h(‘mention_barney’)

s(m_b)

s(m)

{-1, 1}

Similar to count hash(Charikar, Chen, Farrach-Colton, 2003)

Monday, September 19, 11

Page 61: Scaling up Machine Learning

Approximate Orthogonality

Rsmall

We can do multi-task learning!

�()h()

RlargeRsmall

Monday, September 19, 11

Page 62: Scaling up Machine Learning

Guarantees• For a random hash function the inner product vanishes with

high probability via

• We can use this for multitask learning

• The hashed inner product is unbiasedProof: take expectation over random signs

• The variance is O(1/n)Proof: brute force expansion

• Preserves sparsity• No dictionary needed

Pr{|⌅wv, hu(x)⇧| > �} � 2e�C�2m

Direct sum in Hilbert Space

Sum in Hash Space

Monday, September 19, 11

Page 63: Scaling up Machine Learning

Spam classification results!"#$%

!"#&% !"##% !"##% !%

!"!'%

#"$'%

#"(#%

#")$% #")(%

#"##%

#"'#%

#"*#%

#")#%

#"$#%

!"##%

!"'#%

!$% '#% ''% '*% ')%

!"#$%$

&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%

0%0&)!%&1%3#!3')#0,*%

+,-./,01/2134%

5362-7/,8934%

./23,873%

N=20M, U=400KMonday, September 19, 11

Page 64: Scaling up Machine Learning

Lazy users ...

1

10

100

1000

10000

100000

1000000

0

13

26

39

52

65

78

91

104

117

130

143

156

169

182

197

211

228

244

261

288

317

370

523

numberofusers

numberoflabels

Labeledemailsperuser

Monday, September 19, 11

Page 65: Scaling up Machine Learning

Results by user group

Monday, September 19, 11

Page 66: Scaling up Machine Learning

Results by user group

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

('" $!" $$" $%" $&"

!"#$%$

&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%

0%0&)!%&1%3#!3')#0,*%

)!*"

)(*"

)$+,*"

)%+-*"

)'+(.*"

)(&+,(*"

),$+&%*"

)&%+/0"

12345674"

labeled emails:

Monday, September 19, 11

Page 67: Scaling up Machine Learning

Results by user group

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

('" $!" $$" $%" $&"

!"#$%$

&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%

0%0&)!%&1%3#!3')#0,*%

)!*"

)(*"

)$+,*"

)%+-*"

)'+(.*"

)(&+,(*"

),$+&%*"

)&%+/0"

12345674"

labeled emails:

Monday, September 19, 11

Page 68: Scaling up Machine Learning

Matrices

Monday, September 19, 11

Page 69: Scaling up Machine Learning

Collaborative Filtering• Netflix / Amazon / del.icio.us problem

• Many users, many products• Recommend product / news / friends

• Matrix factorization• Latent factor for users and movies each• Compatibility via

• Factorization model

• Optimization via stochastic gradient descent• Loss function depends on problem

(regression, preference, ranking, quatile, novelty)

X � U�V hence Xij � u�i vj

Monday, September 19, 11

Page 70: Scaling up Machine Learning

Collaborative Filtering• Big problem

• We have millions of users• We have millions of products• Storage - for 100 factors this is 800TB of variables• We want a model that can be kept in RAM (<16GB)

• Hashing compression

ui =�

j,k:h(j,k)=i

�(j, k)Ujk and vi =�

j,k:h�(j,k)=i

��(j, k)Vjk.

Xij :=�

k

�(k, i)��(k, j)uh(k,i)vh�(k,j).

Monday, September 19, 11

Page 71: Scaling up Machine Learning

Examples

Thousands of elements in M

Thou

sand

s of

ele

men

ts in

U

1225

840

720

520

400

240

120

60

32 16 10 9 8 7 6 51.20

1.22

1.24

1.26

1.28

1.30

1.32

rows in M

row

s in

U

983

500

450

400

350

300

250

200

150

100

50

1682 500 450 400 350 300 250 200 150 100 50

1.02

1.04

1.06

1.08

1.10

1.12

1.14

1.16

Eachmovie MovieLensMonday, September 19, 11

Page 72: Scaling up Machine Learning

Beyond• String kernels

• Hash substrings• Insert wildcards for approximate matching

• Data structures• Ontologies (hash class labels)• Hierarchical factorization (hash context)

• Feistel hash to reduce cache miss penalty• Better approximation guarantees in terms of risk• Hashing does not satisfy RIP property

(even breaks the Candes and Plan conditions)• Dense function spaces

(even Random Kitchen Sinks are too expensive)Monday, September 19, 11

Page 73: Scaling up Machine Learning

Beyond• String kernels

• Hash substrings• Insert wildcards for approximate matching

• Data structures• Ontologies (hash class labels)• Hierarchical factorization (hash context)

• Feistel hash to reduce cache miss penalty• Better approximation guarantees in terms of risk• Hashing does not satisfy RIP property

(even breaks the Candes and Plan conditions)• Dense function spaces

(even Random Kitchen Sinks are too expensive)Monday, September 19, 11

Page 74: Scaling up Machine Learning

How•Distributed Batch Convex Optimization•Distributed Online Convex Optimization•Parameter Compression•Distributed Sampling and Synchronization

Monday, September 19, 11

Page 75: Scaling up Machine Learning

Latent Variable Models

• We don’t observe everything• Poor engineering• Too intrusive• Too expensive• Machine failure• No editors• Forgot to measure it• Impossible to observe directly

Monday, September 19, 11

Page 76: Scaling up Machine Learning

Latent Variable Models

• We don’t observe everything• Poor engineering• Too intrusive• Too expensive• Machine failure• No editors• Forgot to measure it• Impossible to observe directly

• Local• Lots of evidence (data)• Lots of local state (parameters)

• Global• Large state (too large for single machine)• Depends on local state• Partitioning is difficult (e.g. natural graphs)

Monday, September 19, 11

Page 77: Scaling up Machine Learning

Latent Variable Models

meanvariance

cluster weight

data cluster ID

mixture of Gaussians clustering

Monday, September 19, 11

Page 78: Scaling up Machine Learning

Latent Variable Models

data

local state

global state

Vanilla LDA

User profiling

global state

Monday, September 19, 11

Page 79: Scaling up Machine Learning

Latent Variable Models

data

local state

global state

Vanilla LDA

User profiling

global state

Monday, September 19, 11

Page 80: Scaling up Machine Learning

User profiling

0 10 20 30 400

0.1

0.2

0.3

Prop

otio

nDay

Baseball

Finance

Jobs

Dating

0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

Prop

otio

n

Day

Baseball

Dating

Celebrity

Health

SnookiTom

CruiseKatie

Holmes PinkettKudrow

Hollywood

League baseball

basketball, doublehead

BergesenGriffeybullpen Greinke

skinbody

fingers cells toes

wrinkle layers

women men

dating singles

personals seeking match

Dating Baseball Celebrity Health

job career

businessassistant

hiringpart-time

receptionist

financial Thomson

chart real

StockTrading

currency

Jobs Finance

Monday, September 19, 11

Page 81: Scaling up Machine Learning

User profiling

0 10 20 30 400

0.1

0.2

0.3

Prop

otio

nDay

Baseball

Finance

Jobs

Dating

0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

Prop

otio

n

Day

Baseball

Dating

Celebrity

Health

SnookiTom

CruiseKatie

Holmes PinkettKudrow

Hollywood

League baseball

basketball, doublehead

BergesenGriffeybullpen Greinke

skinbody

fingers cells toes

wrinkle layers

women men

dating singles

personals seeking match

Dating Baseball Celebrity Health

job career

businessassistant

hiringpart-time

receptionist

financial Thomson

chart real

StockTrading

currency

Jobs Finance

Monday, September 19, 11

Page 82: Scaling up Machine Learning

User profiling

0 10 20 30 400

0.1

0.2

0.3

Prop

otio

nDay

Baseball

Finance

Jobs

Dating

0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

Prop

otio

n

Day

Baseball

Dating

Celebrity

Health

500 Million Users100+ topics

full activity logs1000 machines

Monday, September 19, 11

Page 83: Scaling up Machine Learning

User profiling

0 10 20 30 400

0.1

0.2

0.3

Prop

otio

nDay

Baseball

Finance

Jobs

Dating

0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

Prop

otio

n

Day

Baseball

Dating

Celebrity

Health

500 Million Users100+ topics

full activity logs1000 machines

Monday, September 19, 11

Page 84: Scaling up Machine Learning

Synchronization

Monday, September 19, 11

Page 85: Scaling up Machine Learning

Variable Caching

globalstate

datalocalstate

Monday, September 19, 11

Page 86: Scaling up Machine Learning

Variable Caching

globalstate

data localstate

copy

Monday, September 19, 11

Page 87: Scaling up Machine Learning

Variable Caching

globalreplica

rack

cluster

Monday, September 19, 11

Page 88: Scaling up Machine Learning

Message Passing• Child performs updates (sampling, variational)• Synchronization

• Start with common state• Child stores old and new state• Parent keeps global state• Bandwidth limited

• Works for any abelian group (sum, log-sum, cyclic group)

� x� x

old

x

old x

x

global x

global + �

local to global global to local

x x+ (xglobal � x

old)

x

old x

global

Monday, September 19, 11

Page 89: Scaling up Machine Learning

Consistent Hashing• Dedicated server for variables

• Insufficient bandwidth (hotspots)• Insufficient memory

• Select server via consistent hashing

m(x) = argminm2M

h(x,m)

Monday, September 19, 11

Page 90: Scaling up Machine Learning

Consistent Hashing• Storage is O(1/k) per machine• Communication is O(1) per machine• Fast snapshots O(1/k) per machine• O(k) open connections per machine• O(1/k) throughput per machine

m(x) = argminm2M

h(x,m)

Monday, September 19, 11

Page 91: Scaling up Machine Learning

Communication Shaping• Data rate between machines is O(1/k)• Machines operate asynchronously (no barrier)• Solution

• Schedule message pair• Communicate with r machines simultaneously• Use Luby-Rackoff PRNG for load balancing

• Efficiency guarantee

Monday, September 19, 11

Page 92: Scaling up Machine Learning

Performance

• 8 Million documents, 1000 topics, {100,200,400} machines, LDA• Red (symmetric latency bound message passing)• Blue (asynchronous bandwidth bound message passing & message scheduling)

• 10x faster synchronization time• 10x faster snapshots• Scheduling improves 10% already on 150 machines

Monday, September 19, 11

Page 93: Scaling up Machine Learning

LDA - our Guinea Pighttps://github.com/shravanmn/Yahoo_LDA

Monday, September 19, 11

Page 94: Scaling up Machine Learning

Latent Dirichlet Allocation

zij

wij

Θi

j=1..mi

α

βψl

l=1..ki=1..m

Monday, September 19, 11

Page 95: Scaling up Machine Learning

Sequential Algorithm• Collapsed Gibbs Sampler (Griffith & Steyvers 2005)

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update global (word, topic) table

Monday, September 19, 11

Page 96: Scaling up Machine Learning

Sequential Algorithm• Collapsed Gibbs Sampler (Griffith & Steyvers 2005)

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update global (word, topic) table

this kills parallelism

Monday, September 19, 11

Page 97: Scaling up Machine Learning

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table

• Update global (word, topic) table

State of the artUMass Mallet, UC Irvine, Google

p(t|wij) / �w↵t

n(t) + �̄+ �w

n(t, d = i)n(t) + �̄

+n(t, w = wij) [n(t, d = i) + ↵t]

n(t) + �̄

Monday, September 19, 11

Page 98: Scaling up Machine Learning

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table

• Update global (word, topic) table

State of the artUMass Mallet, UC Irvine, Google

p(t|wij) / �w↵t

n(t) + �̄+ �w

n(t, d = i)n(t) + �̄

+n(t, w = wij) [n(t, d = i) + ↵t]

n(t) + �̄

slowMonday, September 19, 11

Page 99: Scaling up Machine Learning

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table

• Update global (word, topic) table

State of the artUMass Mallet, UC Irvine, Google

p(t|wij) / �w↵t

n(t) + �̄+ �w

n(t, d = i)n(t) + �̄

+n(t, w = wij) [n(t, d = i) + ↵t]

n(t) + �̄

slow

changes rapidly

Monday, September 19, 11

Page 100: Scaling up Machine Learning

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table

• Update global (word, topic) table

State of the artUMass Mallet, UC Irvine, Google

p(t|wij) / �w↵t

n(t) + �̄+ �w

n(t, d = i)n(t) + �̄

+n(t, w = wij) [n(t, d = i) + ↵t]

n(t) + �̄

slow

changes rapidly

moderately fastMonday, September 19, 11

Page 101: Scaling up Machine Learning

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table

• Update global (word, topic) table

State of the artUMass Mallet, UC Irvine, Google

p(t|wij) / �w↵t

n(t) + �̄+ �w

n(t, d = i)n(t) + �̄

+n(t, w = wij) [n(t, d = i) + ↵t]

n(t) + �̄

slow

changes rapidly

moderately fast

table out of sync

blocking

network bound

memoryinefficient

Monday, September 19, 11

Page 102: Scaling up Machine Learning

Distributed asynchronous sampler• For 1000 iterations do (independently per computer)

• For each thread/core do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Generate computer local (word, topic) message

• In parallel update local (word, topic) table• In parallel update global (word, topic) table

Monday, September 19, 11

Page 103: Scaling up Machine Learning

Distributed asynchronous sampler

continuoussync

barrier free

concurrentcpu hdd net

minimal view

• For 1000 iterations do (independently per computer)• For each thread/core do

• For each document do• For each word in the document do

• Resample topic for the word• Update local (document, topic) table• Generate computer local (word, topic) message

• In parallel update local (word, topic) table• In parallel update global (word, topic) table

Monday, September 19, 11

Page 104: Scaling up Machine Learning

Multicore Architecture

• Decouple multithreaded sampling and updating (almost) avoids stalling for locks in the sampler

• Joint state table• much less memory required• samplers syncronized (10s vs. m/proc delay)

• Hyperparameter update via stochastic gradient descent• No need to keep documents in memory (streaming OK)

tokens

topics

file

combiner

count

updater

diagnostics

&

optimization

output to

filetopics

samplersampler

samplersampler

sampler

Monday, September 19, 11

Page 105: Scaling up Machine Learning

Scalability

>8000 documents/s

Monday, September 19, 11

Page 106: Scaling up Machine Learning

Outlook

• Convex optimization • Parameter compression • Distributed sampling • Fast nonlinear function classes • Data streams (sketches & statistics)

• Graphs, FAWN architectures, relational data, bandit-like settings, applications

Monday, September 19, 11


Recommended