+ All Categories
Home > Documents > Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering:...

Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering:...

Date post: 24-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
69
page.1 Machine Learning: Think Big and Parallel Day 2 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 3, 2013
Transcript
Page 1: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.1

Machine Learning: Think Big and ParallelDay 2

Inderjit S. DhillonDept of Computer Science

UT Austin

CS395T: Topics in Multicore ProgrammingOct 3, 2013

Page 2: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.2

Outline

Scikit-learn: Machine Learning in Python

Supervised Learning — day1

Regression: Least Squares, Lasso

Classification: kNN, SVM

Unsupervised Learning — day2

Clustering: k-means, Spectral Clustering

Dimensionality Reduction: PCA, Matrix Factorization for RecommenderSystems

Page 3: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.3

Clustering

Page 4: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.4

Clustering

Page 5: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.5

Clustering:

k-means Clustering

Page 6: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.6

Clustering

Goal is to group “similar” instances together

Given data points xi ∈ Rd , i = 1, 2, . . . ,N

But no labels – unsupervised learning

Useful for exploratory data analysis

Page 7: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.7

Clustering

Goal is to group “similar” instances together

Given data points xi ∈ Rd , i = 1, 2, . . . ,N

But no labels – unsupervised learning

Useful for exploratory data analysis

Page 8: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.8

Clustering

Need a measure of similarity (or distance) between two points x and y

Popular distance metrics:

Squared Euclidean distance d(x , y) = ‖x − y‖22

Cosine similarity d(x , y) = (xTy)/‖x‖‖y‖Manhattan distance d(x , y) = ‖x − y‖1

Clustering results are crucially dependent on the distance metric

Page 9: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.9

k-means Clustering

Find k clusters that minimizes the objective:

J =k∑

i=1

∑x∈Ci

‖x −mi‖22

Ci : the set of points in cluster i

mi : the mean(center) of cluster i

Objective is non-convex andproblem is NP-hard in general

Note: for k = 1, J =∑‖x −m‖2

2

⇒ solution is m∗ =1

N

∑x

Page 10: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.10

k-means Algorithm (Batch)

Input: data points x ∈ Rd , number of clusters kOutput: cluster assignment Ci of data points, i = 1, 2, . . . , k1: Randomly partition the data into k clusters2: while not converged do3: Compute mean of each cluster i

mi =1

ni

∑x∈Ci

x

4: For each x , find its new cluster index:

π(x) = arg min1≤i≤k

‖x −mi‖22

5: Update clusters:Ci = x |π(x) = i

6: end while

Page 11: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.11

k-means Clustering

Page 12: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.12

Convergence of k-means

Let the objective at t-th iteration be J(t) =∑k

i=1

∑x∈C(t)

i

‖x −m(t)i ‖2

J(t) =k∑

i=1

∑x∈C(t)

i

‖x −m(t)i ‖

2

≥k∑

i=1

∑x∈C(t)

i

‖x −m(t)π(x)‖

2 =k∑

i=1

∑x∈C(t+1)

i

‖x −m(t)i ‖

2

≥k∑

i=1

∑x∈C(t+1)

i

‖x −m(t+1)i ‖2 = J(t+1)

Each step decreases the objective — guaranteed to converge

But not necessarily to the global minimum

Page 13: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.13

k-means Algorithm (Online)

Input: data points x ∈ Rd , number of clusters kOutput: cluster assignment Ci of data points, i = 1, 2, . . . , k1: Initialize means mi and ni = 0, i = 1, 2, . . . , k2: while not converged do3: Pick a data point x and determine cluster π(x)

π(x) = arg min1≤i≤k

‖x −mi‖22

4: Update mean mπ(x)

nπ(x) = nπ(x) + 1 and mπ(x) = mπ(x) +1

nπ(x)(x −mπ(x))

5: end while

Page 14: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.14

k-means with Bregman Divergences

Bregman divergences:

dΦ(x , y) = Φ(x)− Φ(y)− 〈x − y,∇Φ(y)〉,

where Φ is strictly convex & differentiable

Examples of dΦ(x , y):

Squared Euclidean distance: ‖x − y‖22

KL-divergence:∑

i xi log(xiyi)

Itakura-Saito distance:∑

i

(xiyi− log( xi

yi)− 1

)For Bregman divergences, the arithmetic mean is the best predictor:

1

N

N∑i=1

xi = arg minc

N∑i=1

dΦ(xi , c)

Page 15: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.15

Clustering:

Spectral Clustering

Page 16: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.16

Spectral Clustering

Given:

Number of clusters kGraph G = (V, E)

Set of nodes: V = 1, · · · , nSet of edges: E = eij |i , j ∈ V — similarity between nodesWeighted adjacency matrix W ∈ Rn×n

Wij =

eij , if there is an edge between nodes i and j

0, otherwise

W is symmetric if G is an undirected graphDegree matrix: a diagonal matrix D where Dii =

∑nj=1 Wij

Page 17: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.17

Spectral Clustering

Goal:

Partition V into k disjoint clusters: V1, . . . ,VkWithin-cluster: large weights

Between-cluster: small weights

An ideal but trivial case: G has exactly k connected components

Page 18: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.18

Graph Cut

Small cut between clusters

cut(A,B) =1

2

∑i∈A,j∈B

Wij

Balance of cluster sizes |Vi |Objective:

RatioCut(V1, . . . ,Vk) =k∑

i=1

cut(Vi ,V \ Vi )|Vi |

Goal: minimize RatioCut(V1, . . . ,Vk)

Page 19: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.19

Graph Laplacian

Laplacian: L = D −W

L: symmetric and positive semi-definite

Eigenvalues: 0 ≤ λ1 ≤ λ2 ≤ · · · ≤ λn# of connected components in G = # of 0 eigenvalues of L

For all f ∈ Rn,

fTLf =1

2

n∑i ,j=1

Wij(fi − fj)2

Most importantly,

RatioCut(A1, . . . ,Ak) = trace(FTLF )

for a special F = [f1, . . . , fk ], where Fij =

1/√|Vj |, if i ∈ Vj

0, otherwise

Page 20: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.20

Relaxation of Cut Minimization

In general, minimizing RatioCut is NP-hard!However, based on

RatioCut(V1, . . . ,Vk) = trace(FTLF ),

we have the following relaxation:

SolveF ∗ = arg min

F∈Rn×ktrace(FTLF )

which are exactly the first k eigenvectors of LRecover V1, . . . ,Vk from F ∗ by distance-based clustering algorithms(e.g. k-means)

Page 21: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.21

Spectral Clustering vs. k-means

Clustering data points xi ∈ Rd , i = 1, . . . ,N

First construct kernel matrixe.g. Gaussian kernel:

Wij = K (xi , xj) = e−‖xi−xj‖2/2σ

k-means algorithm can only find lineardecision boundaries

Spectral clustering allows us to findnon-convex boundaries

Page 22: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.22

Variants of Graph Laplacian

Normalized Laplacian:

L = In − D−1/2WD−1/2

NormalizedCut(V1, . . . ,Vk) =∑k

i=1cut(Vi ,V\Vi )

vol(Vi ) , where

vol(Vi ) =∑

j∈Vi Djj

Signed Laplacian:

L = D −W , where Dii =∑n

j=1 |Wij |Handle “signed” similarity graphs with both positive and negative edgeweights

Page 23: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.23

Dimensionality Reduction

Page 24: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.24

Dimensionality Reduction

Page 25: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.25

Dimensionality Reduction:

Principal Component Analysis

Page 26: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.26

Principal Component Analysis

N observations: xi ∈ RD : i = 1 . . . ,NGoal:

Project data onto a space with dimensional M < D

Maximize the variance of the projected data

Example:

Page 27: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.27

PCA: Projection to one dimensional space (M = 1)

Empirical mean and variance of xn:

x =1

N

N∑n=1

xn

S =1

N

N∑n=1

(xn − x)(xn − x)T

w : the direction of the space

‖w‖2 = 1 as the length is not important.

Projw (xn) = wTxn, ∀n = 1, . . . ,N

Projw (x) = wT xThe variance of Projwxn:

1

N

N∑n=1

(wTxn −wT x

)2≡ wTSw .

Page 28: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.28

PCA: Projection to one dimensional space (M = 1)

Goal: maximize the variance of the projected data Projw (xn):

arg maxw1:‖w1‖=1

wT1 Sw1

Lagrangian L(w1, λ1) = wT1 Sw1 + λ1

(1−wT

1 w1

)∇L(w1, λ1) = 0 implies that Sw∗1 = λ1w∗1 .

w∗1 is the eigenvector of S corresponding the largest eigenvalue λ∗1, alsocalled the 1-st principal component.

In general, the k-th principal component w∗k is the eigenvector of Scorresponding to the k-th largest eigenvalue λ∗k .

Dimension reduction:

W = [w∗1 , . . . ,w∗M ]: formed by M principal components.

ProjW (x) = W Tx : the projected vector in M dimensional space.

Page 29: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.29

PCA: An Example

A set of digit images

The mean vector x and the first 4 principal components:

Page 30: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.30

PCA: An Example

Various M:

Eigenvalue Spectrum:

Page 31: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.31

Dimensionality Reduction:

Matrix Factorization

Page 32: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.32

Matrix Factorization

Matrix Factorization

A motivating example: recommender systems

Problem Formulation

Latent Feature Space

Existing Methods

Page 33: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.33

Recommender Systems

Page 34: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.34

Matrix Factorization Approach A ≈ WHT

Page 35: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.35

Matrix Factorization Approach A ≈ WHT

Page 36: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.36

Matrix Factorization Approach

minW∈Rm×k

H∈Rn×k

∑(i ,j)∈Ω

(Aij −wTi hj)

2 + λ(‖W ‖2

F + ‖H‖2F

),

Ω = (i , j) | Aij is observedRegularized terms to avoid over-fitting

Matrix factorization maps users/items to latent feature space Rk

the i th user ⇒ i th row of W , wTi ,

the j th item ⇒ j th row of H, hTj .

wTi hj : measures the interaction between i th user and j th item.

Page 37: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.37

Latent Feature Space

Page 38: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.38

Latent Feature Space

Page 39: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.39

Other Factorizations

Nonnegative Matrix Factorization

minW ,H‖A−WHT‖2

F + λ‖W ‖2F + λ‖H‖2

F

Each entry is positive

A is either fully or partially observed

Goal: find the nonnegative latent factors

Page 40: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.40

Existing Methods

Page 41: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.41

ALS: Alternating Least Squares

Fix either H or W and optimize the other:

LS sub-problem: minwi∈Rk

∑j∈Ωi

(Aij −wTi hj)

2 + λ‖wi‖2

it has closed form solution.

An iteration: update W /H once

O(|Ω|k2 + (m + n)k3)

wT1

wT2

wT3

A11 A12 A13

A21 A22 A23

A31 A32 A33

HT

( )

Page 42: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.42

SGM: Stochastic Gradient Method

SGM update: pick (i , j) ∈ Ω

Rij ← Aij −wTi hj

wi ← wi − η(λwi − Rijhj),

hj ← hj − η(λhj − Rijwi ).

wT1

wT2

wT3

A11 A12 A13

A21 A22 A23

A31 A32 A33

h1 h2 h3

( )

An iteration : |Ω| updates

Time per iteration: O(|Ω|k),

better than O(|Ω|k2) for ALS

Convergence is sensitive to the learning rate η.

Page 43: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.43

Coordinate Descent

Update a variable at a time:

wit ←∑

j∈Ωi(Aij −wT

i hj + withjt)hjt

λ+∑

j∈Ωih2jt

.

Subproblem is just a single-variate quadratic problem

Ωi = j : (i , j) ∈ ΩCan be done in O(|Ωi |)

Update Sequence:

Item/user-wise update:

pick a user i or an item jupdate the i-th row of W or the j-th column of H

Feature-wise update:

pick a feature index t ∈ 1, . . . , kupdate t-column of W and H alternatively

Page 44: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.44

Thoughts on Parallelization

Page 45: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.45

List of Methods in Scikit-learn

Regression:

Linear, Ridge, Lasso, Elastic Net, Bayesian Regression, Support VectorRegression, ...

Classification:

kNN, SVM, Perceptron, Logistic Regression, Naive Bayes, DecisionTrees, Random Forest, AdaBoost, ...

Clustering:

k-means, Spectral Clustering, Affinity Propagation, Mean-Shift,DBSCAN, Hierarchical Clustering, ...

Dimensionality Reduction:

(kernel/sparse) PCA, MF, NMF, Truncated SVD (LSA), DictionaryLearning, Factor Analysis, Independent Component Analysis, ...

Page 46: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.46

Potential Projects

Goal: A fully parallelized version of Scikit-learn

Regression:

parallel solvers for Lasso/Ridge

Classification:

parallel solvers for SVM, Logistic Regression

Clustering:

parallel k-means

Dimensionality Reduction:

parallel MF/NMF for recommender system

Page 47: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.47

Example: Parallel Matrix Factorizationfor Recommender Systems

Page 48: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.48

DSGD: Distributed SGM

wT1

wT2

wT3

h1 h2 h3

A11A12 A13

A21 A22A23

A31 A32 A33

P1 P2 P3

Page 49: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.49

DSGD: Distributed SGM

wT1

wT2

wT3

h1 h2 h3

A11 A12 A13

A21A22 A23

A31 A32A33

P1 P2 P3

Page 50: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.50

DSGD: Distributed SGM

wT1

wT2

wT3

h1 h2 h3

A11 A12

A13

A21 A22 A23

A31A32 A33

P1 P2 P3

Page 51: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.51

Parallel Coordinate Descent

Feature-wise Update: CCD++Rank-one decomposition:

WHT = [· · · wt · · · ][· · · ht · · · ]T =k∑

t=1

wt hTt

CCD++: picks a latent feature t and updates (wt , ht)

minu∈Rm,v∈Rn

∑(i ,j)∈Ω

(Rij − uivj

)2+ λ(‖u‖2 + ‖v‖2).

Rij = Aij −wTi hj

Rij = Rij + wti htj , ∀(i , j) ∈ Ω

(u∗, v∗) is a rank-one approximation of R

Apply the CCD iteration T times to obtain (u∗, v∗)CCD: item/user-wise update

Page 52: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.52

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 53: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.53

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 54: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.54

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 55: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.55

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 56: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.56

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 57: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.57

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 58: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.58

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 59: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.59

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 60: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.60

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 61: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.61

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 62: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.62

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 63: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.63

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 64: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.64

Feature-wise Update: CCD++

When T = 2

Cycle through k featuredimensions

O( 2TT+1 ) faster than CCD

netflix with k = 40

Page 65: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.65

Problems of Different Scales

W ,H, and R fit in the memory of a single computer

Multi-core systems are an appropriate framework.

All cores share the same memory space.

Latest variables are always available to access.

W ,H or R exceeds memory capacity of one computer

Can still run on one computer, but leads to disk swap.

Distributed systems are appropriate.

Matrices are stored in memory of the distributed system ⇒ only localdata can be accessed fast.

Require communication to access latest variables.

Page 66: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.66

Parallelization of CCD++

Key: to parallelize CCD to obtain (u∗, v∗).

Fact: each ui can be updated independently.

Partition u and v into p sub-vectors.

u ⇒ u1, . . . ,ur , . . . ,upv ⇒ v1, . . . , v r , . . . , vp

Run in parallel: the r th core Cr :

computes (u∗)r and (v∗)r

updates w rt and hr

t

See the paper Yu et al, 2013 for more details.

|

u

|

R11 R12 R13

R21 R22 R23

R31 R32 R33

v1 v2 v3

( )C1 C2 C3

Page 67: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.67

CCD++ on Distributed Systems

W ,H,R are distributed over the memory of different computers.

R11 R12 R13

R21

R31

C1R ⇒

R12

R21 R22 R23

R32

C2

R13

R23

R31 R32 R33

C3

W ⇒ W 1 W 2 W 3

( )T

H ⇒ H1 H2 H3

( )T

Page 68: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.68

CCD++ on Distributed Systems

Distributed update: computer Cr :

obtains (ur , v r ) using CCD:

computes ur and broadcasts itcomputes v r and broadcasts it

updates (w rt , hr

t )← (ur , v r )

|

u

|

R11 R12 R13

R21 R22 R23

R31 R32 R33

v1 v2 v3

( )C1 C2 C3

Page 69: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender

page.69

References

[1] R. Gemulla, P. J. Haas, E. Nijkamp, and Y. Sismanis Large-Scale Matrix Factorizationwith Distributed Stochastic Gradient Descent. KDD, 2011.

[2] F. Niu, B. Recht, C. Re, and S. J. Wright Hogwild: A Lock-Free Approach toParallelizing Stochastic Gradient Descent. NIPS, 2011.

[3] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin A Fast Parallel SGD for MatrixFactorization in Shared Memory Systems. RecSys, 2013.

[4] H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon Parallel Matrix Factorization for

Recommender Systems. KAIS, 2013.


Recommended