+ All Categories
Home > Documents > Machine Learning on Spark

Machine Learning on Spark

Date post: 31-Dec-2015
Category:
Upload: seth-perkins
View: 59 times
Download: 0 times
Share this document with a friend
Description:
Machine Learning on Spark. Shivaram Venkataraman UC Berkeley. Machine learning. Computer Science. Statistics. Spam filters. Click prediction. Machine learning. Recommendations. Search ranking. Classification. Clustering. Regression. Machine learning techniques. Active learning. - PowerPoint PPT Presentation
Popular Tags:
34
Machine Learning on Spark Shivaram Venkataraman UC Berkeley
Transcript
Page 1: Machine Learning on Spark

Machine Learning on Spark

Shivaram VenkataramanUC Berkeley

Page 2: Machine Learning on Spark

Computer Science

Machine learning

Statistics

Page 3: Machine Learning on Spark

Machine learning

Spam filters

Recommendations

Click prediction

Search ranking

Page 4: Machine Learning on Spark

Machine learningtechniques

Classification

Regression

Clustering

Active learning

Collaborative filtering

Page 5: Machine Learning on Spark

Implementing Machine Learning

Machine learning algorithms are

- Complex, multi-stage

- Iterative

MapReduce/Hadoop unsuitable Need efficient primitives for data sharing

Page 6: Machine Learning on Spark

Spark RDDs efficient data sharing

In-memory caching accelerates performance

- Up to 20x faster than Hadoop

Easy to use high-level programming interface

- Express complex algorithms ~100 lines.

Machine Learning using Spark

Page 7: Machine Learning on Spark

Machine learningtechniques

Classification

Regression

Clustering

Active learning

Collaborative filtering

Page 8: Machine Learning on Spark

K-Means Clustering using Spark

Focus: Implementation and Performance

Page 9: Machine Learning on Spark

Clustering

Grouping data according to similarity

Distance EastD

ista

nce

Nort

h E.g. archaeological dig

Page 10: Machine Learning on Spark

Clustering

Grouping data according to similarity

Distance EastD

ista

nce

Nort

h E.g. archaeological dig

Page 11: Machine Learning on Spark

K-Means Algorithm

Benefits

• Popular• Fast• Conceptually

straightforward

Distance EastD

ista

nce

Nort

h E.g. archaeological dig

Page 12: Machine Learning on Spark

K-Means: preliminaries

Feature 1Fe

atu

re 2

Data: Collection of values

data = lines.map(line=> parseVector(line))

Page 13: Machine Learning on Spark

K-Means: preliminaries

Feature 1Fe

atu

re 2

Dissimilarity: Squared Euclidean distance

dist = p.squaredDist(q)

Page 14: Machine Learning on Spark

K-Means: preliminaries

Feature 1Fe

atu

re 2

K = Number of clusters

Data assignments to clustersS1, S2,. . ., SK

Page 15: Machine Learning on Spark

K-Means: preliminaries

Feature 1Fe

atu

re 2

K = Number of clusters

Data assignments to clustersS1, S2,. . ., SK

Page 16: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

Page 17: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

Page 18: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

Page 19: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

Page 20: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

Page 21: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

Page 22: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

Page 23: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

Page 24: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

pointsGroup = closest.groupByKey()

Page 25: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

pointsGroup = closest.groupByKey()

newCenters = pointsGroup.mapValues( ps => average(ps))

Page 26: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

pointsGroup = closest.groupByKey()

newCenters = pointsGroup.mapValues( ps => average(ps))

Page 27: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

newCenters = pointsGroup.mapValues( ps => average(ps))

Page 28: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

newCenters =pointsGroup.mapValues( ps => average(ps))

while (dist(centers, newCenters) > ɛ)

Page 29: Machine Learning on Spark

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

newCenters =pointsGroup.mapValues( ps => average(ps))

while (dist(centers, newCenters) > ɛ)

Page 30: Machine Learning on Spark

K-Means Source

Feature 1

Featu

re 2

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

newCenters =pointsGroup.mapValues( ps => average(ps))

while (d > ɛ){

}

d = distance(centers, newCenters)

centers = newCenters.map(_)

Page 31: Machine Learning on Spark

Ease of use

Interactive shell:

Useful for featurization, pre-processing data

Lines of code for K-Means

- Spark ~ 90 lines – (Part of hands-on tutorial !)

- Hadoop/Mahout ~ 4 files, > 300 lines

Page 32: Machine Learning on Spark

25 50 1000

50

100

150

200

250

300274

157

106

197

121

87

143

61

33

Hadoop HadoopBinMemSpark

Number of machines

Itera

tio

n t

ime (

s)

K-Means

25 50 1000

50

100

150

200

250

184

111

76

116

80

62

15

6 3

HadoopHadoopBinMemSpark

Number of machines

Itera

tio

n t

ime (

s)

Logistic Regression

Performance

[Zaharia et. al, NSDI’12]

Page 33: Machine Learning on Spark

K means clustering using Spark Hands-on exercise this afternoon !

Examples and more: www.spark-project.org

Spark: Framework for cluster computing Fast and easy machine learning programs

Conclusion

Page 34: Machine Learning on Spark

Recommended