Machine Learning on Spark

Post on 24-Feb-2016

98 views 0 download

Tags:

description

Machine Learning on Spark. March 15, 2013 AMPCamp @ ECNU, Shanghai, China. Machine learning. Computer Science . Statistics. Spam filters. Click prediction. Machine learning. Recommendations. Search ranking. Classification. Clustering. Regression. Machine learning techniques. - PowerPoint PPT Presentation

transcript

Machine Learning on Spark

March 15, 2013AMPCamp @ ECNU, Shanghai, China

Computer Science Machine learning

Statistics

Machine learning

Spam filters

Recommendations

Click prediction

Search ranking

Machine learningtechniques

Classification

Regression

Clustering

Active learning

Collaborative filtering

Implementing Machine Learning Machine learning algorithms are- Complex, multi-stage- Iterative

MapReduce/Hadoop unsuitable Need efficient primitives for data sharing

Spark RDDs efficient data sharing

In-memory caching accelerates performance- Up to 100x faster than Hadoop

Easy to use high-level programming interface- Express complex algorithms ~100 lines.

Machine Learning using Spark

Machine learningtechniques

Classification

Regression

Clustering

Active learning

Collaborative filtering

K-Means Clustering using Spark

Focus: Implementation and Performance

Clustering

Grouping data according to similarity

Distance EastDi

stan

ce N

orth E.g. archaeological dig

Clustering

Grouping data according to similarity

Distance EastDi

stan

ce N

orth E.g. archaeological dig

K-Means Algorithm

Benefits

• Popular• Fast• Conceptually

straightforward

Distance EastDi

stan

ce N

orth E.g. archaeological dig

K-Means: preliminaries

Feature 1Fe

atur

e 2

Data: Collection of values

data = lines.map(line=> parseVector(line))

K-Means: preliminaries

Feature 1Fe

atur

e 2

Dissimilarity: Squared Euclidean distance

dist = p.squaredDist(q)

K-Means: preliminaries

Feature 1Fe

atur

e 2

K = Number of clusters

Data assignments to clustersS1, S2,. . ., SK

K-Means: preliminaries

Feature 1Fe

atur

e 2

K = Number of clusters

Data assignments to clustersS1, S2,. . ., SK

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues( ps => average(ps))

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues( ps => average(ps))

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues( ps => average(ps))

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters =pointsGroup.mapValues( ps => average(ps))

while (dist(centers, newCenters) > ɛ)

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters =pointsGroup.mapValues( ps => average(ps))

while (dist(centers, newCenters) > ɛ)

K-Means Source

Feature 1

Feat

ure

2

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters =pointsGroup.mapValues( ps => average(ps))

while (d > ɛ){

}

d = distance(centers, newCenters)

centers = newCenters.map(_)

Ease of use Interactive shell: Useful for featurization, pre-processing data

Lines of code for K-Means- Spark ~ 90 lines – (Part of hands-on tutorial !)- Hadoop/Mahout ~ 4 files, > 300 lines

25 50 1000

50

100

150

200

250

300 274

157

106

197

121

87

143

61

33

Hadoop HadoopBinMemSpark

Number of machines

Iter

atio

n ti

me

(s)

K-Means

25 50 1000

50

100

150

200

250

184

111

76

116

80

62

15 6 3

HadoopHadoopBinMemSpark

Number of machines

Iter

atio

n ti

me

(s)

Logistic Regression

Performance

[Zaharia et. al, NSDI’12]

K means clustering using Spark Hands-on exercise this afternoon !

Examples and more: www.spark-project.org

Spark: Framework for cluster computing Fast and easy machine learning programs

Conclusion