+ All Categories
Home > Documents > Scalable K-Means++

Scalable K-Means++

Date post: 22-Mar-2016
Category:
Upload: rocio
View: 121 times
Download: 8 times
Share this document with a friend
Description:
Scalable K-Means++. Bahman Bahmani Stanford University. K-means Clustering. Fundamental problem in data analysis and machine learning “By far the most popular clustering algorithm used in scientific and industrial applications” [ Berkhin ’02] - PowerPoint PPT Presentation
32
Bahman Bahmani Stanford University Scalable K-Means++
Transcript
Page 1: Scalable K-Means++

Bahman BahmaniStanford University

Scalable K-Means++

Page 2: Scalable K-Means++

K-means Clustering

2

Fundamental problem in data analysis and machine learning

“By far the most popular clustering algorithm used in scientific and industrial applications” [Berkhin ’02]

Identified as one of the top 10 algorithms in data mining [Wu et al ’07]

Page 3: Scalable K-Means++

Problem Statement

3

A scalable algorithm for K-means clustering with theoretical guarantees and good practical performance

Page 4: Scalable K-Means++

K-means Clustering

4

Input: A set X={x1, x2, …, xn} of n data pointsNumber of clusters k

For a set C={c1, c2, …, ck} of cluster “centers” define:

where d(x,C) = distance from x to closest center in CGoal: To find a set C of centers that minimizes the

objective function φX(C)€

ϕX (C) = d(x,C)2x∈X∑

Page 5: Scalable K-Means++

K-means Clustering: Example

5K = 4

Page 6: Scalable K-Means++

Lloyd Algorithm

6

Start with k arbitrary centers {c1, c2, …, ck} (typically chosen uniformly at random from data points)

Performs an EM-type local search till convergence

Main advantages: Simplicity, scalability

Page 7: Scalable K-Means++

What’s wrong with Lloyd Algorithm?

7

Takes many iterations to convergeVery sensitive to initializationRandom initialization can easily get two

centers in the same clusterK-means gets stuck in a local optimum

Page 8: Scalable K-Means++

Lloyd Algorithm: Initialization

8 Figure credited to David Arthur

Page 9: Scalable K-Means++

Lloyd Algorithm: Initialization

9 Figure credited to David Arthur

Page 10: Scalable K-Means++

Lloyd Algorithm: Initialization

10 Figure credited to David Arthur

Page 11: Scalable K-Means++

Lloyd Algorithm: Initialization

11 Figure credited to David Arthur

Page 12: Scalable K-Means++

K-means++ [Arthur et al. ’07]

12

Spreads out the centersChoose first center, c1, uniformly at random

from the data setRepeat for 2 ≤ i ≤ k:

Choose ci to be equal to a data point x0 sampled from the distribution:

Theorem: O(log k)-approximation to optimum, right after initialization€

d(x0,C)2

ϕ X (C)∝ d(x0,C)2

Page 13: Scalable K-Means++

K-means++ Initialization

13

Page 14: Scalable K-Means++

K-means++ Initialization

14

Page 15: Scalable K-Means++

K-means++ Initialization

15

Page 16: Scalable K-Means++

K-means++ Initialization

16

Page 17: Scalable K-Means++

K-means++ Initialization

17

Page 18: Scalable K-Means++

What’s wrong with K-means++?

18

Needs K passes over the dataIn large data applications, not only the data

is massive, but also K is typically large (e.g., easily 1000).

Does not scale!

Page 19: Scalable K-Means++

Intuition for a solution

19

K-means++ samples one point per iteration and updates its distribution

What if we oversample by sampling each point independently with a larger probability?

Intuitively equivalent to updating the distribution much less frequentlyCoarser sampling

Turns out to be sufficient: K-means||

Page 20: Scalable K-Means++

K-means|| Initialization

20

K=4, Oversampling factor =3

Page 21: Scalable K-Means++

K-means|| Initialization

21

K=4, Oversampling factor =3

Page 22: Scalable K-Means++

K-means|| Initialization

22

K=4, Oversampling factor =3

Page 23: Scalable K-Means++

K-means|| Initialization

23

K=4, Oversampling factor =3

Page 24: Scalable K-Means++

K-means|| Initialization

24

K=4, Oversampling factor =3

Cluster the intermediate centers

Page 25: Scalable K-Means++

K-means|| [Bahmani et al. ’12]

25

Choose l>1 [Think l=Θ(k)]Initialize C to an arbitrary set of pointsFor R iterations do:

Sample each point x in X independently with probability px = ld2(x,C)/φX(C).

Add all the sampled points to CCluster the (weighted) points in C to find

the final k centers

Page 26: Scalable K-Means++

K-means||: Intuition

26

An interpolation between Lloyd and K-means++Number of

iterations (R)

R=0: Lloyd No guarantees

R=k: Simulating K-means++ (l=1) Strong guarantee

Small R: K-means|| Can it possibly give any guarantees?

Page 27: Scalable K-Means++

Theorem

27

Theorem: If φ and φ’ are the costs of the clustering at the beginning and end of an iteration, and OPT is the cost of the optimum clustering:

Corollary: Let ψ= cost of initial clusteringK-means|| produces a constant-factor approximation

to OPT, using only O(log (ψ/OPT)) iterationsUsing K-means++ for clustering the intermediate

centers, the overall approximation factor = O(log k)

E[ϕ '] ≤O(OPT) +kelϕ

Page 28: Scalable K-Means++

Experimental Results: Quality

28

K-means|| much harder than K-means++ to get confused with noisy outliers

Clustering Cost Right After Initialization

Clustering Cost After Lloyd Convergence

Random NA 22,000K-means++ 430 65K-means|| 16 14

GAUSSMIXTURE: 10,000 points in 15 dimensionsK=50

Costs scaled down by 104

Page 29: Scalable K-Means++

Experimental Results: Convergence

29

K-means|| reduces number of Lloyd iterations even more than K-means++

Number of Lloyd Iterations till ConvergenceRandom 167K-means++ 42K-means|| 28

SPAM: 4,601 points in 58 dimensionsK=50

Page 30: Scalable K-Means++

Experimental Results

30

K-means|| needs a small number of intermediate centers

Better than K-means++ as soon as ~K centers chosenClustering Cost

(Scaled down by 1010)

Number of intermediate centers

Tme (In Minutes)

Random 6.4 * 107 NA 489Partition 1.9 1.47 * 106 1022K-means|| 1.5 3604 87

KDDCUP1999: 4.8M points in 42 dimensionsK=1000

Page 31: Scalable K-Means++

Algorithmic Theme

31

Quickly decrease the size of the data in a distributed fashion…

… while maintaining the important features of the data

Solve the small instance on a single machine

Page 32: Scalable K-Means++

Thank You!

32


Recommended