+ All Categories
Home > Documents > Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State...

Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State...

Date post: 27-Dec-2015
Category:
Upload: erik-skinner
View: 218 times
Download: 2 times
Share this document with a friend
Popular Tags:
62
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University
Transcript
Page 1: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Math 5364 NotesChapter 8: Cluster Analysis

Jesse Crawford

Department of MathematicsTarleton State University

Page 2: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Today's Topics

• Overview of Cluster Analysis

• K-means clustering

Page 3: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

What is Cluster Analysis?

• Dividing objects into clusters• Distances within clusters are small• Distances between clusters are large

Page 4: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

What is Cluster Analysis?

• Dividing objects into clusters• Distances within clusters are small• Distances between clusters are large

• Training data has no class labels!

• Cluster analysis is also called unsupervised classification

Page 5: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Cluster Centers

• Cluster centers: prototypes, centroids, medoids

Page 6: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Purposes of Cluster Analysis

• Understanding• Biology: Divide organisms into different

classes (kingdom, phylum, class, etc.)

• Business: Divide customers into clusters for marketing purposes

• Weather: Identify patterns in atmosphere and ocean

Page 7: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Purposes of Cluster Analysis

• Utility• Replace data points with cluster centers

for summarization/compression

Page 8: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

K-Means Clustering

K-Means Algorithm

• Select K initial centroids• Repeat the following:

• Form K clusters (assign each point to closest centroid)• Recompute the centroid of each cluster

• Stop when centroids converge

Page 9: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

K-Means Clustering

K-Means Algorithm

• Select K initial centroids• Repeat the following:

• Form K clusters (assign each point to closest centroid)• Recompute the centroid of each cluster

• Stop when centroids converge

Requires distance metric(Example: Euclidean distance)Depends on metric

(Example: centroid = meanfor Euclidean distance)

Page 10: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 11: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Sums of Squares for K-Means

2

1 1

Notation

• Number of clusters

• Number of points in th cluster

• th point in the th cluster

• mean of th cluster

• mean of all points

Total SS

SSTk

k

ki

k

NK

kik i

k

i k

K

N

x

x k

x

x x

‖ ‖ 2

1 1

Within SS

SSWkNK

ki kk i

x x

‖ ‖ 2

1 1

Between SS

SSBkNK

kk i

x x

‖ ‖

Page 12: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

2 2

1 1 1 1

2 2

1 1

2 2

1 1 1 1 1 1

2

To

( ( )

2 (

tal SS Within SS Between SS

SST

) '

) '

SSW S

(

B

)

S

k k

k

k k k

N NK K

ki ki k kk i k i

NK

ki k k ki k kk i

N N NK K K

ki k k ki k kk i k i k i

x x

x x

x x x x

x x x x x x

x x x x xx xx

‖ ‖ ‖ ‖

‖ ‖ ‖ ‖

‖ ‖ ‖ ‖

Page 13: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Total SS Within SS Between SS

Goal of -means : Minimize Within SS

Equivalent goal: Maximize Between SS

K

Page 14: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

A Problem with K-Means

• Different initial centroids can result in different clusterings

• Some choices of intial centroids may lead to local minima only.

• Possible solution: Repeat with randomly chosen initial centroids.

• Let m = number of repetitions !1

K

mK

K

ò

Page 15: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Today's Topics

• Cluster Evaluation

• Unsupervised Evaluation Measures• SSW• Silhouette Coefficient

• Supervised Evaluation Measures• Entropy• Purity

• Significance Tests

Page 16: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Unsupervised Evaluation Measures

• Does not use class labels

• SSW = Within Sum of Squares

• Silhouette Coefficient

Page 17: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Interpreting SSW

• SSW 0 as

• SSW 0 when # of points in data set

• Solution : Look for in plot of SSW vs.

• Optimal va

"elbow

l 3

''

ue:

K

K

K

K

Page 18: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Silhouette Coefficient

1. For the ith data object, calculate its distance to all other objects in its cluster. Call this value ai

2. For the ith data object and any cluster not containing that object, calculate the object's average distance to all the objects in the given cluster.

3. The minimum value from Step 2 is called bi

4. For the ith object, the silhouette coefficient is

( ) / max( , )i i i i ias b a b

Page 19: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Silhouette Coefficient

1

• 1

• The silhouette coefficient for the clustering

is the average of the silhouette coefficients.

1

• Silhouette coefficients near 1 indicate

strong cl

1

ustering.

i

n

ii

s

s sn

Page 20: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Distance Matrix for a Data Set

2

• Suppose

• th row of

• Distance matrix

n p

i

n n

ij i j

X

X i X

D

D X X

‖ ‖

Page 21: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Statistical Significance of the Silhouette Coefficient

• Generate 100 uniform data sets with the same data range and sample size as the original data.

• Calculate the silhouette coefficient for each uniform sample.

• Find the percentile rank of the silhouette coefficient for the original data among the randomly generated ones.

• If percentile rank , there is statistically significant evidence of clustering (we can reject the null hypothesis of no clustering).

Page 22: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Supervised Evaluation Measures

21

• Entropy log

• Purity max

• Any classification metric

(precision, recall

[

,

]

etc)

c

i ii

i i

p p

p

Page 23: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

50 50 3 32 2 253 53 53 53

47 47 50 502 2 297 97 97 97

Entropy(Cluster 1) [ log ) log ) 0log 0)]

0.314

En

(

tropy(Cluster 2) [0 log 0) log ) log )]

(

0.

(

( ( (

999

21

Entropy logc

i ii

p p

53 97150 150Weighted Entropy (0.314) (0.9

75

99)

0. 7

Page 24: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

5053

5097

Purity(Cluster 1)

0.943

Purity(Cluster 2)

0.515

Purity m ]ax [i ip

53 50 97 50150 53 150

23

97Weighted Purity ( ) ( )

Page 25: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Today's Topics

• Chi-squared Test for Cluster Evaluation

• DBSCAN

Page 26: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Chi-square Test for Independence

Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94

How can we test indepence of these two variables?

Page 27: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Chi-square Test for Independence

Column 1 Column 2 Column 3 … Column c Totals

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Row r Or1 Or2 Or3 … Orc Rr

Totals C1 C2 C3 … Cc N

0

0

Pr(Row )

Pr(Column )

H Rows/columns independent

H :

:

i

j

ij i j

i

p j

p p

p

p

Page 28: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Chi-square Test for Independence

Column 1 Column 2 Column 3 … Column c Totals

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Row r Or1 Or2 Or3 … Orc Rr

Totals C1 C2 C3 … Cc N

0

0

Pr(Row )

Pr(Column )

H Rows/columns independent

H :

:

i

j

ij i j

i

p j

p p

p

p

ˆ

ˆ jj

ii

Rp

NC

pN

Page 29: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Chi-square Test for Independence

Column 1 Column 2 Column 3 … Column c Totals

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Row r Or1 Or2 Or3 … Orc Rr

Totals C1 C2 C3 … Cc N

0

0

Pr(Row )

Pr(Column )

H Rows/columns independent

H :

:

i

j

ij i j

i

p j

p p

p

p

ˆ

ˆ jj

ii

Rp

NC

pN

0Under H :

ij ij i j

j i ji

E Np N

C RCRNN N N

p p

Page 30: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Chi-square Test for Independence

Column 1 Column 2 Column 3 … Column c Totals

Row 1 O11 O12 O13 … O1c R1

Row 2 O21 O22 O23 … O2c R2

… … … … … … …

Row r Or1 Or2 Or3 … Orc Rr

Totals C1 C2 C3 … Cc N

Define

i jij

RC

NE

22

1 1

20

( )

Under H has an approximate chi-square

distribution with ( 1)( 1) degrees of freedo

(Assuming 5, fo

,

m

r all , )

r cij ij

ij

i j ij

E i

O

j

E

E

r c

Page 31: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Chi-square Test for Independence

Observed Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94

1 111

56 3017.87

94

R

NE

C

Expected Engineering Science and Tech Business Other TotalsIn State 17.87 11.91 13.70 12.51 56Out of State 12.13 8.09 9.30 8.49 38Totals 30 20 23 21 94

Page 32: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Chi-square Test for Independence

Observed Engineering Science and Tech Business Other TotalsIn State 16 14 13 13 56Out of State 14 6 10 8 38Totals 30 20 23 21 94

Expected Engineering Science and Tech Business Other TotalsIn State 17.87 11.91 13.70 12.51 56Out of State 12.13 8.09 9.30 8.49 38Totals 30 20 23 21 94

2 2 22 (16 17.87) (14 11.91) (8 8.49)

1.5217.87 11.91 8.49

-value 0.68 Do not reject null hypothesis that rows

and columns are independent

p

Page 33: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

DBSCAN

• Clustering Algorithm

• Density Based Spatial Clustering of Applications with Noise

Page 34: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps (Must be chosen)• MinPts (Default value = 5)

• Three types of points:• Core points: Those with at least MinPts

neighbors within its Eps neighborhood

• Border points: Not a core point, but within the Eps neighborhood of a core point

• Noise points: Not a core point or a border point

Page 35: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps = 0.2 • MinPts = 5

Page 36: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps = 0.2 • MinPts = 5

Core point• Eps neighborhood

contains points

Page 37: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps = 0.2 • MinPts = 5

Border point• Eps neighborhood

contains points• Eps neighborhood

contains a core point

Page 38: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

DBSCAN: Parameters and Types of Points

• Requires two parameters:• Eps = 0.2 • MinPts = 5

Noise point• Eps neighborhood

contains points• Eps neighborhood

contains no core points

Page 39: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

DBSCAN Algorithm

• Identify all core points, border points, and noise points.

• Two core points within Eps of each other are assigned to the same cluster.

• Border points are assigned to one of the clusters of its associated core points.

• Noise points are not assigned to clusters. They are simply classified as noise.

Page 40: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

DBSCAN Algorithm

• Identify all core points, border points, and noise points.

• Two core points within Eps of each other are assigned to the same cluster.

• Border points are assigned to one of the clusters of its associated core points.

• Noise points are not assigned to clusters. They are simply classified as noise.

Page 41: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Today's Topics

• Agglomerative Hierarchical Clustering

Page 42: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Hierarchical Clustering

Taxonomy of Living Organisms

Dendrogram

Page 43: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Page 44: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Page 45: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Page 46: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Page 47: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Page 48: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Page 49: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Page 50: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Page 51: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Page 52: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Page 53: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Distances Between Clusters

1 2

1 2 1 2

1 2 1 2

12

21

Single Link

( , ) min{ ( , ) }

Complete Link

( , ) max{ ( , ) }

Average

1 1

,

,

, ( )| |

,( )| | yx C C

y

y

C

d C C d x y x C C

d C C d x y x C C

d CC C

d x y

û

û

Page 54: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Agglomerative Hierarchical Clustering

Heights = 1.0, 1.4, 3.0, 3.6, 5.6, 8.1, 13.0, 20.3

Page 55: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Today's Topics

• Gaussian Mixture EM Clustering

Page 56: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Setting for Gaussian Mixture EM Clustering

1 1 1

/2 1/2 1/212

for 1,2, ,

( , , ) ( , , )

Assume the conditional distribution of given i

( ) ( ),

| | ) (

s ( , )

( | ) (2 ) | exp{ ( ) ( )| }

|p p p

y y

py y y y

y c

P X x X x Y y f x x y

X Y y N

f

P Y y f y

y

x y

f x

x x

p.m.f. for YPrior distribution for Y

Joint conditional distributionof Xj's given Y

Page 57: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Setting for Gaussian Mixture EM Clustering

/2 1/2 1/212

1

for 1,2, ,

( | ) (2 ) | exp{ ( ) ( )}

| ))

| )

( ) ( ),

|

( ) (( |

( ) (

py y y y

c

y

y c

f x y x x

y

y y

P Y y f y

f y f xP Y y X x

f f x

Prior distribution for Y

Posterior distribution for Y

Page 58: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

/2 1/2 1/212

1

( | ) (2 ) | exp{ ( ) ( )}

Parameter for model:

( ( ), ) 1, ,

( , , ) ' {

|

0, ,

)

,

}

(

( | )

py y y y

y y

nn

n pij

f x y x x

f y y c

Y Y Y c

X X

Page 59: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

/2 1/2 1/212

1

1

( | ) (2 ) | exp{ ( ) ( )}

Parameter for model:

( ( )

|

,, ) 1, ,

( , , ) ' {0, , }

( )

; , (()( ) | )

( | )

py y y y

y y

nn

n pij

n

i i ii

f x y x x

f y y c

Y Y Y c

X X

Y XL f Y f X Y

Want to maximize this

Problem: Don't know Y's

Page 60: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

/2 1/2 1/212

1

1

( | ) (2 ) | exp{ ( ) ( )}

Parameter for model:

( ( )

|

,, ) 1, ,

( , , ) ' {0, , }

( )

; , (()( ) | )

( | )

py y y y

y y

nn

n pij

n

i i ii

f x y x x

f y y c

Y Y Y c

X X

Y XL f Y f X Y

( )

( )

| ,

( 1) ( )

)

Expectation Maximization (EM) Algorithm

E Step

( | log ( ; , )]

M Step

arg max ( |

[

)

t

t

Y X

t t

Q L Y

Q

E X

Page 61: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

/2 1/2 1/212

1

1

1

( | ) (2 ) | exp{ ( ) ( )}

Parameter for model:

( ( ), ) 1, ,

( , , ) ' {0, , }

( )

; , ) ) ( | )

| ))

|

,

( (

( ) (( |

( ) ( | )

( | )

py y y y

y y

nn

n pij

n

i i ii

i i c

y

f x y x x

f y y c

Y Y Y c

X X

Y X fL X Y

y

y y

f Y

f y f xP Y y X x

f f x

( )

( )

| ,

( 1) ( )

)

Expectation Maximization (EM) Algorithm

E Step

( | log ( ; , )]

M Step

arg max ( |

[

)

t

t

Y X

t t

Q L Y

Q

E X

Page 62: Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Further Reading

• Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B. 39 (1): 1—38.

• Ledolter, J. (2013). Data Mining and Business Analytics with R.


Recommended