+ All Categories
Home > Documents > Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Date post: 20-Dec-2015
Category:
View: 217 times
Download: 2 times
Share this document with a friend
Popular Tags:
60
Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli
Transcript
Page 1: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Clustering Tutorial

Elias Raftopoulos

HY539 29/3/06

Prof. Maria Papadopouli

Page 2: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Roadmap

Math Reminder Principle Components Analysis Clustering ANOVA

Page 3: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Standard Deviation

Statistics – analyzing data sets in terms of the relationships between the individual points

Standard Deviation is a measure of the spread of the data

Calculation: average distance from the mean of the data

Page 4: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Variance

Another measure of the spread of the data in a data set Calculation:

Var( X ) = E(( x – μ )^2)

Why have both variance and SD to calculate the spread of data?

Variance is claimed to be the original statistical measure of spread of data. However it’s unit would be expressed as a square e.g. cm^2, which is unrealistic to express heights or other measures. Hence SD as the square root of variance was born.

Page 5: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Covariance

Variance – measure of the deviation from the mean for points in one dimension e.g. heights

Covariance as a measure of how much each of the dimensions vary from the mean with respect to each other.

Covariance is measured between 2 dimensions to see if there is a relationship between the 2 dimensions e.g. number of hours studied & marks obtained

The covariance between one dimension and itself is the variance

Page 6: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Covariance Matrix

Representing Covariance between dimensions as a matrix e.g. for 3 dimensions:

cov(x,x) cov(x,y) cov(x,z)

C = cov(y,x) cov(y,y) cov(y,z)

cov(z,x) cov(z,y) cov(z,z)

Diagonal is the variances of x, y and z cov(x,y) = cov(y,x) hence matrix is symmetrical

about the diagonal N-dimensional data will result in nxn covariance

matrix

Page 7: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Covariance

Exact value is not as important as it’s sign. A positive value of covariance indicates both dimensions

increase or decrease together e.g. as the number of hours studied increases, the marks in that subject increase.

A negative value indicates while one increases the other decreases, or vice-versa e.g. active social life at RIT vs performance in CS dept.

If covariance is zero: the two dimensions are independent of each other e.g. heights of students vs the marks obtained in a subject

Page 8: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Transformation matrices Consider:

2 3 3 12 3

2 1 2 8 2

Square transformation matrix transforms (3,2) from its original location. Now if we were to take a multiple of (3,2)

3 6

2 4

2 3 6 24 6

2 1 4 16 4

x = 4 x=

x2 =

x = 4 x=

Page 9: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Transformation matrices

Scale vector (3,2) by a value 2 to get (6,4) Multiply by the square transformation matrix We see the result is still a multiple of 4. WHY? A vector consists of both length and direction. Scaling a

vector only changes its length and not its direction. This is an important observation in the transformation of matrices leading to formation of eigenvectors and eigenvalues.

Irrespective of how much we scale (3,2) by, the solution is always a multiple of 4.

Page 10: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

eigenvalue problem

The eigenvalue problem is any problem having the following form:

A . v = λ . v A: n x n matrix v: n x 1 non-zero vector λ: scalar

Any value of λ for which this equation has a solution is called the eigenvalue of A and vector v which corresponds to this value is called the eigenvector of A.

Page 11: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

eigenvalue problem

2 3 3 12 3

2 1 2 8 2

A . v = λ . v

Therefore, (3,2) is an eigenvector of the square matrix A and 4 is an eigenvalue of A

Given matrix A, how can we calculate the eigenvector and eigenvalues for A?

x = 4 x=

Page 12: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Calculating eigenvectors & eigenvalues

Given A . v = λ . v

A . v - λ . I . v = 0

(A - λ . I ). v = 0

Finding the roots of |A - λ . I| will give the eigenvalues and for each of these eigenvalues there will be an eigenvector

Example …

Page 13: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Calculating eigenvectors & eigenvalues

If A = 0 1

-2 -3

Then |A - λ . I| = 0 1 λ 0 = 0

-2 -3 0 λ

-λ 1 = λ2 + 3λ + 2 = 0

-2 -3-λ

This gives us 2 eigenvalues:

λ1 = -1 and λ2 = -2

Page 14: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Properties of eigenvectors and eigenvalues

Note that Irrespective of how much we scale (3,2) by, the solution is always a multiple of 4.

Eigenvectors can only be found for square matrices and not every square matrix has eigenvectors.

Given an n x n matrix, we can find n eigenvectors

Page 15: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Roadmap

Principle Components Analysis Clustering ANOVA

Page 16: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA

principal components analysis (PCA) is a technique that can be used to simplify a dataset

It is a linear transformation that chooses a new coordinate system for the data set such that greatest variance by any projection of the data set

comes to lie on the first axis (then called the first principal component),

the second greatest variance on the second axis, and so on.

PCA can be used for reducing dimensionality by eliminating the later principal components.

Page 17: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA

By finding the eigenvalues and eigenvectors of the covariance matrix, we find that the eigenvectors with the largest eigenvalues correspond to the dimensions that have the strongest correlation in the dataset.

This is the principal component. PCA is a useful statistical technique that has

found application in: fields such as face recognition and image

compression finding patterns in data of high dimension

Page 18: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 1

Subtract the mean

from each of the data dimensions. All the x values have x subtracted and y values have y subtracted from them. This produces a data set whose mean is zero.

Subtracting the mean makes variance and covariance calculation easier by simplifying their equations. The variance and co-variance values are not affected by the mean value.

Page 19: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 1

DATA:

x y

2.5 2.4

0.5 0.7

2.2 2.9

1.9 2.2

3.1 3.0

2.3 2.7

2 1.6

1 1.1

1.5 1.6

1.1 0.9

ZERO MEAN DATA:

x y

.69 .49

-1.31 -1.21

.39 .99

.09 .29

1.29 1.09

.49 .79

.19 -.31

-.81 -.81

-.31 -.31

-.71 -1.01

Page 20: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 1

Page 21: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 2

Calculate the covariance matrix

cov = .616555556 .615444444

.615444444 .716555556

since the non-diagonal elements in this covariance matrix are positive, we should expect that both the x and y variable increase together.

Page 22: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 3

Calculate the eigenvectors and eigenvalues of the covariance matrix

eigenvalues = .0490833989

1.28402771

eigenvectors = -.735178656 -.677873399

.677873399 -.735178656

Page 23: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 3

•eigenvectors are plotted as diagonal dotted lines on the plot. •Note they are perpendicular to each other. •Note one of the eigenvectors goes through the middle of the points, like drawing a line of best fit. •The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main line, but are off to the side of the main line by some amount.

Page 24: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 4

Reduce dimensionality and form feature vector the eigenvector with the highest eigenvalue is the principle component of the data set.

In our example, the eigenvector with the larges eigenvalue was the one that pointed down the middle of the data.

Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance.

Page 25: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 4 Now, if you like, you can decide to ignore

the components of lesser significance

You do lose some information, but if the eigenvalues are small, you don’t lose much

n dimensions in your data calculate n eigenvectors and eigenvalueschoose only the first p eigenvectors final data set has only p dimensions.

Page 26: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 4

Feature Vector

FeatureVector = (eig1 eig2 eig3 … eign)

We can either form a feature vector with both of the eigenvectors:

-.677873399 -.735178656

-.735178656 .677873399

or, we can choose to leave out the smaller, less significant component and only have a single column:

- .677873399

- .735178656

Page 27: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 5

Deriving the new data

FinalData = RowFeatureVector x RowZeroMeanData RowFeatureVector is the matrix with the eigenvectors in

the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the top

RowZeroMeanData is the mean-adjusted data transposed, ie. the data items are in each column, with each row holding a separate dimension.

Page 28: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 5

R VTSU=

samples samples

variables variablesfactors

factors

factors

factors

significant

noiseno

ise noise

sign

ific

ant

sig.

Page 29: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 5

FinalData is the final data set, with data items in columns, and dimensions along rows.

What will this give us? It will give us the original data solely in terms of the

vectors we chose.

We have changed our data from being in terms of the axes x and y , and now they are in terms of our 2 eigenvectors.

Page 30: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 5FinalData transpose: dimensions along columns

x y

-.827970186 -.175115307

1.77758033 .142857227

-.992197494 .384374989

-.274210416 .130417207

-1.67580142 -.209498461

-.912949103 .175282444

.0991094375 -.349824698

1.14457216 .0464172582

.438046137 .0177646297

1.22382056 -.162675287

Page 31: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

PCA process –STEP 5

Page 32: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Reconstruction of original Data

If we reduced the dimensionality, obviously, when reconstructing the data we would lose those dimensions we chose to discard. In our example let us assume that we considered only the x dimension…

Page 33: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Reconstruction of original Data

x

-.827970186

1.77758033

-.992197494

-.274210416

-1.67580142

-.912949103

.0991094375

1.14457216

.438046137

1.22382056

Page 34: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Roadmap

Principle Components Analysis Clustering ANOVA

Page 35: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

What is Cluster Analysis?

Cluster: a collection of data objects Similar to the objects in the same cluster (Intraclass similarity) Dissimilar to the objects in other clusters (Interclass dissimilarity)

Cluster analysis Statistical method for grouping a set of data objects into clusters

A good clustering method produces high quality clusters with high

intraclass similarity and low interclass similarity Clustering is unsupervised classification Can be a stand-alone tool or as a preprocessing step for

other algorithms

Page 36: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Group objects according to their similarity

Cluster:a set of objectsthat are similar to each otherand separatedfrom the otherobjects.

Example: green/red data pointswere generatedfrom two differentnormal distributions

Page 37: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Clustering data

Experiments/samples are given as the row and column vectors of an expression data matrix

Clustering may be applied either to objects experiments (regarded as vectors in Ro or Rn).

n experiments

o ob

ject

s

object expression data matrix

Page 38: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Pattern matrix Proximity matrix

Pattern matrix (nxp)p=attributesn=# of objects

Proximity matrix (nxn)d(i,j)=difference/dissimilarity between i and j

npx...nfx...n1x

...............ipx...ifx...i1x

...............1px...1fx...11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0

Page 39: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Proximity matrix

Clustering methods require that a index of proximity, or alikeness, or affinity or association be established between pairs of patterns

A proximity index is either a similarity or a dissimilarity

The crucial problem in identifying clusters in data is to specify what proximity is and how to measure it

Page 40: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Proximity indices

A proximity index between the ith and kth patterns is denoted d(i,k) and must satisfy the following three properties:

1. (a) for a dissimilarity: d(i,i) = 0, all i

(b) for a similarity: d(i,i) ≥ max d(i,k), all I

2. d(i,k) = d(k,i), all (i,k)

3. d(i,k) ≥ 0, all (i,k)

Page 41: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Different proximity measures

r = 2(Euclidean distance)

[42 + 22]1/2 = 4.472

r = 1(Manhattan distance)

4 + 2 = 6

r → ∞ (“sup” distance)

max{4,2} = 4

Page 42: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

K-Means Clustering The meaning of ‘K-means’

Why it is called ‘K-means’ clustering: K points are used to represent the clustering result; each point corresponds to the centre (mean) of a cluster

Each point is assigned to the cluster with the closest center point

The number K, must be specified Basic algorithm

Page 43: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k non-empty subsets Arbitrarily choose k points as initial centers Assign each object to the cluster with the nearest

seed point (center) Calculate the mean of the cluster and update the

seed point Go back to Step 3, stop when no more new

assignment

Page 44: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

The K-Means Clustering Method (cntd)

The basic step of k-means clustering is simple: Iterate until stable (= no object move group):

Determine the centroid coordinate Determine the distance of each object to the

centroids Group the object based on minimum distance

Page 45: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

The K-Means Clustering Method (cntd)

Page 46: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

The K-Means Clustering Results Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Page 47: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Weaknesses of the K-Means Method

Unable to handle noisy data and outliers Very large or very small values could skew

the mean Not suitable to discover clusters with non-

convex shapes

Page 48: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Hierarchical Clustering

Start with every data point in a separate cluster

Keep merging the most similar pairs of data points/clusters until we have one big cluster left

This is called a bottom-up or agglomerative method

Page 49: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Hierarchical Clustering (cont.)

This produces a binary tree or dendrogram

The final cluster is the root and each data item is a leaf

The height of the bars indicate how close the items are

Page 50: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Hierarchical Clustering Demo

Page 51: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Levels of Clustering

Page 52: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Linkage in Hierarchical Clustering

We already know about distance measures between data items, but what about between a data item and a cluster or between two clusters?

We just treat a data point as a cluster with a single item, so our only problem is to define a linkage method between clusters

As usual, there are lots of choices…

Page 53: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Average Linkage

Definition Each cluster ci is associated with a mean vector i

which is the mean of all the data items in the cluster

The distance between two clusters ci and cj is then just d(i , j )

This is somewhat non-standard – this method is usually referred to as centroid linkage and average linkage is defined as the average of all pairwise distances between points in the two clusters

Page 54: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Single Linkage

The minimum of all pairwise distances between points in the two clusters

Tends to produce long, “loose” clusters

Page 55: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Complete Linkage

The maximum of all pairwise distances between points in the two clusters

Tends to produce very tight clusters

Page 56: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Distances between clusters (summary)

Calculation of the distance between two clusters is based on the pairwise distances between members of the clusters. Complete linkage: largest distance

between points Average linkage: average distance

between points Single linkage: smallest distance

between points Centroid: distance between

centroids

Complete linkage gives preference to compact/spherical clusters. Single linkage can produce long stretched clusters.

Page 57: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

A B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

EXAMPLE

Page 58: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

More on Hierarchical Clustering Methods

Major advantage Conceptually very simple Easy to implement most commonly used technique

Major weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2),

where n is the number of total objects can never undo what was done previously high

likelihood of getting stuck in local minima

Page 59: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

Roadmap

Principle Components Analysis Clustering ANOVA

Page 60: Clustering Tutorial Elias Raftopoulos HY539 29/3/06 Prof. Maria Papadopouli.

(M)ANOVA

The analysis of variance technique in One-Way Analysis of Variance (ANOVA) takes a set of grouped data and determine whether the mean of a variable differs significantly between groups

Often there are multiple variables and you are interested in determining whether the entire set of means is different from one group to the next There is a multivariate version of analysis of variance

that can address that problem (MANOVA)


Recommended