+ All Categories
Home > Documents > The Broad Institute of MIT and Harvard Clustering.

The Broad Institute of MIT and Harvard Clustering.

Date post: 18-Dec-2015
Category:
Upload: allen-simpson
View: 225 times
Download: 2 times
Share this document with a friend
30
Clustering
Transcript
Page 1: The Broad Institute of MIT and Harvard Clustering.

Clustering

Page 2: The Broad Institute of MIT and Harvard Clustering.

Clustering Preliminaries

• Log2 transformation

• Row centering and normalization

• Filtering

Page 3: The Broad Institute of MIT and Harvard Clustering.

Log2 Transformation

• Log2-transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values.– We would like dist(100,200)=dist(1000,2000).

Advantages of log2 transformation:

Page 4: The Broad Institute of MIT and Harvard Clustering.

Row Centering & Normalization

x y=x-mean(x) z=y/stdev(y)

Page 5: The Broad Institute of MIT and Harvard Clustering.

Filtering genes

• Filtering is very important for unsupervised analysis since many noisy genes may totally mask the structure in the data

• After finding a hypothesis one can identify marker genes in a larger dataset via supervised analysis.

Clustering

Supervised AnalysisMarker Selection

All genes

Page 6: The Broad Institute of MIT and Harvard Clustering.

Clustering/Class Discovery

• Aim: Partition data (e.g. genes or samples) into sub-groups (clusters), such that points of the same cluster are “more similar”.

• Challenge: Not well defined. No single objective function / evaluation criterion

• Example:How many clusters? 2+noise, 3+noise, 20, Hierarchical: 23 + noise

• One has to choose:– Similarity/distance measure – Clustering method– Evaluate clusters

Page 7: The Broad Institute of MIT and Harvard Clustering.

Clustering in GenePattern

• Representative based: Find representatives/centroids– K-means: KMeansClustering– Self Organizing Maps (SOM): SOMClustering

• Bottom-up (Agglomerative): HierarchicalClustering Hierarchically unite clusters– single linkage analysis– complete linkage analysis– average linkage analysis

• Clustering-like:– NMFConsensus– PCA (Principal Components Analysis)

No BEST method! For easy problems – most of them work. Each algorithm has its assumptions and strengths and weaknesses

Page 8: The Broad Institute of MIT and Harvard Clustering.

K-means Clustering

• Aim: Partition the data points into K subsets and associate each

subset with a centroid such that the sum of squared distances

between the data points and their associated centroid is minimal.

Page 9: The Broad Institute of MIT and Harvard Clustering.

Iteration = 0

K-means: Algorithm

• Initialize centroids at random positions

• Iterate:– Assign each data point to

its closest centroid

– Move centroids to center of assigned points

• Stop when converged

• Guaranteed to reach a local minimum

Iteration = 1

K=3

Iteration = 1Iteration = 2Iteration = 2

Page 10: The Broad Institute of MIT and Harvard Clustering.

K-means: Summary

• Result depends on initial centroids’ position• Fast algorithm: needs to compute distances from data

points to centroids• Must preset number of clusters.• Fails for non-spherical distributions

Page 11: The Broad Institute of MIT and Harvard Clustering.

Hierarchical Clustering

3

1

4 2

5

52 41 3

Distance between joined clusters

Dendrogram

Page 12: The Broad Institute of MIT and Harvard Clustering.

52 41 3

3

1

4 2

5

Distance between joined clusters

Dendrogram

The dendrogram induces a linear ordering of the data points (up to left/right flip in each split)

The dendrogram induces a linear ordering of the data points (up to left/right flip in each split)

Hierarchical ClusteringNeed to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Page 13: The Broad Institute of MIT and Harvard Clustering.

Average Linkage

Leukemia samples and genes

Page 14: The Broad Institute of MIT and Harvard Clustering.

Single and Complete Linkage

Single-linkage Complete-linkage

Leukemia samples and genes

Page 15: The Broad Institute of MIT and Harvard Clustering.

Similarity/Distance Measures

Decide: which samples/genes should be clustered together

– Euclidean: the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula

– Pearson correlation - a parametric measure of the strength of linear dependence between two variables.

– Absolute Pearson correlation - the absolute value of the Pearson correlation– Spearman rank correlation - a non-parametric measure of independence

between two variables– Uncentered correlation - same as Pearson but assumes the mean is 0– Absolute uncentered correlation - the absolute value of the uncentered

correlation

– Kendall’s tau - a non-parametric similarity measure used to measure the

degree of correspondence between two rankings – City-block/Manhattan - the distance that would be traveled to get from one

point to the other if a grid-like path is followed

yi x

Page 16: The Broad Institute of MIT and Harvard Clustering.

Reasonable Distance Measure

Gene 1

Gene 2

Gene 3

Gene 4

Genes: Close -> Correlated

Samples: Similar profile givingGene 1 and 2 a similar contribution to the distance between sample 1 and 5

Sample 1 Sample 5

Euclidean distance on samples and genes on row-centered and normalized data.

Page 17: The Broad Institute of MIT and Harvard Clustering.

Pitfalls in Clustering

• Elongated clusters

• Filament

• Clusters of different sizes

Page 18: The Broad Institute of MIT and Harvard Clustering.

Compact Separated Clusters

• All methods work

Adapted from E. Domany

Page 19: The Broad Institute of MIT and Harvard Clustering.

Elongated Clusters

Single linkage succeeds to partition Average linkage fails

Page 20: The Broad Institute of MIT and Harvard Clustering.

Filament

• Single linkage not robust

Adapted from E. Domany

Page 21: The Broad Institute of MIT and Harvard Clustering.

Filament with Point Removed

• Single linkage not robust

Adapted from E. Domany

Page 22: The Broad Institute of MIT and Harvard Clustering.

Two-way Clustering

• Two independent cluster analyses on genes and samples used to reorder the data (two-way clustering):

Page 23: The Broad Institute of MIT and Harvard Clustering.

Hierarchical Clustering

• Results depend on distance update method– Single Linkage: elongated clusters– Complete Linkage: sphere-like clusters

• Greedy iterative process • NOT robust against noise• No inherent measure to choose the clusters –

we return to this point in cluster validation

Summary

Page 24: The Broad Institute of MIT and Harvard Clustering.

Clustering Protocol

Page 25: The Broad Institute of MIT and Harvard Clustering.

Validating Number of Clusters

How do we know how many real clusters exist in the dataset?

Page 26: The Broad Institute of MIT and Harvard Clustering.

...

D1 D2Dn

Generate “perturbed”datasets

Consensus Clustering

Apply clustering algorithmto each Di

Clustering1 Clustering2 .. Clusteringn

OriginalDataset

Consensus matrix: counts proportion of times two samples are clustered together.• (1) two samples always cluster together• (0) two samples never cluster together

s1 s2 … sn

s1 s2…

sn

compute consensus matrix dendogram

based on matrix

The Broad Institute of MIT and Harvard

Page 27: The Broad Institute of MIT and Harvard Clustering.

Consensus Clustering

Consensus matrix: counts proportion of times two samples are clustered together.• (1) two samples always cluster together• (0) two samples never cluster together

C1

C2

C3

s1 s3 … si

s1 s3…

si

consensus matrixordered according to dendogram

compute consensus matrix dendogram

based on matrix

...

D1 D2Dn

Apply clustering algorithmto each Di

Clustering1 Clustering2 .. Clusteringn

OriginalDataset

Page 28: The Broad Institute of MIT and Harvard Clustering.

Validation

• Aim: Measure agreement between clustering results on “perturbed” versions of the data.

• Method: – Iterate N times:

• Generate “perturbed” version of the original dataset bysubsampling, resampling with repeats, adding noise

• Cluster the perturbed dataset

– Calculate fraction of iterations where different samples belong to the same cluster

– Optimize the number of clusters K by choosing the value of K which yields the most consistent results

Consistency / Robustness Analysis

Page 29: The Broad Institute of MIT and Harvard Clustering.

Consensus Clustering in GenePattern

Page 30: The Broad Institute of MIT and Harvard Clustering.

Clustering Cookbook

• Reduce number of genes by variation filtering– Use stricter parameters than for comparative marker

selection

• Choose a method for cluster discovery (e.g. hierarchical clustering)

• Select a number of clusters– Check for sensitivity of clusters against filtering and

clustering parameters– Validate on independent data sets– Internally test robustness of clusters with consensus

clustering


Recommended