Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | allen-simpson |
View: | 225 times |
Download: | 2 times |
Clustering
Clustering Preliminaries
• Log2 transformation
• Row centering and normalization
• Filtering
Log2 Transformation
• Log2-transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values.– We would like dist(100,200)=dist(1000,2000).
Advantages of log2 transformation:
Row Centering & Normalization
x y=x-mean(x) z=y/stdev(y)
Filtering genes
• Filtering is very important for unsupervised analysis since many noisy genes may totally mask the structure in the data
• After finding a hypothesis one can identify marker genes in a larger dataset via supervised analysis.
Clustering
Supervised AnalysisMarker Selection
All genes
Clustering/Class Discovery
• Aim: Partition data (e.g. genes or samples) into sub-groups (clusters), such that points of the same cluster are “more similar”.
• Challenge: Not well defined. No single objective function / evaluation criterion
• Example:How many clusters? 2+noise, 3+noise, 20, Hierarchical: 23 + noise
• One has to choose:– Similarity/distance measure – Clustering method– Evaluate clusters
Clustering in GenePattern
• Representative based: Find representatives/centroids– K-means: KMeansClustering– Self Organizing Maps (SOM): SOMClustering
• Bottom-up (Agglomerative): HierarchicalClustering Hierarchically unite clusters– single linkage analysis– complete linkage analysis– average linkage analysis
• Clustering-like:– NMFConsensus– PCA (Principal Components Analysis)
No BEST method! For easy problems – most of them work. Each algorithm has its assumptions and strengths and weaknesses
K-means Clustering
• Aim: Partition the data points into K subsets and associate each
subset with a centroid such that the sum of squared distances
between the data points and their associated centroid is minimal.
Iteration = 0
K-means: Algorithm
• Initialize centroids at random positions
• Iterate:– Assign each data point to
its closest centroid
– Move centroids to center of assigned points
• Stop when converged
• Guaranteed to reach a local minimum
Iteration = 1
K=3
Iteration = 1Iteration = 2Iteration = 2
K-means: Summary
• Result depends on initial centroids’ position• Fast algorithm: needs to compute distances from data
points to centroids• Must preset number of clusters.• Fails for non-spherical distributions
Hierarchical Clustering
3
1
4 2
5
52 41 3
Distance between joined clusters
Dendrogram
52 41 3
3
1
4 2
5
Distance between joined clusters
Dendrogram
The dendrogram induces a linear ordering of the data points (up to left/right flip in each split)
The dendrogram induces a linear ordering of the data points (up to left/right flip in each split)
Hierarchical ClusteringNeed to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Need to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Average Linkage
Leukemia samples and genes
Single and Complete Linkage
Single-linkage Complete-linkage
Leukemia samples and genes
Similarity/Distance Measures
Decide: which samples/genes should be clustered together
– Euclidean: the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula
– Pearson correlation - a parametric measure of the strength of linear dependence between two variables.
– Absolute Pearson correlation - the absolute value of the Pearson correlation– Spearman rank correlation - a non-parametric measure of independence
between two variables– Uncentered correlation - same as Pearson but assumes the mean is 0– Absolute uncentered correlation - the absolute value of the uncentered
correlation
– Kendall’s tau - a non-parametric similarity measure used to measure the
degree of correspondence between two rankings – City-block/Manhattan - the distance that would be traveled to get from one
point to the other if a grid-like path is followed
yi x
Reasonable Distance Measure
Gene 1
Gene 2
Gene 3
Gene 4
Genes: Close -> Correlated
Samples: Similar profile givingGene 1 and 2 a similar contribution to the distance between sample 1 and 5
Sample 1 Sample 5
Euclidean distance on samples and genes on row-centered and normalized data.
Pitfalls in Clustering
• Elongated clusters
• Filament
• Clusters of different sizes
Compact Separated Clusters
• All methods work
Adapted from E. Domany
Elongated Clusters
Single linkage succeeds to partition Average linkage fails
Filament
• Single linkage not robust
Adapted from E. Domany
Filament with Point Removed
• Single linkage not robust
Adapted from E. Domany
Two-way Clustering
• Two independent cluster analyses on genes and samples used to reorder the data (two-way clustering):
Hierarchical Clustering
• Results depend on distance update method– Single Linkage: elongated clusters– Complete Linkage: sphere-like clusters
• Greedy iterative process • NOT robust against noise• No inherent measure to choose the clusters –
we return to this point in cluster validation
Summary
Clustering Protocol
Validating Number of Clusters
How do we know how many real clusters exist in the dataset?
...
D1 D2Dn
Generate “perturbed”datasets
Consensus Clustering
Apply clustering algorithmto each Di
Clustering1 Clustering2 .. Clusteringn
OriginalDataset
Consensus matrix: counts proportion of times two samples are clustered together.• (1) two samples always cluster together• (0) two samples never cluster together
s1 s2 … sn
s1 s2…
sn
compute consensus matrix dendogram
based on matrix
The Broad Institute of MIT and Harvard
Consensus Clustering
Consensus matrix: counts proportion of times two samples are clustered together.• (1) two samples always cluster together• (0) two samples never cluster together
C1
C2
C3
s1 s3 … si
s1 s3…
si
consensus matrixordered according to dendogram
compute consensus matrix dendogram
based on matrix
...
D1 D2Dn
Apply clustering algorithmto each Di
Clustering1 Clustering2 .. Clusteringn
OriginalDataset
Validation
• Aim: Measure agreement between clustering results on “perturbed” versions of the data.
• Method: – Iterate N times:
• Generate “perturbed” version of the original dataset bysubsampling, resampling with repeats, adding noise
• Cluster the perturbed dataset
– Calculate fraction of iterations where different samples belong to the same cluster
– Optimize the number of clusters K by choosing the value of K which yields the most consistent results
Consistency / Robustness Analysis
Consensus Clustering in GenePattern
Clustering Cookbook
• Reduce number of genes by variation filtering– Use stricter parameters than for comparative marker
selection
• Choose a method for cluster discovery (e.g. hierarchical clustering)
• Select a number of clusters– Check for sensitivity of clusters against filtering and
clustering parameters– Validate on independent data sets– Internally test robustness of clusters with consensus
clustering