Model-based clusteringand data transformationsof gene expression data
Walter L. RuzzoUniversity of Washington
UW CSE Computational Biology Group
2
Overview
• Motivation• Model-based clustering• Validation• Summary and Conclusions
3
Toy 2-d ClusteringExample
?
4
K-Means
5
Hierarchical Average Link
6
Model-Based (If You Want)
7
Overview• Motivation• Model-based clustering• Validation• Summary and Conclusions
8
Model-based clustering• Gaussian mixture model:
– Assume each cluster is generated by amultivariate normal distribution
– Cluster k has parameters :• Mean vector: µk
• Covariance matrix: Σk
9
Model-based clustering• Gaussian mixture model:
– Assume each cluster is generated by amultivariate normal distribution
– Cluster k has parameters :• Mean vector: µk
• Covariance matrix: Σk
µ1 µ2
σ1 σ2
10
Variance & Covariance
• Variance
• Covariance
• Correlation
!
cov(x,y) = E((x " x)(y " y))!
var(x) = E((x " x)2)
!
cor(x,y) =cov(x,y)
" x" y
11
Gaussian Distributions
• Univariate
• Multivariate
where Σ is the variance/covariancematrix:
!
1
2"# 2e$21(x$x )
2/# 2
!
1
(2" )n |# |e$21(x$x )
T(#
$1)(x$x )
!
"i, j = E((xi # x i)(x j # x j ))
12
Variance/Covariance
13
Σk=λkDkAkDkT
volume orientationshape
Covariance models(Banfield & Raftery 1993)
• Equal volume spherical
model (EI): ~ kmeans Σk = λ I
14
Σk=λkDkAkDkT
volume orientationshape
Covariance models(Banfield & Raftery 1993)
• Equal volume spherical
model (EI): ~ kmeans Σk = λ I
• Unequal volume spherical (VI): Σk = λkI
15
Σk=λkDkAkDkT
volume orientationshape
Covariance models(Banfield & Raftery 1993)
• Equal volume sphericalmodel (EI): ~ kmeans Σk = λ I
• Unequal volume spherical (VI): Σk = λkI
• Diagonal model: Σk = λkBk, where Bk is diagonal, |Bk|=1
• EEE elliptical model: Σk = λDADT
• Unconstrained model (VVV): Σk = λkDkAkDk
T
More
fle
xible
But
more
par
amet
ers
16
EM algorithm• General approach to maximum likelihood
• Iterate between E and M steps:– E step: compute the probability of each
observation belonging to each cluster usingthe current parameter estimates
– M-step: estimate model parameters usingthe current group membership probabilities
17
Advantages ofmodel-based clustering
• Higher quality clusters• Flexible models• Model selection – A principled way to choose
right model and right # of clusters– Bayesian Information Criterion (BIC):
• Approximate Bayes factor: posterior odds for one modelagainst another model
• Roughly: data likelihood, penalized for number ofparameters
– A large BIC score indicates strong evidence forthe corresponding model.
18
Definition of the BIC score
• The integrated likelihood p(D|Mk) is hardto evaluate,where D is the data, Mk is the model.
• BIC is an approximation to log p(D|Mk)• υk: number of parameters to be
estimated in model Mk
kkkkk BICnMDpMDp =!" )log(),ˆ|(log2)|(log2 #$
19
Overview• Motivation• Model-based clustering• Validation
– Methodology– Data Sets– Results
• Summary and Conclusions
20
Validation Methodology• Compare on data sets with external criteria
(BIC scores do not require the external criteria)
• To compare clusters with external criterion:– Adjusted Rand index (Hubert and Arabie 1985)
– Adjusted Rand index = 1 perfect agreement
– 2 random partitions have an expected index of 0
• Compare quality of clusters to those from:– a leading heuristic-based algorithm: CAST (Ben-Dor &
Yakhini 1999)
– k-Means (EI).
21
Gene expression data sets
• Ovarian cancer data set(Michel Schummer, Institute of Systems Biology)
– Subset of data: 235 clones
24 experiments (cancer/normal tissue samples)
– 235 clones correspond to 4 genes
• Yeast cell cycle data (Cho et al 1998)
– 17 time points
– Subset of 384 genes associated with 5 phases of cellcycle
22
Synthetic data setsBoth based on ovary data• Randomly resampled ovary data
– For each class, randomly sample theexpression levels in each experiment,independently
– Near diagonal covariance matrix
• Gaussian mixture– Generate multivariate normal distributions
with the sample covariance matrix and meanvector of each class in the ovary data
23
-13500
-13000
-12500
-12000
-11500
-11000
-10500
0 2 4 6 8 10 12 14 16
number of clusters
BIC EI
VI
diagonalEEE
Results:randomlyresampledovary data• Diagonal model
achieves max BICscore (~expected)
• max BIC at 4clusters (~expected)
• max adjusted Rand
• beats CAST
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 2 4 6 8 10 12 14 16
number of clusters
Ad
jus
ted
Ra
nd
EIVIVVVdiagonalCASTEEE
24
Results: square root ovary data
• Adjusted Rand:max at EEE 4clusters (> CAST)
• BIC analysis:– EEE and diagonal
models localmax at 4 clusters
– Global max VIat 8 clusters(8 ≈ split of 4).
-3000
-2500
-2000
-1500
-1000
-500
0
0 2 4 6 8 10 12 14 16
number of clusters
BIC
EI
VI
diagonal
EEE
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 2 4 6 8 10 12 14 16
number of clusters
Ad
jus
ted
Ra
nd
EI
VI
VVV
diagonal
CAST
EEE
25
Results: standardized yeast cell cycle data
• AdjustedRand: EIslightly >CAST at 5clusters.
• BIC: selectsEEE at 5clusters.
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0 2 4 6 8 10 12 14 16
number of clusters
Ad
jus
ted
Ra
nd EI
VI
VVV
diagonal
CAST
EEE
-17000
-15000
-13000
-11000
-9000
-7000
-5000
-3000
-1000
0 2 4 6 8 10 12 14 16
number of clusters
BIC
EI
VI
diagonal
EEE
26
27
28
Overview
• Motivation• Model-based clustering• Validation• Importance of Data Transformation• Summary and Conclusions
29
log yeast cell cycle data
30
Standardized yeast cell cycle data
31
Overview
• Motivation• Model-based clustering• Validation• Summary and Conclusions
32
Summary and Conclusions• Synthetic data sets:
– With the correct model, model-based clusteringbetter than a leading heuristic clustering algorithm
– BIC selects the right model & right number of clusters
• Real expression data sets:– Comparable adjusted Rand indices to CAST
– BIC gives a good hint as to the number of clusters
• Appropriate data transformations increasenormality & cluster quality (See paper & web.)
35
Acknowledgements• Ka Yee Yeung1, Chris Fraley2,4,
Alejandro Murua4, Adrian E. Raftery2
• Michèl Schummer5 – the ovary data
• Jeremy Tantrum2 – help with MBC software (diagonal model)
• Chris Saunders3 – CRE & noise model
1Computer Science & Engineering 4Insightful Corporation2Statistics 5Institute of Systems Biology3Genome Sciences
More Infohttp://www.cs.washington.edu/homes/ruzzo
UW CSE Computational Biology Group
44
Adjusted Rand Examplec#1(4) c#2(5) c#3(7) c#4(4)
class#1(2) 2 0 0 0
class#2(3) 0 0 0 3
class#3(5) 1 4 0 0
class#4(10) 1 1 7 1
1192
20
2831592
10
2
5
2
3
2
2
1231432
4
2
7
2
5
2
4
312
7
2
4
2
3
2
2
=!!!""#
$%%&
'=
=!=!""#
$%%&
'+""#
$%%&
'+""#
$%%&
'+""#
$%%&
'=
=!=!""#
$%%&
'+""#
$%%&
'+""#
$%%&
'+""#
$%%&
'=
=""#
$%%&
'+""#
$%%&
'+""#
$%%&
'+""#
$%%&
'=
cbad
ac
ab
a
469.0)(1
)( Rand Adjusted
789.0,
=!
!=
=+++
+=
RE
RER
dcda
daRRand