Model-based clustering and data transformations of gene ... · Model-based clustering and data...

Post on 21-Mar-2020

8 views 1 download

transcript

Model-based clusteringand data transformationsof gene expression data

Walter L. RuzzoUniversity of Washington

UW CSE Computational Biology Group

2

Overview

• Motivation• Model-based clustering• Validation• Summary and Conclusions

3

Toy 2-d ClusteringExample

?

4

K-Means

5

Hierarchical Average Link

6

Model-Based (If You Want)

7

Overview• Motivation• Model-based clustering• Validation• Summary and Conclusions

8

Model-based clustering• Gaussian mixture model:

– Assume each cluster is generated by amultivariate normal distribution

– Cluster k has parameters :• Mean vector: µk

• Covariance matrix: Σk

9

Model-based clustering• Gaussian mixture model:

– Assume each cluster is generated by amultivariate normal distribution

– Cluster k has parameters :• Mean vector: µk

• Covariance matrix: Σk

µ1 µ2

σ1 σ2

10

Variance & Covariance

• Variance

• Covariance

• Correlation

!

cov(x,y) = E((x " x)(y " y))!

var(x) = E((x " x)2)

!

cor(x,y) =cov(x,y)

" x" y

11

Gaussian Distributions

• Univariate

• Multivariate

where Σ is the variance/covariancematrix:

!

1

2"# 2e$21(x$x )

2/# 2

!

1

(2" )n |# |e$21(x$x )

T(#

$1)(x$x )

!

"i, j = E((xi # x i)(x j # x j ))

12

Variance/Covariance

13

Σk=λkDkAkDkT

volume orientationshape

Covariance models(Banfield & Raftery 1993)

• Equal volume spherical

model (EI): ~ kmeans Σk = λ I

14

Σk=λkDkAkDkT

volume orientationshape

Covariance models(Banfield & Raftery 1993)

• Equal volume spherical

model (EI): ~ kmeans Σk = λ I

• Unequal volume spherical (VI): Σk = λkI

15

Σk=λkDkAkDkT

volume orientationshape

Covariance models(Banfield & Raftery 1993)

• Equal volume sphericalmodel (EI): ~ kmeans Σk = λ I

• Unequal volume spherical (VI): Σk = λkI

• Diagonal model: Σk = λkBk, where Bk is diagonal, |Bk|=1

• EEE elliptical model: Σk = λDADT

• Unconstrained model (VVV): Σk = λkDkAkDk

T

More

fle

xible

But

more

par

amet

ers

16

EM algorithm• General approach to maximum likelihood

• Iterate between E and M steps:– E step: compute the probability of each

observation belonging to each cluster usingthe current parameter estimates

– M-step: estimate model parameters usingthe current group membership probabilities

17

Advantages ofmodel-based clustering

• Higher quality clusters• Flexible models• Model selection – A principled way to choose

right model and right # of clusters– Bayesian Information Criterion (BIC):

• Approximate Bayes factor: posterior odds for one modelagainst another model

• Roughly: data likelihood, penalized for number ofparameters

– A large BIC score indicates strong evidence forthe corresponding model.

18

Definition of the BIC score

• The integrated likelihood p(D|Mk) is hardto evaluate,where D is the data, Mk is the model.

• BIC is an approximation to log p(D|Mk)• υk: number of parameters to be

estimated in model Mk

kkkkk BICnMDpMDp =!" )log(),ˆ|(log2)|(log2 #$

19

Overview• Motivation• Model-based clustering• Validation

– Methodology– Data Sets– Results

• Summary and Conclusions

20

Validation Methodology• Compare on data sets with external criteria

(BIC scores do not require the external criteria)

• To compare clusters with external criterion:– Adjusted Rand index (Hubert and Arabie 1985)

– Adjusted Rand index = 1 perfect agreement

– 2 random partitions have an expected index of 0

• Compare quality of clusters to those from:– a leading heuristic-based algorithm: CAST (Ben-Dor &

Yakhini 1999)

– k-Means (EI).

21

Gene expression data sets

• Ovarian cancer data set(Michel Schummer, Institute of Systems Biology)

– Subset of data: 235 clones

24 experiments (cancer/normal tissue samples)

– 235 clones correspond to 4 genes

• Yeast cell cycle data (Cho et al 1998)

– 17 time points

– Subset of 384 genes associated with 5 phases of cellcycle

22

Synthetic data setsBoth based on ovary data• Randomly resampled ovary data

– For each class, randomly sample theexpression levels in each experiment,independently

– Near diagonal covariance matrix

• Gaussian mixture– Generate multivariate normal distributions

with the sample covariance matrix and meanvector of each class in the ovary data

23

-13500

-13000

-12500

-12000

-11500

-11000

-10500

0 2 4 6 8 10 12 14 16

number of clusters

BIC EI

VI

diagonalEEE

Results:randomlyresampledovary data• Diagonal model

achieves max BICscore (~expected)

• max BIC at 4clusters (~expected)

• max adjusted Rand

• beats CAST

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 2 4 6 8 10 12 14 16

number of clusters

Ad

jus

ted

Ra

nd

EIVIVVVdiagonalCASTEEE

24

Results: square root ovary data

• Adjusted Rand:max at EEE 4clusters (> CAST)

• BIC analysis:– EEE and diagonal

models localmax at 4 clusters

– Global max VIat 8 clusters(8 ≈ split of 4).

-3000

-2500

-2000

-1500

-1000

-500

0

0 2 4 6 8 10 12 14 16

number of clusters

BIC

EI

VI

diagonal

EEE

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 2 4 6 8 10 12 14 16

number of clusters

Ad

jus

ted

Ra

nd

EI

VI

VVV

diagonal

CAST

EEE

25

Results: standardized yeast cell cycle data

• AdjustedRand: EIslightly >CAST at 5clusters.

• BIC: selectsEEE at 5clusters.

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0 2 4 6 8 10 12 14 16

number of clusters

Ad

jus

ted

Ra

nd EI

VI

VVV

diagonal

CAST

EEE

-17000

-15000

-13000

-11000

-9000

-7000

-5000

-3000

-1000

0 2 4 6 8 10 12 14 16

number of clusters

BIC

EI

VI

diagonal

EEE

26

27

28

Overview

• Motivation• Model-based clustering• Validation• Importance of Data Transformation• Summary and Conclusions

29

log yeast cell cycle data

30

Standardized yeast cell cycle data

31

Overview

• Motivation• Model-based clustering• Validation• Summary and Conclusions

32

Summary and Conclusions• Synthetic data sets:

– With the correct model, model-based clusteringbetter than a leading heuristic clustering algorithm

– BIC selects the right model & right number of clusters

• Real expression data sets:– Comparable adjusted Rand indices to CAST

– BIC gives a good hint as to the number of clusters

• Appropriate data transformations increasenormality & cluster quality (See paper & web.)

35

Acknowledgements• Ka Yee Yeung1, Chris Fraley2,4,

Alejandro Murua4, Adrian E. Raftery2

• Michèl Schummer5 – the ovary data

• Jeremy Tantrum2 – help with MBC software (diagonal model)

• Chris Saunders3 – CRE & noise model

1Computer Science & Engineering 4Insightful Corporation2Statistics 5Institute of Systems Biology3Genome Sciences

More Infohttp://www.cs.washington.edu/homes/ruzzo

UW CSE Computational Biology Group

44

Adjusted Rand Examplec#1(4) c#2(5) c#3(7) c#4(4)

class#1(2) 2 0 0 0

class#2(3) 0 0 0 3

class#3(5) 1 4 0 0

class#4(10) 1 1 7 1

1192

20

2831592

10

2

5

2

3

2

2

1231432

4

2

7

2

5

2

4

312

7

2

4

2

3

2

2

=!!!""#

$%%&

'=

=!=!""#

$%%&

'+""#

$%%&

'+""#

$%%&

'+""#

$%%&

'=

=!=!""#

$%%&

'+""#

$%%&

'+""#

$%%&

'+""#

$%%&

'=

=""#

$%%&

'+""#

$%%&

'+""#

$%%&

'+""#

$%%&

'=

cbad

ac

ab

a

469.0)(1

)( Rand Adjusted

789.0,

=!

!=

=+++

+=

RE

RER

dcda

daRRand