+ All Categories
Home > Documents > Model-based clustering and data transformations of gene ... · Model-based clustering and data...

Model-based clustering and data transformations of gene ... · Model-based clustering and data...

Date post: 21-Mar-2020
Category:
Upload: others
View: 8 times
Download: 1 times
Share this document with a friend
34
Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group
Transcript
Page 1: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

Model-based clusteringand data transformationsof gene expression data

Walter L. RuzzoUniversity of Washington

UW CSE Computational Biology Group

Page 2: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

2

Overview

• Motivation• Model-based clustering• Validation• Summary and Conclusions

Page 3: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

3

Toy 2-d ClusteringExample

?

Page 4: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

4

K-Means

Page 5: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

5

Hierarchical Average Link

Page 6: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

6

Model-Based (If You Want)

Page 7: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

7

Overview• Motivation• Model-based clustering• Validation• Summary and Conclusions

Page 8: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

8

Model-based clustering• Gaussian mixture model:

– Assume each cluster is generated by amultivariate normal distribution

– Cluster k has parameters :• Mean vector: µk

• Covariance matrix: Σk

Page 9: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

9

Model-based clustering• Gaussian mixture model:

– Assume each cluster is generated by amultivariate normal distribution

– Cluster k has parameters :• Mean vector: µk

• Covariance matrix: Σk

µ1 µ2

σ1 σ2

Page 10: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

10

Variance & Covariance

• Variance

• Covariance

• Correlation

!

cov(x,y) = E((x " x)(y " y))!

var(x) = E((x " x)2)

!

cor(x,y) =cov(x,y)

" x" y

Page 11: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

11

Gaussian Distributions

• Univariate

• Multivariate

where Σ is the variance/covariancematrix:

!

1

2"# 2e$21(x$x )

2/# 2

!

1

(2" )n |# |e$21(x$x )

T(#

$1)(x$x )

!

"i, j = E((xi # x i)(x j # x j ))

Page 12: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

12

Variance/Covariance

Page 13: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

13

Σk=λkDkAkDkT

volume orientationshape

Covariance models(Banfield & Raftery 1993)

• Equal volume spherical

model (EI): ~ kmeans Σk = λ I

Page 14: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

14

Σk=λkDkAkDkT

volume orientationshape

Covariance models(Banfield & Raftery 1993)

• Equal volume spherical

model (EI): ~ kmeans Σk = λ I

• Unequal volume spherical (VI): Σk = λkI

Page 15: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

15

Σk=λkDkAkDkT

volume orientationshape

Covariance models(Banfield & Raftery 1993)

• Equal volume sphericalmodel (EI): ~ kmeans Σk = λ I

• Unequal volume spherical (VI): Σk = λkI

• Diagonal model: Σk = λkBk, where Bk is diagonal, |Bk|=1

• EEE elliptical model: Σk = λDADT

• Unconstrained model (VVV): Σk = λkDkAkDk

T

More

fle

xible

But

more

par

amet

ers

Page 16: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

16

EM algorithm• General approach to maximum likelihood

• Iterate between E and M steps:– E step: compute the probability of each

observation belonging to each cluster usingthe current parameter estimates

– M-step: estimate model parameters usingthe current group membership probabilities

Page 17: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

17

Advantages ofmodel-based clustering

• Higher quality clusters• Flexible models• Model selection – A principled way to choose

right model and right # of clusters– Bayesian Information Criterion (BIC):

• Approximate Bayes factor: posterior odds for one modelagainst another model

• Roughly: data likelihood, penalized for number ofparameters

– A large BIC score indicates strong evidence forthe corresponding model.

Page 18: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

18

Definition of the BIC score

• The integrated likelihood p(D|Mk) is hardto evaluate,where D is the data, Mk is the model.

• BIC is an approximation to log p(D|Mk)• υk: number of parameters to be

estimated in model Mk

kkkkk BICnMDpMDp =!" )log(),ˆ|(log2)|(log2 #$

Page 19: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

19

Overview• Motivation• Model-based clustering• Validation

– Methodology– Data Sets– Results

• Summary and Conclusions

Page 20: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

20

Validation Methodology• Compare on data sets with external criteria

(BIC scores do not require the external criteria)

• To compare clusters with external criterion:– Adjusted Rand index (Hubert and Arabie 1985)

– Adjusted Rand index = 1 perfect agreement

– 2 random partitions have an expected index of 0

• Compare quality of clusters to those from:– a leading heuristic-based algorithm: CAST (Ben-Dor &

Yakhini 1999)

– k-Means (EI).

Page 21: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

21

Gene expression data sets

• Ovarian cancer data set(Michel Schummer, Institute of Systems Biology)

– Subset of data: 235 clones

24 experiments (cancer/normal tissue samples)

– 235 clones correspond to 4 genes

• Yeast cell cycle data (Cho et al 1998)

– 17 time points

– Subset of 384 genes associated with 5 phases of cellcycle

Page 22: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

22

Synthetic data setsBoth based on ovary data• Randomly resampled ovary data

– For each class, randomly sample theexpression levels in each experiment,independently

– Near diagonal covariance matrix

• Gaussian mixture– Generate multivariate normal distributions

with the sample covariance matrix and meanvector of each class in the ovary data

Page 23: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

23

-13500

-13000

-12500

-12000

-11500

-11000

-10500

0 2 4 6 8 10 12 14 16

number of clusters

BIC EI

VI

diagonalEEE

Results:randomlyresampledovary data• Diagonal model

achieves max BICscore (~expected)

• max BIC at 4clusters (~expected)

• max adjusted Rand

• beats CAST

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 2 4 6 8 10 12 14 16

number of clusters

Ad

jus

ted

Ra

nd

EIVIVVVdiagonalCASTEEE

Page 24: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

24

Results: square root ovary data

• Adjusted Rand:max at EEE 4clusters (> CAST)

• BIC analysis:– EEE and diagonal

models localmax at 4 clusters

– Global max VIat 8 clusters(8 ≈ split of 4).

-3000

-2500

-2000

-1500

-1000

-500

0

0 2 4 6 8 10 12 14 16

number of clusters

BIC

EI

VI

diagonal

EEE

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 2 4 6 8 10 12 14 16

number of clusters

Ad

jus

ted

Ra

nd

EI

VI

VVV

diagonal

CAST

EEE

Page 25: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

25

Results: standardized yeast cell cycle data

• AdjustedRand: EIslightly >CAST at 5clusters.

• BIC: selectsEEE at 5clusters.

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0 2 4 6 8 10 12 14 16

number of clusters

Ad

jus

ted

Ra

nd EI

VI

VVV

diagonal

CAST

EEE

-17000

-15000

-13000

-11000

-9000

-7000

-5000

-3000

-1000

0 2 4 6 8 10 12 14 16

number of clusters

BIC

EI

VI

diagonal

EEE

Page 26: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

26

Page 27: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

27

Page 28: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

28

Overview

• Motivation• Model-based clustering• Validation• Importance of Data Transformation• Summary and Conclusions

Page 29: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

29

log yeast cell cycle data

Page 30: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

30

Standardized yeast cell cycle data

Page 31: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

31

Overview

• Motivation• Model-based clustering• Validation• Summary and Conclusions

Page 32: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

32

Summary and Conclusions• Synthetic data sets:

– With the correct model, model-based clusteringbetter than a leading heuristic clustering algorithm

– BIC selects the right model & right number of clusters

• Real expression data sets:– Comparable adjusted Rand indices to CAST

– BIC gives a good hint as to the number of clusters

• Appropriate data transformations increasenormality & cluster quality (See paper & web.)

Page 33: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

35

Acknowledgements• Ka Yee Yeung1, Chris Fraley2,4,

Alejandro Murua4, Adrian E. Raftery2

• Michèl Schummer5 – the ovary data

• Jeremy Tantrum2 – help with MBC software (diagonal model)

• Chris Saunders3 – CRE & noise model

1Computer Science & Engineering 4Insightful Corporation2Statistics 5Institute of Systems Biology3Genome Sciences

More Infohttp://www.cs.washington.edu/homes/ruzzo

UW CSE Computational Biology Group

Page 34: Model-based clustering and data transformations of gene ... · Model-based clustering and data transformations of gene expression data ... Validation Methodology •Compare on data

44

Adjusted Rand Examplec#1(4) c#2(5) c#3(7) c#4(4)

class#1(2) 2 0 0 0

class#2(3) 0 0 0 3

class#3(5) 1 4 0 0

class#4(10) 1 1 7 1

1192

20

2831592

10

2

5

2

3

2

2

1231432

4

2

7

2

5

2

4

312

7

2

4

2

3

2

2

=!!!""#

$%%&

'=

=!=!""#

$%%&

'+""#

$%%&

'+""#

$%%&

'+""#

$%%&

'=

=!=!""#

$%%&

'+""#

$%%&

'+""#

$%%&

'+""#

$%%&

'=

=""#

$%%&

'+""#

$%%&

'+""#

$%%&

'+""#

$%%&

'=

cbad

ac

ab

a

469.0)(1

)( Rand Adjusted

789.0,

=!

!=

=+++

+=

RE

RER

dcda

daRRand


Recommended