Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College...

Model checks for complex hierarchical models

Alex Lewin and Sylvia Richardson

Imperial College

Centre for Biostatistics

Many complex models used in bioinformatics

Classification/clustering can be greatly affected by choice of distributions

Our approach: exploit the structure of the model to perform predictive checks

hierarchical models generally involve exchangeability assumptions

mixture models are partially exchangeable

Background and Aims

Mixture model for gene expression data

Model checks for mixture model

distribution for gene-specific variances

different mixture priors

Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

Outline of Talk

Hierarchical mixture model for gene expression data

differential effect for gene g

variance for each gene

Data: paired log differences between 2

conditions

g

ybarg Sg

σg

μ,τwjηj

g = gener = replicatej = mixture component

ygr | δg, g N(δg, g2)

w ~ Dirichlet(1,…,1), various priors for δg, g

δg | η ~ Σwjhj(ηj), g2 | μ,τ

f(μ,τ)


Many mixture models have been proposed for gene expression data

Set-up is similar to variable selection prior: point mass + alternative distribution

Particular choices for alternative:

Normal (Lönnstedt and Speed)

Uniform (Parmigiani et al)

many others …


Allow for asymmetry in over-and under-expressed genes 3-component mixture model

δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3)

6 knock-out and 5 wildtype mice

MAS5.0 processed data


Classify each gene into mixture components using posterior probabilities

Choice of mixture prior affects classification results

Mixture Prior for δg Est. w2 (% in null)

w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+) 0.96

w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+) 0.68

w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+) 0.99


Models checks for mixture model




Outline of Talk

Predict new data from the model

Use posterior predictive distribution

Condition on hyperparameters (‘mixed predictive’ * not very conservative)

Get Bayesian p-value for each gene/marker/sample

Use all p-values together (100’s or 1000’s) to assess model fit

* Gelman, Meng and Stern 1995; Marshall and Spiegelhalter 2003

Predictive model checks

posterior Smpred

Sgobs

Checking distribution for gene variances

Bayesian p-value for gene g:

pg = Prob( Smpred > Sgobs | data )

All genes are exchangeable

histogram of p-values for all genes together

g

ybarg Sgobs post.

pred.

Sgppred

mixedpred.Smpred

σg

μ,τ

σpred

Predictive p-values for data simulated from the model

Histograms should be Uniform

Mixed predictive distribution much less conservative than posterior predictive

‘Mixed’ v. ‘posterior’ predictive

Using global distributionUsing gene-specific distributions

Checking different variance models

Model differential expression between 3 transgenic and 3 wildtype mice

g2 | μ,τ

Gam(μ,τ), μ fixed

g2 | μ,τ Gam(μ,τ)

g2 | μ,τ logNorm(μ,τ)

g2 = 2 for all genes

pg = 0

for t = 1,…,niter {

σtpred f(μt,τt)

Stmpred Gam( m, m(σt

pred)-2 )

pg pg + I[ Stmpred > Sg

obs ]

}

pg pg / niter

Implementation (MCMC)

Just two extra parameters predicted at each iteration

niter = no. MCMC iterations

m = (no. replicates – 1)/2

g

ybarg Sgobs

mixedpred.Smpred

σg

μ,τ

σpred






Outline of Talk

Checking mixture prior

δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3)

OR

δg | η, zg = j ~ hj(ηj) j = 1,…,3

P(zg = j) = wj

Model checking: focus on separate mixture components

δg | η, zg = j ~ hj(ηj) j = 1,…,3

Think about MCMC iterations …

Mixture component is estimated from genes currently assigned to that component

Can only define p-value for given gene and mix. component when the gene is assigned to that component (i.e. condition on zg in p-value)

So check each component using only the genes currently assigned (i.e. condition on zg in histogram)

Issues for mixture model checking

g jpred

wj

ybarg Sg ybargjmpred

σg

μ,τηj

Predictive checks for mixture model

Bayesian p-value for gene g and mix. component j:

pgj = Prob( ybargjmpred > ybarg

obs | data, zg=j )

Genes assigned to the same mix. component are exchangeable

histogram of p-values for each mix. component separately

histogram for component j made only from genes with large P(zg = j)

Effectively we condition on a best classification

Condition on classification to check separate components

All genes with P(zg = j) > 0

Only genes with P(zg = j) > 0.5

Predictive p-values for data simulated from the model

Checking different mixture distributions

w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+)

Outer mix. components skewed too much away from zero

Null component too narrow


w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+)

Outer components skewed opposite

Null still too narrow?


w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+)

Better fit for all components

Implementationg j

pred

wj

ybarg Sg ybargjmpred

σg

μ,τηj

pgj = 0

for t = 1,…,niter { δjt

pred ~ hjt(ηjt) j = 1,…,3

ybargtmpred

N( δjtpred , g

2/nrep ) for j = zgt

pgj pgj + I[ ybargtmpred > ybarg

obs ] for j = zgt

}

pgj pgj / niter(zg=j)

Need ≈ngenes extra parameters at each iteration

Summary of model checking procedure

1. Find part of model where individuals are assumed to be exchangeable (so information is shared)

2. Choose test statistic T (eg. sample mean or variance)

3. Predict Tpred from distribution for exchangeable individuals (whole posterior for Tpred)

4. Compare observed Ti for each individual i to distribution of Tpred

5. For checking mixture components, condition on the best classification






Outline of Talk

yi vector of gene expression for each sample i = 1,…,n

Multi-variate mixture model for clustering samples:

yi | zi = j MVN(ζj, Λj) j = 1,…,J

P(zi = j) = wj

No. of mix. components (J) is estimated in the model

Aim to select genes which are informative for clustering the samples

Clustering and variable selection (Tadesse et al. 2005)


ji

Ci

Ti yy

j

))()(2

1exp( )()(1

)()()(

))()(2

1exp(~| )'()'(1

)'(1

)'()'(

i

n

i

Ti yyzLikelihood

γ = vector of indices of selected variables

γ’ = vector of indices of variables not used to cluster samples

Likelihood conditional on allocation to mixture:

Conjugate priors on multivariate means and covariance matrices

P(γg = 1) = φi = sampleg = genej = mix.

component


i = sampleg = genej = mix.

component

Model checking: want to check the distribution for each mixture component separately (conditional on J)

In addition, need to condition on a given variable selection

Clearly impossible computationally

μj(γ) , Σj

(γ)

yi y(γ)jpred

wj

η(γ), Ω(γ) φ

J

1) Run model with no prediction

2) Find the best configuration:

set of selected variables (γ)

no. mixture components J

allocation of samples to mixture components z i

3) Re-run model, with (γ), J and zi fixed, calculated predictive p-values

Computing predictive p-values

pij = Prob( Tjpred > Ti

obs | data, zi=j, J, (γ) )

where T = |y|2 (for example)

Conclusions

Choice of model distributions can greatly influence results of clustering and classification

For models where information is shared across individuals, predictive checks can be used as an alternative to cross-validation

Should be possible to do this even for quite complex models (if you can fit the model, you can check it)

Acknowledgements

Collaborators on BBSRC Exploiting Genomics Grant

Natalia Bochkina, Clare Marshall

Peter Green

Meeting on model checking in Cambridge

David Spiegelhalter

Shaun Seaman

BBSRC Exploiting Genomics Grant

Paper and software at http://www.bgx.org.uk/

Date post:	28-Mar-2015
Category:	Documents
Upload:	amia-leblanc
View:	219 times
Download:	3 times

Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College...

Documents