Date post: | 28-Mar-2015 |
Category: |
Documents |
Upload: | amia-leblanc |
View: | 219 times |
Download: | 3 times |
Model checks for complex hierarchical models
Alex Lewin and Sylvia Richardson
Imperial College
Centre for Biostatistics
Many complex models used in bioinformatics
Classification/clustering can be greatly affected by choice of distributions
Our approach: exploit the structure of the model to perform predictive checks
hierarchical models generally involve exchangeability assumptions
mixture models are partially exchangeable
Background and Aims
Mixture model for gene expression data
Model checks for mixture model
distribution for gene-specific variances
different mixture priors
Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)
Outline of Talk
Hierarchical mixture model for gene expression data
differential effect for gene g
variance for each gene
Data: paired log differences between 2
conditions
g
ybarg Sg
σg
μ,τwjηj
g = gener = replicatej = mixture component
ygr | δg, g N(δg, g2)
w ~ Dirichlet(1,…,1), various priors for δg, g
δg | η ~ Σwjhj(ηj), g2 | μ,τ
f(μ,τ)
Mixture model for gene expression data
Many mixture models have been proposed for gene expression data
Set-up is similar to variable selection prior: point mass + alternative distribution
Particular choices for alternative:
Normal (Lönnstedt and Speed)
Uniform (Parmigiani et al)
many others …
Mixture model for gene expression data
Allow for asymmetry in over-and under-expressed genes 3-component mixture model
δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3)
6 knock-out and 5 wildtype mice
MAS5.0 processed data
Mixture model for gene expression data
Classify each gene into mixture components using posterior probabilities
Choice of mixture prior affects classification results
Mixture Prior for δg Est. w2 (% in null)
w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+) 0.96
w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+) 0.68
w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+) 0.99
Mixture model for gene expression data
Models checks for mixture model
distribution for gene-specific variances
different mixture priors
Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)
Outline of Talk
Predict new data from the model
Use posterior predictive distribution
Condition on hyperparameters (‘mixed predictive’ * not very conservative)
Get Bayesian p-value for each gene/marker/sample
Use all p-values together (100’s or 1000’s) to assess model fit
* Gelman, Meng and Stern 1995; Marshall and Spiegelhalter 2003
Predictive model checks
posterior Smpred
Sgobs
Checking distribution for gene variances
Bayesian p-value for gene g:
pg = Prob( Smpred > Sgobs | data )
All genes are exchangeable
histogram of p-values for all genes together
g
ybarg Sgobs post.
pred.
Sgppred
mixedpred.Smpred
σg
μ,τ
σpred
Predictive p-values for data simulated from the model
Histograms should be Uniform
Mixed predictive distribution much less conservative than posterior predictive
‘Mixed’ v. ‘posterior’ predictive
Using global distributionUsing gene-specific distributions
Checking different variance models
Model differential expression between 3 transgenic and 3 wildtype mice
g2 | μ,τ
Gam(μ,τ), μ fixed
g2 | μ,τ Gam(μ,τ)
g2 | μ,τ logNorm(μ,τ)
g2 = 2 for all genes
pg = 0
for t = 1,…,niter {
σtpred f(μt,τt)
Stmpred Gam( m, m(σt
pred)-2 )
pg pg + I[ Stmpred > Sg
obs ]
}
pg pg / niter
Implementation (MCMC)
Just two extra parameters predicted at each iteration
niter = no. MCMC iterations
m = (no. replicates – 1)/2
g
ybarg Sgobs
mixedpred.Smpred
σg
μ,τ
σpred
Mixture model for gene expression data
Model checks for mixture model
distribution for gene-specific variances
different mixture priors
Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)
Outline of Talk
Checking mixture prior
δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3)
OR
δg | η, zg = j ~ hj(ηj) j = 1,…,3
P(zg = j) = wj
Model checking: focus on separate mixture components
δg | η, zg = j ~ hj(ηj) j = 1,…,3
Think about MCMC iterations …
Mixture component is estimated from genes currently assigned to that component
Can only define p-value for given gene and mix. component when the gene is assigned to that component (i.e. condition on zg in p-value)
So check each component using only the genes currently assigned (i.e. condition on zg in histogram)
Issues for mixture model checking
g jpred
wj
ybarg Sg ybargjmpred
σg
μ,τηj
Predictive checks for mixture model
Bayesian p-value for gene g and mix. component j:
pgj = Prob( ybargjmpred > ybarg
obs | data, zg=j )
Genes assigned to the same mix. component are exchangeable
histogram of p-values for each mix. component separately
histogram for component j made only from genes with large P(zg = j)
Effectively we condition on a best classification
Condition on classification to check separate components
All genes with P(zg = j) > 0
Only genes with P(zg = j) > 0.5
Predictive p-values for data simulated from the model
Checking different mixture distributions
w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+)
Outer mix. components skewed too much away from zero
Null component too narrow
Checking different mixture distributions
w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+)
Outer components skewed opposite
Null still too narrow?
Checking different mixture distributions
w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+)
Better fit for all components
Implementationg j
pred
wj
ybarg Sg ybargjmpred
σg
μ,τηj
pgj = 0
for t = 1,…,niter { δjt
pred ~ hjt(ηjt) j = 1,…,3
ybargtmpred
N( δjtpred , g
2/nrep ) for j = zgt
pgj pgj + I[ ybargtmpred > ybarg
obs ] for j = zgt
}
pgj pgj / niter(zg=j)
Need ≈ngenes extra parameters at each iteration
Summary of model checking procedure
1. Find part of model where individuals are assumed to be exchangeable (so information is shared)
2. Choose test statistic T (eg. sample mean or variance)
3. Predict Tpred from distribution for exchangeable individuals (whole posterior for Tpred)
4. Compare observed Ti for each individual i to distribution of Tpred
5. For checking mixture components, condition on the best classification
Mixture model for gene expression data
Model checks for mixture model
distribution for gene-specific variances
different mixture priors
Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)
Outline of Talk
yi vector of gene expression for each sample i = 1,…,n
Multi-variate mixture model for clustering samples:
yi | zi = j MVN(ζj, Λj) j = 1,…,J
P(zi = j) = wj
No. of mix. components (J) is estimated in the model
Aim to select genes which are informative for clustering the samples
Clustering and variable selection (Tadesse et al. 2005)
Clustering and variable selection (Tadesse et al. 2005)
ji
Ci
Ti yy
j
))()(2
1exp( )()(1
)()()(
))()(2
1exp(~| )'()'(1
)'(1
)'()'(
i
n
i
Ti yyzLikelihood
γ = vector of indices of selected variables
γ’ = vector of indices of variables not used to cluster samples
Likelihood conditional on allocation to mixture:
Conjugate priors on multivariate means and covariance matrices
P(γg = 1) = φi = sampleg = genej = mix.
component
Clustering and variable selection (Tadesse et al. 2005)
i = sampleg = genej = mix.
component
Model checking: want to check the distribution for each mixture component separately (conditional on J)
In addition, need to condition on a given variable selection
Clearly impossible computationally
μj(γ) , Σj
(γ)
yi y(γ)jpred
wj
η(γ), Ω(γ) φ
J
1) Run model with no prediction
2) Find the best configuration:
set of selected variables (γ)
no. mixture components J
allocation of samples to mixture components z i
3) Re-run model, with (γ), J and zi fixed, calculated predictive p-values
Computing predictive p-values
pij = Prob( Tjpred > Ti
obs | data, zi=j, J, (γ) )
where T = |y|2 (for example)
Conclusions
Choice of model distributions can greatly influence results of clustering and classification
For models where information is shared across individuals, predictive checks can be used as an alternative to cross-validation
Should be possible to do this even for quite complex models (if you can fit the model, you can check it)
Acknowledgements
Collaborators on BBSRC Exploiting Genomics Grant
Natalia Bochkina, Clare Marshall
Peter Green
Meeting on model checking in Cambridge
David Spiegelhalter
Shaun Seaman
BBSRC Exploiting Genomics Grant
Paper and software at http://www.bgx.org.uk/