Date post: | 28-Mar-2015 |
Category: |
Documents |
Upload: | aidan-ayers |
View: | 217 times |
Download: | 0 times |
Bayesian mixture models for analysing gene expression
dataNatalia Bochkina
In collaboration with Alex Lewin , Sylvia Richardson,
BAIR ConsortiumImperial College London, UK
IntroductionWe use a fully Bayesian approach to model data and
MCMC for parameter estimation.
• Models all parameters simultaneously.
• Prior information can be included in the model.
• Variances are automatically adjusted to avoid unstable estimates for small number of observations.
• Inference is based on the posterior distribution of all parameters.
• Use the mean of the posterior distribution as an estimate for all parameters.
Condition 1
Distribution of differential expression parameter
Condition 2
Differential expression
Distribution of expression index for gene g , condition 1
Distribution of expression index for gene g , condition 2
Bayesian Model
yg1r ~ N( g - ½ dg , σg1) , r = 1, … R1
yg2r ~ N( g + ½ dg , σg2 ), r = 1, … R2
σ2gk ~ IG(ak, bk), k=1,2
E(σ2gk|s2
gk) = [(Rk-1) s2gk + 2bk]/(Rk-1+2ak)
Non-informative priors on g , ak , bk.
Mean Difference (log fold change)
2 conditions:Number of replicates in each condition
Prior model:
(Assume data is background corrected, log-transformed and normalised)
Prior distribution on dg?
Modelling differential expression
Prior information / assumption:
Genes are either differentially expressed or not (of interest or not)
Can include this in the model via modelling the difference as a mixture
dg ~ (1-p) δ0(dg) + p H(dg | θg)
How to choose H?
Advantages:• Automatically selects threshold as opposed to specifying constants as in the non-informative prior model for differences
• Interpretable: can use Bayesian classification to select differentially expressed genes:
P{g in H1| data} > P{g in H0| data}.
• Can estimate false discovery and non-discovery rates (Newton et al 2004).
H0H1
Considered mixture modelsWe choose several distributions as the non-zero part in the mixture distribution for dg: double gamma, Student t distribution, the conjugate model (Lonnstedt and Speed (2002)) and the uniform distribution in a fully Bayesian context.
Gamma model: H is double gamma distribution:
),2|(),2|()(~ 221100 xpxpxpdg
T model: H is Student t
distribution:
dg ~ (1-p)δ0 + p T (ν, μ, τ)
LS model: H is normal with variance proportional to variance of the data:
dg ~ (1-p)δ0 + p N (0, c
σg2)Uniform model: H is
uniform distribution:
dg ~ (1-p)δ0 + p U(-m1,
m2)
σg2 = σg1
2/R1+ σg2
2/R2
(-m1, m2) - slightly widened range of observed differences
Priors on hyperparameters are either non-informative or weakly informative G(1,1) for parameters with support on positive semiline.
Simulated data
Hyperparameters of variance a=1.5, b=0.05 are chosen close to Bayesian estimates of those in a real data set
Difference
Vari
ance
We compare performance of the four models on simulated data. For simplicity we considered a one group model (or a paired two group model). We simulate a data set with 1000 variables and 8 replicates:
Plot of the simulated data set
Differences
• Gamma, T and LS models estimate differences well
• Uniform model shrinks values to zero
• Compared to empirical Bayes, posterior estimates in the fully Bayesian approach do not shrink large values of the differences
Mixture estimates vs true values
Pos
teri
or m
ean
Pos
teri
or m
ean
Pos
teri
or m
ean
Pos
teri
or m
ean
Bayesian estimates of variance
• T and Gamma models have very similar variance estimates
• Uniform model produces similar estimates for small values and higher estimates for larger values compared with T and Gamma models
• LS model has more pertubation at both higher and lower values compared to T and Gamma models
Blue: variance estimate based on Bayesian model with non-informative prior on differences.
E(σ
2|
y)
Mixture estimate of the variance can be larger than the sample variance
Uniform model
LS model
sample variance
sample variance
sample variance
sample variance
Gamma model
T model
E(σ
2|
y) E(σ
2|
y)
E(σ
2|
y)
Classification
• T, LS and Gamma models perform similarly
• Uniform model has a smaller number of false positives but also a smaller number of true positives
Diff. Expressed genes (200)
Alternative NullGamma 179 21Uniform 172 28T 182 18LS 182 18
Non D. Expressed genes (800)Alternative Null
Gamma 4 796Uniform 1 799T 4 796LS 3 797
Uniform prior is more conservative
Wrongly classified by mixture:
truly dif. expressed,
truly not dif. expressed
Classification errorsare on the borderline:
Confusion betweensize of fold change and biological variability
Another simulation
Can we improve estimationof within conditionbiological variability ?
2628 data points
Many points addedon borderline:classificationerrors in red
g = 1:G
DAG for the mixture model
a1, b1
½(yg1.+ yg2.)
1 , 2
δg 2g1 s2
g1
2g2 s2
g2g
zg
a2, b2
p
yg1.- yg2.
The varianceestimates areinfluenced bythe mixtureparameters
Use only partialinformation fromthe replicatesto estimate2
gs and feed
forwardin the mixture ?
Estimation
• Estimation of all parameters combines information from biological replicates and between condition contrasts
• s2gs = 1/Rs Σr (ygsr - ygs. )2 , s = 1,2
Within condition biological variability
• 1/Rs Σr ygsr = ygs. ,
Average expression over replicates
• ½(yg1.+ yg2.) Average expression over conditions
• ½(yg1.- yg2.) Between conditions contrast
Mixture, full vs partial
In 46 data pointswith improvedclassification when‘feed back frommixture is cut’
In11 data pointswith changedbut new incorrect classification
Classificationaltered for 57 points:
Work in progress
Different classification:
Truly diff.expressed
Truly not diff.expressed
Model p0 (0.685) p1 (0.157) p2 (0.158) FP FNFull 0.745 0.128 0.128 29 285With cut 0.737 0.132 0.131 30 251
Difference: cut and no cut
Full Full
Variance Posterior probability Sample st.dev. vs diff.
Cut
Microarray data
Genes classified differently by the full model
and the model with feedback cut follow a curve.
Posterior probability
Sample differenceFullFull
Cut
Pool
ed s
ampl
e st
.dev
.
Sample st.dev. vs diff.Variance
Since variance is overestimated in full mixture model compared to mixture model with cut,
the number of False Negatives is lower for model with cut than for the full model.
LS model: empirical vs fully Bayesian
in fully Bayesian model (FB) and empirical Bayes (EB) model.
Parametersc a bEB, p=0.01 280.61 1.54 0.048EB, p=0.2 34.30 1.54 0.048Fully B 34.09 1.56 0.049
Estimated parameters
),0()1(~ 20 gg cpNpd
Compare the Lonnstedt and Speed (LS) model
Classification
• If parameter p is specified correctly, empirical and fully Bayesian models do not differ
• If parameter p is misspecified, estimate of the parameter c changes which leads to misclassification
Method False PositivesFalse Negatives
EB, p=0.01 0 44EB, p=0.2 3 18FB 3 18
Small p (p=0.01)
CutNo Cut
Bayesian Estimate of FDR• Step 1: Choose a gene specific parameter (e.g. δg ) or a gene
statistic• Step 2: Model its prior distribution using a mixture model
-- with one component to model the unaffected genes (null hypothesis) e.g. point mass at 0 for δg
-- other components to model (flexibly) the alternative
• Step 3: Calculate the posterior probability for any gene to belong to the unmodified component : pg0 | data
• Step 4: Evaluate FDR (and FNR) for any listassuming that all the gene classification are independent(Broët et al 2004) :
Bayes FDR (list) | data = 1/card(list) Σg list pg0
• Gene lists can be built by computing separately a criteria for each gene and ranking
• Thousands of genes are considered simultaneously• How to assess the performance of such lists ?
Multiple Testing Problem
Statistical ChallengeSelect interesting genes without including too many false
positives in a gene list
A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up
Want an evaluation of the expected false discovery rate (FDR)
Post Prob (g H1) = 1- pg0
Bayesrule
FDR (black)FNR (blue)as a function of1- pg0
Observedand estimatedFDR/FNRcorrespond well
Summary• Mixture models estimate differences and
hyperparameters well on simulated data.
• Variance is overestimated for some genes.
• Mixture model with uniform alternative distribution is more conservative in classifying genes than structured models.
• Lonnstedt and Speed model: performs better in fully Bayesian framework because parameter p is estimated from data.
• Estimates of false discovery and non-discovery rates are close to the true values