Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with...

Bayesian mixture models for analysing gene expression

dataNatalia Bochkina

In collaboration with Alex Lewin , Sylvia Richardson,

BAIR ConsortiumImperial College London, UK

IntroductionWe use a fully Bayesian approach to model data and

MCMC for parameter estimation.

• Models all parameters simultaneously.

• Prior information can be included in the model.

• Variances are automatically adjusted to avoid unstable estimates for small number of observations.

• Inference is based on the posterior distribution of all parameters.

• Use the mean of the posterior distribution as an estimate for all parameters.

Condition 1

Distribution of differential expression parameter

Condition 2

Differential expression

Distribution of expression index for gene g , condition 1

Distribution of expression index for gene g , condition 2

Bayesian Model

yg1r ~ N( g - ½ dg , σg1) , r = 1, … R1

yg2r ~ N( g + ½ dg , σg2 ), r = 1, … R2

σ2gk ~ IG(ak, bk), k=1,2

E(σ2gk|s2

gk) = [(Rk-1) s2gk + 2bk]/(Rk-1+2ak)

Non-informative priors on g , ak , bk.

Mean Difference (log fold change)

2 conditions:Number of replicates in each condition

Prior model:

(Assume data is background corrected, log-transformed and normalised)

Prior distribution on dg?

Modelling differential expression

Prior information / assumption:

Genes are either differentially expressed or not (of interest or not)

Can include this in the model via modelling the difference as a mixture

dg ~ (1-p) δ0(dg) + p H(dg | θg)

How to choose H?

Advantages:• Automatically selects threshold as opposed to specifying constants as in the non-informative prior model for differences

• Interpretable: can use Bayesian classification to select differentially expressed genes:

P{g in H1| data} > P{g in H0| data}.

• Can estimate false discovery and non-discovery rates (Newton et al 2004).

H0H1

Considered mixture modelsWe choose several distributions as the non-zero part in the mixture distribution for dg: double gamma, Student t distribution, the conjugate model (Lonnstedt and Speed (2002)) and the uniform distribution in a fully Bayesian context.

Gamma model: H is double gamma distribution:

),2|(),2|()(~ 221100 xpxpxpdg

T model: H is Student t

distribution:

dg ~ (1-p)δ0 + p T (ν, μ, τ)

LS model: H is normal with variance proportional to variance of the data:

dg ~ (1-p)δ0 + p N (0, c

σg2)Uniform model: H is

uniform distribution:

dg ~ (1-p)δ0 + p U(-m1,

m2)

σg2 = σg1

2/R1+ σg2

2/R2

(-m1, m2) - slightly widened range of observed differences

Priors on hyperparameters are either non-informative or weakly informative G(1,1) for parameters with support on positive semiline.

Simulated data

Hyperparameters of variance a=1.5, b=0.05 are chosen close to Bayesian estimates of those in a real data set

Difference

Vari

ance

We compare performance of the four models on simulated data. For simplicity we considered a one group model (or a paired two group model). We simulate a data set with 1000 variables and 8 replicates:

Plot of the simulated data set

Differences

• Gamma, T and LS models estimate differences well

• Uniform model shrinks values to zero

• Compared to empirical Bayes, posterior estimates in the fully Bayesian approach do not shrink large values of the differences

Mixture estimates vs true values

Pos

teri

or m

ean

Pos

teri

or m

ean

Pos

teri

or m

ean

Pos

teri

or m

ean

Bayesian estimates of variance

• T and Gamma models have very similar variance estimates

• Uniform model produces similar estimates for small values and higher estimates for larger values compared with T and Gamma models

• LS model has more pertubation at both higher and lower values compared to T and Gamma models

Blue: variance estimate based on Bayesian model with non-informative prior on differences.

E(σ

2|

y)

Mixture estimate of the variance can be larger than the sample variance

Uniform model

LS model

sample variance

sample variance

sample variance

sample variance

Gamma model

T model

E(σ

2|

y) E(σ

2|

y)

E(σ

2|

y)

Classification

• T, LS and Gamma models perform similarly

• Uniform model has a smaller number of false positives but also a smaller number of true positives

Diff. Expressed genes (200)

Alternative NullGamma 179 21Uniform 172 28T 182 18LS 182 18

Non D. Expressed genes (800)Alternative Null

Gamma 4 796Uniform 1 799T 4 796LS 3 797

Uniform prior is more conservative

Wrongly classified by mixture:

truly dif. expressed,

truly not dif. expressed

Classification errorsare on the borderline:

Confusion betweensize of fold change and biological variability

Another simulation

Can we improve estimationof within conditionbiological variability ?

2628 data points

Many points addedon borderline:classificationerrors in red

g = 1:G

DAG for the mixture model

a1, b1

½(yg1.+ yg2.)

1 , 2

δg 2g1 s2

g1

2g2 s2

g2g

zg

a2, b2

p

yg1.- yg2.

The varianceestimates areinfluenced bythe mixtureparameters

Use only partialinformation fromthe replicatesto estimate2

gs and feed

forwardin the mixture ?

Estimation

• Estimation of all parameters combines information from biological replicates and between condition contrasts

• s2gs = 1/Rs Σr (ygsr - ygs. )2 , s = 1,2

Within condition biological variability

• 1/Rs Σr ygsr = ygs. ,

Average expression over replicates

• ½(yg1.+ yg2.) Average expression over conditions

• ½(yg1.- yg2.) Between conditions contrast

Mixture, full vs partial

In 46 data pointswith improvedclassification when‘feed back frommixture is cut’

In11 data pointswith changedbut new incorrect classification

Classificationaltered for 57 points:

Work in progress

Different classification:

Truly diff.expressed

Truly not diff.expressed

Model p0 (0.685) p1 (0.157) p2 (0.158) FP FNFull 0.745 0.128 0.128 29 285With cut 0.737 0.132 0.131 30 251

Difference: cut and no cut

Full Full

Variance Posterior probability Sample st.dev. vs diff.

Cut

Microarray data

Genes classified differently by the full model

and the model with feedback cut follow a curve.

Posterior probability

Sample differenceFullFull

Cut

Pool

ed s

ampl

e st

.dev

.

Sample st.dev. vs diff.Variance

Since variance is overestimated in full mixture model compared to mixture model with cut,

the number of False Negatives is lower for model with cut than for the full model.

LS model: empirical vs fully Bayesian

in fully Bayesian model (FB) and empirical Bayes (EB) model.

Parametersc a bEB, p=0.01 280.61 1.54 0.048EB, p=0.2 34.30 1.54 0.048Fully B 34.09 1.56 0.049

Estimated parameters

),0()1(~ 20 gg cpNpd

Compare the Lonnstedt and Speed (LS) model

Classification

• If parameter p is specified correctly, empirical and fully Bayesian models do not differ

• If parameter p is misspecified, estimate of the parameter c changes which leads to misclassification

Method False PositivesFalse Negatives

EB, p=0.01 0 44EB, p=0.2 3 18FB 3 18

Small p (p=0.01)

CutNo Cut

Bayesian Estimate of FDR• Step 1: Choose a gene specific parameter (e.g. δg ) or a gene

statistic• Step 2: Model its prior distribution using a mixture model

-- with one component to model the unaffected genes (null hypothesis) e.g. point mass at 0 for δg

-- other components to model (flexibly) the alternative

• Step 3: Calculate the posterior probability for any gene to belong to the unmodified component : pg0 | data

• Step 4: Evaluate FDR (and FNR) for any listassuming that all the gene classification are independent(Broët et al 2004) :

Bayes FDR (list) | data = 1/card(list) Σg list pg0

• Gene lists can be built by computing separately a criteria for each gene and ranking

• Thousands of genes are considered simultaneously• How to assess the performance of such lists ?

Multiple Testing Problem

Statistical ChallengeSelect interesting genes without including too many false

positives in a gene list

A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up

Want an evaluation of the expected false discovery rate (FDR)

Post Prob (g H1) = 1- pg0

Bayesrule

FDR (black)FNR (blue)as a function of1- pg0

Observedand estimatedFDR/FNRcorrespond well

Summary• Mixture models estimate differences and

hyperparameters well on simulated data.

• Variance is overestimated for some genes.

• Mixture model with uniform alternative distribution is more conservative in classifying genes than structured models.

• Lonnstedt and Speed model: performs better in fully Bayesian framework because parameter p is estimated from data.

• Estimates of false discovery and non-discovery rates are close to the true values

Date post:	28-Mar-2015
Category:	Documents
Upload:	aidan-ayers
View:	217 times
Download:	0 times

Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with...

Documents