+ All Categories
Home > Documents > Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health, Imperial College

Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health, Imperial College

Date post: 06-Jan-2016
Category:
Upload: katima
View: 25 times
Download: 0 times
Share this document with a friend
Description:
Bayesian modelling of differential gene expression data. Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health, Imperial College. In collaboration with Anne Mette Hein and Clare Marshall (Imperial) Philippe Bro ë t (INSERM) and Peter Green (Bristol) - PowerPoint PPT Presentation
Popular Tags:
35
Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health, Imperial College Bayesian modelling of differential gene expression data In collaboration with Anne Mette Hein and Clare Marshall (Imperial) Philippe Broët (INSERM) and Peter Green (Bristol) Helen Causton and Tim Aitman
Transcript
Page 1: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health,

Imperial College

Bayesian modelling of differential gene expression data

In collaboration with Anne Mette Hein and Clare Marshall (Imperial)

Philippe Broët (INSERM) and Peter Green (Bristol)Helen Causton and Tim Aitman (Hammersmith)

Page 2: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Outline

• Introduction: what is gene expression data ?

• Array effects (normalisation)

• Exchangeable model for the gene specific

variances and Bayesian model checks

• Differential expression

• Rank statistics

• Mixture models for differential expression

Page 3: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Protein-encoding genes are transcribed into mRNA (messenger), and the mRNA is translated to make proteins, the building blocks of living cells

DNA Microarrays can be used to measure the relative abundance of mRNA, providing information on gene expression in a particular cell type, under particular conditions

The fundamental principle used to measure the expression is that of hybridisation

What is gene expression data ?

Introduction 1

Page 4: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

The Principle of Hybridisation

20µm

Millions of copies of a specificoligonucleotide sequence element

Image of Hybridised Array

Approx. ½ million differentcomplementary oligonucleotides

Single stranded, labeled RNA sample

Oligonucleotide element

**

**

*

1.28cm

Hybridised Spot

Slide courtesy of Affymetrix

Expressed genes

Non-expressed genes

Zoom Image of Hybridised Array

Introduction 2

Page 5: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Hybridisation indicates that the corresponding gene is present because :

The binding is highly specific The sequences represented on the array are

designed to be unique in the genome

The expression level of thousands of genes are measured on a single microarray

gene expression profile

There are different types of arrays :we consider the case of oligonucleotide arrays where one extract per slide is hybridised

Introduction 3

Page 6: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Variation and uncertainty

Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability

• Treatment• Response to genetic and environmental

conditions

• Biological heterogeneity• Sample preparation• Array manufacture• Hybridisation process• Imaging

signal

Different components of noise

Introduction 4

Page 7: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Variation and uncertainty

Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability

• Treatment• Response to genetic and environmental

conditions

• Biological heterogeneity inherent• Sample preparation• Array manufacture• Hybridisation process• Imaging

signal

Good practice minimizes these sources of variation

Page 8: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Low-level Model(how is the measured expression related to the

signal)

Normalisation(to make samples comparable)

Differential Expression

ClusteringPartition Model

Gene expression analysis is a multi-step process

We aim to integrate all the steps in a common statistical framework

Introduction 5

Page 9: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Bayesian hierarchical model framework

Ability to model various sources of variability:e.g. detailed modelling of experimental variability: within array, between array, estimation of gene specific variability …

Building of all these features into a common model

uncertainty is propagated Ability to borrow / share information in appropriate ways to get better estimates

Introduction 6

Page 10: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Data Set and Biological question

Previous Work (Tim Aitman, Anne Marie Glazier)Deficiency in gene Cd36 found to be associated

with insulin resistance in SHR (spontaneously hypertensive rat)

Good animal model to tease out genes implicated in this syndrome

Microarray Study • 3 SHR compared with 3 transgenic rats• 3 wildtype mice compared with 3 knockout mice• Two tissues: fat and heart• Affymetrix chips U34A-C and U74A-C

( 12000 genes)

Page 11: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

I Additive (log scale) model for expression

Notation• ygr = gene expression measurements (log scale) for

gene g, g=1, …, N, replicate r, r = 1,…, R Additive model:

ygr = g + r(g) + gr

Here:

-- g is the expression level of the gth gene,

-- r(g) is the array effect (normalisation term) possibly dependent on g through the expression level g ,

constraints needed to ensure identifiability:

Σr r(g) = 0-- gr an error term, mean 0, Var(gr ) = σg

2 Model 1

Page 12: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

6 equal size groupsof genes defined by mean expression level

Exploratory analysis of array effect

Wildtype mouse fat data on 3 arrays

Estimate a constant array effect ri

for each group i, i =1:6

ygr = g + ri + gr

Each array corresponds to a different colour

ri, r =1:3, i=1:6

Model 2

Page 13: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Flexible model of the array effect

The exploratory analysis suggests to model the array effect as a (smooth) function of g

Piecewise polynomial with unknown break points:

r(g) = quadratic in g for ark-1 ≤ g ≤ ark

with coeff (brk(1), brk

(2) ), k =1, … # knots

Location of break points are not fixed

All coefficients brk(j) are given centred normal priors,

Σr r(g) = 0 constraintModel 3

Page 14: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Non linear fit of array effect as a function of level g

Model 4

Page 15: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Before (ygr)

After (ygr- r(g) )

Wildtype Knockout

Effect of normalisation on density

Model 5

Page 16: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

• 2nd level of the model : Exchangeable hierarchical prior:

g2 lognormal (μ, τ), g = 1, … N

The hyper parameters μ and τ can be influential• 3rd level of the model

μ N( c, d)

τ lognormal (e, f)

where the constants c, d, e and f are chosen so that the third level priors are vague

Hierarchical structure for gene variability in each condition

Model 6

Page 17: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

μ, τygr

br ar

g

r(g)

r = 1:R

g = 1:G

g2

br ar

Graphicalrepresentationof the 3-levelmodel

Model 7

Page 18: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

• Variances are estimated using information from all G x R measurements (~12000 x 3) rather than just 3

• Variances are stabilised and shrunk towards average variance

Smoothing of the gene specific variances

The implementation of the model uses WinBugsModel 8

Page 19: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

• Check different possible assumptions on gene variances, e.g.

equal : gr N(0, 2) or exchangeable variances ?• Predict sample variance Sg

2 new (our checking function)

from the model specification• Compare predicted Sg

2 new with observed Sg2 obs

Bayesian p-value Prob( Sg2 new > Sg

2 obs )

• Distribution of p-values Uniform if model is ‘true’• Easily implemented in MCMC algorithm

Bayesian Model Checking

Model 9

Page 20: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

μ, τ

ygr

br ar

g

r(g)

r = 1:R

g = 1:G

g2

br ar

Bayesian modelchecking

ygrg

2new new

Model 10

Page 21: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Bayesian predictive p-valuesExchangeable variance model is supported by the data

Equal variance model has too little variability for the data

Page 22: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

• Gene expression data can be used in several types of analysis:

-- Comparison of gene expression

under different experimental

conditions,

-- Classification of gene

expression profiles

-- Exploration of patterns in gene

expression matrices

-- Association of gene expression

with factors, e.g. prognosis

II-- Analysing gene expression dataGene expression data matrix

Samples

Gen

esCond 1

yg1r

Cond 2

yg2r

Page 23: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Differential expression model

The quantity of interest is the difference between conditions for each gene: dg , g = 1, …,N

Joint model for the 2 conditions :

yg1r = g - ½ dg + 1r(g) + g1r , r = 1, … R1

yg2r = g + ½ dg + 2r(g) + g2r , r = 1, … R2

• g is now the overall gene effect over the conditions

• Same assumptions for the distribution of σ2gs and the

modelling of sr(g) as before, s = 1, 2

• All hyper parameters are indexed by the condition

(1)

Differential 2

Page 24: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Possible statistics for differential expression

• The differences dg (≈ log fold change) or

the standardised differences

dg* = dg / (σ2 g1 / R1 + σ2 g2 / R2 )½

• We obtain the joint distribution of all {dg } or {dg* }

Process the output to have the distributions of the ranks

{ r (dg ), g = 1, …., N }

Model the distribution of dg flexibly to allow a mixture of

(small) subgroups of genes with ‘extreme’ dg (H1)

and a (large) group of genes with dg around 0 (H0)

Differential 3

Page 25: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Ranks of modelled log fold change dg

150 genes with lowest rank

Even genes with median rank less than 100 can have large uncertainty

3 wildtype micecompared to3 knockout mice

2.5% - 97.5% rank intervals for each gene

Differential 4

Page 26: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Posterior probabilities for each gene to be ranked

In the bottom 100 In the top 100

log fold change dg Differential 5

Page 27: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Mixture modelling the distribution of dg

• Mixture model framework but with an ‘unknown’ number of components

• Fully Bayesian hierarchical framework (the number of components is a random variable)

• Based on Green (1995) and Richardson and Green (1997) papers

• MCMC algorithms• Applied in an experimental context concerning

bladder cancer

(Broët, Richardson and Radvanyi; Journal of Computational Biology, 9, 671-683, 2002) Mixture 1

Page 28: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Bayesian mixture model

Normal mixtures with an unknown number of states

A gene can be in different states:

down regulated, …, unaffected, …, up-regulated

Bayesian estimation of posterior distribution:• for number of states (besides the unaffected one)• for the allocation of genes to the states

classification of states in the ‘extreme components’

or in the central one, based on their posterior probability

the central state corresponding to dg ≈ 0

Mixture 2

Page 29: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Mixture model specification

dg ~ w0 N(0, λ2η0

2) + j=1:k wj N(μj, ηj2)

Prior setting

λ2 > 1 expresses that the central component is expected to have a larger variance

μj+ > 0 , ordered, uniform on upper range

μj- < 0 , ordered, uniform on lower range

{η02 , ηj

2 } exchangeable, Gamma distributed

Weights {w0 , wj, j = 1:k} ~ Dirichlet, uniform or other alternativesk, unknown number of components

μj

Mixture 3

Page 30: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Biological Context

• Data from the Curie Institute/Research Section

(F. Radvanyi team, CNRS/Curie)

• The cell line: T24 bladder cell lines • The transfection: FGFR2 cDNA (fibroblast growth factor

receptor 2)• Comparison of 2 cell lines: Unmodified and Modified

(transfected by FGFR2 cDNA)• Material: Nylon microarray, 4608 gene expression from the

same batch, 33P labelling, 4 replicates, same experimenter, same day

Aim: to study transcriptional changes induced by a

defined DNA transfection in a cell line

Mixture 4

Page 31: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Comparison of gene expression in two bladder cancer cell lines (unmodified versus transfected)

Distribution of dg and QQ plot

dg : Differential expression for gene g after taking intoaccount array and cell line main effects. Negative values of dg correspond to up-regulated genes.

Page 32: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Posterior distribution of the number of mixture components

Components left of central one

P(0) = 0.00 P(1) = 0.11 P(2) = 0.73

P(3) = 0.14 P(4) = 0.01 P(5) = 0.00

Components right of central one

P(0) = 0.00 P(1) = 0.46 P(2) = 0.49

P(3) = 0.04 P(4) = 0.052 P(5) = 0.00

Support for a model with 5 components,

2 on the left and 2 on the right of the central one Mixture 5

Page 33: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Estimate of the posterior probability for each gene to be in each of the 5 components

Page 34: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

Classification

• Classification based on maximum a posteriori rule:

Group 1 2 3 4 5

Nb 9 26 4540 29 4 Mean -0.82 -0.51 0 0.51 0.86

SD 0.16 0.08 0.14 0.10 0.17

Weight 0.2% 0.8% 98% 0.8% 0.2%

This analysis highlights 9 up-regulated genes (expressed after transfection of the receptor) and 4 down-regulated ones, with an indication of 2 further subgroups showing some evidence of differential expression

Page 35: Sylvia Richardson, with Alex Lewin  Department of Epidemiology and Public Health, Imperial College

• Model different sources of variability into a single model

• Borrow information from all genes to estimate gene specific variances, non linear array effect

• Exploit the joint distribution of the differential expression measure through ranks or mixture models

• Further work is under way – on modelling of the low level Affy probe data – on more general clustering algorithms

Summary


Recommended