DESeq Paper Journal club

Background - DESeq• Modelling the number of reads sequenced from a gene X – Can use a Binomial B(n, p), n=total number of reads, p=prob. from gene X

– Can approximate with a Poisson(np) as n large, p small

– Poisson model works ok for a gene’s variation between technical replicates

– However, Poisson understimates variation between biological replicates

– edgeR and deseq use a negative binomial instead (for gene i in sample j)

Equation (1): Kij ~ NB(mu_ij, sigma2ij)

– Negative binomial has two parameters, mean mu and variance sigma2

– Number of replicates is usually too small to estimate both for a gene X

EdgeR – Assumes sigma2 = mu + alpha*mu2, where alpha is the same for all genes

– Just needs to estimate mu for a gene, then calculate sigma2 from that

• DESeq – For each sample, makes a local regression of sigma2 versus mu

– Given mu for gene X, uses the local regression to estimate sigma2

Results & Discussion• DESeq’s model - makes three assumptions – Equation (2): mu_ij = qi,rho(j) * sj

mu_ij = expected value of mean count (no. reads) for gene i in sample j

qi,rho(j) = proportional to concentration of fragments from gene i in sample j

sj = coverage (sampling depth) of library j

– Equation (3): sigma2_ij = mu_ij + sj2 * vi,rho(j)

sigma2_ij = variance of no. reads for gene i in sample j

mu_ij = variance due to Poisson model (technical variation) = “shot noise”

sj2 * vi,rho(j) = variance due to biological variation(?) = “raw variance”

– Equation (4): vi,rho(j) = vrho ( qi,rho(j) )

ie. vi,rho(j) is a function of qi,rho(j)

So we can make a regression of vi,rho(j) against qi,rho(j) for lots of genes (i)

Then estimate vi,rho(j) for gene X, based on qi,rho(j) and the regression line

• DESeq’s model – estimating parameters – sj : coverage (sampling depth) of library j

The total number of reads in library j is not a good measure of depth.

Instead, take the median (over all genes) of the ratios of observed counts:

Equation (5): sj = median_over_i ( kij / [ Sum_over_v kiv ]^(1/m) ] )

– qi,rho(j) = “expression strength” parameter for gene i in condition rho

Proportional to concentration of fragments from gene i in sample j.

Use the average of countsfrom samples j for condition rho:

Equation (6): qi,rho = 1/m_rho * Sum_over_j (kij / sj)

– vrho = function describing how vi,rho(j) depends on qi,rho(j)

Estimate the sample variance for each gene i, wi(rho) (Equation 7)

Fit a local regression line to wi(rho) versus qi(rho)

For a particular qi(rho) value, predict w=wi(rho) from the regression line

Also calculate zi(rho) for gene i (Equation 8)

Then use v = w – zi(rho) as an unbiased estimate of the variance vi,rho for gene i (Equation 9)

• DESeq’s model – testing for differential expression – Null hypothesis: qiA = qiB

qiA = expression strength parameter for gene i in the samples of condition A,

mA = number of samples for condition A

– Test statistic: total counts in each condition

Equation (10): KiA = counts in condition A = Sum_over_A ( Kij)

– P-value for test of null hypothesis

Under the null hypothesis, can compute prob(KiA = a, KiB = b) = p(a,b)

Equation (11): P-value for observed count (kiA, kiB) =

Sum of probabilities p(a,b) where p(a,b)≤ p(kiA,kiB), a+b = kiA+kiB

Sum of probabilities p(a,b) where a+b = kiA+kiB

– Computing p(a,b) values

p(a,b) = Prob(KiA = a) * Prob(KiB = b), assuming samples are independent

KiA is the sum of mA NB-distributed variables

We approximate its distribution by a NB(mu, sigma) distribution

whose parameters mu, sigma are estimated using Equations 12,13,14

Applications• Variance estimation – Use RNA-seq data from fly embryos: ‘A’ and ‘B’ samples, 2 replicates each

Figure 1: estimated variances wi(rho) plotted against qi(rho) for fly sample A

Distance between orange and purple lines is noise due to biological sampling

regression

edgeR

“shot noise”(technicalvariation)

• Testing for differential expression – Compared the 2 replicates for fly sample A

Figure 2: the empirical cumulative distribution functions of the P-values

The ECDF curve (blue line) should be below the diagonal (gray line)

Type I error is controlled by EdgeR & DESeq, but not a Poisson-based test

EdgeR has an excess of small P-values for low counts, but is more conservative for high counts

DESeq

edgeR

Poisson

Low High All

• Testing for differential expression – Compared fly A & B samples

Figure 3: obtained fold changes and P-values

The ability to detect differential expression depends on overall counts

The strong shot noise (technical variation) for low counts causes the testing procedure to call only very high fold changes as significant

Red: significant p-value

• Comparison with EdgeR – Ran edgeR with 4 settings:

(i) “Common-dispersion” or “tagwise-dispersion” modes for estimating variance

(ii) Size factors estimated by DESeq, or total number of reads

Results were very similar for the 4 settings

EdgeR’s single-value dispersion estimate of variance is lower than DESeq for weakly expressed genes & higher for strongly expressed genes (Figure 1)

regression

edgeR

“shot noise”(technicalvariation)

As a result, EdgeR is anti-conservative for lowly expressed genes, but more conservative for strongly expressed genes

This biases the list of discoveries by EdgeR

Figure 4 shows that weakly expressed genes seem to be over-represented

Few genes with high average level are called differentially expressed by EdgeR

DESeq produced results which were more balanced over the dynamic range

All fly data

DESeq hits

EdgeR hits

• Working without replicates – DESeq can work if there are no replicates in one or both conditions

If there are just replicates from one condition, fit regression line using that one

If there are no replicates, treat the samples as replicates to fit the regression

For neural cell data, variability between replicates ≈ variability bet. conditions

However, for fly data, variability between replicates << variability bet. conditions

• Variance-stabilising transformation (VST) – Given a variance-mean regression, a VST transforms the values so the

variance is independent of the mean (Equation 15)

This yields (transformed) count values whose variances are approximately the same throughout the dynamic range

This is useful for sample clustering, since clustering assumes all genes have roughly the same variance

Figure 5 shows clustering for neural cell samples, using VST-transformed data

• ChIP-Seq data – Compared HapMap IDs GM12878 and GM12891

DESeq does not give false positives when comparing replicates for 1 individual

Using a Poisson-based model, you would get many false positives

DESeq

Poisson

Same individual Different individuals

Summary• A Poisson model underestimates the variance between biological

samples; this leads to false positives in differential expression analyses• A Negative Binomial distribution is much better• This is especially true for highly expressed genes• DESeq and EdgeR use the Negative Binomial• However, DESeq estimates the sequencing depth differently• Also DESeq estimates the variance for a gene by assuming it has similar

variance to genes of similiar expression level• DESeq and EdgeR have similar sensitivity, but EdgeR calls a greater

number of weakly expressed genes as significant, and fewer highly expressed genes as significant

Date post:	22-May-2015
Category:	Education
Upload:	avrilcoghlan
View:	892 times
Download:	1 times

DESeq Paper Journal club

Education