Estimation of alternative splicing isoform frequencies
from RNA-Seq data
Ion MandoiuComputer Science and Engineering Department
University of Connecticut
Joint work with Marius Nicolae, Serghei Mangul, and Alex Zelikovsky
Outline
• Introduction• EM Algorithm• Experimental results• Conclusions and future work
Alternative Splicing
[Griffith and Marra 07]
RNA-Seq
A B C D E
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
Gene Expression (GE)
A B C
A C
D E
Isoform Discovery (ID) Isoform Expression (IE)
Gene Expression Challenges
• Read ambiguity (multireads)
• What is the gene length?
A B C D E
Previous approaches to GE
• Ignore multireads• [Mortazavi et al. 08]
– Fractionally allocate multireads based on unique read estimates
• [Pasaniuc et al. 10]– EM algorithm for solving ambiguities
• Gene length: sum of lengths of exons that appear in at least one isoform Underestimates expression levels for genes with 2 or
more isoforms [Trapnell et al. 10]
Read Ambiguity in IE
A B C D E
A C
Previous approaches to IE
• [Jiang&Wong 09]– Poisson model + importance sampling, single reads
• [Richard et al. 10]• EM Algorithm based on Poisson model, single reads in exons
• [Li et al. 10]– EM Algorithm, single reads
• [Feng et al. 10]– Convex quadratic program, pairs used only for ID
• [Trapnell et al. 10]– Extends Jiang’s model to paired reads– Fragment length distribution
Our contribution
• EM Algorithm for IE– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores
Read-Isoform Compatibilityirw ,
a
aaair FQOw ,
Fragment length distribution
• Paired reads
• Single reads
A B C
A C
A B C
A CA C
A B C
A B C
A C
A B C
A C
A B C
A C
Series1
Series1
Series1
Series1
IsoEM algorithm
E-step
M-step
Simulation setup• Human genome UCSC known isoforms
• GNFAtlas2 gene expression levels– Uniform/geometric expression of gene isoforms
• Normally distributed fragment lengths– Mean 250, std. dev. 25
0 5 10 15 20 25 30 35 40 45 50 551
10
100
1000
10000
100000
Number of isoforms
Num
ber o
f gen
es
10
31.6227766...100
316.227766...1000
3162.27766...
10000
31622.7766...
1000000
5000
10000
15000
20000
25000
Isoform length
Num
ber o
f iso
form
s
Accuracy measures
• Error Fraction (EFt)– Percentage of isoforms (or genes) with relative
error larger than given threshold t• Median Percent Error (MPE)
– Threshold t for which EF is 50%• r2
Error Fraction Curves - Isoforms• 30M single reads of length 25
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
70
80
90
100
Uniq
Rescue
UniqLN
Cufflinks
RSEM
IsoEM
Relative error threshold
% o
f iso
form
s ov
er th
resh
old
Error Fraction Curves - Genes• 30M single reads of length 25
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
70
80
90
100
Uniq
Rescue
GeneEM
Cufflinks
RSEM
IsoEM
Relative error threshold
% o
f gen
es o
ver t
hres
hold
MPE and EF15 by Gene Frequency• 30M single reads of length 25
Read Length Effect• Fixed sequencing throughput (750Mb)
25 35 45 55 65 75 85 950
5
10
15
20
25
Paired reads
Single reads
Read lengthM
edia
n Pe
rcen
t Err
or
25 35 45 55 65 75 85 950.962000000000001
0.964000000000001
0.966000000000001
0.968000000000001
0.970000000000001
0.972000000000001
0.974000000000001
0.976000000000001
0.978000000000001
Paired reads
Single reads
Read length
r2
Effect of Pairs & Strand Information
• 1-60M 75bp reads
0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,0000.925
0.93
0.935
0.94
0.945
0.95
0.955
0.96
0.965
0.97
0.975
0.98
0.985
RandomStrand-Pairs
CodingStrand-pairs
RandomStrand-Single
CodingStrand-single
# reads
r2
Validation on Human RNA-Seq Data
• ≈8 million 27bp reads from two cell lines [Sultan et al. 10]• 47 AEEs measured by qPCR [Richard et al. 10]
0% 20% 40% 60% 80% 100%0%
20%
40%
60%
80%
100%
R² = 0.5433666236408
POEM
qPCR AE Fraction
Estim
ated
AE
Frac
tion
0% 20% 40% 60% 80% 100%0%
20%
40%
60%
80%
100%
R² = 0.472092562009362
Cufflinks
qPCR AE Fraction
Estim
ated
AE
Frac
tion
0% 20% 40% 60% 80% 100%0%
20%
40%
60%
80%
100%
R² = 0.610623442668948
IsoEM
qPCR AE Fraction
Estim
ated
AE
Frac
tion
Validation on Drosophila RNA-Seq Data
• [McManus et al. 10]
26M 42M 31M 78M Paired-end reads (37bp)
Allele Specific Expression in Parental Pool
1 100
1
100R² = 0.892234244861626
D.Mel.
D.M
el. I
n Pa
rent
al P
ool
1 100
0.000000001
0.0000001
0.00001
0.001
0.1
10R² = 0.933304143243501
D.Sec.
D.Se
c.in
Pare
ntal
Poo
l
Comparison to Pyrosequencing
-2 -1 0 1 2 3 4 5-2
-1
0
1
2
3
4
5
R² = 0.826523462271037R² = 0.896557530912755
HybridLinear (Hybrid)Parental Pool
Log2(M/S) pyroseq
Log2
(M/S
) Iso
EM
Runtime scalability
0 10000000 20000000 300000000
20
40
60
80
100
120
140
160
RandomStrand-Pairs
CodingStrand-Pairs
RandomStrand-Single
CodingStrand-Single
Million Fragments
CPU
Sec
onds
• Scalability experiments conducted on a Dell PowerEdge R900– Four 6-core E7450Xeon processors at 2.4Ghz, 128Gb of internal
memory
Conclusions & Future Work• Presented EM algorithm for estimating isoform/gene
expression levels– Integrates fragment length distribution, base qualities, pair and strand
info– Java implementation available at http://dna.engr.uconn.edu/software/IsoEM/
• Ongoing work– Correction for library preparation and sequencing biases
– E.g., random hexamer priming bias [Hansen et al. 10]– Comparison of RNA-Seq with DGE– Isoform discovery– Reconstruction & frequency estimation for virus quasispecies
Acknowledgments NSF awards 0546457 & 0916948 to IM and 0916401 to AZ