Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | vernon-mothershed |
View: | 217 times |
Download: | 0 times |
Alex ZelikovskyDepartment of Computer Science
Georgia State University
Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu
Viral Quasispecies Reconstruction Based on Unassembled Frequency Estimation
Outline
• Introduction• ML Model• EM Algorithm• VSEM Algorithm• Experimental Results• Conclusions and future work
ISBRA 2011, Central South University, Changsha, China
454 Pyrosequencing
• Emulsion PCR• Single nucleotide addition
— Natural nucleotides— DNA ploymerase pauses until complementary
nucleotide is dispensed— Nucleotide incorporation triggers enzymatic
reaction that results in emission of light
ISBRA 2011, Central South University, Changsha, China
ML Model• Panel : bipartite graph
— RIGHT: strings>unknown frequencies
— LEFT: reads>observed frequencies
— EDGES: probability of the read to be emitted by the string>weights are calculated based on
the mapping of the reads to the strings
ISBRA 2011, Central South University, Changsha, China
strings
S1
S2
S3
R1
R2
R4
reads
R3
ML estimates of string frequencies
• Probability that a read is sampled from string is proportional with its frequency f(j)• ML estimates for f(j) is given by n(j)/(n(1) + . . . +
n(N))— n(j) - number of reads sampled from string j
ISBRA 2011, Central South University, Changsha, China
EM algorithm
• E-step: Compute the expected number n(j) of reads that come from string j under the assumption that string frequencies f(j) are correct• M-step: For each string j, set the new value of f(j)
equal to the portion of reads being originated by string j among all observed reads in the sample
ISBRA 2011, Central South University, Changsha, China
ML Model Quality• How well the maximum likelihood model explain the
reads• Measured by deviation between expected and
observed read frequencies
— expected read frequency:
ISBRA 2011, Central South University, Changsha, China
MLj
hil
li
jij f
h
he
ji
0: ,
,
,
||
||
R
eoD j jj
VSEM : Virtual String EM
ISBRA 2011, Central South University, Changsha, China
deviation betweenexpected /observed
read frequencies
ML estimates of string
frequencies
Computeexpected read
frequencies
update weightsof reads in virtual string
EM(incomplete) panel+ virtual stringwith 0-weightsin virtual string
Stop condition
Output : stringfrequencies,
reads
EM
yesno
Example : 1st iteration
9 ISBRA 2011, Central South University, Changsha, China
strings
S1
S2
S3
R1
R2
R4
reads
R3
strings
S1
S2
R1
R2
R4
reads
R3
Full Panel Incomplete Panel
O
0.25
0.25
0.25
0.25
O
0.25
0.25
0.25
0.25
VS
VS
Example : 1st iteration
10 ISBRA 2011, Central South University, Changsha, China
strings
S1
S2
S3
R1
R2
R4
reads
R3
strings
S1
S2
R1
R2
R4
reads
R3
Full Panel Incomplete Panel
O
0.25
0.25
0.25
0.25
O
0.25
0.25
0.25
0.25
ML
0.25
0.5
0.25
ML
0.33
0.66
VS
VS
Example : 1st iteration
11 ISBRA 2011, Central South University, Changsha, China 11 ISBRA 2011, Central South University, Changsha, China
strings
S1
S2
S3
R1
R2
R4
reads
R3
strings
S1
S2
R1
R2
R4
reads
R3
Full Panel Incomplete Panel
O E
.25 .25
.25 .25
.25 .25
.25 .25
O E
.25 .32
.25 .32
.25 .16
.25 .16
ML
.25
.5
.25
ML
.33
.66
VS
VS
Example : 1st iteration
12 ISBRA 2011, Central South University, Changsha, China
strings
S1
S2
S3
R1
R2
R4
reads
R3
strings
S1
S2
R1
R2
R4
reads
R3
Full Panel Incomplete Panel
O E
.25 .25
.25 .25
.25 .25
.25 .25
O E
.25 .32
.25 .32
.25 .16
.25 .16
ML
.25
.5
.25
ML
.34
.66
VS
VS
D=0 D=.08
Example : 1st iteration
13 ISBRA 2011, Central South University, Changsha, China
strings
S1
S2
S3
R1
R2
R4
reads
R3
strings
S1
S2
R1
R2
R4
reads
R3
Full Panel Incomplete Panel
O E
.25 .25
.25 .25
.25 .25
.25 .25
O E
.25 .3
.25 .3
.25 .15
.25 .15
ML
.25
.5
.25
0
ML
.32
.65
.02
VS
VS
D=0 D=.075
Incomplete Panel
Example : last iteration
14 ISBRA 2011, Central South University, Changsha, China
strings
S1
S2
S3
R1
R2
R4
reads
R3
strings
S1
S2
R1
R2
R4
reads
R3
Full Panel Incomplete Panel
O E
.25 .25
.25 .25
.25 .25
.25 .25
O E
.25 .25
.25 .25
.25 .25
.25 .25
ML
.25
.5
.25
0
ML
.20
.6
.2
VS
VS
D=0 D=0
VSEM : Virtual String EM
• Decide if the panel is likely to be incomplete• Estimate total frequency of missing strings• Identify read spectrum emitted by missing strings
ISBRA 2011, Central South University, Changsha, China
ViSpA• ViSpA [Astrovskaya et al. 2011] – viral spectrum
assembling tool for inferring viral quasispecies sequences and their frequencies from pyroseqencing shotgun reads — align reads— built a read graph :
>V – reads>E – overlap between reads>each path – candidate sequence
— filter based on ML frequencies
16 ISBRA 2011, Central South University, Changsha, China
ViSpA-VSEM
17 ISBRA 2011, Central South University, Changsha, China
ViSPA Weighted assembler
assembled Qsps Qsps Library
VSEM Virtual String EM
reads, weights
Viral Spectrum+Statistics
reads
ViSpAML estimator
removing duplicated & rare qsps
Stopping condition
YES
NO
Simulation Setup and Accuracy Measures• Real quasispecies sequences data from [von Hahn et
al. 2006]— 44 sequences (1739 bp long) from the E1E2 region of
Hepatitis C virus— Error-free data was simulated by in-house simulator
>populations sizes: 10, 20, 30, and 40 sequences>population distributions: geometric, skewed normal, uniform
• Accuracy measures— Kullback-Leibler divergence— Correlation between real and predicted frequencies— Average prediction error
18 ISBRA 2011, Central South University, Changsha, China
Experimental Validation of VSEM
• Detection of panel incompleteness— VSEM can detect 1% of missing strings
• Improving quasispecies frequencies • Detection of reads emitted by missing string
— Correlation between predicted reads and reads emitted by missing strings >65%
19 ISBRA 2011, Central South University, Changsha, China
EM vs VSEM
20 ISBRA 2011, Central South University, Changsha, China
% of missing strings
r.l./n.r <10% 10%-20% 20%-30% 30%-40% 40%-50% >50%
r err r err r err r err r err r err
ViSpA 100/20K 90.2 4.5 91.0 6.8 75.4 5.1 68.6 1.6 40.8 2.3 39.8 10.4
ViSpA-VSEM 100/20K 91.6 2.3 92.8 4.4 76.5 4.1 70.5 1.4 54.2 2.0 50.8 7.4
ViSpA 300/20K 95.7 3.8 93.2 10.2 89.8 1.0 66.7 1.5 62.1 2.1 46.8 9.7
ViSpA-VSEM 300/20K 95.4 1.7 95.8 1.1 96.9 0.6 85.7 0.9 88.0 0.9 60.4 2.6
ViSpA 100/100K 95.2 4.5 93.9 9.1 84.8 1.4 74.2 1.8 74.5 2.3 73.4 9.9
ViSpA-VSEM 100/100K 97.8 2.6 95.6 3.0 86.3 1.3 79.8 1.7 79.0 2.1 74.2 8.8
ViSpA 300/100K 96.2 3.9 88.6 12.4 88.9 1.0 85.1 1.4 75.1 2.3 49.5 10.5
ViSpA-VSEM 300/100K 96.2 2.0 92.8 0.9 93.7 0.7 90.2 1.2 84.4 1.7 67.1 4.8
ViSpA vs ViSpA-VSEM
21 ISBRA 2011, Central South University, Changsha, China
ViSpA ViSpA-VSEM
Distribution PPV Sensetivity RE r err PPV Sensetivity RE r err Gain
Geometric 0.767 0.5 -0.0099 0.954 7.36 0.591 0.73 0.0276 0.909 2.91 2.3
Skewed 0.733 0.4 -0.0196 0.673 13.01 0.701 0.77 0.0085 0.967 2.5 4
Uniform 0.733 0.4 -0.0191 0.716 12.76 0.645 0.73 0.0108 0.976 2.34 3.7
• 100K reads from 10 QSPS• average length 300
ViSpA vs ViSpA-VSEM
#mismatches
ViSpA ViSpA-VSEM
PPV Sensetivity RE r err PPV Sensetivity RE r err Gain
k = 0 0.5 0.5 0.0720 0.9860 9.98 0.546 0.6 0.0494 0.974 7.54 1
k = 2 0.6 0.6 0.0668 0.9860 9.16 0.636 0.7 0.0434 0.9680 6.67 1
k = 6 0.7 0.7 0.0577 0.9856 7.95 0.727 0.8 0.0369 0.946 6.20 1
k =7 0.8 0.8 0.0525 0.9866 7.26 0.818 0.9 0.0335 0.948 5.65 1
22 ISBRA 2011, Central South University, Changsha, China
• 100K reads from 10 QSPS• average length 300
Conclusions & Future Work
• Apply VSEM to RNA-Seq data• Assemble missing strings from the set of reads
emitted by missing strings• Handle chimerical strings presented in the panel
23 ISBRA 2011, Central South University, Changsha, China