+ All Categories
Home > Documents > Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei...

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei...

Date post: 14-Dec-2015
Category:
Upload: vernon-mothershed
View: 217 times
Download: 0 times
Share this document with a friend
Popular Tags:
25
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral Quasispecies Reconstruction Based on Unassembled Frequency Estimation
Transcript

Alex ZelikovskyDepartment of Computer Science

Georgia State University

Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu

Viral Quasispecies Reconstruction Based on Unassembled Frequency Estimation

Outline

• Introduction• ML Model• EM Algorithm• VSEM Algorithm• Experimental Results• Conclusions and future work

ISBRA 2011, Central South University, Changsha, China

454 Pyrosequencing

• Emulsion PCR• Single nucleotide addition

— Natural nucleotides— DNA ploymerase pauses until complementary

nucleotide is dispensed— Nucleotide incorporation triggers enzymatic

reaction that results in emission of light

ISBRA 2011, Central South University, Changsha, China

ML Model• Panel : bipartite graph

— RIGHT: strings>unknown frequencies

— LEFT: reads>observed frequencies

— EDGES: probability of the read to be emitted by the string>weights are calculated based on

the mapping of the reads to the strings

ISBRA 2011, Central South University, Changsha, China

strings

S1

S2

S3

R1

R2

R4

reads

R3

ML estimates of string frequencies

• Probability that a read is sampled from string is proportional with its frequency f(j)• ML estimates for f(j) is given by n(j)/(n(1) + . . . +

n(N))— n(j) - number of reads sampled from string j

ISBRA 2011, Central South University, Changsha, China

EM algorithm

• E-step: Compute the expected number n(j) of reads that come from string j under the assumption that string frequencies f(j) are correct• M-step: For each string j, set the new value of f(j)

equal to the portion of reads being originated by string j among all observed reads in the sample

ISBRA 2011, Central South University, Changsha, China

ML Model Quality• How well the maximum likelihood model explain the

reads• Measured by deviation between expected and

observed read frequencies

— expected read frequency:

ISBRA 2011, Central South University, Changsha, China

MLj

hil

li

jij f

h

he

ji

0: ,

,

,

||

||

R

eoD j jj

VSEM : Virtual String EM

ISBRA 2011, Central South University, Changsha, China

deviation betweenexpected /observed

read frequencies

ML estimates of string

frequencies

Computeexpected read

frequencies

update weightsof reads in virtual string

EM(incomplete) panel+ virtual stringwith 0-weightsin virtual string

Stop condition

Output : stringfrequencies,

reads

EM

yesno

Example : 1st iteration

9 ISBRA 2011, Central South University, Changsha, China

strings

S1

S2

S3

R1

R2

R4

reads

R3

strings

S1

S2

R1

R2

R4

reads

R3

Full Panel Incomplete Panel

O

0.25

0.25

0.25

0.25

O

0.25

0.25

0.25

0.25

VS

VS

Example : 1st iteration

10 ISBRA 2011, Central South University, Changsha, China

strings

S1

S2

S3

R1

R2

R4

reads

R3

strings

S1

S2

R1

R2

R4

reads

R3

Full Panel Incomplete Panel

O

0.25

0.25

0.25

0.25

O

0.25

0.25

0.25

0.25

ML

0.25

0.5

0.25

ML

0.33

0.66

VS

VS

Example : 1st iteration

11 ISBRA 2011, Central South University, Changsha, China 11 ISBRA 2011, Central South University, Changsha, China

strings

S1

S2

S3

R1

R2

R4

reads

R3

strings

S1

S2

R1

R2

R4

reads

R3

Full Panel Incomplete Panel

O E

.25 .25

.25 .25

.25 .25

.25 .25

O E

.25 .32

.25 .32

.25 .16

.25 .16

ML

.25

.5

.25

ML

.33

.66

VS

VS

Example : 1st iteration

12 ISBRA 2011, Central South University, Changsha, China

strings

S1

S2

S3

R1

R2

R4

reads

R3

strings

S1

S2

R1

R2

R4

reads

R3

Full Panel Incomplete Panel

O E

.25 .25

.25 .25

.25 .25

.25 .25

O E

.25 .32

.25 .32

.25 .16

.25 .16

ML

.25

.5

.25

ML

.34

.66

VS

VS

D=0 D=.08

Example : 1st iteration

13 ISBRA 2011, Central South University, Changsha, China

strings

S1

S2

S3

R1

R2

R4

reads

R3

strings

S1

S2

R1

R2

R4

reads

R3

Full Panel Incomplete Panel

O E

.25 .25

.25 .25

.25 .25

.25 .25

O E

.25 .3

.25 .3

.25 .15

.25 .15

ML

.25

.5

.25

0

ML

.32

.65

.02

VS

VS

D=0 D=.075

Incomplete Panel

Example : last iteration

14 ISBRA 2011, Central South University, Changsha, China

strings

S1

S2

S3

R1

R2

R4

reads

R3

strings

S1

S2

R1

R2

R4

reads

R3

Full Panel Incomplete Panel

O E

.25 .25

.25 .25

.25 .25

.25 .25

O E

.25 .25

.25 .25

.25 .25

.25 .25

ML

.25

.5

.25

0

ML

.20

.6

.2

VS

VS

D=0 D=0

VSEM : Virtual String EM

• Decide if the panel is likely to be incomplete• Estimate total frequency of missing strings• Identify read spectrum emitted by missing strings

ISBRA 2011, Central South University, Changsha, China

ViSpA• ViSpA [Astrovskaya et al. 2011] – viral spectrum

assembling tool for inferring viral quasispecies sequences and their frequencies from pyroseqencing shotgun reads — align reads— built a read graph :

>V – reads>E – overlap between reads>each path – candidate sequence

— filter based on ML frequencies

16 ISBRA 2011, Central South University, Changsha, China

ViSpA-VSEM

17 ISBRA 2011, Central South University, Changsha, China

ViSPA Weighted assembler

assembled Qsps Qsps Library

VSEM Virtual String EM

reads, weights

Viral Spectrum+Statistics

reads

ViSpAML estimator

removing duplicated & rare qsps

Stopping condition

YES

NO

Simulation Setup and Accuracy Measures• Real quasispecies sequences data from [von Hahn et

al. 2006]— 44 sequences (1739 bp long) from the E1E2 region of

Hepatitis C virus— Error-free data was simulated by in-house simulator

>populations sizes: 10, 20, 30, and 40 sequences>population distributions: geometric, skewed normal, uniform

• Accuracy measures— Kullback-Leibler divergence— Correlation between real and predicted frequencies— Average prediction error

18 ISBRA 2011, Central South University, Changsha, China

Experimental Validation of VSEM

• Detection of panel incompleteness— VSEM can detect 1% of missing strings

• Improving quasispecies frequencies • Detection of reads emitted by missing string

— Correlation between predicted reads and reads emitted by missing strings >65%

19 ISBRA 2011, Central South University, Changsha, China

EM vs VSEM

20 ISBRA 2011, Central South University, Changsha, China

    % of missing strings

  r.l./n.r <10% 10%-20% 20%-30% 30%-40% 40%-50% >50%

    r err r err r err r err r err r err

ViSpA 100/20K 90.2 4.5 91.0 6.8 75.4 5.1 68.6 1.6 40.8 2.3 39.8 10.4

ViSpA-VSEM 100/20K 91.6 2.3 92.8 4.4 76.5 4.1 70.5 1.4 54.2 2.0 50.8 7.4

ViSpA 300/20K 95.7 3.8 93.2 10.2 89.8 1.0 66.7 1.5 62.1 2.1 46.8 9.7

ViSpA-VSEM 300/20K 95.4 1.7 95.8 1.1 96.9 0.6 85.7 0.9 88.0 0.9 60.4 2.6

ViSpA 100/100K 95.2 4.5 93.9 9.1 84.8 1.4 74.2 1.8 74.5 2.3 73.4 9.9

ViSpA-VSEM 100/100K 97.8 2.6 95.6 3.0 86.3 1.3 79.8 1.7 79.0 2.1 74.2 8.8

ViSpA 300/100K 96.2 3.9 88.6 12.4 88.9 1.0 85.1 1.4 75.1 2.3 49.5 10.5

ViSpA-VSEM 300/100K 96.2 2.0 92.8 0.9 93.7 0.7 90.2 1.2 84.4 1.7 67.1 4.8

ViSpA vs ViSpA-VSEM

21 ISBRA 2011, Central South University, Changsha, China

ViSpA ViSpA-VSEM

Distribution PPV Sensetivity RE r err PPV Sensetivity RE r err Gain

Geometric 0.767 0.5 -0.0099 0.954 7.36 0.591 0.73 0.0276 0.909 2.91 2.3

Skewed 0.733 0.4 -0.0196 0.673 13.01 0.701 0.77 0.0085 0.967 2.5 4

Uniform 0.733 0.4 -0.0191 0.716 12.76 0.645 0.73 0.0108 0.976 2.34 3.7

• 100K reads from 10 QSPS• average length 300

ViSpA vs ViSpA-VSEM

#mismatches

ViSpA ViSpA-VSEM

PPV Sensetivity RE r err PPV Sensetivity RE r err Gain

k = 0 0.5 0.5 0.0720 0.9860 9.98 0.546 0.6 0.0494 0.974 7.54 1

k = 2 0.6 0.6 0.0668 0.9860 9.16 0.636 0.7 0.0434 0.9680 6.67 1

k = 6 0.7 0.7 0.0577 0.9856 7.95 0.727 0.8 0.0369 0.946 6.20 1

k =7 0.8 0.8 0.0525 0.9866 7.26 0.818 0.9 0.0335 0.948 5.65 1

22 ISBRA 2011, Central South University, Changsha, China

• 100K reads from 10 QSPS• average length 300

Conclusions & Future Work

• Apply VSEM to RNA-Seq data• Assemble missing strings from the set of reads

emitted by missing strings• Handle chimerical strings presented in the panel

23 ISBRA 2011, Central South University, Changsha, China

Acknowledgments• NFS …

24 ISBRA 2011, Central South University, Changsha, China

非常感謝

25 ISBRA 2011, Central South University, Changsha, China


Recommended