Base Calling Error Toleration in Reference Base Assembly

Post on 19-Jan-2017

381 views 1 download

transcript

Base Calling Error Toleration in Reference Based Assembly

Hadi GharibiEmail: h_gharibi@ee.sharif.eduSharif University of Technology

Max Planck Institute for Molecular GeneticsMay 2015

How Base Calling Error Can Be Tolerated in Next Generation Sequencing (NGS)

2

Importance

Challenges

Our Hypothesis

Our Approach

• Deal with Large Amount of Data • Impact on Sequencing Data Analysis Time and Accuracy

Researchers have developed many base calling algorithms, however, they have not resolved the tradeoff between accuracy and time complexity.

• Required Accuracy • Sequencing Data Analysis Execution Time

Base Calling Error Is Compensated in Down-stream Sequencing Steps

• Massive Data• Diverse Algorithms

Importance: Base Calling Translates Noisy Intensity Data Into Reads

3© EMBO Conference, 2014 [1]© illumina Incorporation, 2011.[2]

IntensityImage Processing

Base Calling

ReadAssemblingGenome

Challenge: Base Calling Errors Are Always Compared

4© C. Ye, 2014 [3]

Figure: Error rate for base callers per sequencing cycle on the PhiX174 test data is plotted. Accurate callers are slower than the others. [3]

Fundamental Question:

5

How Much Accuracy Is Required?

Our Approach: Analytical Assumptions and Method

6

Assumptions

• Random Genome• Single Variations• Mismatches << Read Length• Uniform Substitution Error• Equally Likely Base Errors

Method• Variant Calling for Re-sequencing

• Derive Variant Calling Errors

Analytical Results: Base Calling Error Is Tolerated by Mapping Mismatch

7

Figure: Variant Calling Error Vs. Base Calling Error

Random GenomeMismatches={2, 5, 7, 9}Genome Size ~ 4MbpRead Length= 30bpVariation Rate= 0.01

Simulation Method and Setup

8

• Generate Target Genome• Simulate Reads [4]• Add Base Calling Error• Call Variants• Calculate Variant Calling Error

Method Setup

© Gemsim, 2013[4]

Simulation Results: Simulation Verifies Analysis Predictions

9

• E-Coli Genome [5]• Mismatches= {3, 4, 5}• Genome Size ~ 4Mbp• Read Length= 30bp• Variation Rate~ 0.01• Single-end Shotgun Run • Map with SOAP[6]

Figure: Variant Calling Error Vs. Base Calling Error

© NCBI, 2014[5]© G. BGI, 2008[6]

Simulation Results: Random Genome Obviates Repeat Region Effect

10

• Genome Sizes ~ 4Mbp• Mismatches= 3• Read Length= 30bp• Variation Rate~ 0.01• Single-end Shotgun Run • Map with SOAP[6]

Figure: Random Genome Vs. E-Coli Genome

© G. BGI, 2008[6]

11

Conclusion

Simulation Results

• Confirm the Hypothesis• Genome Repeat Regions Impair Accuracy

• Confirm the Hypothesis• Higher Mismatches May Not Obey

Analytical Results

Next Steps

12

Simulation Steps• Genome Having More Repeat Regions • Develop Mapper with Higher Mismatches

• Genome Structure• Paired-end Shotgun Sequencing• Erasure Base Calling Error• Other Variant Types

Analytical Steps

References[1] EMBO Conference, “Human Evolution in the Genomic Era: Origins, Populations, and Phenotypes,” 2014, [Online]. Available: events.embo.org/14-human-evo[2] Illumina Inc., “Theory of Operation, HCS 1.4/RTA 1.12”,2011.[3] C. Ye, C. Hsiao, and H. Corrada Bravo, “BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution,” Bioinformatics, 30(9), 1214–1219, 2014. [4] C. Ledergerber and C. Dessimoz, “Base-calling for next-generation sequencing platforms”, Briefings in Bioinformatics, 2011.[5] GemSIM, “Gemsim,” 2013. [Online]. Available: http://sourceforge.net/projects/gemsim[6] NCBI, “Escherichia coli o157:h7 str. sakai dna, complete genome - nucleotide - ncbi,” 2014. [Online]. Available: http://www.ncbi.nlm.nih.gov/nuccore/47118301?report=fasta[7] G. BGI, “Soap: Short oligonucleotide analysis package,” 2008. [Online]. Available: http://soap.genomics.org.cn

13

Acknowledgement

Thank You for Your Patience, Time and Attention.

14

Danke Seher