+ All Categories
Home > Documents > University of Connecticut School of Engineering Assembler Reference Abyss 1.5.1 Simpson et al., J....

University of Connecticut School of Engineering Assembler Reference Abyss 1.5.1 Simpson et al., J....

Date post: 29-Jan-2016
Category:
Upload: julia-harper
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
1
University of Connecticut School of Engineering Assembler Reference Abyss 1.5.1 Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J., and Birol, I. (2009). Abyss: a parallel assembler for short read sequence data. Genome research, 19(6), 1117–1123. Cabog 7.0 Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C., and Sutton, G. (2008). Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24(24), 2818–2824. Mira 4.0.2 Barthelson, R., McFarlin, A. J., Rounsley, S. D., and Young, S. (2011). Plantagora: modeling whole genome sequencing and assembly of plant genomes. PLoS One, 6(12), e28436. MaSuRCA 2.2.1 Zimin, A. V., Marc¸ais, G., Puiu, D., Roberts, M., Salzberg, S. L., and Yorke, J. A. (2013). The masurca genome assembler. Bioinformatics, 29(21), 2669–2677. SGA 0.10.13 Simpson, J. T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data structures. Genome research, 22(3), 549–556. SoapDenovo 2.04 Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., et al. (2012). Soapdenovo2: an empirically improved memory-efficient short- read de novo Hierarchical Genome Assembly Anas Al-Okaily and Ion Mӑndoiu Department of Computer Science and Engineering, University of Connecticut INTRODUCTION Assembly quality delivered by current assemblers improves only marginally or gets worse for ultra-deep genome sequencing data. Magoc et al. 2013 Lonardi et al. 2015 Current high-throughput sequencing technologies generate large numbers of relatively short and error-prone reads, making the de novo assembly problem challenging. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter are costly to generate. Recently, the GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short- insert library with very high coverage. In this poster, we introduce and empirically evaluate a novel hierarchical genome assembly (HGA) methodology that takes further advantage of such very high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads. EVALUATED ASSEMBLERS THE CHALLENGE OF ULTRA-DEEP DATA ASSEMBLY HiSeq datasets (100bp) MiSeq datasets (250bp) IDENTIFIED GENE RESULTS HiSeq datasets (100bp) MiSeq datasets (250bp) BEST HGA PARAMETERS Assemblies were evaluated using multiple metrics computed using QUAST, including: Number of contigs Number of known genes completely or partially covered by the contigs • N50, the contig length that covers at least 50% of the total length of the assembly • NA50, computed like N50 after breaking misassembled contigs • Genome fraction: percentage of genome bases aligned to at least on contig • Duplication ratio: number of aligned contig bases divided by the number of reference bases aligned to at least one contig Number of global and local misassemblies Mismatches and indels per 100Kb Unaligned contig length DATASETS AND ACCURACY METRICS Empirical evaluation of this methodology for 8 leading assemblers using 7 GAGE-B bacterial datasets consisting of 100bp Illumina HiSeq and 250bp Illumina MiSeq reads shows that HGA leads to a significant improvement in assembly quality for all evaluated assemblers and all datasets. In ongoing work we are evaluating the HGA methodology on ultra-deep BAC sequencing data. Availability: Version 1.0.0 of HGA, implemented in Python, is available at http://dna.engr.uconn.edu/software/HGA . Acknowledgements: This work has been partially supported by the Agriculture and Food Research Initiative Competitive Grant No. 2011-67016-30331 from the USDA National Institute of Food and Agriculture. References Gurevich, A., Saveliev, V., Vyahhi, N., and Tesler, G. (2013). Quast: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075. Lonardi, S., Mirebrahim, H., Wanamaker, S., Alpert, M., Ciardo, G., Duma, D., Close, T.J. (2015), When Less is More: “Slicing” Sequencing Data Improves Read Decoding Accuracy and De Novo Assembly Quality, Bioinformatics, advance access. CONCLUSIONS ASSEMBLY FLOWS CORRECTED N50 RESULTS The proposed hierarchical assembly flows consist of following steps: 1. Partitioning the reads into p disjoint parts, where p = 2, 4, or 8 2. Independent assembly of each part using one of the 8 evaluated assemblers, with kmer size between 21 to 101 in increments of 10 3. Merging the resulting contigs, respectively combinining them using the Velvet assembler with kmer size 31 and expected coverage = p 4. Reassembling the merged/combined contigs along with the original reads using SPAdes, again with kmer size between 21 to 101 in increments of 10 For each assembler, reported HGA results are for the assembly with the largest (uncorrected) N50 over the tested values of p and kmer sizes. 21 31 41 51 61 71 81 91 0 5 10 15 21 51 81 Best kmer combinations (HiSeq) Assembly kmer Count Reassembly kmer 21 31 41 51 61 71 81 91 101 0 2 4 6 8 21 51 81 Best kmer combinations (MiSeq) Assembly kmer Count Reassembly kmer
Transcript
Page 1: University of Connecticut School of Engineering Assembler Reference Abyss 1.5.1 Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,

University of Connecticut

School of Engineering

Assembler Reference

Abyss 1.5.1 Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J., and Birol, I. (2009). Abyss: a parallel assembler for short read sequence data. Genome research, 19(6), 1117–1123.

Cabog 7.0 Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz,B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C., andSutton, G. (2008). Aggressive assembly of pyrosequencingreads with mates. Bioinformatics, 24(24), 2818–2824.

Mira 4.0.2Barthelson, R., McFarlin, A. J., Rounsley, S. D., and Young, S. (2011). Plantagora: modeling whole genome sequencing and assembly of plant genomes. PLoS One, 6(12), e28436.

MaSuRCA 2.2.1 Zimin, A. V., Marc¸ais, G., Puiu, D., Roberts, M., Salzberg,S. L., and Yorke, J. A. (2013). The masurca genomeassembler. Bioinformatics, 29(21), 2669–2677.

SGA 0.10.13 Simpson, J. T. and Durbin, R. (2012). Efficient denovo assembly of large genomes using compressed datastructures. Genome research, 22(3), 549–556.

SoapDenovo 2.04 Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G.,Chen, Y., Pan, Q., Liu, Y., et al. (2012). Soapdenovo2: anempirically improved memory-efficient short-read de novoassembler. Gigascience, 1(1), 18.

SPAdes 3.0.0Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A.,Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I.,Pham, S., Prjibelski, A. D., et al. (2012). Spades: a newgenome assembly algorithm and its applications to singlecell sequencing. Journal of Computational Biology, 19(5), 455–477.

Velvet 1.2.10 Zerbino, D. R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5), 821–829.

Hierarchical Genome AssemblyAnas Al-Okaily and Ion MӑndoiuDepartment of Computer Science and Engineering, University of Connecticut

INTRODUCTION

Assembly quality delivered by current assemblers improves only marginally or gets worse for ultra-deep genome sequencing data.

Magoc et al. 2013 Lonardi et al. 2015

Current high-throughput sequencing technologies generate large numbers of relatively short and error-prone reads, making the de novo assembly problem challenging. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter are costly to generate. Recently, the GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with very high coverage. In this poster, we introduce and empirically evaluate a novel hierarchical genome assembly (HGA) methodology that takes further advantage of such very high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads.

EVALUATED ASSEMBLERS

THE CHALLENGE OF ULTRA-DEEP DATA ASSEMBLY

HiSeq datasets (100bp)

MiSeq datasets (250bp)

IDENTIFIED GENE RESULTS

HiSeq datasets (100bp)

MiSeq datasets (250bp)

BEST HGA PARAMETERS

Assemblies were evaluated using multiple metrics computed using QUAST, including:• Number of contigs• Number of known genes completely or partially covered by

the contigs • N50, the contig length that covers at least 50% of the total

length of the assembly• NA50, computed like N50 after breaking misassembled

contigs• Genome fraction: percentage of genome bases aligned to at

least on contig• Duplication ratio: number of aligned contig bases divided by

the number of reference bases aligned to at least one contig• Number of global and local misassemblies• Mismatches and indels per 100Kb• Unaligned contig length

DATASETS AND ACCURACY METRICS

Empirical evaluation of this methodology for 8 leading assemblers using 7 GAGE-B bacterial datasets consisting of 100bp Illumina HiSeq and 250bp Illumina MiSeq reads shows that HGA leads to a significant improvement in assembly quality for all evaluated assemblers and all datasets. In ongoing work we are evaluating the HGA methodology on ultra-deep BAC sequencing data.

Availability: Version 1.0.0 of HGA, implemented in Python, is available at http://dna.engr.uconn.edu/software/HGA.

Acknowledgements: This work has been partially supported by the Agriculture and Food Research Initiative Competitive Grant No. 2011-67016-30331 from the USDA National Institute of Food and Agriculture.

References

• Gurevich, A., Saveliev, V., Vyahhi, N., and Tesler, G. (2013). Quast: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075.

• Lonardi, S., Mirebrahim, H., Wanamaker, S., Alpert, M., Ciardo, G., Duma, D., Close, T.J. (2015), When Less is More: “Slicing” Sequencing Data Improves Read Decoding Accuracy and De Novo Assembly Quality, Bioinformatics, advance access.

• Magoc, T., Pabinger, S., Canzar, S., Liu, X., Su, Q., Puiu, D., Tallon, L. J., and Salzberg, S. L. (2013). GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics, 29(14), 1718–1725.

CONCLUSIONS

ASSEMBLY FLOWS

CORRECTED N50 RESULTS

The proposed hierarchical assembly flows consist of following steps:1. Partitioning the reads into p disjoint parts, where p = 2, 4, or 82. Independent assembly of each part using one of the 8 evaluated

assemblers, with kmer size between 21 to 101 in increments of 103. Merging the resulting contigs, respectively combinining them using the

Velvet assembler with kmer size 31 and expected coverage = p4. Reassembling the merged/combined contigs along with the original

reads using SPAdes, again with kmer size between 21 to 101 in increments of 10

For each assembler, reported HGA results are for the assembly with the largest (uncorrected) N50 over the tested values of p and kmer sizes.

21 31 41 51 61 71 81 91

0

2

4

6

8

10

12

2141

6181

Best kmer combinations (HiSeq)

Assembly kmer

Coun

t

Reassembly kmer

21 31 41 51 61 71 81 91 101

012345678

2141

6181

101

Best kmer combinations (MiSeq)

Assembly kmer

Coun

t

Reassembly kmer

Recommended