Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | randolph-poole |
View: | 214 times |
Download: | 0 times |
My portfolio: Sequencing projects
Alla Lapidus, Ph.D
Associate Professor, Fox Chase Cancer Center
EDUCATION: 1980 M.S. in Physics (with honors) - Department of Theoretical and Experimental Physics, Moscow Physics-Engineering
Institute (МИФИ), Moscow, Russia.1986 Ph.D. in Molecular Biology - Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia
PROFESSIONAL MEMBERSHIPS and SERVICE2011 - Reviewer for Frontiers in Evolutionary and Genomic Microbiology2008 - Reviewer for Nucleic Acids Research2007 - Reviewer for PLoS Genetics2006- current - Organizing Committee Member – “Sequencing, Finishing and Analysis in the Future” SFAF meeting
(http://www.lanl.gov/finishinginthefuture/) 1998 - American Society for Microbiology1998 - Grant Reviewer (INTAS)
FIRST GENOMIC PROJECT: 1994 – European project - Bacillus subtilis, INRA, France
UNIVERSOTY of CHICAGO: 1998 – Rhodobacter capsulatus genome
INTEGRATED GENOMICS, Inc: 2001 – Director of Sequencing Center, Chicago
JOINT GENOME INSTITUTE (LBNL): 2003 – Genome Finishing Group, Projects coordinator
Fox Chase Cancer Center, Cancer Genome Institute: 2010 - Director of Bioinformatics
May 30th-June 1, 2011. Santa Fe
Genome assembly and finishing group at JGI
Projects in the group
• Microbial projects – ~120 genomes assembled and finished • Metagenomes – 3 completely finished members of different
communities + approach development • Fungi – genome assembly and partial improvement• Single cell – one finished genome• Bioinformatics - small tools needed for assembly improvement
and visualization • Quality control - needs and approaches
http://www.standardsingenomics.org/index.php/sigen/search/results
Major Sequencing Centers for Prokaryotic Finished Genomes in 2011:
1765 projects
Nikos C. Kyrpides, http://www.genomesonline.org/cgi-bin/index.cgi
Metagenomes
Metagenomic Assembly Challenges – Molecular Biology
• Low representation of individual species• Requires high depth of coverage to find all
species– Low abundance species unassembled
• Extraction bias, etc
9
Metagenomic Assembly Challenges - Software
• Algorithm Bias– de novo assembly assumes normal, even distribution of
sequence data– Assumes all Kmers will have similar coverage numbers
• Memory and Run Times– Illumina sequenced metagenomes can generate > 60 GB of
data• Assembly software could require >512 GB RAM
– Lower Kmer values generate more Kmers, increasing memory requirement and computing cycles increasing assembly time
• Also can improve assembly (low abundance)• Individual genome finishing
10
What is needed for better performance
• reduce Size of Data Sets• implement read quality approach (trimming,
filtering, binning)• choose the best assembler• selecting “Best” Assembly
– Merging Velvet Assemblies with minimus – Algorithm for selection of best Kmer
• process automation
METAGENOME SAMPLES2011: 1927 samples
Nikos C. Kyrpideshttp://www.genomesonline.org/cgi-bin/index.cgi
Simple Communities are Very Complex
High-resolution metagenomics targets specific functional types in complex microbial communitiesM. Kalyuzhnaya, A.Lapidus, N. Ivanova, A.Copeland, A. McHardy, E. Szeto, A.Salamov, I. Grigoriev, D. uciu, S. Levine, V.M. Markowitz, I.Rigoutsos, S.Tringe, D. Bruce, P. Richardson, M.Lidstrom & L.ChistoserdovaNature Biotechnology 26, 1029 - 1034 (2008) Published online: 17 August 2008
Over a million species in the Kingdom Fungi have evolved over millions of years to occupy diverse ecological niches and have accumulated an enormous but yet undiscovered natural arsenal of potentially useful innovations. While the number of fungal genome sequencing projects continues to increase, the phylogenetic breadth of current sequencing targets is extremely limited. Exploration of phylogenetic and ecological diversity of fungi by genome sequencing is therefore a potentially rich source of valuable metabolic pathways and enzyme activities that will remain undiscovered and unexploited until a systematic survey of phylogenetically diverse genome sequences is undertaken.
Fungal projects
Fungal assembly challenges
-Small amount of gDNA => poor quality libraries, insufficient amount of libraries (different PE – paired end libraries are needed – hard to make with new sequencing protocols)
-Large data sets
-Polyploidy
- Large variety of repeats (lengths, complexity)
Figure 2 Model of the evolution of the N. tetrasperma mat A mating-type chromosome. The order of rearrangement events is shown in A and begins with the ancestral mat A chromosome (1) which was collinear with mat a and the mating-type chromosome of N. crassa. The 1.2-Mb inversion occurred first and produced the orientation in 2. This event was followed relatively quickly by the 5.3-Mb inversion (3). The 68-kb inversion, shown as the line at the far right of B, occurred much later to produce the current arrangement of the mat A chromosome (B). The 1.2-Mb inversion (breakpoints show in red) is flanked by unique 50-bp duplications (D) that would have been in an inverted orientation before the occurrence of the large inversion, consistent with rearrangement via staggered single-strand breaks. The 5.3-Mb inversion (breakpoints shown in blue) is flanked by Marinertransposable elements (M), consistent with rearrangement via ectopic recombination. Mariner remnants were not present in either of the homologous regions in the mat a chromosome. The overlapping nature of these two inversions explains the relocated genomic region. The 68-kb inversion is flanked by a microsatellite containing, low-complexity sequence and may have occurred due to ectopic recombination between blocks of microhomology. MAT denotes the location of the mating-type locus while CEN shows the location of the centromere.
Massive Changes in Genome Architecture Accompany the
Transition to Self-Fertility in the Filamentous FungusNeurospora
tetrasperma
(Genetics. 2011 September; 189(1): 55–69.)
Single-cell approach
•Single-cell genomics is a method for amplifying DNA from single bacterial cells using Multiple Displacement Amplification (MDA)
•Only 2% of microbes can be cultured.
•Discovery of novel enzymes, new antibiotics and more
•Cancer research and clinical diagnostic
Single-cell Process
Challenges with Single-cell
•Single-cell methodology is sensitive to reagent or processing contamination from multiple displacement amplification (MDA).
•MDA produces non-uniform read coverage, posing problems with current short read assemblers.
•MDA produces chimerical reads.
•De novo assembly of complete genome sequences.
Candidatus Sulcia muelleri DMIN
Sulcia cell isolation and sequence coverage, closure and polishing locations along the Sulcia
DMIN single cell genome.
(A) Micromanipulation of the single Sulcia cell from the sharpshooter bacteriome metasample. (B) Sequence coverage including closure and polishing locations along the finished, circular Sulcia DMIN genome with circles corresponding to following features, starting with outermost circle: (1) Illumina sequence coverage ranging from 0–3276 (mean 303+−386), (2) pyrosequence sequence coverage ranging from 0–231 (mean 42+−39), (3) Sanger sequence coverage ranging from 0–30 (mean 10+−7), (4) locations of captured (green) and uncaptured gaps (orange), (5) polishing locations corrected using Illumina (blue) and Sanger (purple) seqeunce, (6) GC content heat map (dark blue to light green = low to high values) and (7) GC skew.
Woyke T, Tighe D, Mavromatis K, Clum A, Copeland A, Schackwitz W, Lapidus A, et al. 2010 One Bacterial Cell, One Complete Genome. PLoS ONE 5(4): e10314.
Identification of MDA contaminants
Red = suspect contaminant
S. Trong, JGI
MDA coverage bias
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Nu
mO
ccK
me
r
0 100 200 300 400 500
#KmerFreq
Overlay Plot
1000600
300
10000500030002000
10000060000
30000
1000000500000300000200000
100000006000000
3000000
1000000005000000030000000
Tota
lOcc
s
0 100 200 300 400 500 600 700 800 900 ###
#KmerFreq
Overlay Plot
Single-Cell kmer distribution
Shotgun sequencing theoretical kmer distribution; current short read assemblers expect this uniformity
Isolate kmer distribution
Woyke T, et al. 2011 PLoS ONE 6(10): e26161.
Normalizing read coverage improves assembly!
1000600
300
10000500030002000
10000060000
30000
1000000500000300000200000
100000006000000
3000000
1000000005000000030000000
Tota
lOcc
s
0 100 200 300 400 500 600 700 800 900 ###
#KmerFreq
Overlay Plot
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Nu
mO
ccK
me
r
0 100 200 300 400 500
#KmerFreq
Overlay Plot
Woyke T, Sczyrba A, Lee J, Rinke C, Tighe D, et al. 2011 Decontamination of MDA Reagents for Single Cell Whole Genome Amplification. PLoS ONE 6(10): e26161.
JGI’s Approach
• Develop pipeline to assemble single-cell genomes addressing contamination and read coverage problems and providing as much genome completeness as possible.
• Provide QC metrics to evaluate contaminants in the reads and assembly.
• Allpath-LG(APLG) - uses less memory, requires much less data• For microbes and fungi - less contigs, better N50 numbers and better
annoation when compared to a reference using allpaths than with velvet. • APLG rarely has large misassemblies whereas with velvet you have to play
with the minimum pair cutoff to make sure you don't get misassemblies. • For single cells both allpaths and velvet are run and the results are
merged. • APLG is not used for metagenomes• APLG works best with at least an overlapping standard library and a mate
pair library between 3-8kb (real of fake). If you give allpaths a mate pair library over 10kb without providing some smaller mate pair library it can get confused.
• APLGdoesn't accept variable length mate pair data.
Allpaths-LG vs Velvet
See “Assembly and finishing” presentation for microbial assemblies and bioinformatics tools.
Microbial assemblies and finishing
Thank you!