My portfolio: Sequencing projects. Alla Lapidus, Ph.D Associate Professor, Fox Chase Cancer Center...

My portfolio: Sequencing projects

Alla Lapidus, Ph.D

Associate Professor, Fox Chase Cancer Center

EDUCATION: 1980 M.S. in Physics (with honors) - Department of Theoretical and Experimental Physics, Moscow Physics-Engineering

Institute (МИФИ), Moscow, Russia.1986 Ph.D. in Molecular Biology - Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia

PROFESSIONAL MEMBERSHIPS and SERVICE2011 - Reviewer for Frontiers in Evolutionary and Genomic Microbiology2008 - Reviewer for Nucleic Acids Research2007 - Reviewer for PLoS Genetics2006- current - Organizing Committee Member – “Sequencing, Finishing and Analysis in the Future” SFAF meeting

(http://www.lanl.gov/finishinginthefuture/) 1998 - American Society for Microbiology1998 - Grant Reviewer (INTAS)

FIRST GENOMIC PROJECT: 1994 – European project - Bacillus subtilis, INRA, France

UNIVERSOTY of CHICAGO: 1998 – Rhodobacter capsulatus genome

INTEGRATED GENOMICS, Inc: 2001 – Director of Sequencing Center, Chicago

JOINT GENOME INSTITUTE (LBNL): 2003 – Genome Finishing Group, Projects coordinator

Fox Chase Cancer Center, Cancer Genome Institute: 2010 - Director of Bioinformatics

May 30th-June 1, 2011. Santa Fe

Genome assembly and finishing group at JGI

Projects in the group

• Microbial projects – ~120 genomes assembled and finished • Metagenomes – 3 completely finished members of different

communities + approach development • Fungi – genome assembly and partial improvement• Single cell – one finished genome• Bioinformatics - small tools needed for assembly improvement

and visualization • Quality control - needs and approaches

http://www.standardsingenomics.org/index.php/sigen/search/results

http://www.standardsingenomics.org/index.php/sigen/search/results

Major Sequencing Centers for Prokaryotic Finished Genomes in 2011:

1765 projects

Nikos C. Kyrpides, http://www.genomesonline.org/cgi-bin/index.cgi

http://www.genomesonline.org/cgi-bin/index.cgi

Metagenomes

Metagenomic Assembly Challenges – Molecular Biology

• Low representation of individual species• Requires high depth of coverage to find all

species– Low abundance species unassembled

• Extraction bias, etc

9

Metagenomic Assembly Challenges - Software

• Algorithm Bias– de novo assembly assumes normal, even distribution of

sequence data– Assumes all Kmers will have similar coverage numbers

• Memory and Run Times– Illumina sequenced metagenomes can generate > 60 GB of

data• Assembly software could require >512 GB RAM

– Lower Kmer values generate more Kmers, increasing memory requirement and computing cycles increasing assembly time

• Also can improve assembly (low abundance)• Individual genome finishing

10

What is needed for better performance

• reduce Size of Data Sets• implement read quality approach (trimming,

filtering, binning)• choose the best assembler• selecting “Best” Assembly

– Merging Velvet Assemblies with minimus – Algorithm for selection of best Kmer

• process automation

METAGENOME SAMPLES2011: 1927 samples

Nikos C. Kyrpideshttp://www.genomesonline.org/cgi-bin/index.cgi

Simple Communities are Very Complex

High-resolution metagenomics targets specific functional types in complex microbial communitiesM. Kalyuzhnaya, A.Lapidus, N. Ivanova, A.Copeland, A. McHardy, E. Szeto, A.Salamov, I. Grigoriev, D. uciu, S. Levine, V.M. Markowitz, I.Rigoutsos, S.Tringe, D. Bruce, P. Richardson, M.Lidstrom & L.ChistoserdovaNature Biotechnology 26, 1029 - 1034 (2008) Published online: 17 August 2008

Over a million species in the Kingdom Fungi have evolved over millions of years to occupy diverse ecological niches and have accumulated an enormous but yet undiscovered natural arsenal of potentially useful innovations. While the number of fungal genome sequencing projects continues to increase, the phylogenetic breadth of current sequencing targets is extremely limited. Exploration of phylogenetic and ecological diversity of fungi by genome sequencing is therefore a potentially rich source of valuable metabolic pathways and enzyme activities that will remain undiscovered and unexploited until a systematic survey of phylogenetically diverse genome sequences is undertaken.

Fungal projects

Fungal assembly challenges

-Small amount of gDNA => poor quality libraries, insufficient amount of libraries (different PE – paired end libraries are needed – hard to make with new sequencing protocols)

-Large data sets

-Polyploidy

- Large variety of repeats (lengths, complexity)

Figure 2 Model of the evolution of the N. tetrasperma mat A mating-type chromosome. The order of rearrangement events is shown in A and begins with the ancestral mat A chromosome (1) which was collinear with mat a and the mating-type chromosome of N. crassa. The 1.2-Mb inversion occurred first and produced the orientation in 2. This event was followed relatively quickly by the 5.3-Mb inversion (3). The 68-kb inversion, shown as the line at the far right of B, occurred much later to produce the current arrangement of the mat A chromosome (B). The 1.2-Mb inversion (breakpoints show in red) is flanked by unique 50-bp duplications (D) that would have been in an inverted orientation before the occurrence of the large inversion, consistent with rearrangement via staggered single-strand breaks. The 5.3-Mb inversion (breakpoints shown in blue) is flanked by Marinertransposable elements (M), consistent with rearrangement via ectopic recombination. Mariner remnants were not present in either of the homologous regions in the mat a chromosome. The overlapping nature of these two inversions explains the relocated genomic region. The 68-kb inversion is flanked by a microsatellite containing, low-complexity sequence and may have occurred due to ectopic recombination between blocks of microhomology. MAT denotes the location of the mating-type locus while CEN shows the location of the centromere.

Massive Changes in Genome Architecture Accompany the

Transition to Self-Fertility in the Filamentous FungusNeurospora

tetrasperma

(Genetics. 2011 September; 189(1): 55–69.)

Single-cell approach

•Single-cell genomics is a method for amplifying DNA from single bacterial cells using Multiple Displacement Amplification (MDA)

•Only 2% of microbes can be cultured.

•Discovery of novel enzymes, new antibiotics and more

•Cancer research and clinical diagnostic

Single-cell Process

Challenges with Single-cell

•Single-cell methodology is sensitive to reagent or processing contamination from multiple displacement amplification (MDA).

•MDA produces non-uniform read coverage, posing problems with current short read assemblers.

•MDA produces chimerical reads.

•De novo assembly of complete genome sequences.

Candidatus Sulcia muelleri DMIN

Sulcia cell isolation and sequence coverage, closure and polishing locations along the Sulcia

DMIN single cell genome.

(A) Micromanipulation of the single Sulcia cell from the sharpshooter bacteriome metasample. (B) Sequence coverage including closure and polishing locations along the finished, circular Sulcia DMIN genome with circles corresponding to following features, starting with outermost circle: (1) Illumina sequence coverage ranging from 0–3276 (mean 303+−386), (2) pyrosequence sequence coverage ranging from 0–231 (mean 42+−39), (3) Sanger sequence coverage ranging from 0–30 (mean 10+−7), (4) locations of captured (green) and uncaptured gaps (orange), (5) polishing locations corrected using Illumina (blue) and Sanger (purple) seqeunce, (6) GC content heat map (dark blue to light green = low to high values) and (7) GC skew.

Woyke T, Tighe D, Mavromatis K, Clum A, Copeland A, Schackwitz W, Lapidus A, et al. 2010 One Bacterial Cell, One Complete Genome. PLoS ONE 5(4): e10314.

Identification of MDA contaminants

Red = suspect contaminant

S. Trong, JGI

MDA coverage bias

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Nu

mO

ccK

me

r

0 100 200 300 400 500

#KmerFreq

Overlay Plot

1000600

300

10000500030002000

10000060000

30000

1000000500000300000200000

100000006000000

3000000

1000000005000000030000000

Tota

lOcc

s

0 100 200 300 400 500 600 700 800 900 ###

#KmerFreq

Overlay Plot

Single-Cell kmer distribution

Shotgun sequencing theoretical kmer distribution; current short read assemblers expect this uniformity

Isolate kmer distribution

Woyke T, et al. 2011 PLoS ONE 6(10): e26161.

Normalizing read coverage improves assembly!

1000600

300

10000500030002000

10000060000

30000

1000000500000300000200000

100000006000000

3000000

1000000005000000030000000

Tota

lOcc

s

0 100 200 300 400 500 600 700 800 900 ###

#KmerFreq

Overlay Plot

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Nu

mO

ccK

me

r

0 100 200 300 400 500

#KmerFreq

Overlay Plot

Woyke T, Sczyrba A, Lee J, Rinke C, Tighe D, et al. 2011 Decontamination of MDA Reagents for Single Cell Whole Genome Amplification. PLoS ONE 6(10): e26161.

JGI’s Approach

• Develop pipeline to assemble single-cell genomes addressing contamination and read coverage problems and providing as much genome completeness as possible.

• Provide QC metrics to evaluate contaminants in the reads and assembly.

• Allpath-LG(APLG) - uses less memory, requires much less data• For microbes and fungi - less contigs, better N50 numbers and better

annoation when compared to a reference using allpaths than with velvet. • APLG rarely has large misassemblies whereas with velvet you have to play

with the minimum pair cutoff to make sure you don't get misassemblies. • For single cells both allpaths and velvet are run and the results are

merged. • APLG is not used for metagenomes• APLG works best with at least an overlapping standard library and a mate

pair library between 3-8kb (real of fake). If you give allpaths a mate pair library over 10kb without providing some smaller mate pair library it can get confused.

• APLGdoesn't accept variable length mate pair data.

Allpaths-LG vs Velvet

See “Assembly and finishing” presentation for microbial assemblies and bioinformatics tools.

Microbial assemblies and finishing

Thank you!

Date post:	11-Jan-2016
Category:	Documents
Upload:	randolph-poole
View:	214 times
Download:	0 times

My portfolio: Sequencing projects. Alla Lapidus, Ph.D Associate Professor, Fox Chase Cancer Center...

Documents