+ All Categories
Home > Education > UC Davis EVE161 Lecture 15 by @phylogenomics

UC Davis EVE161 Lecture 15 by @phylogenomics

Date post: 10-May-2015
Category:
Upload: jonathan-eisen
View: 1,103 times
Download: 0 times
Share this document with a friend
Description:
Slides for Lecture 15 in EVE 161 Course by Jonathan Eisen at UC Davis
Popular Tags:
54
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Lecture 14: EVE 161: Microbial Phylogenomics Lecture #15: Era IV: Shotgun Metagenomics UC Davis, Winter 2014 Instructor: Jonathan Eisen 1
Transcript
Page 1: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Lecture 14:

EVE 161:Microbial Phylogenomics

!Lecture #15:

Era IV: Shotgun Metagenomics !

UC Davis, Winter 2014 Instructor: Jonathan Eisen

!1

Page 2: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Where we are going and where we have been

• Previous lecture: !14: Era IV: Metagenomics

• Current Lecture: !15: Era IV: Shotgun Metagenomics

!Next Lecture: !16: Era IV: Function in Metagenomics

!2

Page 3: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Era IV: Genomes in the environment

Era IV: Shotgun Metagenomics

Page 4: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Environmental Shotgun Sequencing

• ESS first applied to endosymbiont genomes

• Endosymbionts relatively clonal within one host and even within one species sometimes

• Buchnera genome sequenced with ESS

• Many others too

Page 5: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Wolbachia Metagenomic Sequencing

shotgun

sequence

Page 6: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Page 7: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Page 8: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Page 9: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Page 10: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Wu et al., 2004. Collaboration between Jonathan Eisen and Scott O’Neill (Yale, U. Queensland).

Wolbachia pipientis wMel

Page 11: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Community structure and metabolismthrough reconstruction of microbialgenomes from the environmentGene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4,Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2

1Department of Environmental Science, Policy and Management, 2Department of Earth and Planetary Sciences, and 3Department of Physics, University of California,Berkeley, California 94720, USA4Joint Genome Institute, Walnut Creek, California 94598, USA

...........................................................................................................................................................................................................................

Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and theirroles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we reportreconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three othergenomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency ofgenomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a differentindividual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level.The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundancevariants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologousrecombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed thepathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extremeenvironment.

The study of microbial evolution and ecology has been revolutio-nized by DNA sequencing and analysis1–3. However, isolates havebeen the main source of sequence data, and only a small fraction ofmicroorganisms have been cultivated4–6. Consequently, focus hasshifted towards the analysis of uncultivated microorganisms viacloning of conserved genes5 and genome fragments directly fromthe environment7–9. To date, only a small fraction of genes have beenrecovered from individual environments, limiting the analysis ofmicrobial communities as networks characterized by symbioses,competition and partitioning of community-essential roles.Comprehensive genomic data would resolve organism-specificpathways and provide insights into population structure, speciationand evolution. So far, sequencing of whole communities has notbeen practical because most communities comprise hundreds tothousands of species10.

Acid mine drainage (AMD) is a worldwide environmentalproblem that arises largely from microbial activity11. Here, wefocused on a low-complexity AMD microbial biofilm growinghundreds of feet underground within a pyrite (FeS2) ore body

12–15.This represents a self-contained biogeochemical system character-ized by tight coupling between microbial iron oxidation andacidification due to pyrite dissolution11,16,17. Random shotgunsequencing of DNA from entire microbial communities is oneapproach for the recovery of the gene complement of uncultivatedorganisms, and for determining the degree of variability withinpopulations at the genome level. We used random shotgun sequen-cing of the biofilm to obtain the first reconstruction of multiplegenomes directly from a natural sample. The results provide novelinsights into community structure, and reveal the strategies thatunderpin microbial activity in this environment.

Initial characterization of the biofilmBiofilms growing on the surface of flowing AMD in the five-way region of the Richmond mine at Iron Mountain, California12,were sampled in March 2000. Screening using group-specific18

fluorescence in situ hybridization (FISH) revealed that all biofilmscontained mixtures of bacteria (Leptospirillum, Sulfobacillus and, ina few cases, Acidimicrobium) and archaea (Ferroplasma and othermembers of the Thermoplasmatales). The genome of one of thesearchaea, Ferroplasma acidarmanus fer1, isolated from the Richmondmine, has been sequenced previously (http://www.jgi.doe.gov/JGI_microbial/html/ferroplasma/ferro_homepage.html).A pink biofilm (Fig. 1a) typical of AMD communities was

selected for detailed genomic characterization (see SupplementaryInformation). The biofilm was dominated by Leptospirillum speciesand contained F. acidarmanus at a relatively low abundance (Fig. 1b,c). This biofilm was growing in pH 0.83, 42 8C, 317mM Fe, 14mMZn, 4mM Cu and 2mM As solution, and was collected from asurface area of approximately 0.05m2.A 16S ribosomal RNA gene clone library was constructed from

DNA extracted from the pink biofilm, and 384 clones were end-sequenced (see Supplementary Information). Results indicated thepresence of three bacterial and three archaeal lineages. The mostabundant clones are close relatives of L. ferriphilum19 and belongto Leptospirillum group II (ref. 13). Although 94% of the Lepto-spirillum group II clones were identical, 17 minor variants weredetected with up to 1.2% 16S rRNA gene-sequence divergence fromthe dominant type. Tightly defined groups (up to 1% sequencedivergence) related to Leptospirillum group III (ref. 13), Sulfobacillus,Ferroplasma (some identical to fer1), ‘A-plasma’15 and ‘G-plasma’15

were also detected. Leptospirillum group III, G-plasma andA-plasma have only recently been detected in culture-independentmolecular surveys. FISH-based quantification (Fig. 1c; seealso Supplementary Information) confirmed the dominance ofLeptospirillum group II in the biofilm.

Community genome sequencing and assemblyIn conventional shotgun sequencing projects of microbial isolates,all shotgun fragments are derived from clones of the same genome.When using the shotgun sequencing approach on genomes from an

articles

NATURE | doi:10.1038/nature02340 | www.nature.com/nature 1© 2004 Nature Publishing Group

!11

Environmental Genome ShotgunSequencing of the Sargasso SeaJ. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3

Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3

Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3

Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6

Michael W. Lomas,6 Ken Nealson,5 Owen White,3

Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6

Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4

Hamilton O. Smith1

Wehave applied “whole-genome shotgun sequencing” tomicrobial populationscollected enmasse on tangential flow and impact filters from seawater samplescollected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairsof nonredundant sequencewas generated, annotated, and analyzed to elucidatethe gene content, diversity, and relative abundance of the organisms withinthese environmental samples. These data are estimated to derive from at least1800 genomic species based on sequence relatedness, including 148 previouslyunknown bacterial phylotypes. We have identified over 1.2 million previouslyunknown genes represented in these samples, including more than 782 newrhodopsin-like photoreceptors. Variation in species present and stoichiometrysuggests substantial oceanic microbial diversity.

Microorganisms are responsible for most of thebiogeochemical cycles that shape the environ-ment of Earth and its oceans. Yet, these organ-isms are the least well understood on Earth, asthe ability to study and understand the metabol-ic potential of microorganisms has been ham-pered by the inability to generate pure cultures.Recent studies have begun to explore environ-mental bacteria in a culture-independent man-ner by isolating DNA from environmental sam-ples and transforming it into large insert clones.For example, a previously unknown light-drivenproton pump, proteorhodopsin, was discoveredwithin a bacterial artificial chromosome (BAC)from the genome of a SAR86 ribotype (1), andsoil microbial DNA libraries have been construct-ed and screened for specific activities (2).

Here we have applied whole-genome shot-gun sequencing to environmental-pooled DNAsamples to test whether new genomic approach-es can be effectively applied to gene and spe-cies discovery and to overall environmental

characterization. To help ensure a tractable pilotstudy, we sampled in the Sargasso Sea, a nutrient-limited, open ocean environment. Further, weconcentrated on the genetic material captured onfilters sized to isolate primarily microbial inhabit-ants of the environment, leaving detailed analysisof dissolved DNA and viral particles on one endof the size spectrum and eukaryotic inhabitants onthe other, for subsequent studies.The Sargasso Sea. The northwest Sar-

gasso Sea, at the Bermuda Atlantic Time-seriesStudy site (BATS), is one of the best-studiedand arguably most well-characterized regionsof the global ocean. The Gulf Stream representsthe western and northern boundaries of thisregion and provides a strong physical boundary,separating the low nutrient, oligotrophic openocean from the more nutrient-rich waters of theU.S. continental shelf. The Sargasso Sea hasbeen intensively studied as part of the 50-yeartime series of ocean physics and biogeochem-istry (3, 4) and provides an opportunity forinterpretation of environmental genomic data inan oceanographic context. In this region, for-mation of subtropical mode water occurs eachwinter as the passage of cold fronts across theregion erodes the seasonal thermocline andcauses convective mixing, resulting in mixedlayers of 150 to 300 m depth. The introductionof nutrient-rich deep water, following thebreakdown of seasonal thermoclines into thebrightly lit surface waters, leads to the bloom-ing of single cell phytoplankton, including twocyanobacteria species, Synechococcus and Pro-

chlorococcus, that numerically dominate thephotosynthetic biomass in the Sargasso Sea.

Surface water samples (170 to 200 liters)were collected aboard the RV Weatherbird IIfrom three sites off the coast of Bermuda inFebruary 2003. Additional samples were col-lected aboard the SV Sorcerer II from “Hydro-station S” in May 2003. Sample site locationsare indicated on Fig. 1 and described in tableS1; sampling protocols were fine-tuned fromone expedition to the next (5). Genomic DNAwas extracted from filters of 0.1 to 3.0 !m, andgenomic libraries with insert sizes ranging from2 to 6 kb were made as described (5). Theprepared plasmid clones were sequenced fromboth ends to provide paired-end reads at the J.Craig Venter Science Foundation Joint Tech-nology Center on ABI 3730XL DNA sequenc-ers (Applied Biosystems, Foster City, CA).Whole-genome random shotgun sequencing ofthe Weatherbird II samples (table S1, samples 1 to4) produced 1.66 million reads averaging 818 bpin length, for a total of approximately 1.36 Gbp ofmicrobial DNA sequence. An additional 325,561sequences were generated from the Sorcerer IIsamples (table S1, samples 5 to 7), yielding ap-proximately 265 Mbp of DNA sequence.Environmental genome shotgun as-

sembly. Whole-genome shotgun sequencingprojects have traditionally been applied to iden-tify the genome sequence(s) from one particularorganism, whereas the approach taken here isintended to capture representative sequencefrom many diverse organisms simultaneously.Variation in genome size and relative abun-dance determines the depth of coverage of anyparticular organism in the sample at a givenlevel of sequencing and has strong implicationsfor both the application of assembly algorithmsand for the metrics used in evaluating the re-sulting assembly. Although we would expectabundant species to be deeply covered and wellassembled, species of lower abundance may berepresented by only a few sequences. For asingle genome analysis, assembly coveragedepth in unique regions should approximate aPoisson distribution. The mean of this distribu-tion can be estimated from the observed data,looking at the depth of coverage of contigsgenerated before any scaffolding. The assem-bler used in this study, the Celera Assembler(6), uses this value to heuristically identifyclearly unique regions to form the backbone ofthe final assembly within the scaffolding phase.However, when the starting material consists ofa mixture of genomes of varying abundance, athreshold estimated in this way would classifysamples from the most abundant organism(s) asrepetitive, due to their greater-than-averagedepth of coverage, paradoxically leaving themost abundant organisms poorly assembled.We therefore used manual curation of an initial

1The Institute for Biological Energy Alternatives, 2TheCenter for the Advancement of Genomics, 1901 Re-search Boulevard, Rockville, MD 20850, USA. 3TheInstitute for Genomic Research, 9712 Medical CenterDrive, Rockville, MD 20850, USA. 4The J. Craig VenterScience Foundation Joint Technology Center, 5 Re-search Place, Rockville, MD 20850, USA. 5University ofSouthern California, 223 Science Hall, Los Angeles, CA90089–0740, USA. 6Bermuda Biological Station forResearch, Inc., 17 Biological Lane, St George GE 01,Bermuda.

*To whom correspondence should be addressed. E-mail: [email protected]

RESEARCH ARTICLE

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org66

Page 12: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Shotgun metagenomics

shotgunsequence

!12

Page 13: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Community structure and metabolismthrough reconstruction of microbialgenomes from the environmentGene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4,Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2

1Department of Environmental Science, Policy and Management, 2Department of Earth and Planetary Sciences, and 3Department of Physics, University of California,Berkeley, California 94720, USA4Joint Genome Institute, Walnut Creek, California 94598, USA

...........................................................................................................................................................................................................................

Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and theirroles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we reportreconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three othergenomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency ofgenomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a differentindividual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level.The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundancevariants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologousrecombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed thepathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extremeenvironment.

The study of microbial evolution and ecology has been revolutio-nized by DNA sequencing and analysis1–3. However, isolates havebeen the main source of sequence data, and only a small fraction ofmicroorganisms have been cultivated4–6. Consequently, focus hasshifted towards the analysis of uncultivated microorganisms viacloning of conserved genes5 and genome fragments directly fromthe environment7–9. To date, only a small fraction of genes have beenrecovered from individual environments, limiting the analysis ofmicrobial communities as networks characterized by symbioses,competition and partitioning of community-essential roles.Comprehensive genomic data would resolve organism-specificpathways and provide insights into population structure, speciationand evolution. So far, sequencing of whole communities has notbeen practical because most communities comprise hundreds tothousands of species10.

Acid mine drainage (AMD) is a worldwide environmentalproblem that arises largely from microbial activity11. Here, wefocused on a low-complexity AMD microbial biofilm growinghundreds of feet underground within a pyrite (FeS2) ore body

12–15.This represents a self-contained biogeochemical system character-ized by tight coupling between microbial iron oxidation andacidification due to pyrite dissolution11,16,17. Random shotgunsequencing of DNA from entire microbial communities is oneapproach for the recovery of the gene complement of uncultivatedorganisms, and for determining the degree of variability withinpopulations at the genome level. We used random shotgun sequen-cing of the biofilm to obtain the first reconstruction of multiplegenomes directly from a natural sample. The results provide novelinsights into community structure, and reveal the strategies thatunderpin microbial activity in this environment.

Initial characterization of the biofilmBiofilms growing on the surface of flowing AMD in the five-way region of the Richmond mine at Iron Mountain, California12,were sampled in March 2000. Screening using group-specific18

fluorescence in situ hybridization (FISH) revealed that all biofilmscontained mixtures of bacteria (Leptospirillum, Sulfobacillus and, ina few cases, Acidimicrobium) and archaea (Ferroplasma and othermembers of the Thermoplasmatales). The genome of one of thesearchaea, Ferroplasma acidarmanus fer1, isolated from the Richmondmine, has been sequenced previously (http://www.jgi.doe.gov/JGI_microbial/html/ferroplasma/ferro_homepage.html).A pink biofilm (Fig. 1a) typical of AMD communities was

selected for detailed genomic characterization (see SupplementaryInformation). The biofilm was dominated by Leptospirillum speciesand contained F. acidarmanus at a relatively low abundance (Fig. 1b,c). This biofilm was growing in pH 0.83, 42 8C, 317mM Fe, 14mMZn, 4mM Cu and 2mM As solution, and was collected from asurface area of approximately 0.05m2.A 16S ribosomal RNA gene clone library was constructed from

DNA extracted from the pink biofilm, and 384 clones were end-sequenced (see Supplementary Information). Results indicated thepresence of three bacterial and three archaeal lineages. The mostabundant clones are close relatives of L. ferriphilum19 and belongto Leptospirillum group II (ref. 13). Although 94% of the Lepto-spirillum group II clones were identical, 17 minor variants weredetected with up to 1.2% 16S rRNA gene-sequence divergence fromthe dominant type. Tightly defined groups (up to 1% sequencedivergence) related to Leptospirillum group III (ref. 13), Sulfobacillus,Ferroplasma (some identical to fer1), ‘A-plasma’15 and ‘G-plasma’15

were also detected. Leptospirillum group III, G-plasma andA-plasma have only recently been detected in culture-independentmolecular surveys. FISH-based quantification (Fig. 1c; seealso Supplementary Information) confirmed the dominance ofLeptospirillum group II in the biofilm.

Community genome sequencing and assemblyIn conventional shotgun sequencing projects of microbial isolates,all shotgun fragments are derived from clones of the same genome.When using the shotgun sequencing approach on genomes from an

articles

NATURE | doi:10.1038/nature02340 | www.nature.com/nature 1© 2004 Nature Publishing Group

Page 14: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Acid Mine Drainage 2004

!14

environmental sample, however, variation within each speciespopulation might complicate assembly. If intraspecies variation isdominated by limited local polymorphism or homologous recom-bination, it should be possible to define a composite genome foreach species population. Conversely, if the genomic heterogeneitywithin a species is dominated by large rearrangements, deletions, orinsertions, it may be impossible to define composite genomes forspecies populations from natural communities.A small insert plasmid library (average insert size 3.2 kilobases

(kb)) was constructed from the biofilm DNA for random shotgunsequencing (see Supplementary Information). A total of 76.2million base pairs (bp) of DNA sequence was generated from103,462 high-quality reads (averaging 737 bp per read). Analysisof raw shotgun data (Supplementary Figs S1–5) indicated thepresence of both bacterial and archaeal genomes at sequencecoverages of up to 10£, which would be sufficient to produce ahigh-quality assembly from a conventional microbial genomeproject20,21. The shotgun data set was assembled with JAZZ, awhole-genome shotgun assembler22. Anticipating polymorphisms,we permitted alignment discrepancies beyond those expected fromsequencing error if they were consistent with end-pairing con-straints. Over 85% of the shotgun reads were assembled intoscaffolds longer than 2 kb (a scaffold is a reconstructed genomicregion that may contain gaps of a known size range). The combinedlength of the 1,183 scaffolds is 10.83 megabases (Mb). The assemblyis internally self consistent, with 97.2% of end pairs from the sameclone assembled with the appropriate orientation and separation, asexpected for a low rate of mispairing error (tracking and chimaericclones).The first step in assignment of scaffolds to organism types was to

separate the scaffolds by average GþC content. These were sub-sequently subdivided using read depth (coverage). Dinucleotidefrequencies did not allow for further subdivision. Notably, separa-tion of scaffolds into low GþC (,43.5%; Supplementary Fig. S3a)and high GþC ($43.5%) content ‘bins’ was not significantlycompromised by local heterogeneities in GþC content becausethe scaffolds were binned after assembly. As the scaffolds aretypically tens of kilobases long, local fluctuations in GþC contentare averaged over the length of each scaffold, allowing, in most cases(.99%), clear assignment to bins of high or low GþC content.

The high GþC scaffolds at approximately 10£ coverage (70scaffolds up to 137 kb in length, totalling 2.23Mb) were identifiedby the presence of a single 16S rRNA gene as belonging to thegenome of a Leptospirillum group II species. The average GþCcontent (55.8%) is comparable to the GþC content (54.9–58%) ofL. ferriphilum19. The total high GþC scaffold length is close to theestimated genome size of Leptospirillum ferrooxidans23 (1.9Mb).This suggests that essentially the entire Leptospirillum group IIgenome was recovered from the community DNA.

The low GþC scaffolds at approximately 10£ coverage wereassembled into 59 scaffolds of up to 138 kb in length, totalling1.82Mb. The single 16S rRNA gene identified in these scaffolds was99% identical to that of the fer1 isolate; however, alignment of thescaffolds to the fer1 genome revealed an average of 22% divergenceat the nucleotide level (Supplementary Fig. S6). The total scaffoldlength is close to the genome size of fer1 (1.9Mb; Allen et al.,unpublished data), and local gene order and content are highlyconserved (Supplementary Fig. S7). Therefore, these 59 scaffoldsrepresent a nearly complete genome of a previously unknown,uncultured Ferroplasma species distinct from fer1. We designatethis as Ferroplasma type II. The dominance of this organism typewas unexpected before the genomic analysis.

We assigned the roughly 3£ coverage, high GþC scaffolds toLeptospirillum group III on the basis of rRNAmarkers (474 scaffoldsup to 31 kb, totalling 2.66Mb). Comparison of these scaffolds withthose assigned to Leptospirillum group II indicates significantsequence divergence and only locally conserved gene order, con-firming that the scaffolds belong to a relatively distant relative ofLeptospirillum group II. A partial 16S rRNA gene sequence fromSulfobacillus thermosulfidooxidans was identified in the un-assembled reads, suggesting very low coverage of this organism. Ifany Sulfobacillus scaffolds .2 kb were assembled, they would begrouped with the Leptospirillum group III scaffolds.

We compared the 3£ coverage, low GþC scaffolds (580 scaffolds,4.12Mb) to the fer1 genome in order to assign them to organismtypes (Supplementary Fig. S6). Scaffolds with $96% nucleotideidentity to fer1 were assigned to an environmental Ferroplasma typeI genome (170 scaffolds up to 47 kb in length and comprising1.48Mb of sequence). The remaining low-coverage, low GþCscaffolds are tentatively assigned to G-plasma. The largest scaffoldin this bin (62 kb) contains the G-plasma 16S rRNA gene. The 410scaffolds assigned to G-plasma comprise 2.65Mb of sequence. Apartial 16S rRNAgene sequence fromA-plasmawas identified in theunassembled reads, suggesting low coverage of this organism. Anyscaffolds from A-plasma.2 kb would be included in the G-plasmabin. Although eukaryotes are present in the AMD system, they werein low abundance in the biofilm studied. So far, no scaffolds fromeukaryotes have been detected.

As independent evidence that the Leptospirillum group II andFerroplasma type II genomes are nearly complete, we located a fullcomplement of transfer RNA synthetases in each genome data set.An almost complete set of these genes was also recovered fromLeptospirillum group III. TheG-plasma bin containsmore than a fullset of tRNA synthetases, consistent with inclusion of some A-plasmascaffolds. In addition, we established that the Leptospirillumgroup II, Leptospirillum group III, Ferroplasma type I, Ferroplasmatype II and G-plasma bins contained only one set of rRNA genes.

Figure 1 The pink biofilm. a, Photograph of the biofilm in the Richmond mine (hand

included for scale). b, FISH image of a. Probes targeting bacteria (EUBmix; fluoresceinisothiocyanate (green)) and archaea (ARC915; Cy5 (blue)) were used in combination with a

probe targeting the Leptospirillum genus (LF655; Cy3 (red)). Overlap of red and green

(yellow) indicates Leptospirillum cells and shows the dominance of Leptospirillum.

c, Relative microbial abundances determined using quantitative FISH counts.

articles

NATURE | doi:10.1038/nature02340 | www.nature.com/nature2 © 2004 Nature Publishing Group

environmental sample, however, variation within each speciespopulation might complicate assembly. If intraspecies variation isdominated by limited local polymorphism or homologous recom-bination, it should be possible to define a composite genome foreach species population. Conversely, if the genomic heterogeneitywithin a species is dominated by large rearrangements, deletions, orinsertions, it may be impossible to define composite genomes forspecies populations from natural communities.A small insert plasmid library (average insert size 3.2 kilobases

(kb)) was constructed from the biofilm DNA for random shotgunsequencing (see Supplementary Information). A total of 76.2million base pairs (bp) of DNA sequence was generated from103,462 high-quality reads (averaging 737 bp per read). Analysisof raw shotgun data (Supplementary Figs S1–5) indicated thepresence of both bacterial and archaeal genomes at sequencecoverages of up to 10£, which would be sufficient to produce ahigh-quality assembly from a conventional microbial genomeproject20,21. The shotgun data set was assembled with JAZZ, awhole-genome shotgun assembler22. Anticipating polymorphisms,we permitted alignment discrepancies beyond those expected fromsequencing error if they were consistent with end-pairing con-straints. Over 85% of the shotgun reads were assembled intoscaffolds longer than 2 kb (a scaffold is a reconstructed genomicregion that may contain gaps of a known size range). The combinedlength of the 1,183 scaffolds is 10.83 megabases (Mb). The assemblyis internally self consistent, with 97.2% of end pairs from the sameclone assembled with the appropriate orientation and separation, asexpected for a low rate of mispairing error (tracking and chimaericclones).The first step in assignment of scaffolds to organism types was to

separate the scaffolds by average GþC content. These were sub-sequently subdivided using read depth (coverage). Dinucleotidefrequencies did not allow for further subdivision. Notably, separa-tion of scaffolds into low GþC (,43.5%; Supplementary Fig. S3a)and high GþC ($43.5%) content ‘bins’ was not significantlycompromised by local heterogeneities in GþC content becausethe scaffolds were binned after assembly. As the scaffolds aretypically tens of kilobases long, local fluctuations in GþC contentare averaged over the length of each scaffold, allowing, in most cases(.99%), clear assignment to bins of high or low GþC content.

The high GþC scaffolds at approximately 10£ coverage (70scaffolds up to 137 kb in length, totalling 2.23Mb) were identifiedby the presence of a single 16S rRNA gene as belonging to thegenome of a Leptospirillum group II species. The average GþCcontent (55.8%) is comparable to the GþC content (54.9–58%) ofL. ferriphilum19. The total high GþC scaffold length is close to theestimated genome size of Leptospirillum ferrooxidans23 (1.9Mb).This suggests that essentially the entire Leptospirillum group IIgenome was recovered from the community DNA.

The low GþC scaffolds at approximately 10£ coverage wereassembled into 59 scaffolds of up to 138 kb in length, totalling1.82Mb. The single 16S rRNA gene identified in these scaffolds was99% identical to that of the fer1 isolate; however, alignment of thescaffolds to the fer1 genome revealed an average of 22% divergenceat the nucleotide level (Supplementary Fig. S6). The total scaffoldlength is close to the genome size of fer1 (1.9Mb; Allen et al.,unpublished data), and local gene order and content are highlyconserved (Supplementary Fig. S7). Therefore, these 59 scaffoldsrepresent a nearly complete genome of a previously unknown,uncultured Ferroplasma species distinct from fer1. We designatethis as Ferroplasma type II. The dominance of this organism typewas unexpected before the genomic analysis.

We assigned the roughly 3£ coverage, high GþC scaffolds toLeptospirillum group III on the basis of rRNAmarkers (474 scaffoldsup to 31 kb, totalling 2.66Mb). Comparison of these scaffolds withthose assigned to Leptospirillum group II indicates significantsequence divergence and only locally conserved gene order, con-firming that the scaffolds belong to a relatively distant relative ofLeptospirillum group II. A partial 16S rRNA gene sequence fromSulfobacillus thermosulfidooxidans was identified in the un-assembled reads, suggesting very low coverage of this organism. Ifany Sulfobacillus scaffolds .2 kb were assembled, they would begrouped with the Leptospirillum group III scaffolds.

We compared the 3£ coverage, low GþC scaffolds (580 scaffolds,4.12Mb) to the fer1 genome in order to assign them to organismtypes (Supplementary Fig. S6). Scaffolds with $96% nucleotideidentity to fer1 were assigned to an environmental Ferroplasma typeI genome (170 scaffolds up to 47 kb in length and comprising1.48Mb of sequence). The remaining low-coverage, low GþCscaffolds are tentatively assigned to G-plasma. The largest scaffoldin this bin (62 kb) contains the G-plasma 16S rRNA gene. The 410scaffolds assigned to G-plasma comprise 2.65Mb of sequence. Apartial 16S rRNAgene sequence fromA-plasmawas identified in theunassembled reads, suggesting low coverage of this organism. Anyscaffolds from A-plasma.2 kb would be included in the G-plasmabin. Although eukaryotes are present in the AMD system, they werein low abundance in the biofilm studied. So far, no scaffolds fromeukaryotes have been detected.

As independent evidence that the Leptospirillum group II andFerroplasma type II genomes are nearly complete, we located a fullcomplement of transfer RNA synthetases in each genome data set.An almost complete set of these genes was also recovered fromLeptospirillum group III. TheG-plasma bin containsmore than a fullset of tRNA synthetases, consistent with inclusion of some A-plasmascaffolds. In addition, we established that the Leptospirillumgroup II, Leptospirillum group III, Ferroplasma type I, Ferroplasmatype II and G-plasma bins contained only one set of rRNA genes.

Figure 1 The pink biofilm. a, Photograph of the biofilm in the Richmond mine (hand

included for scale). b, FISH image of a. Probes targeting bacteria (EUBmix; fluoresceinisothiocyanate (green)) and archaea (ARC915; Cy5 (blue)) were used in combination with a

probe targeting the Leptospirillum genus (LF655; Cy3 (red)). Overlap of red and green

(yellow) indicates Leptospirillum cells and shows the dominance of Leptospirillum.

c, Relative microbial abundances determined using quantitative FISH counts.

articles

NATURE | doi:10.1038/nature02340 | www.nature.com/nature2 © 2004 Nature Publishing Group

Page 15: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Methods

• Plasmid library

• Shotgun sequence

• Assembled

• Binning ! GC content ! Coverage

• Potential “nearly” complete genomes ! Leptospirillum group II ! Ferroplasma type II ! Evidence for completeness: housekeeping genes

• Annotation, population analysis

Page 16: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

These results, and the agreement between the recovered andanticipated genome sizes, confirm that dividing scaffolds by GþCcontent, read depth and homology to the fer1 genome is a validmeans of sorting most genomes from this community data set.Methods for the analysis of genome signatures currently underdevelopment may be required for binning genome data from morecomplex communities24,25.

Population structure and speciationThe biofilm sample contained approximately 108 Leptospirillumgroup II cells. Thus, the roughly 29,000 reads assembled into theLeptospirillum group II genome probably all came from differentindividuals. Despite this, there were no nucleotide polymorphismsin the 16S or 23S rRNA genes, or in their intergenic region. Theaverage nucleotide polymorphism rate in the Leptospirillum group IIgenome is 0.08%; approximately one-third of the polymorphismscause changes at the protein-coding level. The low incidence ofnucleotide polymorphisms indicates that we have sequenced essen-tially a single strain from the community. Homogeneity within theLeptospirillum group II genome may reflect strong recent environ-mental selection for this genome type or be the result of a foundereffect.

There are no nucleotide polymorphisms in the rRNA genes ofFerroplasma type II, despite an average polymorphism rate of about2.2% (see Supplementary Information). Polymorphism-freeregions typically a few hundred base pairs long and up to severalkilobases in length occur one to tens of kilobases apart on theFerroplasma type II scaffolds (see Supplementary Information).Although these homogenous regions may contain conservedgenes, there is no consistency in the function of proteins encodedby them (for example, Fig. 2a; see also Supplementary Fig. S8).In any given region, typically between one and three distinct

patterns of nucleotide polymorphism were observed in theassembled Ferroplasma type II composite genome (Fig. 2b). Byusing sequence read end-pair information, these nucleotide poly-morphism patterns could be connected across homogenousregions. We could identify points within individual reads andbetween end pairs where one distinct nucleotide polymorphismpattern transitioned into another (Fig. 2b; see also SupplementaryFig. S8). The most likely explanation for this, and for the largerhomogeneous regions, is that the Ferroplasma type II strains haveundergone homologous recombination. It is unlikely that the readswith pattern transitions represent variants that arose simplythrough accumulation of nucleotide polymorphisms, because this

Figure 2 Segment of the Ferroplasma type II composite genome. a, A 4.2-kb region

showing annotated open reading frames (ORFs) (red), average read depth (blue line), and

the number of nucleotide polymorphisms in the ‘green’ and ‘yellow’ relative to the ‘pink’

strain (green and yellow lines) averaged over 60-bp windows. Black dots indicate

recombination sites. b, Alignment of individual reads (XYG) for a 96-bp region in a. Lettersindicate nucleotide polymorphisms in the green and yellow strains relative to the pink

strain. Note the recombinant sequence (XYG48207). c, Evolutionary distance tree inferredfrom the ancestral strain sequences in a.

articles

NATURE | doi:10.1038/nature02340 | www.nature.com/nature 3© 2004 Nature Publishing Group

Page 17: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

would require precise selection acting on small degrees of hetero-geneity at virtually every locus.Assembled reads in regions of up to 10 kb chosen at random on

Ferroplasma type II scaffolds generally contained about 20 tran-sition points, interpreted to be recombination boundaries (seeSupplementary Fig. S9). If the characteristics of these regions aretypical, then switching of nucleotide polymorphism patterns occurson average about every 5 kb, and the mosaic genomes of currentlyexisting strains were constructed via at least 400 recombinationevents. It is impossible to obtain a detailed representation of thestrain genome structure or relative abundances of individual typesbecause we cannot directly link polymorphism patterns over longgenomic regions with small insert library data.At the time of sampling, the Ferroplasma type II species popu-

lation seemed to be dominated by strains with mosaic genomesconstructed by recombination of three closely related but distinctgenome types (pink, green and yellow in Fig. 2b and SupplementaryFig. S9) that we infer correspond to three ‘ancestral’ strains (Fig. 3).Other ancestral types may have existed, and may have given rise tovariants that are too rare to be detected by our analysis. Combina-torial variants such as these have been observed previously inpopulations of enteric bacteria from disparate locations26 but theyhave not been documented in archaea or environmental samplesfrom a single location.Phylogenetic analysis of genome fragments reconstructed for the

three ancestral strains reveals that they derived from a relativelyrecent common ancestor (Fig. 2c). On the basis of the overall stronginternal consistency within the nucleotide polymorphism patterns,we infer that the most recent evolution of the Ferroplasma type IIpopulation has been dominated by homologous recombination. Ifthe population were to undergo further recombination (a likelyscenario), three distinct ancestral strains would still be identifiableat the local scale. Therefore, it is not possible to determine howmany episodes of recombination have occurred in the population.Sequences reconstructed for the three ancestral types (Fig. 2b)

were used to calculate relative polymorphism frequencies (Fig. 2a).Most of the proteins encoded by the sequences were slightlydifferent at the amino acid level. In one region (SupplementaryFig. S9) the nucleotide polymorphism rates were 0.6% for the greencompared with pink (56% non-synonymous) and 3% for yellowcompared with pink (44% non-synonymous). This suggests thatrecombination involving fragments carrying slightly different pro-tein-coding sequences yields a very large number of genomic

combinatorial variants (Fig. 3) with subtly different metaboliccharacteristics. The existence of mosaic genome types has ecologicaland evolutionary significance because genome diversity due toextensive recombination would ensure availability of an optimizedstrain when the system is perturbed, conferring resilience to thespecies.

The frequency of recombination between organisms decreasesexponentially as they become more divergent27. Unlike the Ferro-plasma type II strains, the Ferroplasma type I and II genomes haveno anomalous regions of high sequence identity, indicating thatthey have not undergone recent recombination (at least on lengthscales smaller than the scaffold size). On the basis of the lowrecombination rate and the separation of these genomes fromeach other by assembly, we predict that Ferroplasma type I and IIare physiologically distinct, and thus are separate species. Recom-bination and assembly may provide useful genome-based criteria toseparate species from strains in cases where one or both organismsare uncultivated.

The combination of a 16S rRNA-based survey with comprehen-sive genomic sampling provides a snapshot of the populationstructure. We have not yet determined how stable the communitystructure is, or the factors responsible for the success of the observedstrains. However, an important finding is the dominance of thebiofilm by a handful of distinct genome types. A few organism typeswithin a much larger pool of rare types is typical of lognormalabundance distributions in other natural communities28–30. Thismay be attributed to a small number of niches within the AMDsystem at any one time, possibly because the ecosystem is relativelygeochemically simple (for example, the dominant electron donorsand acceptors are iron, sulphur and oxygen, and temperature andfluid composition cycle within a relatively narrow range over annualtimescales)12,14.

Pathways for genetic exchangeAlthough there is evidence for genetic exchange between Ferro-plasma type II strains, it is unclear how recombination is achieved.There is also some evidence of ancient gene transfer between theSulfolobales and members of the Thermoplasmatales, and betweenbacteria and archaea (data not shown). Transformation by uptake ofnaked DNA is unlikely, as DNA is not expected to have a significantresidence time in the acid solution. There is no evidence forconjugation genes in Ferroplasma type I or II, and there is onlylimited evidence for transduction (some possible phage genes andintegrases). We compared the sequences of the probable prophagegenes in order to test whether the host range of phage in the AMDsystem is large enough to provide a mechanism for lateral genetransfer. Identical reverse transcriptases (LambdaSa1) occur in the G-plasma and Ferroplasma type II genomes (in very different genomiccontexts), suggesting that a single phage type has recently targetedboth lineages. Similarly, identical retron-type reverse transcriptaseswith identical adjacent transposases occur in otherwise differentgenomic contexts within the Leptospirillum group II and III genomes,indicating that a broad host range phage targets both of thesegroups.

Metabolic analysisWe recovered near-complete gene inventories for the five dominantmembers of the biofilm community. The data for Leptospirillumgroup II are particularly notable, as no genome of a Nitrospiraphylum member had been sequenced previously. Here we focus onthe metabolic pathways recovered in Leptospirillum group II andFerroplasma type II (Fig. 4; see also annotation files in the Sup-plementary Information), and on the new insights into the eco-logical roles of individual members that have changed ourunderstanding of how the community functions.

The acidophilic biofilms are self-sustaining communitiesthat grow in the deep subsurface and receive no significant

Figure 3 Schematic diagram illustrating a diversity of mosaic genome types within the

Ferroplasma type II population that are inferred to have arisen by homologous recombination

between three closely related ancestral genome types (pink, yellow and green).

articles

NATURE | doi:10.1038/nature02340 | www.nature.com/nature4 © 2004 Nature Publishing Group

Page 18: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

inputs of fixed carbon or nitrogen from external sources. As withLeptospirillum group I, both Leptospirillum group II and III have thegenes needed to fix carbon by means of the Calvin–Benson–Bassham cycle (using type II ribulose 1,5-bisphosphate carboxy-lase–oxygenase). All genomes recovered from the AMD system

contain formate hydrogenlyase complexes. These, in combinationwith carbon monoxide dehydrogenase, may be used for carbonfixation via the reductive acetyl coenzyme A (acetyl-CoA) pathwayby some, or all, organisms. Given the large number of ABC-typesugar and amino acid transporters encoded in the Ferroplasma type

Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs

identified in the Leptospirillum group II genome (63% with putative assigned function) and

1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell

cartoons are shown within a biofilm that is attached to the surface of an acid mine

drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation,

pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate

carboxylase–oxygenase. THF, tetrahydrofolate.

articles

NATURE | doi:10.1038/nature02340 | www.nature.com/nature 5© 2004 Nature Publishing Group

!18

Page 19: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Environmental Genome ShotgunSequencing of the Sargasso SeaJ. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3

Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3

Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3

Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6

Michael W. Lomas,6 Ken Nealson,5 Owen White,3

Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6

Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4

Hamilton O. Smith1

Wehave applied “whole-genome shotgun sequencing” tomicrobial populationscollected enmasse on tangential flow and impact filters from seawater samplescollected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairsof nonredundant sequencewas generated, annotated, and analyzed to elucidatethe gene content, diversity, and relative abundance of the organisms withinthese environmental samples. These data are estimated to derive from at least1800 genomic species based on sequence relatedness, including 148 previouslyunknown bacterial phylotypes. We have identified over 1.2 million previouslyunknown genes represented in these samples, including more than 782 newrhodopsin-like photoreceptors. Variation in species present and stoichiometrysuggests substantial oceanic microbial diversity.

Microorganisms are responsible for most of thebiogeochemical cycles that shape the environ-ment of Earth and its oceans. Yet, these organ-isms are the least well understood on Earth, asthe ability to study and understand the metabol-ic potential of microorganisms has been ham-pered by the inability to generate pure cultures.Recent studies have begun to explore environ-mental bacteria in a culture-independent man-ner by isolating DNA from environmental sam-ples and transforming it into large insert clones.For example, a previously unknown light-drivenproton pump, proteorhodopsin, was discoveredwithin a bacterial artificial chromosome (BAC)from the genome of a SAR86 ribotype (1), andsoil microbial DNA libraries have been construct-ed and screened for specific activities (2).

Here we have applied whole-genome shot-gun sequencing to environmental-pooled DNAsamples to test whether new genomic approach-es can be effectively applied to gene and spe-cies discovery and to overall environmental

characterization. To help ensure a tractable pilotstudy, we sampled in the Sargasso Sea, a nutrient-limited, open ocean environment. Further, weconcentrated on the genetic material captured onfilters sized to isolate primarily microbial inhabit-ants of the environment, leaving detailed analysisof dissolved DNA and viral particles on one endof the size spectrum and eukaryotic inhabitants onthe other, for subsequent studies.The Sargasso Sea. The northwest Sar-

gasso Sea, at the Bermuda Atlantic Time-seriesStudy site (BATS), is one of the best-studiedand arguably most well-characterized regionsof the global ocean. The Gulf Stream representsthe western and northern boundaries of thisregion and provides a strong physical boundary,separating the low nutrient, oligotrophic openocean from the more nutrient-rich waters of theU.S. continental shelf. The Sargasso Sea hasbeen intensively studied as part of the 50-yeartime series of ocean physics and biogeochem-istry (3, 4) and provides an opportunity forinterpretation of environmental genomic data inan oceanographic context. In this region, for-mation of subtropical mode water occurs eachwinter as the passage of cold fronts across theregion erodes the seasonal thermocline andcauses convective mixing, resulting in mixedlayers of 150 to 300 m depth. The introductionof nutrient-rich deep water, following thebreakdown of seasonal thermoclines into thebrightly lit surface waters, leads to the bloom-ing of single cell phytoplankton, including twocyanobacteria species, Synechococcus and Pro-

chlorococcus, that numerically dominate thephotosynthetic biomass in the Sargasso Sea.

Surface water samples (170 to 200 liters)were collected aboard the RV Weatherbird IIfrom three sites off the coast of Bermuda inFebruary 2003. Additional samples were col-lected aboard the SV Sorcerer II from “Hydro-station S” in May 2003. Sample site locationsare indicated on Fig. 1 and described in tableS1; sampling protocols were fine-tuned fromone expedition to the next (5). Genomic DNAwas extracted from filters of 0.1 to 3.0 !m, andgenomic libraries with insert sizes ranging from2 to 6 kb were made as described (5). Theprepared plasmid clones were sequenced fromboth ends to provide paired-end reads at the J.Craig Venter Science Foundation Joint Tech-nology Center on ABI 3730XL DNA sequenc-ers (Applied Biosystems, Foster City, CA).Whole-genome random shotgun sequencing ofthe Weatherbird II samples (table S1, samples 1 to4) produced 1.66 million reads averaging 818 bpin length, for a total of approximately 1.36 Gbp ofmicrobial DNA sequence. An additional 325,561sequences were generated from the Sorcerer IIsamples (table S1, samples 5 to 7), yielding ap-proximately 265 Mbp of DNA sequence.Environmental genome shotgun as-

sembly. Whole-genome shotgun sequencingprojects have traditionally been applied to iden-tify the genome sequence(s) from one particularorganism, whereas the approach taken here isintended to capture representative sequencefrom many diverse organisms simultaneously.Variation in genome size and relative abun-dance determines the depth of coverage of anyparticular organism in the sample at a givenlevel of sequencing and has strong implicationsfor both the application of assembly algorithmsand for the metrics used in evaluating the re-sulting assembly. Although we would expectabundant species to be deeply covered and wellassembled, species of lower abundance may berepresented by only a few sequences. For asingle genome analysis, assembly coveragedepth in unique regions should approximate aPoisson distribution. The mean of this distribu-tion can be estimated from the observed data,looking at the depth of coverage of contigsgenerated before any scaffolding. The assem-bler used in this study, the Celera Assembler(6), uses this value to heuristically identifyclearly unique regions to form the backbone ofthe final assembly within the scaffolding phase.However, when the starting material consists ofa mixture of genomes of varying abundance, athreshold estimated in this way would classifysamples from the most abundant organism(s) asrepetitive, due to their greater-than-averagedepth of coverage, paradoxically leaving themost abundant organisms poorly assembled.We therefore used manual curation of an initial

1The Institute for Biological Energy Alternatives, 2TheCenter for the Advancement of Genomics, 1901 Re-search Boulevard, Rockville, MD 20850, USA. 3TheInstitute for Genomic Research, 9712 Medical CenterDrive, Rockville, MD 20850, USA. 4The J. Craig VenterScience Foundation Joint Technology Center, 5 Re-search Place, Rockville, MD 20850, USA. 5University ofSouthern California, 223 Science Hall, Los Angeles, CA90089–0740, USA. 6Bermuda Biological Station forResearch, Inc., 17 Biological Lane, St George GE 01,Bermuda.

*To whom correspondence should be addressed. E-mail: [email protected]

RESEARCH ARTICLE

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org66

!19

http://www.sciencemag.org/content/304/5667/66

Page 20: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Sargasso Sea

assembly to identify a set of large, deeply as-sembling nonrepetitive contigs. This was used toset the expected coverage in unique regions (to23!) for a final run of the assembler. This al-lowed the deep contigs to be treated as uniquesequence when they would otherwise be labeledas repetitive. We evaluated our final assemblyresults in a tiered fashion, looking at well-sampledgenomic regions separately from those barelysampled at our current level of sequencing.

The 1.66 million sequences from theWeatherbird II samples (table S1; samples 1 to4; stations 3, 11, and 13), were pooled andassembled to provide a single master assemblyfor comparative purposes. The assembly gener-ated 64,398 scaffolds ranging in size from 826bp to 2.1 Mbp, containing 256 Mbp of uniquesequence and spanning 400 Mbp. After assem-bly, there remained 217,015 paired-end reads,or “mini-scaffolds,” spanning 820.7 Mbp aswell as an additional 215,038 unassembled sin-gleton reads covering 169.9 Mbp (table S2,column 1). The Sorcerer II samples providedalmost no assembly, so we consider for thesesamples only the 153,458 mini-scaffolds, span-ning 518.4 Mbp, and the remaining 18,692singleton reads (table S2, column 2). In total,1.045 Gbp of nonredundant sequence was gen-erated. The lack of overlapping reads within theunassembled set indicates that lack of addition-al assembly was not due to algorithmic limita-tions but to the relatively limited depth of se-quencing coverage given the level of diversitywithin the sample.

The whole-genome shotgun (WGS) assemblyhas been deposited at DDBJ/EMBL/GenBankunder the project accession AACY00000000,and all traces have been deposited in a corre-sponding TraceDB trace archive. The versiondescribed in this paper is the first version,AACY01000000. Unlike a conventional WGSentry, we have deposited not just contigs andscaffolds but the unassembled paired singletonsand individual singletons in order to accurate-ly reflect the diversity in the sample andallow searches across the entire sample with-in a single database.Genomes and large assemblies. Our

analysis first focused on the well-sampled ge-nomes by characterizing scaffolds with at least3! coverage depth. There were 333 scaffoldscomprising 2226 contigs and spanning 30.9Mbp that met this criterion (table S3), account-ing for roughly 410,000 reads, or 25% of thepooled assembly data set. From this set of well-sampled material, we were able to cluster andclassify assemblies by organism; from the rarespecies in our sample, we used sequence similar-ity based methods together with computationalgene finding to obtain both qualitative and quan-titative estimates of genomic and functional diver-sity within this particular marine environment.

We employed several criteria to sort themajor assembly pieces into tentative organism“bins”; these include depth of coverage, oligo-

nucleotide frequencies (7), and similarity topreviously sequenced genomes (5). With thesetechniques, the majority of sequence assignedto the most abundant species (16.5 Mbp of the30.9 Mb in the main scaffolds) could be sepa-rated based on several corroborating indicators.In particular, we identified a distinct group ofscaffolds representing an abundant populationclearly related to Burkholderia (fig. S2) andtwo groups of scaffolds representing two dis-tinct strains closely related to the published

Shewanella oneidensis genome (8) (fig. S3).There is a group of scaffolds assembling at over6! coverage that appears to represent the ge-nome of a SAR86 (table S3). Scaffold setsrepresenting a conglomerate of Prochlorococ-cus strains (Fig. 2), as well as an unculturedmarine archaeon, were also identified (table S3;Fig. 3). Additionally, 10 putative mega plasmidswere found in the main scaffold set, coveredat depths ranging from 4! to 36! (indicatedwith shading in table S3 with nine depicted in

Fig. 1. MODIS-Aqua satellite image ofocean chlorophyll in the Sargasso Sea gridabout the BATS site from 22 February2003. The station locations are overlainwith their respective identifications. Notethe elevated levels of chlorophyll (greencolor shades) around station 3, which arenot present around stations 11 and 13.

Fig. 2. Gene conser-vation among closelyrelated Prochlorococ-cus. The outermostconcentric circle ofthe diagram depictsthe competed genom-ic sequence of Pro-chlorococcus marinusMED4 (11). Fragmentsfrom environmentalsequencing were com-pared to this complet-ed Prochlorococcus ge-nome and are shown inthe inner concentriccircles and were givenboxed outlines. Genesfor the outermost cir-cle have been as-signed psuedospec-trum colors based onthe position of thosegenes along the chro-mosome, where genesnearer to the start ofthe genome are col-ored in red, and genesnearer to the end of the genome are colored in blue. Fragments from environmental sequencingwere subjected to an analysis that identifies conserved gene order between those fragments andthe completed Prochlorococcus MED4 genome. Genes on the environmental genome segmentsthat exhibited conserved gene order are colored with the same color assignments as theProchlorococcus MED4 chromosome. Colored regions on the environmental segments exhibitingcolor differences from the adjacent outermost concentric circle are the result of conserved geneorder with other MED4 regions and probably represent chromosomal rearrangements. Genes thatdid not exhibit conserved gene order are colored in black.

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 67

!20

http://www.sciencemag.org/content/304/5667/66

Page 21: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

• Sampling Protocols. Sampling on the RV Weatherbird II was done as follows: Seawater (170 liters) from stations 11 and 13 was directly filtered through a 0.8µm Supor membrane disc filter (Pall Life Sciences) followed in series by a 0.22µm Supor membrane disc filter (Pall Life Sciences). The sample from station 3 was pumped into a 250 L carboy prior to being filtered through the impact filters. The length of time from collection of the sample until the end of the filtration step was approximately one hour. Filters were placed in 5ml of sucrose lysis buffer (20mM EDTA, 400mM NaCl, 0.75 M Sucrose, 50mM Tris-HCl, pH 9.0) and stored in liquid nitrogen on the Weatherbird then placed at -80oC until DNA extractions were done. Alternatively seawater (340 liters) was collected from 5 meters below the surface into a carboy then filtered through a 0.8µm Supor membrane disc filter (Pall Life Sciences), followed by concentration to 1 liter using a Pellicon tangential flow filtration system (Millipore) with a 0.1µm Durapore VVPP cartridge (Millipore); again the total time for the filtration and concentration was approximately one hour. Cells were pelleted at 10,000 rpm, 4oC for 30 minutes. ). The impact filters and the retentate from the TFF were then handled as described above. The carboys, tubing and filter systems were cleaned with a 10% hydrochloric acid wash prior to each leg of the sampling. Any of the sampling equipment (tubing, etc.) that could reasonably be soaked was soaked in an acid bath is for at least 24 hours. Sampling carboys were filled with the acid wash and “soaked” for at least 24 hours as well. All acid washed items were subsequently rinsed very liberally with Milli-Q water. A liberal Milli-Q water rinse was also conducted between samples on the same leg. All spigots from the carboys were covered with a ziploc bag until needed. Tubing was stored in clean ziploc bags until needed.

Page 22: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Sample preparation. The impact filters were cut into quarters and placed in individual 50 ml conical tubes. TE buffer (5 ml, pH 8) containing 150 ug/ml lysozyme was added to each tube. The tubes were incubated at 37oC for 2 hours. SDS was added to 0.1% and the samples were then put through three freeze/thaw cycles. The lysate was then treated with Proteinase K (100 ug/mL) for one hour at 55oC followed by three aqueous phenol extractions and one extraction with phenol/chloroform. The supernatant was then precipitated with two volumes of 100% ethanol and the DNA pellet washed with 70% ethanol.

Page 23: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

DNA preparation. DNA was randomly sheared, end-polished with consecutive BAL31 nuclease and T4 DNA polymerase treatments, and size-selected by electrophoresis on 1% low-melting-point agarose. After ligation to Bst XI adapters (Invitrogen, catalog no.!N408-18), DNA was purified by three rounds of gel electrophoresis to remove excess adapters, and the fragments, now with 3'-CACA overhangs, were inserted into Bst XI- linearized plasmid vector with 3'-TGTG overhangs. Fragments were cloned in a medium- copy pBR322 derivative.

Page 24: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Sequence assembly. With default parameter settings, the highly covered genome sequences would have been treated as repetitive DNA by the Celera Assembler. Since the Celera Assembler constructs scaffolds only from a backbone of sequence heuristically classified as unique, these organisms would not have been eligible for scaffolding and would have been absent from the final assembly. However, by tuning the threshold parameter for classifying unique sequence, we were able to compensate for the apparent repetitiveness of these genomic regions, and scaffold them appropriately. This was accomplished by identifying the most deeply assembling, obviously non-repetitive contigs in an initial run of the assembler (in this case, the strong assemblies at 21-36x coverage which were identified as gene-rich Burkholderia-like and plasmid scaffolds), and using a value slightly below the calculated “A-statistic” (an empirical uniqueness measure within the Assembler) of these contigs as the threshold parameter in a subsequent run. This allows the deep contigs to be treated as unique sequence, when they would otherwise be labeled as repetitive. At the other end of the spectrum, rare organisms in the sample have been sampled by sequencing only to a shallow depth of coverage. Routine assembly would not have considered the small fragment overlap based assemblies with shallow coverage as an eligible basis for scaffolding, due to a minimum length requirement of 1000bp, which is typically in place for efficiency. Therefore, in the present use case, the organisms represented by these sequences would not have been ordered and oriented with mate-pairs without adjusting the default minimum length to compensate for the low anticipated coverage depth and assembly length. With this selection of parameters, more suitable to the enivironmental project at hand, we were able to adequately assemble both the dominant and rare species simultaneously.

Page 25: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Methods

• Plasmid library

• Shotgun sequence

• Assembled

• No Major Binning

• Potential “nearly” complete genomes

• Annotation, population analysis, phylogenetic analysis

Page 26: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

assembly to identify a set of large, deeply as-sembling nonrepetitive contigs. This was used toset the expected coverage in unique regions (to23!) for a final run of the assembler. This al-lowed the deep contigs to be treated as uniquesequence when they would otherwise be labeledas repetitive. We evaluated our final assemblyresults in a tiered fashion, looking at well-sampledgenomic regions separately from those barelysampled at our current level of sequencing.

The 1.66 million sequences from theWeatherbird II samples (table S1; samples 1 to4; stations 3, 11, and 13), were pooled andassembled to provide a single master assemblyfor comparative purposes. The assembly gener-ated 64,398 scaffolds ranging in size from 826bp to 2.1 Mbp, containing 256 Mbp of uniquesequence and spanning 400 Mbp. After assem-bly, there remained 217,015 paired-end reads,or “mini-scaffolds,” spanning 820.7 Mbp aswell as an additional 215,038 unassembled sin-gleton reads covering 169.9 Mbp (table S2,column 1). The Sorcerer II samples providedalmost no assembly, so we consider for thesesamples only the 153,458 mini-scaffolds, span-ning 518.4 Mbp, and the remaining 18,692singleton reads (table S2, column 2). In total,1.045 Gbp of nonredundant sequence was gen-erated. The lack of overlapping reads within theunassembled set indicates that lack of addition-al assembly was not due to algorithmic limita-tions but to the relatively limited depth of se-quencing coverage given the level of diversitywithin the sample.

The whole-genome shotgun (WGS) assemblyhas been deposited at DDBJ/EMBL/GenBankunder the project accession AACY00000000,and all traces have been deposited in a corre-sponding TraceDB trace archive. The versiondescribed in this paper is the first version,AACY01000000. Unlike a conventional WGSentry, we have deposited not just contigs andscaffolds but the unassembled paired singletonsand individual singletons in order to accurate-ly reflect the diversity in the sample andallow searches across the entire sample with-in a single database.Genomes and large assemblies. Our

analysis first focused on the well-sampled ge-nomes by characterizing scaffolds with at least3! coverage depth. There were 333 scaffoldscomprising 2226 contigs and spanning 30.9Mbp that met this criterion (table S3), account-ing for roughly 410,000 reads, or 25% of thepooled assembly data set. From this set of well-sampled material, we were able to cluster andclassify assemblies by organism; from the rarespecies in our sample, we used sequence similar-ity based methods together with computationalgene finding to obtain both qualitative and quan-titative estimates of genomic and functional diver-sity within this particular marine environment.

We employed several criteria to sort themajor assembly pieces into tentative organism“bins”; these include depth of coverage, oligo-

nucleotide frequencies (7), and similarity topreviously sequenced genomes (5). With thesetechniques, the majority of sequence assignedto the most abundant species (16.5 Mbp of the30.9 Mb in the main scaffolds) could be sepa-rated based on several corroborating indicators.In particular, we identified a distinct group ofscaffolds representing an abundant populationclearly related to Burkholderia (fig. S2) andtwo groups of scaffolds representing two dis-tinct strains closely related to the published

Shewanella oneidensis genome (8) (fig. S3).There is a group of scaffolds assembling at over6! coverage that appears to represent the ge-nome of a SAR86 (table S3). Scaffold setsrepresenting a conglomerate of Prochlorococ-cus strains (Fig. 2), as well as an unculturedmarine archaeon, were also identified (table S3;Fig. 3). Additionally, 10 putative mega plasmidswere found in the main scaffold set, coveredat depths ranging from 4! to 36! (indicatedwith shading in table S3 with nine depicted in

Fig. 1. MODIS-Aqua satellite image ofocean chlorophyll in the Sargasso Sea gridabout the BATS site from 22 February2003. The station locations are overlainwith their respective identifications. Notethe elevated levels of chlorophyll (greencolor shades) around station 3, which arenot present around stations 11 and 13.

Fig. 2. Gene conser-vation among closelyrelated Prochlorococ-cus. The outermostconcentric circle ofthe diagram depictsthe competed genom-ic sequence of Pro-chlorococcus marinusMED4 (11). Fragmentsfrom environmentalsequencing were com-pared to this complet-ed Prochlorococcus ge-nome and are shown inthe inner concentriccircles and were givenboxed outlines. Genesfor the outermost cir-cle have been as-signed psuedospec-trum colors based onthe position of thosegenes along the chro-mosome, where genesnearer to the start ofthe genome are col-ored in red, and genesnearer to the end of the genome are colored in blue. Fragments from environmental sequencingwere subjected to an analysis that identifies conserved gene order between those fragments andthe completed Prochlorococcus MED4 genome. Genes on the environmental genome segmentsthat exhibited conserved gene order are colored with the same color assignments as theProchlorococcus MED4 chromosome. Colored regions on the environmental segments exhibitingcolor differences from the adjacent outermost concentric circle are the result of conserved geneorder with other MED4 regions and probably represent chromosomal rearrangements. Genes thatdid not exhibit conserved gene order are colored in black.

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 67http://www.sciencemag.org/content/304/5667/66

Page 27: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Fig. 4). Other organisms were not so readilyseparated, presumably reflecting some combi-nation of shorter assemblies with less “taxo-nomic signal,” less distinctive sequence, andgreater divergence from previously sequencedgenomes (9).Discrete species versus a population

continuum. The most deeply covered of thescaffolds (21 scaffolds with over 14! coverageand 9.35 Mb of sequence), contain just over 1single nucleotide polymorphism (SNP) per10,000 base pairs, strongly supporting the pres-ence of discrete species within the sample. Inthe remaining main scaffolds (table S3), theSNP rate ranges from 0 to 26 per 1000 bp, witha length-weighted average of 3.6 per 1000 bp.We closely examined the multiple sequencealignments of the contigs with high SNP ratesand were able to classify these into two fairlydistinct classes: regions where several closelyrelated haplotypes have been collapsed, in-creasing the depth of coverage accordingly(10), and regions that appear to be a relativelyhomogenous blend of discrepancies from theconsensus without any apparent separation intohaplotypes, such as the Prochlorococcus scaf-fold region (Fig. 5). Indeed, the Prochlorococ-cus scaffolds display considerable heterogene-ity not only at the nucleotide sequence level(Fig. 5) but also at the genomic level, wheremultiple scaffolds align with the same region ofthe MED4 (11) genome but differ due to geneor genomic island insertion, deletion, rearrange-ment events. This observation is consistent withprevious findings (12). For instance, scaffolds2221918 and 2223700 share gene synteny witheach other and MED4 but differ by the insertionof 15 genes of probable phage origin, likelyrepresenting an integrated bacteriophage. Thesegenomic differences are displayed graphicallyin Fig. 2, where it is evident that up to fourconflicting scaffolds can align with the sameregion of the MED4 genome. More than 85%of the Prochlorococcus MED4 genome can bealigned with Sargasso Sea scaffolds greaterthan 10 kb; however, there appear to be acouple of regions of MED4 that are not repre-sented in the 10-kb scaffolds (Fig. 2). Thelarger of these two regions (PMM1187 toPMM1277) consists primarily of a gene clustercoding for surface polysaccharide biosynthesis,which may represent a MED4-specific polysac-charide absent or highly diverged in our Sar-gasso Sea Prochlorococcus bacteria. The heter-ogeneity of the Prochlorococcus scaffolds suggestthat the scaffolds are not derived from a singlediscrete strain, but instead probably represent aconglomerate assembled from a population ofclosely related Prochlorococcus biotypes.The gene complement of the Sargasso.

The heterogeneity of the Sargasso sequencescomplicates the identification of microbialgenes. The typical approach for microbial an-notation, model-based gene finding, relies en-tirely on training with a subset of manually

Fig. 3. Comparison ofSargasso Sea scaf-folds to Crenarchaealclone 4B7. Predictedproteins from 4B7and the scaffoldsshowing significanthomology to 4B7 bytBLASTx are arrayedin positional orderalong the x and yaxes. Colored boxesrepresent BLASTpmatches scoring atleast 25% similarityand with an e valueof better than 1e-5.Black vertical andhorizontal lines delin-eate scaffold borders.

Fig. 4. Circular diagrams of nine complete megaplasmids. Genes encoded in the forward directionare shown in the outer concentric circle; reverse coding genes are shown in the inner concentriccircle. The genes have been given role category assignment and colored accordingly: amino acidbiosynthesis, violet; biosynthesis of cofactors, prosthetic groups, and carriers, light blue; cellenvelope, light green; cellular processes, red; central intermediary metabolism, brown; DNAmetabolism, gold; energy metabolism, light gray; fatty acid and phospholipid metabolism, magenta;protein fate and protein synthesis, pink; purines, pyrimidines, nucleosides, and nucleotides, orange;regulatory functions and signal transduction, olive; transcription, dark green; transport and bindingproteins, blue-green; genes with no known homology to other proteins and genes with homologyto genes with no known function, white; genes of unknown function, gray; Tick marks are placedon 10-kb intervals.

R E S E A R C H A R T I C L E

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org68

http://www.sciencemag.org/content/304/5667/66

Page 28: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Fig. 4). Other organisms were not so readilyseparated, presumably reflecting some combi-nation of shorter assemblies with less “taxo-nomic signal,” less distinctive sequence, andgreater divergence from previously sequencedgenomes (9).Discrete species versus a population

continuum. The most deeply covered of thescaffolds (21 scaffolds with over 14! coverageand 9.35 Mb of sequence), contain just over 1single nucleotide polymorphism (SNP) per10,000 base pairs, strongly supporting the pres-ence of discrete species within the sample. Inthe remaining main scaffolds (table S3), theSNP rate ranges from 0 to 26 per 1000 bp, witha length-weighted average of 3.6 per 1000 bp.We closely examined the multiple sequencealignments of the contigs with high SNP ratesand were able to classify these into two fairlydistinct classes: regions where several closelyrelated haplotypes have been collapsed, in-creasing the depth of coverage accordingly(10), and regions that appear to be a relativelyhomogenous blend of discrepancies from theconsensus without any apparent separation intohaplotypes, such as the Prochlorococcus scaf-fold region (Fig. 5). Indeed, the Prochlorococ-cus scaffolds display considerable heterogene-ity not only at the nucleotide sequence level(Fig. 5) but also at the genomic level, wheremultiple scaffolds align with the same region ofthe MED4 (11) genome but differ due to geneor genomic island insertion, deletion, rearrange-ment events. This observation is consistent withprevious findings (12). For instance, scaffolds2221918 and 2223700 share gene synteny witheach other and MED4 but differ by the insertionof 15 genes of probable phage origin, likelyrepresenting an integrated bacteriophage. Thesegenomic differences are displayed graphicallyin Fig. 2, where it is evident that up to fourconflicting scaffolds can align with the sameregion of the MED4 genome. More than 85%of the Prochlorococcus MED4 genome can bealigned with Sargasso Sea scaffolds greaterthan 10 kb; however, there appear to be acouple of regions of MED4 that are not repre-sented in the 10-kb scaffolds (Fig. 2). Thelarger of these two regions (PMM1187 toPMM1277) consists primarily of a gene clustercoding for surface polysaccharide biosynthesis,which may represent a MED4-specific polysac-charide absent or highly diverged in our Sar-gasso Sea Prochlorococcus bacteria. The heter-ogeneity of the Prochlorococcus scaffolds suggestthat the scaffolds are not derived from a singlediscrete strain, but instead probably represent aconglomerate assembled from a population ofclosely related Prochlorococcus biotypes.The gene complement of the Sargasso.

The heterogeneity of the Sargasso sequencescomplicates the identification of microbialgenes. The typical approach for microbial an-notation, model-based gene finding, relies en-tirely on training with a subset of manually

Fig. 3. Comparison ofSargasso Sea scaf-folds to Crenarchaealclone 4B7. Predictedproteins from 4B7and the scaffoldsshowing significanthomology to 4B7 bytBLASTx are arrayedin positional orderalong the x and yaxes. Colored boxesrepresent BLASTpmatches scoring atleast 25% similarityand with an e valueof better than 1e-5.Black vertical andhorizontal lines delin-eate scaffold borders.

Fig. 4. Circular diagrams of nine complete megaplasmids. Genes encoded in the forward directionare shown in the outer concentric circle; reverse coding genes are shown in the inner concentriccircle. The genes have been given role category assignment and colored accordingly: amino acidbiosynthesis, violet; biosynthesis of cofactors, prosthetic groups, and carriers, light blue; cellenvelope, light green; cellular processes, red; central intermediary metabolism, brown; DNAmetabolism, gold; energy metabolism, light gray; fatty acid and phospholipid metabolism, magenta;protein fate and protein synthesis, pink; purines, pyrimidines, nucleosides, and nucleotides, orange;regulatory functions and signal transduction, olive; transcription, dark green; transport and bindingproteins, blue-green; genes with no known homology to other proteins and genes with homologyto genes with no known function, white; genes of unknown function, gray; Tick marks are placedon 10-kb intervals.

R E S E A R C H A R T I C L E

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org68 http://www.sciencemag.org/content/304/5667/66

Page 29: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

identified and curated genes. With the vast ma-jority of the Sargasso sequence in short (lessthan 10 kb), unassociated scaffolds and single-tons from hundreds of different organisms, it isimpractical to apply this approach. Instead, wedeveloped an evidence-based gene finder (5).Briefly, evidence in the form of protein align-ments to sequences in the bacterial portion ofthe nonredundant amino acid (nraa) data set(13) was used to determine the most likelycoding frame. Likewise, approximate start andstop positions were determined from the bound-ing coordinates of the alignments and refined toidentify specific start and stop codons. Thisapproach identified 1,214,207 genes coveringover 700 MB of the total data set. This repre-sents approximately an order of magnitudemore sequences than currently archived in thecurated SwissProt database (14), which con-tains 137,885 sequence entries at the time ofwriting; roughly the same number of sequencesas have been deposited into the uncuratedREM-TrEMBL database (14) since its incep-tion in 1996. After excluding all intervals cov-ered by previously identified genes, additionalhypothetical genes were identified on the basisof the presence of conserved open readingframes (5). A total of 69,901 novel genes be-longing to 15,601 single link clusters were iden-tified. The predicted genes were categorized

Fig. 5. Prochlorococcus-related scaffold 2223290 illustrates the assembly of a broad commu-nity of closely related organisms, distinctly nonpunctate in nature. The image represents (A)global structure of Scaffold 2223290 with respect to assembly and (B) a sample of the multiplesequence alignment. Blue segments, contigs; green segments, fragments; and yellow segments,stages of the assembly of fragments into the resulting contigs. The yellow bars indicate thatfragments were initially assembled in several different pieces, which in places collapsed toform the final contig structure. The multiple sequence alignment for this region shows ahomogenous blend of haplotypes, none with sufficient depth of coverage to provide aseparate assembly.

Table 1. Gene count breakdown by TIGR rolecategory. Gene set includes those found on as-semblies from samples 1 to 4 and fragment readsfrom samples 5 to 7. A more detailed table, sep-arating Weatherbird II samples from the Sorcerer IIsamples is presented in the SOM (table S4). Notethat there are 28,023 genes which were classifiedin more than one role category.

TIGR role category Totalgenes

Amino acid biosynthesis 37,118Biosynthesis of cofactors,prosthetic groups, and carriers

25,905

Cell envelope 27,883Cellular processes 17,260Central intermediary metabolism 13,639DNA metabolism 25,346Energy metabolism 69,718Fatty acid and phospholipidmetabolism

18,558

Mobile and extrachromosomalelement functions

1,061

Protein fate 28,768Protein synthesis 48,012Purines, pyrimidines, nucleosides,and nucleotides

19,912

Regulatory functions 8,392Signal transduction 4,817Transcription 12,756Transport and binding proteins 49,185Unknown function 38,067Miscellaneous 1,864Conserved hypothetical 794,061

Total number of roles assigned 1,242,230

Total number of genes 1,214,207

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 69http://www.sciencemag.org/content/304/5667/66

Page 30: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

identified and curated genes. With the vast ma-jority of the Sargasso sequence in short (lessthan 10 kb), unassociated scaffolds and single-tons from hundreds of different organisms, it isimpractical to apply this approach. Instead, wedeveloped an evidence-based gene finder (5).Briefly, evidence in the form of protein align-ments to sequences in the bacterial portion ofthe nonredundant amino acid (nraa) data set(13) was used to determine the most likelycoding frame. Likewise, approximate start andstop positions were determined from the bound-ing coordinates of the alignments and refined toidentify specific start and stop codons. Thisapproach identified 1,214,207 genes coveringover 700 MB of the total data set. This repre-sents approximately an order of magnitudemore sequences than currently archived in thecurated SwissProt database (14), which con-tains 137,885 sequence entries at the time ofwriting; roughly the same number of sequencesas have been deposited into the uncuratedREM-TrEMBL database (14) since its incep-tion in 1996. After excluding all intervals cov-ered by previously identified genes, additionalhypothetical genes were identified on the basisof the presence of conserved open readingframes (5). A total of 69,901 novel genes be-longing to 15,601 single link clusters were iden-tified. The predicted genes were categorized

Fig. 5. Prochlorococcus-related scaffold 2223290 illustrates the assembly of a broad commu-nity of closely related organisms, distinctly nonpunctate in nature. The image represents (A)global structure of Scaffold 2223290 with respect to assembly and (B) a sample of the multiplesequence alignment. Blue segments, contigs; green segments, fragments; and yellow segments,stages of the assembly of fragments into the resulting contigs. The yellow bars indicate thatfragments were initially assembled in several different pieces, which in places collapsed toform the final contig structure. The multiple sequence alignment for this region shows ahomogenous blend of haplotypes, none with sufficient depth of coverage to provide aseparate assembly.

Table 1. Gene count breakdown by TIGR rolecategory. Gene set includes those found on as-semblies from samples 1 to 4 and fragment readsfrom samples 5 to 7. A more detailed table, sep-arating Weatherbird II samples from the Sorcerer IIsamples is presented in the SOM (table S4). Notethat there are 28,023 genes which were classifiedin more than one role category.

TIGR role category Totalgenes

Amino acid biosynthesis 37,118Biosynthesis of cofactors,prosthetic groups, and carriers

25,905

Cell envelope 27,883Cellular processes 17,260Central intermediary metabolism 13,639DNA metabolism 25,346Energy metabolism 69,718Fatty acid and phospholipidmetabolism

18,558

Mobile and extrachromosomalelement functions

1,061

Protein fate 28,768Protein synthesis 48,012Purines, pyrimidines, nucleosides,and nucleotides

19,912

Regulatory functions 8,392Signal transduction 4,817Transcription 12,756Transport and binding proteins 49,185Unknown function 38,067Miscellaneous 1,864Conserved hypothetical 794,061

Total number of roles assigned 1,242,230

Total number of genes 1,214,207

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 69

http://www.sciencemag.org/content/304/5667/66

Page 31: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Page 32: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

rRNA phylotyping from metagenomics

!32

http://www.sciencemag.org/content/304/5667/66

Page 33: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Shotgun Sequencing Allows Alternative Anchors (e.g., RecA)

!33

http://www.sciencemag.org/content/304/5667/66

Page 34: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

using the curated TIGR role categories (5). Abreakdown of predicted genes by category isgiven in Table 1.

The samples analyzed here represent onlyspecific size fractions of the sampled environ-ment, dictated by the pore size of the collectionfilters. By our selection of filter pore sizes, wedeliberately focused this initial study on theidentification and analysis of microbial organ-isms. However, we did examine the data for thepresence of eukaryotic content as well. Al-though the bulk of known protists are 10 !mand larger, there are some known in the rangeof 1 to 1.5 !m in diameter [for example, Os-treococcus tauri (15) and the Bolidomonas spe-cies (16)], and such organisms could potentiallywork their way through a 0.80 !m prefilter. Aninitial screening for 18S ribosomal RNA(rRNA), a commonly used eukaryotic marker,identified 69 18S rRNA genes, with 63 of theseon singletons and the remaining 6 on verysmall, lowcoverage assemblies. These 18SrRNAs are similar to uncultured marine eu-karyotes and are indicative of a eukaryotic pres-ence but inconclusive on their own. Becausebacterial DNA contains a much greater densityof genes than eukaryotic DNA, the relativeproportion of gene content can be used as an-other indicator to distinguish eukaryotic mate-rial in our sample. An inverse relation wasobserved between the pore size of the pre-filtersand collection filters and the fraction of se-quence coding for genes (table S5). This rela-tion, together with the presence of 18S rRNAgenes in the samples, is strong evidence thateukaryotic material was indeed captured.Diversity and species richness. Most

phylogenetic surveys of uncultured organismshave been based on studies of rRNA genesusing polymerase chain reaction (PCR) withprimers for highly conserved positions in thosegenes. More than 60,000 small subunit rRNAsequences from a wide diversity of prokaryotictaxa have been reported (17). However, PCR-based studies are inherently biased, because notall rRNA genes amplify with the same “univer-sal” primers. Within our shotgun sequence dataand assemblies, we identified 1164 distinctsmall subunit rRNA genes or fragments ofgenes in the Weatherbird II assemblies andanother 248 within the Sorcerer II reads (5).Using a 97% sequence similarity cutoff to dis-tinguish unique phylotypes, we identified 148previously unknown phylotypes in our samplewhen compared against the RDP II database(17). With a 99% similarity cutoff, this numberincreases to 643. Though sequence similarity isnot necessarily an accurate predictor of func-tional conservation and sequence divergencedoes not universally correlate with the biologi-cal notion of “species,” defining species (alsoknown as phylotypes) by sequence similaritywithin the rRNA genes is the accepted standardin studies of uncultured microbes. All sampledrRNAs were then assigned to taxonomic groups

using an automated rRNA classification pro-gram (5). Our samples are dominated by rRNAgenes from Proteobacteria (primarily membersof the ", #, and $ subgroups) with moderatecontributions from Firmicutes (low-GC Grampositive), Cyanobacteria, and species in theCFB phyla (Cytophaga, Flavobacterium, andBacteroides) (fig. S4A; Fig. 6). The patterns wesee are similar in broad outline to those ob-served by rRNA PCR studies from the SargassoSea (18), but with some quantitative differencesthat reflect either biases in PCR studies or dif-ferences in the species found in our sampleversus those in other studies.

An additional disadvantage associated withrelying on rRNA for estimates of species diver-sity and abundance is the varying number ofcopies of rRNA genes between taxa (more thanan order of magnitude among prokaryotes)(19). Therefore, we constructed phylogenetictrees (fig. S4, B to E) using other representedphylogenetic markers found in our data set,[RecA/RadA, heat shock protein 70 (HSP70),elongation factor Tu (EF-Tu), and elongationfactor G (EF-G)]. Each marker gene interval inour data set (with a minimum length of 75amino acids) was assigned to a putative taxo-nomic group using the phylogenetic analysisdescribed for rRNA. For example, our data set

contains over 600 recA homologs fromthroughout the bacterial phylogeny, includingrepresentatives of Proteobacteria, low- andhigh-GC Gram positives, Cyanobacteria, greensulfur and green nonsulfur bacteria, and othergroups. Assignment to phylogenetic groupsshows a broad consensus among the differentphylogenetic markers. For most taxa, therRNA-based proportion is the highest or lowestin comparison to the other markers. We believethis is due to the large amount of variation incopy number of rRNA genes between species.For example, the rRNA-based estimate of theproportion of $Proteobacteria is the highest,while the estimate for cyanobacteria is the low-est, which is consistent with the reports thatmembers of the $-Proteobacteria frequentlyhave more than five rRNA operon copies,whereas cyanobacteria frequently have fewerthan three (19).

Just as phylogenetic classification isstrengthened by a more comprehensive markerset, so too is the estimation of species richness.In this analysis, we define “genomic” species asa clustering of assemblies or unassembled readsmore than 94% identical on the nucleotide lev-el. This cutoff, adjusted for the protein-codingmarker genes, is roughly comparable to the97% cutoff traditionally used for rRNA. Thus

Fig. 6. Phylogenetic diversity of Sargasso Sea sequences using multiple phylogenetic markers. Therelative contribution of organisms from different major phylogenetic groups (phylotypes) wasmeasured using multiple phylogenetic markers that have been used previously in phylogeneticstudies of prokaryotes: 16S rRNA, RecA, EF-Tu, EF-G, HSP70, and RNA polymerase B (RpoB). Therelative proportion of different phylotypes for each sequence (weighted by the depth of coverageof the contigs from which those sequences came) is shown. The phylotype distribution wasdetermined as follows: (i) Sequences in the Sargasso data set corresponding to each of these geneswere identified using HMM and BLAST searches. (ii) Phylogenetic analysis was performed for eachphylogenetic marker identified in the Sargasso data separately compared with all members of thatgene family in all complete genome sequences (only complete genomes were used to control forthe differential sampling of these markers in GenBank). (iii) The phylogenetic affinity of eachsequence was assigned based on the classification of the nearest neighbor in the phylogenetic tree.

R E S E A R C H A R T I C L E

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org70 !34

http://www.sciencemag.org/content/304/5667/66

Page 35: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

defined, the mean number of species at thepoint of deepest coverage was 451 [averagedover the six genes analyzed; range 341 to 569(Table 2)]; this serves as the most conservativeestimate of species richness.

Although counts of observed species in asample are directly obtainable, the true numberof distinct species within a sample is almostcertainly greater than that which can be ob-served by finite sequence sampling. That is,given a diverse sample at any given level ofsequencing, random sampling is likely to en-tirely miss some subset of the species at lowestabundance. We considered three approaches toestimating the true diversity: nonparametricmethods for small sample corrections (20),parametric methods assuming a log-normal dis-tribution of species abundance (21), and a novelmethod based on fitting the observed depth ofcoverage to a theoretical model of assemblyprogress for a sample corresponding to a mix-

ture of organisms at different abundances. Allthree methods agree on a minimum of at least300 species per sample, with more than 1000species in the combined sample (5). We de-scribe in detail a model based on assemblydepth of coverage (5). Assuming standard ran-dom models for shotgun sequencing, sequenc-ing of an environmental sample should result indepths of sequence coverage reflecting a mix-ture of Poisson distributions. We computed theempirical distribution of coverage depth at ev-ery position in the full set of assemblies (includ-ing single fragment contigs, but not countinggaps between contigs), and compared it withhand-constructed mixtures of Poisson distribu-tions. The three depth of coverage–based mod-els shown in Table 3 indicate that there are atleast 1800 species in the combined sample, andthat a minimum of 12-fold deeper samplingwould be required to obtain 95% of the uniquesequence. However, these are only lower

bounds. The depth of coverage modeling isconsistent with as much as 80% of the assem-bled sequence being contributed by organismsat very low individual abundance and thuswould be compatible with total diversity ordersof magnitude greater than the lower bound. Theassembly coverage data also implies that morethan 100 Mbp of genome (i.e., probably morethan 50 species) is present at coverage highenough to permit assembly of a complete ornearly complete genome were we to sequenceto 5- to 10-fold greater sampling depth.

Taking the well-known marine rRNA cladeSAR11 as an example, one can readily get amore tangible view of the diversity within oursample. The SAR11 rRNA group accounts for26% of all RNA clones that have been identi-fied by culture-independent PCR amplificationof seawater (22) and has been found in nearlyevery pelagic marine bacterioplankton commu-nity. However, there are very few cultured rep-resentatives of this clade (23), and little isknown about its metabolic diversity. In total, 89scaffolds and 291 singletons from our data setcontain a SAR11 rRNA sequence. However,even with these nearly 400 representatives ofSAR11 within our sample, assembly depth ofcoverage ranges from only from 0.94- to 2.2-fold, and the largest scaffold is quite small at21,000 bp. This indicates a much more diversepopulation of organisms than previously attrib-uted to SAR11 based on rRNA PCR methods.Variability in species abundance. A

key advantage to the random sampling ap-proach described here is that it allows assess-ment of the stoichiometry (that is, estimation ofthe relative abundance of the dominant organ-isms). For example, we found that more thanhalf of all assemblies with more than 50 frag-ments were from organisms that were not atequal relative abundance in all of the samplescollected, suggesting widespread “patchiness”

Table 2. Diversity of ubiquitous single copy protein coding phylogenetic markers. Protein column usessymbols that identify six proteins encoded by exactly one gene in virtually all known bacteria. SequenceID specifies the GenBank identifier for corresponding E. coli sequence. Ortholog cutoff identifies BLASTxe-value chosen to identify orthologs when querying the E. coli sequence against the complete SargassoSea data set. Maximum fragment depth shows the number of reads satisfying the ortholog cutoff at thepoint along the query for which this value is maximal. Observed “species” shows the number of distinctclusters of reads from the maximum fragment depth column, after grouping reads whose containingassemblies had an overlap of at least 40 bp with ! 94% nucleotide identity (single-link clustering).Singleton “species” shows the number of distinct clusters from the observed “species” column thatconsist of a single read. Most abundant column shows the fraction of the maximum fragment depth thatconsists of single largest cluster.

Protein Sequence ID Orthologcutoff

Max.fragmentdepth

Observed“species”

Singleton“species”

Mostabundant(%)

AtpD NTL01EC03653 1e-32 836 456 317 6GyrB NTL01EC03620 1e-11 924 569 429 4Hsp70 NT01EC0015 1e-31 812 515 394 4RecA NTL01EC02639 1e-21 592 341 244 8RpoB NTL01EC03885 1e-41 669 428 331 7TufA NTL01EC03262 1e-41 597 397 307 3

Table 3. Diversity models based on depth of coverage. Each row corre-sponds to an abundance class of organisms. The first column in eachmodel “fr(asm)” gives the fraction of the assembly consensus modeleddue to organisms at an abundance giving “Depth” coverage depth (second

column) in the sample. The third column [“E(s)”] gives the fraction of sucha genome expected to be sampled. The fourth column (“Genomes”)gives the resulting estimated number of genomes in the abun-dance class.

Model 1 Model 2 Model 3

fr(asm) Depth E(s) Genomes fr(asm) Depth E(s) Genomes fr(asm) Depth E(s) Genomes

0.0055 25 1.00E"00 2.5 0.0055 25 1.00E"00 2.5 0.0055 25 1.00E"00 2.50.005 21 1.00E"00 2.3 0.005 21 1.00E"00 2.3 0.005 21 1.00E"00 2.30.0035 13 1.00E"00 1.6 0.0035 13 1.00E"00 1.6 0.0035 13 1.00E"00 1.60.004 9 1.00E"00 1.8 0.004 9 1.00E"00 1.8 0.004 9 1.00E"00 1.80.008 7 9.99E-01 3.6 0.0088 7 9.99E-01 4 0.0088 7 9.99E-01 40.0047 6 9.98E-01 2.1 0.0047 6 9.98E-01 2.1 0.0047 6 9.98E-01 2.10.01 4 9.82E-01 4.6 0.029 2.4 9.09E-01 14.4 0.029 2.4 9.09E-01 14.40.0258 2.4 9.09E-01 12.8 0.096 2 8.65E-01 50 0.097 2 8.65E-01 50.50.07 2 8.65E-01 36.4 0.0235 1 6.32E-01 16.7 0.0225 1 6.32E-01 160.8635 0.25 2.21E-01 1,756.7 0.06 0.5 3.93E-01 68.6 0.06 0.5 3.93E-01 68.6

0.76 0.09 8.61E-02 3,973.6 0.66 0.124 1.17E-01 2,546.70.1 0.001 1.00E-03 45,022.5

Total 1824.4 4137.6 47,733

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 71

http://www.sciencemag.org/content/304/5667/66

Page 36: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea Venter et al., revised

Figure S6. Accumulation curve for rpoB. Observed (black) OTU counts for rpoB (based on the fragment grouping summarized in Table 2), as well as the Chao1-corrected estimate of total species (red; see (3)). Points are mean values of 1000 shufflings of the observed data, while bars show 90% confidence intervals.

http://www.sciencemag.org/content/304/5667/66

Page 37: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea Venter et al., revised

Figure S7. Each point in the figure corresponds to a scaffold from the assembly (restricted to scaffolds > 10kb). Scaffolds were placed in separate panels of the figure according to the most closely related organism as indicated by the BLAST searches described in the text. Within a panel, a scaffold is shown with x coordinate equal to its length, y coordinate equal to its estimated depth of coverage, and color determined by which of 6 k-mer composition clusters it was assigned to. Depth of coverage was estimated as the total base pairs in reads belonging to a given assembly piece divided by the length of the consensus sequence for the piece. K-mer composition clusters were determined by representing each scaffold as a vector of the frequencies of all possible 4-mers, considering both the forward and reverse strands of the sequence, and then applying the K-means clustering algorithm.

http://www.sciencemag.org/content/304/5667/66

Page 38: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Functional Diversity of Proteorhodopsins?

!38

http://www.sciencemag.org/content/304/5667/66

Page 39: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea Venter et al., revised

Figure S10. Scaffold 2217664, containing the gene encoding Proteorhodopsin. Genes are colored using color assignments described in Fig. 2, and contig boundaries are indicated with red vertical lines. In this scaffold, rhodopsin is associated with a DNA-directed RNA polymerase, sigma subunit (rpoD) originating in the CFB group.

http://www.sciencemag.org/content/304/5667/66

Page 40: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

A B C D E F G

T U V W X Y Z

Binning challenge

!40

Page 41: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

A B C D E F G

T U V W X Y Z

Binning challenge

Best binning method: reference genomes

!41

Page 42: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Glassy Winged Sharpshooter

• Feeds on xylem sap • Vector for Pierce’s Disease • Potential bioterror agent • Collaboration with Nancy

Moran to sequence symbiont genomes

• Funded by NSF • Published in PLOS Biology

2006

Page 43: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014Wu et al. 2006 PLoS Biology 4: e188.

Page 44: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Sharpshooter Shotgun Sequencing

shotgun

Wu et al. 2006 PLoS Biology 4: e188.Collaboration with Nancy Moran’s lab

Page 45: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Page 46: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

A B C D E F G

T U V W X Y Z

Binning challenge

No reference genome? What do you do? !Phylogeny ....

Page 47: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

CFB Phyla

Page 48: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014Wu et al. 2006 PLoS Biology 4: e188.

Baumannia makes amino acids

Sulcia makes vitamins and cofactors

���48

Page 49: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Page 50: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014the pilot Sargasso Sea study, 200 l surface seawater wasfiltered to isolate microorganisms for metagenomic analysis.DNA was isolated from the collected organisms, and genomeshotgun sequencing methods were used to identify more than1.2 million new genes, providing evidence for substantialmicrobial taxonomic diversity [19]. Several hundred new anddiverse examples of the proteorhodopsin family of light-harvesting genes were identified, documenting their exten-sive abundance and pointing to a possible important role inenergy metabolism under low-nutrient conditions. However,substantial sequence diversity resulted in only limitedgenome assembly. These results generated many additionalquestions: would the same organisms exist everywhere in theocean, leading to improved assembly as sequence coverageincreased; what was the global extent of gene and gene familydiversity, and can we begin to exhaust it with a large butachievable amount of sequencing; how do regions of theocean differ from one another; and how are differentenvironmental pressures reflected in organisms and com-munities? In this paper we attempt to address these issues.

Results

Sampling and the Metagenomic DatasetMicrobial samples were collected as part of the Sorcerer II

expedition between August 8, 2003, and May 22, 2004, by theS/V Sorcerer II, a 32-m sailing sloop modified for marineresearch. Most specimens were collected from surface watermarine environments at approximately 320-km (200-mile)intervals. In all, 44 samples were obtained from 41 sites(Figure 1), covering a wide range of distinct surface marine

environments as well as a few nonmarine aquatic samples forcontrast (Table 1).Several size fractions were isolated for every site (see

Materials and Methods). Total DNA was extracted from oneor more fractions, mostly from the 0.1–0.8-lm size range.This fraction is dominated by bacteria, whose compactgenomes are particularly suitable for shotgun sequencing.Random-insert clone libraries were constructed. Dependingon the uniqueness of each sampling site and initial estimatesof the genetic diversity, between 44,000 and 420,000 clonesper sample were end-sequenced to generate mated sequenc-ing reads. In all, the combined dataset includes 6.25 Gbp ofsequence data from 41 different locations. Many of the clonelibraries were constructed with a small insert size (,2 kbp) tomaximize cloning efficiency. As this often resulted in matedsequencing reads that overlapped one another, overlappingmated reads were combined, yielding a total of ;6.4 Mcontiguous sequences, totaling ;5.9 Gbp of nonredundantsequence. Taken together, this is the largest collection ofmetagenomic sequences to date, providing more than a 5-foldincrease over the dataset produced from the Sargasso Seapilot study [19] and more than a 90-fold increase over theother large marine metagenomic dataset [20].

AssemblyAssembling genomic data into larger contigs and scaffolds,

especially metagenomic data, can be extremely valuable, as itplaces individual sequencing reads into a greater genomiccontext. A largely contiguous sequence links genes intooperons, but also permits the investigation of largerbiochemical and/or physiological pathways, and also connectsotherwise-anonymous sequences with highly studied ‘‘taxo-

Figure 1. Sampling Sites

Microbial populations were sampled from locations in the order shown. Samples were collected at approximately 200 miles (320 km) intervals along theeastern North American coast through the Gulf of Mexico into the equatorial Pacific. Samples 00 and 01 identify sets of sites sampled as part of theSargasso Sea pilot study [19]. Samples 27 through 36 were sampled off the Galapagos Islands (see inset). Sites shown in gray were not analyzed as partof this study.doi:10.1371/journal.pbio.0050077.g001

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770003

Sorcerer II GOS Expedition

Page 51: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Stalking the Fourth Domain in Metagenomic Data:Searching for, Discovering, and Interpreting Novel, DeepBranches in Marker Gene Phylogenetic TreesDongying Wu1, Martin Wu1,4, Aaron Halpern2,3, Douglas B. Rusch2,3, Shibu Yooseph2,3, Marvin Frazier2,3,

J. Craig Venter2,3, Jonathan A. Eisen1*

1 Department of Evolution and Ecology, Department of Medical Microbiology and Immunology, University of California Davis Genome Center, University of California

Davis, Davis, California, United States of America, 2 The J. Craig Venter Institute, Rockville, Maryland, United States of America, 3 The J. Craig Venter Institute, La Jolla,

California, United States of America, 4 University of Virginia, Charlottesville, Virginia, United States of America

Abstract

Background: Most of our knowledge about the ancient evolutionary history of organisms has been derived from dataassociated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, andculturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generateddirectly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as weargue here, in studies of very early events in the evolution of gene families and of species.

Methodology/Principal Findings: We designed and implemented new methods for analyzing metagenomic data and usedthem to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonlyused in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies.Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties inmaking robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novelbranches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as thesenovel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences.

Conclusions/Significance: Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come fromuncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A thirdpossibility is that some come from novel cellular lineages that are only distantly related to any organisms for whichsequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the treeof life, we suggest that methods such as those described herein currently offer the best way to search for them.

Citation: Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, andInterpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011

Editor: Robert Fleischer, Smithsonian Institution National Zoological Park, United States of America

Received October 25, 2010; Accepted February 20, 2011; Published March 18, 2011

This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the publicdomain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.

Funding: The development and main work on this project was supported by the National Science Foundation via an ‘‘Assembling the Tree of Life’’ grant(number 0228651) to to Jonathan A. Eisen and Naomi Ward. The final work on this project was funded by the Gordon and Betty Moore Foundation (throughgrants 0000951 and 0001660). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

During the last 30 years, technological advances in nucleic acidsequencing have led to revolutionary changes in our perception ofthe evolutionary relationships among all species as visualized in thetree of life. The first revolution was spawned by the work of CarlWoese and colleagues who, through sequencing and phylogeneticanalysis of fragments of rRNA molecules, demonstrated how thediverse kinds of known cellular organisms could be placed on asingle tree of life [1,2,3]. Most significantly, their analyses revealedthe existence of a third major branch on the tree; the Archaea(then referred to as Archaebacteria) took their place along with theBacteria and the Eukaryota [2]. Several factors make rRNA genesexceptionally powerful for this purpose, the most important beingperhaps that highly conserved, homologous rRNA genes arepresent in all cellular lineages. To this day, analyses of rRNA genes

continue to clarify and extend our knowledge of the evolutionaryrelationships among all life forms [4,5].

For microbial organisms, this approach was restricted to theminority that could be grown in pure culture in the laboratoryuntil Norm Pace and colleagues showed that one could sequencerRNAs directly from environmental samples [6,7]. Initially, themethodology was cumbersome. However, this changed with thedevelopment of the polymerase chain reaction (PCR) methodology[8]. PCR generates many copies of a target segment of DNA,which in turn facilitates cloning and sequencing of that segment.However, delineation of the segment to be amplified requiresprimers, i.e., short segments of DNA whose nucleotide sequence iscomplementary to the DNA flanking the target. Because rRNAgenes contain regions that are very highly conserved, ‘‘universalprimers’’ can be used for PCR amplification of those genes even inenvironmental samples [9,10]. Thus, in principle, one can use

PLoS ONE | www.plosone.org 1 March 2011 | Volume 6 | Issue 3 | e18011

Page 52: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Figure 1. Phylogenetic tree of the RecA superfamily. All RecA sequences were grouped into clusters using the Lek algorithm. Representativesof each cluster that contained .2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignmentusing PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RecAsuperfamily are shaded and given a name on the right. Five of the proposed subfamilies contained only GOS sequences at the time of our initialanalysis (RecA-like SAR, Phage SAR1, Phage SAR2, Unknown 1 and Unknown 2) and are highlighted by colored shading. As noted on the tree and inthe text, sequences from two Archaea that were released after our initial analysis group in the Unknown 2 subfamily.doi:10.1371/journal.pone.0018011.g001

Stalking the Fourth Domain

PLoS ONE | www.plosone.org 5 March 2011 | Volume 6 | Issue 3 | e18011

Page 53: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Subfamily RecA AccessionAccession ofLinked Gene Assembly ID Neighboring Gene Description Taxonomy Assignment

Unknown2 1096686533379 1096686533473 1096627390330 aryl-alcohol dehydrogenases relatedoxidoreductases

Eukaryota

Unknown2 1096686533379 1096686533505 1096627390330 snRNP Sm-like protein Chain A Eukaryota

Unknown2 1096689280551 1096689280549 1096627650434 S-adenosylmethionine synthetase Bacteria

RecA-like SAR1 1096683378299 1096683378297 1096627289467 DNA polymerase III alpha subunit Bacteria

Unknown1 1096694953057 1096694953059 1096520459783 FKBP-type peptidyl-prolyl cis-trans isomerase Archaea

Unknown1 1096665977449 1096665977451 1096627520210 single-stranded DNA binding protein Viruses/Phages

Unknown1 1096682182125 1096682182127 1096628394294 DNA polymerase I Bacteria

Five RecA subfamilies were identified as being novel (i.e., only seen in metagenomic data) in our initial analyses. GOS metagenome assemblies that encode members ofthese subfamilies were identified and the genes neighboring the novel RecAs were characterized. The neighboring gene descriptions are based on the top BLASTP hitsagainst the NRAA database; taxonomy assignments are based on their closest neighbor in phylogenetic trees built from the top NRAA BLASTP hits.doi:10.1371/journal.pone.0018011.t002

Table 2. Cont.

Figure 2. The largest assembly from the GOS data that encodes a novel RecA subfamily member (a representative of subfamilyUnknown 2). This GOS assembly (ID 1096627390330) encodes 33 annotated genes plus 16 hypothetical proteins, including several with similarity toknown archaeal genes (e.g., DNA primase, translation initiation factor 2, Table 2). The arrow indicates a novel recA homolog from the Unknown 2subfamily (cluster ID 9).doi:10.1371/journal.pone.0018011.g002

Stalking the Fourth Domain

PLoS ONE | www.plosone.org 7 March 2011 | Volume 6 | Issue 3 | e18011

Page 54: UC Davis EVE161 Lecture 15 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014Methods

Identification of deeply-branching ss-rRNA sequencesA data set of 340 representative ss-rRNA sequences from all

three domains was prepared. These sequences represented 134eukaryotic, 186 bacterial, and 20 archaeal species. Alignments for

these 340 sequences were extracted from the European RibosomalRNA database [66] and then manually curated to removecolumns with more than 90% gaps or with poor alignmentquality. Sorcerer II Global Ocean Sampling Expedition (GOS) ss-rRNA sequences were identified by the PhylOTU pipeline [67].Using MUSCLE [68,69], each GOS ss-rRNA sequence was

Figure 3. Phylogenetic tree of the RpoB superfamily. All RpoB sequences were grouped into clusters using the Lek algorithm. Representativesof each cluster that contained .2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignmentusing PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RpoBsuperfamily are shaded and given a name on the right. The two novel RpoB clades that contain only GOS sequences are highlighted by the coloredpanels.doi:10.1371/journal.pone.0018011.g003

Stalking the Fourth Domain

PLoS ONE | www.plosone.org 9 March 2011 | Volume 6 | Issue 3 | e18011


Recommended