Date post: | 10-May-2015 |
Category: |
Health & Medicine |
Upload: | jonathan-eisen |
View: | 704 times |
Download: | 0 times |
Phylogenetic and Phylogenomic Approaches to the
Study of Microbial Communities
March 7, 2012IOM Forum on Microbial Threats
Social Biology of Microbes
Jonathan A. EisenUniversity of California, Davis
Wednesday, March 7, 12
Acknowledgements• $$$• DOE• NSF• GBMF• Sloan• DARPA• DSMZ• DHS
• People, places• DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides• UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell Neches,
Jenna Morgan-Lang• Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak, Jack
Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward, Hans-Peter Klenk
Wednesday, March 7, 12
Outline
• Introduction• Phylotyping and phylogenetic ecology• Functional prediction• Selecting organisms• Future needs
Wednesday, March 7, 12
Phylogeny• Phylogeny is a description of the
evolutionary history of relationships among organisms (or their parts).
• This is frequently portrayed in a diagram called a phylogenetic tree.
• Phylogenies can be more complex than a bifurcating tree (e.g., lateral gene transfer, recombination, hybridization)
Wednesday, March 7, 12
Four Models for Rooting TOLfrom Lake et al. doi: 10.1098/rstb.2009.0035
Whatever the History: Trying to Incorporate it is Critical
Wednesday, March 7, 12
Uses of Phylogenyin Genomics and Metagenomics
Example 1:
Phylotyping and Phylogenetic Ecology
Wednesday, March 7, 12
rRNA Phylotyping• Collect DNA from
environment• PCR amplify rRNA genes
using broad (so-called universal) primers
• Sequence• Align to others• Infer evolutionary tree• Unknowns “identified” by
placement on tree
Wednesday, March 7, 12
rRNA Phylotyping
Wednesday, March 7, 12
Three Major Issues in PhylotpyingBeyond Moore’s Law Metagenomics
Short reads
Wednesday, March 7, 12
rRNA Phylotyping in Sargasso Sea Metagenomic
Metagenomic Data
Venter et al., Science 304: 66. 2004
Wednesday, March 7, 12
RecA Phylotyping in Sargasso Data
Venter et al., Science 304: 66. 2004
Wednesday, March 7, 12
RecA Phylotyping in Sargasso Data
Venter et al., Science 304: 66. 2004
Wednesday, March 7, 12
0
0.125
0.250
0.375
0.500
Alphapro
teobacteria
Gamm
aproteobacteria
Deltapro
teobacteria
Firmicutes
Chlorobi
Chloroflexi
Fusobacteria
Euryarchaeota
Sargasso PhylotypesW
eigh
ted
% o
f Clo
nes
Major Phylogenetic Group
EFG EFTu HSP70 RecA RpoB rRNA
Venter et al., Science 304: 66-74. 2004Wednesday, March 7, 12
Solution: More Automation
• BLAST????• Composition/word frequencies• Automation of trees
Wednesday, March 7, 12
AutoPhylotyping 1: Each Sequence is an Island
Wednesday, March 7, 12
STAP
STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.
Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.
Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002
Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566
Wu et al. 2008 PLoS OneWednesday, March 7, 12
STAP
STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.
Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.
Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002
Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566
Wu et al. 2008 PLoS One
STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.
Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.
Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002
Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566
Each sequence analyzed separately
Wednesday, March 7, 12
AMPHORA
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Wednesday, March 7, 12
WGT
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151Wednesday, March 7, 12
AMPHORA
Guide tree
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Wednesday, March 7, 12
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Wednesday, March 7, 12
Comparison of the phylotyping performance by AMPHORA and MEGAN. The sensitivity and specificity of the phylotyping methods were measured across taxonomic ranks using simulated Sanger shotgun sequences of 31 genes from 100 representative bacterial genomes. The figure shows that AMPHORA significantly outperforms MEGAN in sensitivity without sacrificing specificity.
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Wednesday, March 7, 12
AutoPhylotyping 2: Most in the Family
Wednesday, March 7, 12
Metagenomic Phylogenetic challenge
A single tree with everything
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
Wednesday, March 7, 12
Metagenomic Phylogenetic challenge
A single tree with everything
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
Wednesday, March 7, 12
rRNA Phylotyping in Sargasso Sea Metagenomic
Metagenomic Data
Venter et al., Science 304: 66. 2004
Wednesday, March 7, 12
STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.
Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.
Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002
Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566
Combine all into one alignment
Wednesday, March 7, 12
sequences obtained in this and previous (12, 13) studies fallwithin the marine roseobacters (see Fig. S1 in the supplemen-tal material), a major clade of culturable marine heterotrophs(7), many of which play a role in sulfur cycles (e.g., see refer-ence 8). One clade of six CGOF sequences is most closelyrelated to NAC11-6 from a dimethylsulfoniopropionate-pro-ducing algal bloom (8), while CGOCA38 groups closely with
NAC11-7 (from the same algal bloom study [8]) and an uncul-tivated marine bacterium, ZD0207, associated with dimethyl-sulfoniopropionate uptake (15). CGOAB33 is most similar toone (slope strain EI1*) of a group of thiosulfate-oxidizingbacteria from marine sediments and hydrothermal vents (14).
Members of the family Pseudomonadaceae comprised 23 to69% of the gammaproteobacterial sequences in those samples
FIG. 1. Histogram showing percentages of composition (by taxon) for 16S rRNA gene libraries generated for this study, showing only taxacomprising at least 20% of sequences in at least one clone library.
TABLE 1. Summary of Gulf of Alaska samples from which 16S rRNA gene sequences were obtained
Sample Origin Seamount Alvin dive no.
Dive depth(m) forsample
collection
No. ofsequences
Taxonomy(no. of members)
FastGroup result(no. of sequences)b
Phyla Classes Orders Families Sing Doub Trip Cluster
CGOA Bamboo coral Murray 3805 1,100–1,950a 99 11 13 21 23 30 3 1 7CGOD Black coral Murray 3805 1,100–1,950a 47 1 2 2 2 0 0 0 2NISA Water column Murray 3798 760 236 7 5 20 22 24 2 1 5BRRA Rock biofilm Murray 3798 670–1,094a 58 6 7 10 13 27 5 1 1CGOC Bamboo coral Chirikof 3803 2,660–3,300a 57 6 8 10 10 12 5 1 2CGOF Bamboo coral Warwick 3808 758 343 13 15 22 28 46 13 3 12CGOG Bamboo coral Warwick 3806 634 45 5 8 10 11 27 5 2 0
a Depths covered during entire dive (specific sample collection depth not available).b Results of sorting sequences using FastGroup. Sing, unique sequences; Doub, doubletons (two identical sequences); Trip, tripletons (three identical sequences);
Cluster, cluster of more than three identical sequences.
VOL. 72, 2006 BACTERIAL COMMUNITIES ON ALASKAN SEAMOUNT CORALS 1681
on Novem
ber 15, 2011 by guesthttp://aem
.asm.org/
Dow
nloaded from
APPLIED AND ENVIRONMENTAL MICROBIOLOGY, Feb. 2006, p. 1680–1683 Vol. 72, No. 20099-2240/06/$08.00!0 doi:10.1128/AEM.72.2.1680–1683.2006Copyright © 2006, American Society for Microbiology. All Rights Reserved.
Characterization of Bacterial Communities Associated with Deep-SeaCorals on Gulf of Alaska Seamounts†
Kevin Penn,1 Dongying Wu,1 Jonathan A. Eisen,1,2 and Naomi Ward1,3*The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 208501; Johns Hopkins University,
Charles and 34th Streets, Baltimore, Maryland 212182; and Center of Marine Biotechnology,701 East Pratt Street, Baltimore, Maryland 202123
Received 22 June 2005/Accepted 8 November 2005
Although microbes associated with shallow-water corals have been reported, deepwater coral microbes arepoorly characterized. A cultivation-independent analysis of Alaskan seamount octocoral microflora showedthat Proteobacteria (classes Alphaproteobacteria and Gammaproteobacteria), Firmicutes, Bacteroidetes, and Ac-idobacteria dominate and vary in abundance. More sampling is needed to understand the basis and significanceof this variation.
The most abundant corals on Gulf of Alaska seamounts areoctocorals (9), which create a habitat structure for mobilefauna (4). Concerns about the benthic impacts of commercialfishing have renewed interest in habitat-forming deep-sea cor-als (4). Studies of shallow-water scleractinian corals (12) haverevealed a diverse microflora and evidence of host-microbeinteractions. Although studies of the deep-sea octocoral mi-croflora are under way (10), there have been no publishedreports describing the microbial community composition.
Three Gulf of Alaska seamounts were visited during re-search cruise AT7-15/16 aboard the R/V Atlantis. The biolog-ical objectives of the cruise included sampling of deep-seaoctocorals for studies of their dispersal and reproductive strat-egies, with a particular focus on the abundant bamboo corals(Isididae). We took advantage of available coral specimens toexamine their associated microflora.
Coral, rock, and water column samples (Table 1) were col-lected from the Warwick, Murray, and Chirikof seamountsusing the deep-submergence vehicle Alvin. Corals and rockswere harvested using the submersible’s manipulators andstored in a closed box during ascent to minimize physical dis-turbance by surface waters. The water adjacent to coral colo-nies was sampled using a Niskin bottle fired at depth. Aftersubmersible recovery, freshly extruded coral exopolysaccharideand scrapings of coral and rock surfaces were transferred tosterile cryovials. Water samples were prefiltered through 20-"m-pore-size Nitex, concentrated using a TFF apparatus (Mil-lipore), and vacuum filtered (1.0-"m and 0.2-"m pore size).The 0.2-"m filter retentate was resuspended in sterile salinesolution. Samples were frozen immediately at #70°C andshipped frozen for subsequent processing.
Genomic DNAs were extracted using Ultra Clean soil DNAkits (MoBio), and 16S rRNA genes were PCR amplified usingprimers 27F and 1525R (11) and PlatTaq PCR supermix (In-
vitrogen). Amplifications were performed with an initial dena-turation of 2 min at 94°C, followed by 29 cycles of 30 s at 94°C,30 s at 55°C, and 2 min at 72°C, with a final extension of 5 minat 72°C. PCR products were cloned using a TOPO TA cloningkit (Invitrogen), and primers M13F and M13R were used tosequence positions 9 to 1545 of the 16S rRNA gene.
BLASTN (1) was used to compare our query sequences withreference sequences from the RDP2 (3) database. Represen-tative sequences from the BLASTN output were aligned withour query sequences, using an RDP2-provided profile align-ment. Neighbor-joining trees were created using PHYLIP (6)and used to assign putative taxonomy down to the family level.Detailed phylogenetic trees were constructed using the rele-vant sequences from each clone library, two reference se-quences most closely related to the query sequence, and addi-tional reference sequences. Alignments were generated usingthe RDP2 profile alignment, and bootstrapped neighbor-join-ing trees were reconstructed using PHYLIP (6).
The clones sequenced comprised 19 phyla (see Table S1 inthe supplemental material), dominated by Proteobacteria(classes Alphaproteobacteria and Gammaproteobacteria), Firmi-cutes, Bacteroidetes, and Acidobacteria (Fig. 1). The relativeproportions of these groups varied widely across the five coralsamples, as did the degree to which a given library was domi-nated by a single group (Fig. 1; see Table S1 in the supple-mental material). At the subphylum level, families occurring inmajor proportions included Rhizobiaceae, Rhodobacteraceae,and Sphingomonadaceae (Alphaproteobacteria); Pseudomona-daceae, Alteromonadaceae, and Halomonadaceae (Gammapro-teobacteria); Bacillaceae, Clostridiaceae, and Mycoplasmataceae(Firmicutes); and Flexibacteraceae and Flavobacteraceae (Bac-teroidetes).
Members of the family Rhodobacteraceae and the familyPseudomonadaceae were selected for further analysis, based ontheir relative abundance and on the previous finding (13) thatshallow-water corals contain significantly larger numbers ofthese bacteria than the surrounding water. Members of thefamily Rhodobacteraceae comprised 23 to 100% of the alpha-proteobacterial sequences in those libraries containing at least10% alphaproteobacteria. The majority of Rhodobacteraceae
* Corresponding author. Mailing address: The Institute forGenomic Research, 9712 Medical Center Drive, Rockville, MD 20850.Phone: (301) 795-7813. Fax: (301) 838-0208. E-mail: [email protected].
† Supplemental material for this article may be found at http://aem.asm.org/.
1680
on Novem
ber 15, 2011 by guesthttp://aem
.asm.org/
Dow
nloaded from
Wednesday, March 7, 12
RecA Phylotyping in Sargasso Data
Venter et al., Science 304: 66. 2004
Wednesday, March 7, 12
RecA Phylotyping in Sargasso Data
Venter et al., Science 304: 66. 2004
Wednesday, March 7, 12
0
0.125
0.250
0.375
0.500
Alphapro
teobacteria
Gamm
aproteobacteria
Deltapro
teobacteria
Firmicutes
Chlorobi
Chloroflexi
Fusobacteria
Euryarchaeota
Sargasso PhylotypesW
eigh
ted
% o
f Clo
nes
Major Phylogenetic Group
EFG EFTu HSP70 RecA RpoB rRNA
Venter et al., Science 304: 66-74. 2004Wednesday, March 7, 12
AutoPhylotyping 3: All in the Family
Wednesday, March 7, 12
Metagenomic Phylogenetic challenge
A single tree with everything
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
Wednesday, March 7, 12
Metagenomic Phylogenetic challenge
A single tree with everything
Wednesday, March 7, 12
alignment used to build the profile, resulting in a multiplesequence alignment of full-length reference sequences andmetagenomic reads. The final step of the alignment process is aquality control filter that 1) ensures that only homologous SSU-rRNA sequences from the appropriate phylogenetic domain areincluded in the final alignment, and 2) masks highly gappedalignment columns (see Text S1).We use this high quality alignment of metagenomic reads and
references sequences to construct a fully-resolved, phylogenetictree and hence determine the evolutionary relationships betweenthe reads. Reference sequences are included in this stage of theanalysis to guide the phylogenetic assignment of the relativelyshort metagenomic reads. While the software can be easilyextended to incorporate a number of different phylogenetic toolscapable of analyzing metagenomic data (e.g., RAxML [27],pplacer [28], etc.), PhylOTU currently employs FastTree as adefault method due to its relatively high speed-to-performanceratio and its ability to construct accurate trees in the presence ofhighly-gapped data [29]. After construction of the phylogeny,lineages representing reference sequences are pruned from thetree. The resulting phylogeny of metagenomic reads is then used tocompute a PD distance matrix in which the distance between apair of reads is defined as the total tree path distance (i.e., branchlength) separating the two reads [30]. This tree-based distancematrix is subsequently used to hierarchically cluster metagenomicreads via MOTHUR into OTUs in a fashion similar to traditionalPID-based analysis [31]. As with PID clustering, the hierarchicalalgorithm can be tuned to produce finer or courser clusters,corresponding to different taxonomic levels, by adjusting theclustering threshold and linkage method.To evaluate the performance of PhylOTU, we employed
statistical comparisons of distance matrices and clustering resultsfor a variety of data sets. These investigations aimed 1) to compare
PD versus PID clustering, 2) to explore overlap between PhylOTUclusters and recognized taxonomic designations, and 3) to quantifythe accuracy of PhylOTU clusters from shotgun reads relative tothose obtained from full-length sequences.
PhylOTU Clusters Recapitulate PID ClustersWe sought to identify how PD-based clustering compares to
commonly employed PID-based clustering methods by applyingthe two methods to the same set of sequences. Both PID-basedclustering and PhylOTU may be used to identify OTUs fromoverlapping sequences. Therefore we applied both methods to adataset of 508 full-length bacterial SSU-rRNA sequences (refer-ence sequences; see above) obtained from the Ribosomal DatabaseProject (RDP) [25]. Recent work has demonstrated that PID ismore accurately calculated from pairwise alignments than multiplesequence alignments [32–33], so we used ESPRIT, whichimplements pairwise alignments, to obtain a PID distance matrixfor the reference sequences [32]. We used PhylOTU to compute aPD distance matrix for the same data. Then, we used MOTHUR tohierarchically cluster sequences into OTUs based on both PIDand PD. For each of the two distance matrices, we employed arange of clustering thresholds and three different definitions oflinkage in the hierarchical clustering algorithm: nearest-neighbor,average, and furthest-neighbor.To statistically evaluate the similarity of cluster composition
between of each pair of clustering results, we used two summarystatistics that together capture the frequency with which sequencesare co-clustered in both analyses: true conjunction rate (i.e., theproportion of pairs of sequences derived from the same cluster inthe first analysis that also are clustered together in the secondanalysis) and true disjunction rate (i.e., the proportion of pairs ofsequences derived from different clusters in the first analysis thatalso are not clustered together in the second analysis) (see Methods
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalizeworkflow of PhylOTU. See Results section for details.doi:10.1371/journal.pcbi.1001061.g001
Finding Metagenomic OTUs
PLoS Computational Biology | www.ploscompbiol.org 3 January 2011 | Volume 7 | Issue 1 | e1001061
PhylOTU - Sharpton et al. PLoS Comp. Bio 2011Wednesday, March 7, 12
Wednesday, March 7, 12
AutoPhylotyping 4: All in the Genome
Wednesday, March 7, 12
Challenge
• Each gene poorly sampled in metagenomes• Can we combine all into a single tree?
Wednesday, March 7, 12
Kembel et al. The phylogenetic diversity of metagenomes. PLoS One 2011
AMPHORA ALL
Wednesday, March 7, 12
Wednesday, March 7, 12
Figure 3. Taxonomic diversity and standardized phylogenetic diversity versus depth in environmental samples along an oceanic depth gradient at the HOT ALOHA
site.
cally defined by a sequence similarity threshold) in the sampleas equally related. Newer ! diversity measures that incorporatephylogenetic information are more powerful because they ac-count for the degree of divergence between sequences (13, 18,29, 30). Phylogenetic ! diversity measures can also be eitherquantitative or qualitative depending on whether abundance istaken into account. The original, unweighted UniFrac measure(13) is a qualitative measure. Unweighted UniFrac measuresthe distance between two communities by calculating the frac-tion of the branch length in a phylogenetic tree that leads todescendants in either, but not both, of the two communities(Fig. 1A). The fixation index (FST), which measures thedistance between two communities by comparing the geneticdiversity within each community to the total genetic diversity ofthe communities combined (18), is a quantitative measure thataccounts for different levels of divergence between sequences.The phylogenetic test (P test), which measures the significanceof the association between environment and phylogeny (18), istypically used as a qualitative measure because duplicate se-quences are usually removed from the tree. However, the Ptest may be used in a semiquantitative manner if all clones,even those with identical or near-identical sequences, are in-cluded in the tree (13).
Here we describe a quantitative version of UniFrac that wecall “weighted UniFrac.” We show that weighted UniFrac be-haves similarly to the FST test in situations where both are
applicable. However, weighted UniFrac has a major advantageover FST because it can be used to combine data in whichdifferent parts of the 16S rRNA were sequenced (e.g., whennonoverlapping sequences can be combined into a single treeusing full-length sequences as guides). We use two differentdata sets to illustrate how analyses with quantitative and qual-itative ! diversity measures can lead to dramatically differentconclusions about the main factors that structure microbialdiversity. Specifically, qualitative measures that disregard rel-ative abundance can better detect effects of different foundingpopulations, such as the source of bacteria that first colonizethe gut of newborn mice and the effects of factors that arerestrictive for microbial growth such as temperature. In con-trast, quantitative measures that account for the relative abun-dance of microbial lineages can reveal the effects of moretransient factors such as nutrient availability.
MATERIALS AND METHODS
Weighted UniFrac. Weighted UniFrac is a new variant of the original un-weighted UniFrac measure that weights the branches of a phylogenetic treebased on the abundance of information (Fig. 1B). Weighted UniFrac is thus aquantitative measure of ! diversity that can detect changes in how many se-quences from each lineage are present, as well as detect changes in which taxaare present. This ability is important because the relative abundance of differentkinds of bacteria can be critical for describing community changes. In contrast,the original, unweighted UniFrac (Fig. 1A) is a qualitative ! diversity measurebecause duplicate sequences contribute no additional branch length to the tree(by definition, the branch length that separates a pair of duplicate sequences iszero, because no substitutions separate them).
The first step in applying weighted UniFrac is to calculate the raw weightedUniFrac value (u), according to the first equation:
u ! !i
n
bi " "Ai
AT#
Bi
BT"
Here, n is the total number of branches in the tree, bi is the length of branch i,Ai and Bi are the numbers of sequences that descend from branch i in commu-nities A and B, respectively, and AT and BT are the total numbers of sequencesin communities A and B, respectively. In order to control for unequal samplingeffort, Ai and Bi are divided by AT and BT.
If the phylogenetic tree is not ultrametric (i.e., if different sequences in thesample have evolved at different rates), clustering with weighted UniFrac willplace more emphasis on communities that contain quickly evolving taxa. Sincethese taxa are assigned more branch length, a comparison of the communitiesthat contain them will tend to produce higher values of u. In some situations, itmay be desirable to normalize u so that it has a value of 0 for identical commu-nities and 1 for nonoverlapping communities. This is accomplished by dividing uby a scaling factor (D), which is the average distance of each sequence from theroot, as shown in the equation as follows:
D ! !j
n
dj " #Aj
AT$
Bj
BT$
Here, dj is the distance of sequence j from the root, Aj and Bj are the numbersof times the sequences were observed in communities A and B, respectively, andAT and BT are the total numbers of sequences from communities A and B,respectively.
Clustering with normalized u values treats each sample equally instead of
TABLE 1. Measurements of diversity
Measure Measurement of " diversity Measurement of ! diversity
Only presence/absence of taxa considered Qualitative (species richness) QualitativeAdditionally accounts for the no. of times that
each taxon was observedQuantitative (species richness and evenness) Quantitative
FIG. 1. Calculation of the unweighted and the weighted UniFracmeasures. Squares and circles represent sequences from two differentenvironments. (a) In unweighted UniFrac, the distance between thecircle and square communities is calculated as the fraction of thebranch length that has descendants from either the square or the circleenvironment (black) but not both (gray). (b) In weighted UniFrac,branch lengths are weighted by the relative abundance of sequences inthe square and circle communities; square sequences are weightedtwice as much as circle sequences because there are twice as many totalcircle sequences in the data set. The width of branches is proportionalto the degree to which each branch is weighted in the calculations, andgray branches have no weight. Branches 1 and 2 have heavy weightssince the descendants are biased toward the square and circles, respec-tively. Branch 3 contributes no value since it has an equal contributionfrom circle and square sequences after normalization.
VOL. 73, 2007 PHYLOGENETICALLY COMPARING MICROBIAL COMMUNITIES 1577
Wednesday, March 7, 12
AutoPhylotyping 5: Novel lineages and decluttering
Wednesday, March 7, 12
RecA Tree of Life
Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science 276:734-740
Archaea
Eukaryotes
Bacteria
Other lineages?
Wednesday, March 7, 12
Lek Clustering
0.33
0.75
0.75 0.75
0.75
1 1
Wednesday, March 7, 12
Lek Clustering
0.33
0.75
0.75 0.75
0.75
1 1
Cutoff of 0.5
Wednesday, March 7, 12
RecAGOS 1
GOS 2
GOS 3
GOS 4
GOS 5
RecA
Wednesday, March 7, 12
RpoB Too
Wednesday, March 7, 12
Side benefit: binning
Wednesday, March 7, 12
Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes vitamins and cofactors
Sulcia makes amino acids
Wednesday, March 7, 12
Uses of Phylogeny in Genomics and Metagenomics
Example 2:
Functional Diversity and Functional Predictions
Wednesday, March 7, 12
Predicting Function
• Key step in genome projects• More accurate predictions help guide
experimental and computational analyses• Many diverse approaches• All improved both by “phylogenomic” type
analyses that integrate evolutionary reconstructions and understanding of how new functions evolve
Wednesday, March 7, 12
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWNFUNCTIONS ONTO TREE
INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
12
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998 Genome Res 8: 163-167.
Wednesday, March 7, 12
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWNFUNCTIONS ONTO TREE
INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
12
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998 Genome Res 8: 163-167.
Wednesday, March 7, 12
0.01
Legend:
Dataset genes
MA ammonialyase
MA mutase S subunit
MA mutase E subunit
PHA synthatase
cellulase
CRISPRs
CAS
Color ranges:
New Genomes
Haloferax mediterraneiHaloferax mucosumHaloferax volcaniiHaloferax sulfurifontisHaloferax denitrificans
Halogeometricum borinquenseHaloquadratum walsbyiHalorubrum lacusprofundi
Halalkalicoccus jeotgali
Natrialba magadiiHaloterrigena turkmenica
Halopiger xanaduensis
Haloarcula vallismortisHaloarcula marismortuiHaloarcula sinaiiensisHaloarcula californiae
Halomicrobium mukohataeiHalorhabdus utahensisNatronomonas pharaonis
Halobacterium sp. NRC�1Halobacterium salinarum R1
0.320.93
0.551
0.83
0.900.34
0.32
0.411
0.52
0.080.23
0.790.52
10.710.24
1
Wednesday, March 7, 12
Haloarchaea TBPs!"#
$%&'()**+,)-%.(/*01)-+2%.#3%(*+%&4%.1%,%5++.#$%&'6+,)*.7%-%5/)-8+8.#$%&%&0%&+2'22/8.9)'(,%&+.#$%&'4%2()*+/1.86".3:;�#.(46<$%&'4%2()*+/1.8%&+-%*/1.:#.(46<
$%&'4%2()*+/1.86".3:;�#.(46=$%&'4%2()*+/1.8%&+-%*/1.:#.(46=
$%&'%*2/&%.1%*+81'*(/+.#$%&'%*2/&%.>%&&+81'*(+8.#$%&'%*2/&%.8+-%++)-8+8.#$%&'%*2/&%.2%&+?'*-+%).#
$%&'1+2*'4+/1.1/0'@%(%)+.#
3%(*'-'1'-%8.6@%*%'-+8.#$%&'*@%45/8./(%@)-8+8.#
$%&',)'1)(*+2/1.4'*+-A/)-8).#$%&'A/%5*%(/1.B%&84C+.D
$%&'?)*%7.1)5+()**%-)+.D$%&'?)*%7.1/2'8/1.#$%&'?)*%7.>'&2%-++.#$%&'?)*%7.8/&?/*+?'-(+8.#$%&'?)*%7.5)-+(*+?+2%-8.#
$%&'*/4*/1.&%2/86*'?/-5+.E$%&'4%2()*+/1.86".3:;�#.(46F$%&'4%2()*+/1.8%&+-%*/1.:#.(46F#$%&'4%2()*+/1.86".3:;�#.(46G$%&'4%2()*+/1.8%&+-%*/1.:#.E$%&'4%2()*+/1.8%&+-%*/1.:#.H$%&'4%2()*+/1.8%&+-%*/1.:#.#$%&'4%2()*+/1.86".3:;�#.(46I$%&'4%2()*+/1.8%&+-%*/1.:#.D
$%&',)'1)(*+2/1.4'*+-A/)-8).D$%&'*/4*/1.&%2/86*'?/-5+.H$%&'A/%5*%(/1.B%&84C+.#
$%&'?)*%7.8/&?/*+?'-(+8.D$%&'?)*%7.5)-+(*+?+2%-8.D
$%&'?)*%7.>'&2%-++.D$%&'?)*%7.1)5+()**%-)+.#$%&'?)*%7.1/2'8/1.D
$%&'*/4*/1.&%2/86*'?/-5+.D
$%&'?)*%7.1/2'8/1.H$%&'?)*%7.1)5+()**%-)+.H
$%&'?)*%7.5)-+(*+?+2%-8.E
$%&'?)*%7.1/2'8/1.E$%&'?)*%7.1)5+()**%-)+.E
$%&'?)*%7.8/&?/*+?'-(+8.H$%&'?)*%7.5)-+(*+?+2%-8.H
$%&'?)*%7.>'&2%-++.E
$%&'%*2/&%.2%&+?'*-+%).D$%&'?)*%7.>'&2%-++.H$%&'*/4*/1.&%2/86*'?/-5+.#$%&'4%2()*+/1.86".3:;�#.(46;$%&'4%2()*+/1.8%&+-%*/1.:#.(46;#
!"JH
!"HH
!"#D
!"DK
!"L!
!"H#
!"DE
!"KL!"J#
!"ED#
#
!"ED!"EE
#!"DJ!"#J
!"ML
!"HJ!"E#
!"MD!"NL
!"JD
#
!"N!
!"NK!"L#
!"JN!"JH
!"NJ
!"NH
!"MJ
!"K#!"E!
!"M!!"NJ!"LN!"K#
!"KN
!"J!!"MN
!"EK
!"MD!"JM
!"NM!"MJ
#!"J!
##
Figure 8. Independent expansion of the TATA-binding protein family in two haloarchaeal genera. Phylogeny of TATA-binding protein (TBP) homologs identified by RAST with Bootstrap values shown. Colored branches represent duplication events (with the dark blue branch representing four duplications). Ancestral TBP (found in all genomes) is shown on the purple branch. Successive duplications are shown in darkening shades of green (Halobacterium) or blue (Haloferax). Lynch et al. in preparation
Wednesday, March 7, 12
Massive Diversity of Proteorhodopsins
Venter et al., 2004Wednesday, March 7, 12
Metagenomics DARPACharacterizing the niche-space distributions of componentsS
ite
s
N orth American E ast C oast_G S 005_E mbayment
N orth American E ast C oast_G S 002_C oasta l
N orth American E ast C oast_G S 003_C oasta l
N orth American E ast C oast_G S 007_C oasta l
N orth American E ast C oast_G S 004_C oasta l
N orth American E ast C oast_G S 013_C oasta l
N orth American E ast C oast_G S 008_C oasta l
N orth American E ast C oast_G S 011_E stuary
N orth American E ast C oast_G S 009_C oasta l
E astern Tropica l Pacific_G S 021_C oasta l
N orth American E ast C oast_G S 006_E stuary
N orth American E ast C oast_G S 014_C oasta l
Polynesia Archipelagos_G S 051_C ora l R eef Atoll
G alapagos Islands_G S 036_C oasta l
G alapagos Islands_G S 028_C oasta l
Indian O cean_G S 117a_C oasta l sample
G alapagos Islands_G S 031_C oasta l upwelling
G alapagos Islands_G S 029_C oasta l
G alapagos Islands_G S 030_W arm S eep
G alapagos Islands_G S 035_C oasta l
S argasso S ea_G S 001c_O pen O cean
E astern Tropica l Pacific_G S 022_O pen O cean
G alapagos Islands_G S 027_C oasta l
Indian O cean_G S 149_H arbor
Indian O cean_G S 123_O pen O cean
C aribbean S ea_G S 016_C oasta l S ea
Indian O cean_G S 148_Fringing R eef
Indian O cean_G S 113_O pen O cean
Indian O cean_G S 112a_O pen O cean
C aribbean S ea_G S 017_O pen O cean
Indian O cean_G S 121_O pen O cean
Indian O cean_G S 122a_O pen O cean
G alapagos Islands_G S 034_C oasta l
C aribbean S ea_G S 018_O pen O cean
Indian O cean_G S 108a_Lagoon R eef
Indian O cean_G S 110a_O pen O cean
E astern Tropica l Pacific_G S 023_O pen O cean
Indian O cean_G S 114_O pen O cean
C aribbean S ea_G S 019_C oasta l
C aribbean S ea_G S 015_C oasta l
Indian O cean_G S 119_O pen O cean
G alapagos Islands_G S 026_O pen O cean
Polynesia Archipelagos_G S 049_C oasta l
Indian O cean_G S 120_O pen O cean
Polynesia Archipelagos_G S 048a_C ora l R eef
Component 1
Component 2
Component 3
Component 4
Component 5
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6
0 .2 0 .4 0 .6 0 .8 1 .0
Salin
ity
Sam
ple
Dep
th
Ch
loro
ph
yll
Tem
pera
ture
Inso
lati
on
Wate
r D
ep
th
G enera l
H ighM ediumLowN A
H ighM ediumLowN A
W ater depth
>4000m2000!4000m900!2000m100!200m20!100m0!20m
>4000m2000!4000m900!2000m100!200m20!100m0!20m
(a) (b) (c)
Figure 3: a) Niche-space distributions for our five components (HT ); b) the site-similarity matrix (HT H); c) environmental variables for the sites. The matrices arealigned so that the same row corresponds to the same site in each matrix. Sites areordered by applying spectral reordering to the similarity matrix (see Materials andMethods). Rows are aligned across the three matrices.
Figure 3a shows the estimated niche-space distribution for each of the five com-ponents. Components 2 (Photosystem) and 4 (Unidentified) are broadly distributed;Components 1 (Signalling) and 5 (Unidentified) are largely restricted to a handful ofsites; and component 3 shows an intermediate pattern. There is a great deal of overlapbetween niche-space distributions for di�erent components.
Figure 3b shows the pattern of filtered similarity between sites. We see clear pat-terns of grouping, that do not emerge when we calculate functional distances withoutfiltering, or using PCA rather than NMF filtering (Figure 3 in Text S1). As withthe Pfams, we see clusters roughly associated with our components, but there is moreoverlapping than with the Pfam clusters (Figure 2b).
Figure 3c shows the distribution of environmental variables measured at each site.Inspection of Figure 3 reveals qualitative correspondence between environmental factorsand clusters of similar sites in the similarity matrix. For example, the “North AmericanEast Coast” samples are divided into two groups, one in the top left and the other in thebottom right of the similarity matrix. Inspection of the environmental features suggeststhat the split in these samples could be mostly due to the di�erences in insolation andwater depth.
We can also examine patterns of similarity between the components themselves,using niche-site distributions or functional profiles (see Figure 5 in Text S1). All 5
8
Wednesday, March 7, 12
Uses of Phylogeny in Genomics and Metagenomics
Example 3:
Selecting Organisms for Study
Wednesday, March 7, 12
rRNA Tree of Life
Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science 276:734-740
Archaea
Eukaryotes
Bacteria
Wednesday, March 7, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
As of 2002
Based on Hugenholtz, 2002
Wednesday, March 7, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Most genomes from three phyla
As of 2002
Based on Hugenholtz, 2002
Wednesday, March 7, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Most genomes from three phyla
• Some studies in other phyla
As of 2002
Based on Hugenholtz, 2002
Wednesday, March 7, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Most genomes from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Eukaryotes
As of 2002
Based on Hugenholtz, 2002
Wednesday, March 7, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Most genomes from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Viruses
As of 2002
Based on Hugenholtz, 2002
Wednesday, March 7, 12
Wednesday, March 7, 12
http://www.jgi.doe.gov/programs/GEBA/pilot.htmlWednesday, March 7, 12
GEBA Pilot Project: Components• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan Eisen,
Eddy Rubin, Jim Bristow)• Project management (David Bruce, Eileen Dalin, Lynne Goodwin)• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla Lapidus, Mat
Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng)• Annotation and data release (Nikos Kyrpides, Victor Markowitz, et al)• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor
Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati, Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla)
• Adopt a microbe education project (Cheryl Kerfeld)• Outreach (David Gilbert)• $$$ (DOE, Eddy Rubin, Jim Bristow)
Wednesday, March 7, 12
GEBA Lesson 1:Phylogeny driven genome selection (and
phylogenetics) improves genome annotation
• Took 56 GEBA genomes and compared results vs. 56 randomly sampled new genomes
• Better definition of protein family sequence “patterns”• Greatly improves “comparative” and “evolutionary” based
predictions• Conversion of hypothetical into conserved hypotheticals• Linking distantly related members of protein families• Improved non-homology prediction
Wednesday, March 7, 12
GEBA Lesson 2
Phylogeny-driven genome selection helps discover new genetic diversity
Wednesday, March 7, 12
Protein Family Rarefaction Curves
• Take data set of multiple complete genomes• Identify all protein families using MCL• Plot # of genomes vs. # of protein families
Wednesday, March 7, 12
Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
Synapomorphies exist
Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
Families/PD not uniform
� �
�������6���
3����1�����
Wednesday, March 7, 12
GEBA Lesson 3
Improves analysis of genome data from uncultured organisms
Wednesday, March 7, 12
0
0.125
0.250
0.375
0.500
Alphapro
teobacteria
Gamm
aproteobacteria
Deltapro
teobacteria
Firmicutes
Chlorobi
Chloroflexi
Fusobacteria
Euryarchaeota
Sargasso Phylotypes
Wei
ghte
d %
of C
lone
s
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Shotgun Sequencing Allows Use of Other Markers
GEBA Project improves metagenomic analysis
Venter et al., Science 304: 66-74. 2004Wednesday, March 7, 12
0
0.125
0.250
0.375
0.500
Alphapro
teobacteria
Gamm
aproteobacteria
Deltapro
teobacteria
Firmicutes
Chlorobi
Chloroflexi
Fusobacteria
Euryarchaeota
Sargasso Phylotypes
Wei
ghte
d %
of C
lone
s
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Shotgun Sequencing Allows Use of Other Markers
But not a lot
Venter et al., Science 304: 66-74. 2004Wednesday, March 7, 12
Phylogeny and Metagenomics Future 1
Need to adapt genomic and metagenomic methods to make better
use of data
Wednesday, March 7, 12
iSEEM Project
Wednesday, March 7, 12
Wednesday, March 7, 12
More MarkersPhylogenetic group Genome
NumberGene Number
Maker Candidates
Archaea 62 145415 106Actinobacteria 63 267783 136Alphaproteobacteria 94 347287 121Betaproteobacteria 56 266362 311Gammaproteobacteria
126 483632 118Deltaproteobacteria 25 102115 206Epislonproteobacteria 18 33416 455Bacteriodes 25 71531 286Chlamydae 13 13823 560Chloroflexi 10 33577 323Cyanobacteria 36 124080 590Firmicutes 106 312309 87Spirochaetes 18 38832 176Thermi 5 14160 974Thermotogae 9 17037 684
Wednesday, March 7, 12
Phylogeny and Metagenomics Future 2
We have still only scratched the surface of microbial diversity
Wednesday, March 7, 12
rRNA Tree of Life
Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science 276:734-740
Archaea
Eukaryotes
Bacteria
Wednesday, March 7, 12
Phylogenetic Diversity: Genomes
From Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
Phylogenetic Diversity with GEBA
From Wu et al. 2009 Nature 462, 1056-1060
Wednesday, March 7, 12
Phylogenetic Diversity: Isolates
From Wu et al. 2009 Nature 462, 1056-1060Wednesday, March 7, 12
Phylogenetic Diversity: All
From Wu et al. 2009 Nature 462, 1056-1060Wednesday, March 7, 12
86
Number of SAGs from Candidate Phyla
OD
1
OP
11
OP
3
SA
R4
06
Site A: Hydrothermal vent 4 1 - -Site B: Gold Mine 6 13 2 -Site C: Tropical gyres (Mesopelagic) - - - 2Site D: Tropical gyres (Photic zone) 1 - - -
Sample collections at 4 additional sites are underway.
Phil Hugenholtz
GEBA uncultured
Wednesday, March 7, 12
Phylogenomics Future 3
Need Experiments from Across the Tree of Life too
Wednesday, March 7, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
As of 2002
Based on Hugenholtz, 2002
Wednesday, March 7, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Experimental studies are mostly from three phyla
As of 2002
Based on Hugenholtz, 2002
Wednesday, March 7, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Experimental studies are mostly from three phyla
• Some studies in other phyla
As of 2002
Based on Hugenholtz, 2002
Wednesday, March 7, 12
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Experimental studies are mostly from three phyla
• Some studies in other phyla
• Same trend in Eukaryotes
As of 2002
Based on Hugenholtz, 2002
Wednesday, March 7, 12
0.1
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
Tree based on Hugenholtz (2002) with some modifications.
Need experimental studies from across the tree too
Wednesday, March 7, 12
0.1
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
Tree based on Hugenholtz (2002) with some modifications.
Adopt a Microbe
Wednesday, March 7, 12
Acknowledgements• $$$• DOE• NSF• GBMF• Sloan• DARPA• DSMZ• DHS
• People, places• DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides• UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell Neches,
Jenna Morgan-Lang• Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak, Jack
Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward, Hans-Peter Klenk
Wednesday, March 7, 12