Supplemental information for: Abigail Manson McGuire1, Kyla Cochrane2, Allison D. Griggs1,
Brian J. Haas1, Thomas Abeel1,4, Qiandong Zeng1, Justin B. Nice5, Hanlon MacDonald5, Bruce W.
Birren1, Bryan W. Berger3,5, Emma Allen-Vercoe2, Ashlee M. Earl1*. Evolution of Invasion in a
Diverse Set of Fusobacterium.
SUPPLEMENTAL RESULTS
Adaptive radiation of fusobacterial lineages. To gain greater insight into how active
and passive invasion strategies may have evolved, we looked closely at the phylogenetic
relationship among the active and passive invader species (Figure 1). While our maximum
likelihood tree agreed with previous reports of Fusobacterium phylogeny, with Lineage 2
(containing Clades C and D) in the more basal position relative to Lineage 3 (containing Clade E)
(1, 2), there was very low (4%) bootstrap support for this node, indicating a very weak basis for
sequential ordering of that division. Bootstrap support for all other nodes in the tree was
excellent. When individual gene trees for each of the 498 core orthogroups were examined for
relatedness, only 37% were found to support the topology predicted by maximum likelihood
analysis, while 27% supported an alternate topology placing Lineage 3 in the more basal
position relative to Lineage 2, and 32% represented topologies where these clades diverged
from Lineage 1 (containing Clades A and B) via a common ancestor. These data strongly suggest
that the relative placement of Lineage 2 and Lineage 3 is poorly defined due to their near
simultaneous divergence, likely representing an adaptive radiation where the last common
ancestor diversified into three major lineages containing five clades, three clades acquiring (or
1
retaining) the potential to actively invade and the others lacking this capability.
ANI-based species definition. ANI values of 94-95% represent a commonly accepted
threshold for species designation (3, 4). Pairwise comparisons between members of different F.
nucleatum subspecies were lower than this threshold (89-93%), while values obtained between
members of the same subspecies were above this threshold (96-99%) (Figure S1d). These
results indicate that each of the four F. nucleatum subspecies in our analysis would be
considered a separate species, according to ANI-based species definition.
ANI values within F. periodonticum ranged from 92-98%, overlapping the ANI species
line of 94-95%. F. periodonticum 2_1_31 and F. periodonticum D10 would be contained within
a single species (pairwise ANI values of 98%), while F. periodonticum ATCC 33693 and F.
periodonticum 1_1_41FAA would each be placed in a separate species (pairwise ANI values of
92-94% to other strains; see Figure S1e). For all other species with more than one strain in our
analysis (F. necrophorum, F. gonidiaformans, and F. ulcerans), all pairwise intraspecies ANI
values were >95%, indicating that they represent a single species.
Species-specific orthogroups. Species-specific orthogroups, or groups of orthologous
genes, primarily contain genes annotated as encoding “hypothetical proteins”, with no
indication of protein function. Pfam, GO, KEGG, and other functional annotations provide some
clues to their function.
F. nucleatum. Orthogroups present in all 14 F. nucleatum strains, but no other strains in
our analysis, include those encoding two MORN2 domain containing proteins, two periplasmic
2
solute-binding proteins, several transporters, a polysaccharide deacetylase, and a protein
containing a NUDIX domain. NUDIX domains have previously been associated with infection in
other pathogenic bacteria (5). Many F. nucleatum-specific orthogroups include genes with no
functional annotation.
There are also numerous subspecies-specific orthogroups (see Table S3), which are
genes present in all sequenced members of a single F. nucleatum strain, and absent in all other
sequenced F. nucleatum strains. Most of these subspecies-specific orthogroups contain genes
of unknown function. For F. nucleatum animalis, subspecies-specific orthogroups include those
encoding a YadA family adhesin, a RelB/Stbe addiction module toxin, several transporters, a
bacterial surface protein, and several proteins with likely roles in gene regulation. F. nucleatum
vincentii-specific orthogroups include a TonB-dependent receptor component, several
transporters, a YadA family adhesin, and an orthogroup with MORN2 domains. F. nucleatum
nucleatum-specific orthogroups include those encoding a MORN2 protein, and several
transporters. F. nucleatum polymorphum-specific orthogroups include those encoding two
MORN2 proteins, several transporters, regulators, and a colicin. Orthogroups missing in
individual subspecies also encode transporters, MORN2 proteins, autotransporter β-domain
containing proteins (which is a component of RadD-family adhesins), and regulatory proteins,
as well as many proteins of unknown function.
F. periodonticum. Orthogroups present only in all four F. periodonticum strains include
3 MORN2-containing orthogroups, two orthogroups encoding extracellular solute-binding
proteins, several orthogroups encoding transporters, several encoding regulatory proteins, an
orthogroup encoding a TonB family protein, and an orthogroup encoding a tetratricopeptide-
3
repeat (TPR) containing protein. TPR containing proteins contain a set of repeats, which can
fold together to form a solenoid domain (6). Many F. periodonticum-specific orthogroups
include genes with no functional annotation.
F. necrophorum. Orthogroups present only in both F. necrophorum strains include three
orthogroups encoding either YadA family or haemagglutinin adhesins, an orthogroup encoding
a TonB family protein, two orthogroups encoding extracellular solute-binding proteins, two
orthogroups encoding haemolysin secretion/activation proteins, a orthogroup with genes
encoding a POTRA domain (POTRA domains are known to be involved in the production of
haemolysins (7)), an orthogroup encoding Omptin domain (Omptins are known to be involved
in pathogenesis in other organisms (8)), and several orthogroups containing transporters.
Many F. necrophorum-specific orthogroups include genes with no functional annotation.
F. gonidiaformans. There are fewer F. gonidiaformans-specific orthogroups because
these are the two smallest genomes in our data set. Orthogroups present only in both F.
gonidiaformans strains include two groups encoding transporters, a group encoding a
regulator, and several encoding proteins of unknown function.
F. ulcerans. Orthogroups present only in both F. ulcerans strains include 17 orthogroups
encoding bacterial DNA-binding proteins with the Pfam domain PF00216, nine orthogroups
encoding regulatory proteins, two orthogroups encoding FadA adhesin domains, 15
orthogroups with EAL and GGDEF signaling domains, and four orthogroups encoding
extracellular solute-binding proteins.
F. varium and F. mortiferum. Because there is only one representative for each of these
two species in our dataset, all strain-specific orthogroups are included in the list in Table S3.
4
Most of these contain strain-specific genes with no functional annotation.
Species-specific gene family expansions. We identified expansions of KEGG Pathways,
Gene Ontology categories and Pfam protein domain families within individual species.
F. nucleatum. In this highly invasive species, we observed expansions of gene families
related primarily to membrane components, transporters, and pathogenesis (see Table S4).
Other expansions included several Pfam domains of unknown function (DUF1703, DUF1311,
DUF3601, and DUF1016), as well as a large expansion of MORN2 domains. In F. nucleatum, we
also observed an expansion of genes related to addiction-module toxin-antitoxin systems,
which are usually related to maintaining the stability of extrachromosomal elements, and can
be involved in pathogenesis. We have fully resolved plasmids for the manually finished F.
nucleatum animalis 7-1, F. nucleatum animalis 4-8, and F. nucleatum vincentii 3-1-27.
The most striking subspecies-specific characteristics in F. nucleatum polymorphum are
expansions in amino acid metabolism genes (Table S4). Categories that were highly expanded
include branched chain amino acid biosynthesis, leucine biosynthesis, lysine biosynthesis, and
C5-branched dibasic acid metabolism. We also observed expansions in vitamin B6 metabolism
and pantothenate biosynthesis in this subspecies. In F. nucleatum polymorphum ATCC 0953,
previous research showed that proteins involved in amino acid metabolism change
concentration in alkaline biofilms, along with a shift to more efficient metabolism (9). Other
subspecies-specific gene family expansions are related to transposase activity (see Table S4).
F. periodonticum . Similar to F. nucleatum polymorphum, we observed expansions of
amino acid metabolism genes. Phylogenetic profiles for genes related to amino acid
5
metabolism varied widely across the organisms in our data set. This was in agreement with
previous studies, which showed that Fusobacterium spp. strains within different niches of the
oral cavity utilize different sets of amino acids (10). Several categories of amino acid
metabolism genes were expanded or only present in F. periodonticum, as well as F. nucleatum
subsp. polymorphum, particularly branched-chain amino acid metabolism (including leucine and
threonine; see Table S4). Pantothenate and coA biosynthesis also followed this same profile of
being expanded only in F. nucleatum polymorphum and F. periodonticum. We also saw a
striking expansion of the MORN2 Pfam domains. There were also many MORN2 proteins in the
active invaders F. nucleatum, F. ulcerans and F. varium, but the most striking expansion was
observed in F. periodonticum. Each F. periodonticum genome had approximately 50 proteins
containing MORN2 domains.
F. necrophorum. This species had expansions of genes containing adhesion-related
Pfam domains (“haemagglutinin”, “YadA-like C-terminal domain”, and “Hep Hag”). F.
necrophorum also had a significant expansion of TonB-dependent receptor genes. These
membrane proteins sense and transmit signals from outside the cell into the cytoplasm. F.
necrophorum had significantly reduced numbers of genes related to the membrane (GO terms
“intrinsic to membrane”, “integral to membrane”, and “plasma membrane”, and “MORN2”).
F. gonidiaformans. This species was missing many functions relative to the other
organisms, since its two representative genome sequences were the smallest in our data set.
The categories most significantly reduced included those related to the membrane (“membrane
part”, “intrinsic/integral to the membrane”, “plasma membrane”, “outer membrane”, and
“MORN2”) and categories related to transport (“transporter activity”, “transport”, “active
6
transmembrane transporter activity”, “antiporter activity”, “transmembrane transporter
activity”, “substract-specific transporter activity”, and “drug transport”). This large reduction in
membrane components was consistent with the fact that F. gonidiaformans is not believed to
be able to invade host cells independently. F. gonidiaformans has approximately one-third as
many genes as other fusobacteria belonging to the “membrane part” GO category.
F. ulcerans and F. varium . These species exhibited very striking expansions of DNA
binding proteins, including transcription factors. The most striking expanded category was
PF00216 (bacterial DNA-binding proteins (Table S4). There were 50-60 of these in each F.
ulcerans and F. varium genome, and less than five in other organisms. The exact function of
these domains is not clear, but they are annotated as being histone-like proteins involved in
wrapping and stabilizing DNA. Many categories of transcriptional regulators were highly
expanded in F. ulcerans and F. varium. These species had the largest genomes in our collection
(~3500 genes vs. 2000 in F. nucleatum). Since the number of regulatory proteins tends to go up
as the square of the genome size (11-13), an expansion of regulatory proteins was expected in
these two species genomes, as we would fewer regulatory proteins in F. necrophorum and F.
gonidiaformans. Overall, the expansion of regulator proteins in F. ulcerans and F. varium
(approximately three times as many as in F. nucleatum) is consistent with the expansion
expected due to its larger genome size.
F. ulcerans and F. varium also contained expansions of the GGDEF and EAL domains,
which are related to signaling, as well as expansions of genes related to metal binding, and
numerous types of transporter domains. The most significantly expanded KEGG pathway was
ko00633 (nitrotoluene degradation). Because these are larger genomes, we saw more
7
significant gene expansions than gene reductions. However, we did observe that the histidine
metabolism pathway (ko00340) was completely absent in F. ulcerans and F. varium.
F. mortiferum. Because there is only one sequenced F. mortiferum genome, we do not
have additional strains for comparison. In F. mortiferum ATCC 9817, we observed modest
expansions of carbohydrate transporters and sugar phosphotransferase systems as compared
to other fusobacterial species, indicating adaptation to a specific metabolic environment.
Protein families expanded in active invaders cluster in the genome. Using the finished
F. nucleatum genomes, we performed a computational analysis to quantitate the proximity
between FadA, RadD, and MORN2 protein families, and members of the following categories
(see Materials and Methods): 1) all Pfam domains, 2) all gene ontology categories, 3) proteins
containing signal peptides as predicted by SignalP (14), and 4) proteins containing
transmembrane domains predicted by TMHMM (15). These Pfam and gene ontology categories
include groupings related to phage and other mobile elements. We observed that genes
encoding active-specific protein families associated with one another with statistical
significance (Table S6), suggesting that they physically cluster. The most significant clustering
occurred between groups of genes encoding MORN2 proteins (see Table S6 and Figure 3). F.
nucleatum genomes contained 5-9 separate clusters of MORN2 genes, with each cluster
containing 2-4 MORN2 genes, and F. periodonticum genomes contained 7-11 separate clusters
of MORN2 genes, with each cluster containing 4-5 MORN2 genes. There were also differences
in the gene content of MORN2 clusters between strains, even between strains of the same
subspecies, suggesting that these clusters are hotspots for variation.
8
In addition to physical clustering between different Pfam families expanded in active
invaders, we also observed that the expanded Pfam families localized with other adhesins and
virulence-associated genes. For instance, there were 17 instances where a FadA domain-
containing gene was located near an OmpA family protein. OmpA proteins are membrane-
embedded β-barrels, often involved in bacterial pathogenesis, adhesion, and invasion (16).
Genes encoding proteins with signal peptides, indicating an extracellular role, were highly
clustered with MORN2-, FadA-, and RadD-containing genes, as well as genes related to
transport, cell membrane, cell wall organization, and peptidoglycan biosynthesis. There was
also an association between chorismate mutase (CM)-domain containing proteins and proteins
containing MORN2 domains. In each of the seven finished genomes, all three CM-domains
were either found within the same gene as a MORN2 domain, or were located within two
adjacent genes in the same operon (FN0043, FN0044, and FN0045 in F. nucleatum subsp.
nucleatum ATCC 25586).
We also observed that proteins related to phage and transposition significantly
clustered with MORN2-containing proteins (Table S6). IS elements (see Materials and
Methods) were also 1.6 times more likely to be found within 2 Kb of a MORN2 region than
other genes in the genome. IS elements have previously been shown to associate with outer
membrane proteins in F. nucleatum (17): In F. nucleatum subsp. nucleatum 25586, IS elements
were observed to flank outer membrane proteins, including a RadD family member containing
an autotransporter β-domain. However, prophage predicted by the Phast tool (18) did not
overlap with any MORN2, FadA, or RadD regions , so prophage do not appear to promote
present-day evolution and movement of these regions. Additionally, active-specific genomic
9
regions did not appear to have been recently horizontally acquired from non-fusobacterial
sources. We searched for genomic islands using the Islandviewer software (19) (see Materials
and Methods), and found very few predictions in these genomes. Together, these data suggest
that MORN2 genes have been carried in Fusobacterium for a long period of time, and are being
actively reshuffled.
Active-specific features present in F. nucleatum from cancerous tumors. To further
validate our lists of active-specific orthogroups, as well as functional categories over-
represented in the active invaders, we analyzed six additional sequenced F. nucleatum genomes
isolated from cancerous tumors (see Materials and Methods). In these cancer strains, we
observed similar copy numbers for MORN2 (27-31 copies), FadA (3-6 copies), and RadD (5-11
copies) family genes as for the active invaders. In addition, all 44 lf the active-specific
orthogroups were also present in each of the six cancer strains, supporting the idea that these
six F. nucleatum strains are similar to the other active invader strains in our dataset.
MORN2 domains in other organisms. The massive expansion of MORN2 domains
observed in the active invader clades is a feature highly specific to the Fusobacterium genus.
MORN2 domains are present in about 10% of the >5000 bacterial strains represented in the
Pfam database. However, most of these species contain only a small number of MORN2
domains: only 23 bacteria in the Pfam database contain >100 MORN2 domains, and 22 of these
23 are actively invading Fusobacterium. The only non-fusobacterial species in Pfam containing
over 100 MORN2 domains is Helicobacter bilis ATCC 43879.
10
The 65 genomes in the Pfam database containing 20-100 MORN2 domains (more than
the passively invading Fusobacterium spp., but fewer than most members of the active invader
clade) comprise a diverse set, including three Leptotrichiaceae (28-60 MORN2 domains); five
Bacteroidetes (Cytophagia, Flavobacteria, and Saprospira, with 26-51 MORN2 domains each);
16 Shewanella strains (each containing 26-29 MORN2s); two Vibrio species (29 and 34
MORN2s); three of the opportunistic pathogens Myroides odoratimumus (20-26 MORN2s) (20);
and two pneumonia-causing Parachlamydia acanthamoebae (21 MORN2s) (21). A variety of
additional organisms contain smaller numbers of MORN2 domains, including numerous Yersinia
pestis strains (9-18 MORN2s); several Shigella strains (9-10 MORN2s); numerous Salmonella
strains (9 MORN2s); numerous Campylobacter strains (6 MORN2s); numerous Neisseria strains
(4 MORN2 s); numerous Chlamydia strains (4 MORN2s); several Pseudomonas strains (3
MORN2s); and a small number of E. coli strains (10 MORN2s), including an avian-pathogenic
strain causing extraintestinal colibacillosis. Across all organisms, MORN2 domains are often
found clustered, with multiple copies within a single gene. Like for Fusobacterium, the role of
MORN2 domains in these organisms is unclear.
Chromosomal rearrangements in Fusobacterium. Perhaps because of evolution
through duplication in Fusobacterium genomes, genes containing MORN2 domains correlate
with plasticity in genomic architecture. Actively invading Fusobacterium genomes exhibit an
exceptional level of genomic rearrangement (Figure S3), whereas genomes of passive invaders
with fewer MORN2 genes have fewer rearrangements (p=9e-6), as measured by syntenic
fraction (Figure S3; see Methods). The rate observed within some Fusobacterium spp.,
11
including F. nucleatum, is even higher than the rates observed in H. pylori and Y. pestis, species
known to undergo chromosomal rearrangements at unusually high rates (22-24). Even the
level of rearrangements present within individual subspecies of F. nucleatum is quite high.
The association between MORN2 genes and large-scale synteny breaks, potentially
indicating a causal relationship, was confirmed (p=0.005; see example shown in Figure 3) using
a metric for average syntenic conservation (see Materials and Methods) for all MORN2-
containing orthogroups (0.31), as well as for all orthogroups in our set of finished genomes
(0.43), in which synteny was less in question. Recombination between non-randomly
distributed, repetitive chromosomal sequences is a widely conserved mechanism to promote
genome diversity in prokaryotes, including H. pylori and M. tuberculosis (24). Therefore, it is
possible that the repetitive MORN2 regions could be driving this elevated rate of chromosomal
rearrangement in Fusobacterium, which could allow the bacteria to rapidly adapt to their varied
ecological niches and diverse environmental stresses within the host.
12
SUPPLEMENTAL MATERIALS AND METHODS
Sequencing. 21 new draft genomes were produced as part of the reference genome
collection for the Human Microbiome Project at the Broad Institute. Assembly statistics can be
seen in Table S1. Genomes were sequenced using 454. Sample preparation was performed as
described in Lennon et al. (25). We selected five strains for finishing (F. nucleatum subsp.
vincentii 3_1_27, F. nucleatum subsp. vincentii 3_1_36A2, F. nucleatum subsp. animalis 21_1A,
F. nucleatum subsp. animalis 4_8, and F. nucleatum subsp. animalis 7_1). Gaps within scaffolds
were spanned by PCR amplicons that were then Sanger sequenced using end primers or
internal walking primers. Gaps between scaffolds were addressed using a combinatorial PCR
approach where scaffold edge primers were used in amplification pairs with every other
scaffold edge. Reaction products were subjected to Sanger sequencing and reads were
incorporated as appropriate using Consed (26). Three of the finished genomes (F. nucleatum
subsp. vincentii 3_1_27, F. nucleatum subsp. animalis 4_8, and F. nucleatum subsp. animalis
7_1) have plasmids, and F. nucleatum subsp. vincentii 3_1_27 also had a second 373 Kb
chromosome. Assembly statistics for these finished genomes and plasmids can be found in
Table S1.
We also used the following six F. nucleatum strains isolated from cancerous tumors,
sequenced at the Broad Institute after our comparative analysis was initiated, to further
validate our results (NCBI accessions are indicated in parentheses): F. nucleatum CTI-1
(AXNZ01000000), F. nucleatum CTI-2 (AXNY01000000), F. nucleatum CTI-3 (AXNX01000000), F.
nucleatum CTI-5 (AXNW01000000), F. nucleatum CTI-6 (AXNV01000000), and F. nucleatum CTI-
13
7 (AXNU01000000).
Genome annotation. To assure consistency and to reduce artifacts among the genomes
being analyzed, all genomes were re-annotated in a uniform manner using the Broad Institute’s
prokaryotic annotation pipeline. The protein-coding genes were predicted with Prodigal (27)
and filtered to remove genes with >=70% overlap to tRNAs or rRNAs. The tRNAs were
identified by tRNAscan-SE (28). The rRNA genes were predicted using RNAmmer (29). The
gene product names were assigned based on top blast hits against SwissProt protein database
(>=70% identity and >=70% query coverage), and protein family profile search against the
TIGRfam hmmer equivalogs. Additional annotation analyses performed include PFAM (30),
TIGRfam (31), KEGG (32), COG (33), GO (34), EC (35), SignalP (14), and TMHMM (15).
Summaries of gene counts can be found in Table S1.
Orthogroup clustering. SYNERGY2 (36-38), available at
http://sourceforge.net/projects/synergytwo/, was used to identify orthogroups in our set of 27
genomes. Orthogroups contain orthologs, which are vertically inherited genes that likely have
the same function, and also possibly paralogs, which are duplicated genes that may have
different function.
Phylogenetic trees and renaming strains. Phylogenetic trees were generated by
applying RAxML (39) to a concatenated alignment of 498 single-copy core orthogroups
(excluding orthogroups with paralogs) across all 27 organisms (including the Leptotrichia
outgroup). Bootstrapping was performed using RAxML’s rapid bootstrapping algorithm. 16S
phylogenetic trees were generated using ClustalW (40) alignments and Phylip’s DNAdist and
Fitch algorithms (41).
14
Using these trees, we were able to assign organisms to appropriate species and
subspecies and rename them accordingly (see Table S1 and Figure S1a-b). In order to accurately
rename the subspecies within F. nucleatum, we added one additional, recently completed
genome to our analysis in order to clarify phylogenetic relationships (F. nucleatum subsp.
animalis ATCC 51191; this genome was not available when we began our analysis). We
constructed orthogroups using OrthoMCL (42), using the 14 F. nucleatum genomes in Table S1,
plus F. nucleatum subsp. animalis ATCC 51191, and constructed the phylogenetic tree in Figure
S1b.
Since we had no complete genome sequence for the fifth known species of F. nucleatum
(F. nucleatum subsp. fusiforme), we also generated a 16S rRNA-based tree to verify this new
strain naming (see Figure S1c). We used the 16S sequence for a F. nucleatum subsp. fusiforme
strain, as well as 16S sequences for all of the other genomes in our analysis.
ANI and shared-gene analysis. SYNERGY2 orthogroups were used to determine shared
gene content in pairwise genome comparisons. For a genome pair (genome 1 and genome 2),
the total number of genes in genome 1 was determined and the number of genes in genome 1
shared with genome 2 (based on shared ortholog group membership) was determined. Percent
shared gene content was calculated by dividing the number of genome 1 genes shared with
genome 2 by the number of genes in genome 1. Nucleotide alignments of shared genes were
used to determine the numbers of identical and different nucleotide residues in shared genes.
Percent ANI was calculated by dividing the number of identical nucleotide residues in shared
genes by the total number of nucleotide residues.
Identification of fusobacterial clades. BAPS 6.0 (43, 44) was used to cluster the 26
15
fusobacterial genomes. We used a concatenated alignment of single copy core orthogroups
from SYNERGY as input, using the BAPS module for genetic mixture analysis of sequences or
linked loci. Using values of k ranging from 3 to 8, we identified an optimum clustering yielding 5
groupings, by examining the output log(ml) values.
Gene families. We selected the gene categories (PFAM, GO, and KEGG) most expanded
or reduced when comparing active and passive invaders by using Fisher’s test (Q<0.05). We
compared members of each species to members of all other species. We compared members
of each F. nucleatum subspecies to members of all other F. nucleatum subspecies. We also
compared a group consisting of active invaders (F. nucleatum, F. periodonticum, F. ulcerans, F.
varium) to the group of passive invaders (F. necrophorum and F. gonidiaformans). We filtered
the results, removing all gene categories where the smaller group contains greater than 150
members, as well as those gene groupings with less than a 20% different in copy number
between the two sets.
Comparison of sequence distances for MORN2 domains. Protein alignments for 3570
individual MORN2 domains from Fusobacteriaceae and Leptotrichiaceae were extracted from
the Pfam database (30). This dataset includes 3422 domains from the Fusobacteriaceae and
148 domains from the Leptotrichiaceae (including Sebaldella, Ilyobacter, Leptotrichia, and
Streptobacillus). Sequence identities were computed between all pairwise combinations of
MORN2 domains, using the Pfam alignment. For each gene, an average sequence distance was
computed for all MORN2 domains within 1kb, 2kb, 5kb, and 10kb, and for all MORN2 domains
further away than 10kb in the same genome. Average sequence distances were also computed
for domains located within the same proteins. The distributions of values were compared using
16
the paired Wilcoxon-rank sum test using R.
Co-localization of proteins containing MORN2 domains with known virulence factors.
In order to examine the co-localization of MORN2, FadA, and RadD family proteins with other
known virulence factors, we performed a computational analysis of the proximity between
members of these three expanded Pfam families, and members of all other gene families. We
performed this analysis for all GO terms, Pfam domains, proteins containing transmembrane
domains predicted by TMHMM (15), and proteins with Signal peptides predicted by SignalP
(14). For each gene category, we counted all pairs of genes within a certain distance (1, 2, 5,
10, or 20 genes away) from a MORN2, fadA, or radD gene. In order to calculate an over-
representation statistic, we repeated the same analysis using 1000 randomly chosen sets of
genes of the same size as our sets of MORN2, fadA, or radD genes. For each gene category, we
compared the number of observed instances co-localized with a MORN2, fadA, or radD gene, to
the number observed in our randomly chosen gene sets. We calculated a mean and standard
deviation for our randomly chosen gene sets, and calculated a z-score that relates the number
of nearby pairings found in the test set to the number found in the control set.
Mobilome identification. To predict prophage, we used the Phast (18) algorithm.
Regions with Phast score greater than 70 were considered as potential phage regions in our
analysis. To identify genomic island regions, we used the SigiHMM (45) and Islandpath-DIMOB
(46) algorithms implemented in Islandviewer (19). We did not use Islandviewer’s “IslandPick”
algorithm (47), because there were too few closely related genomes in the IslandPick database.
To locate IS elements, we used IS finder (48), with a an E-value cutoff of 1e-5.
Multiple alignments. In order to examine larger-scale rearrangements in our set of
17
finished genomes, multiple alignments were constructed using Mauve (49). We generated an
alignment using ProgressiveMauve for the seven finished F. nucleatum genomes. We first used
the Mauve Contig Mover to reorder and reorient the draft genome contigs (50). F. nucleatum
subsp. nucleatum 25586 was used as the reference for the Mauve Contig Mover.
Calculation of Syntenic Conservation. Synteny between genomes was quantified using
syntenic fraction, a metric described by Wapinski (37). For a pair of genes in the same
orthologous cluster, syntenic fraction is the percentage of their neighbors that are orthologous
to each other. A window of 5kb up and downstream of a gene was used to define its neighbors.
The syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction
of all orthologous gene pairs in that orthogroup. Self-comparisons were not included in these
averages. An average was calculated for all orthogroups, and compared to the average for all
MORN2 orthogroups only. Rates of genomic rearrangement are highly dependent on
phylogenetic distance.
To compare rates between active and passive invaders, we examined syntenic
conservation between all pairwise combinations of F. gonidiaformans and F. necrophorum spp.
(average phylogenetic distance 0.19 ± 0.001), as well as between all pairwise combinations of F.
nucleatum and F. periodonticum spp. (which have a similar average phylogenetic distance of
0.14 ± 0.004). To compare overall syntenic values between pairs of genomes, we summed the
values for all orthogroups in these genomes. The distributions were compared by T-test to
obtain a p-value.
SUPPLEMENTAL REFERENCES
18
1. Citron DM. 2002. Update on the taxonomy and clinical aspects of the genus fusobacterium. Clin Infect Dis 35:S22-27.
2. Munoz R, Yarza P, Ludwig W, Euzeby J, Amann R, Schleifer KH, Glockner FO, Rossello-Mora R. 2011. Release LTPs104 of the All-Species Living Tree. Systematic and applied microbiology 34:169-170.
3. Konstantinidis KT, Tiedje JM. 2005. Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci U S A 102:2567-2572.
4. Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM. 2007. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol 57:81-91.
5. Luo Y, Liu Y, Sun D, Ojcius DM, Zhao J, Lin X, Wu D, Zhang R, Chen M, Li L, Yan J. 2011. InvA protein is a Nudix hydrolase required for infection by pathogenic Leptospira in cell lines and animals. The Journal of biological chemistry 286:36852-36863.
6. Blatch GL, Lassle M. 1999. The tetratricopeptide repeat: a structural motif mediating protein-protein interactions. Bioessays 21:932-939.
7. Sanchez-Pulido L, Devos D, Genevrois S, Vicente M, Valencia A. 2003. POTRA: a conserved domain in the FtsQ family and a class of beta-barrel outer membrane proteins. Trends Biochem Sci 28:523-526.
8. Hritonenko V, Stathopoulos C. 2007. Omptin proteins: an expanding family of outer membrane proteases in Gram-negative Enterobacteriaceae. Mol Membr Biol 24:395-406.
9. Chew J, Zilm PS, Fuss JM, Gully NJ. 2012. A proteomic investigation of Fusobacterium nucleatum alkaline-induced biofilms. BMC Microbiol 12:189.
10. Gharbia SE, Shah HN. 1991. Comparison of the amino acid uptake profile of reference and clinical isolates of Fusobacterium nucleatum subspecies. Oral Microbiol Immunol 6:264-269.
11. Cases I, de Lorenzo V, Ouzounis CA. 2003. Transcription regulation and environmental adaptation in bacteria. Trends Microbiol 11:248-253.
12. Molina N, van Nimwegen E. 2009. Scaling laws in functional genome content across prokaryotic clades and lifestyles. Trends Genet 25:243-247.
13. van Nimwegen E. 2003. Scaling laws in the functional content of genomes. Trends Genet 19:479-484.
14. Petersen TN, Brunak S, von Heijne G, Nielsen H. 2011. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8:785-786.
15. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. 2001. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567-580.
16. Confer AW, Ayalew S. 2013. The OmpA family of proteins: roles in bacterial pathogenesis and immunity. Vet Microbiol 163:207-222.
17. Kapatral V, Anderson I, Ivanova N, Reznik G, Los T, Lykidis A, Bhattacharyya A, Bartman A, Gardner W, Grechkin G, Zhu L, Vasieva O, Chu L, Kogan Y, Chaga O, Goltsman E, Bernal A, Larsen N, D'Souza M, Walunas T, Pusch G, Haselkorn R, Fonstein M, Kyrpides N, Overbeek R. 2002. Genome sequence and analysis of the oral bacterium Fusobacterium nucleatum strain ATCC 25586. J Bacteriol 184:2005-2018.
18. Zhou Y, Liang Y, Lynch KH, Dennis JJ, Wishart DS. 2011. PHAST: a fast phage search tool. Nucleic acids research 39:W347-352.
19. Langille MG, Brinkman FS. 2009. IslandViewer: an integrated interface for
19
computational identification and visualization of genomic islands. Bioinformatics 25:664-665.
20. Benedetti P, Rassu M, Pavan G, Sefton A, Pellizzer G. 2011. Septic shock, pneumonia, and soft tissue infection due to Myroides odoratimimus: report of a case and review of Myroides infections. Infection 39:161-165.
21. Greub G. 2009. Parachlamydia acanthamoebae, an emerging agent of pneumonia. Clin Microbiol Infect 15:18-28.
22. Darling AE, Miklos I, Ragan MA. 2008. Dynamics of genome rearrangement in bacterial populations. PLoS Genet 4:e1000128.
23. Lara-Ramirez EE, Segura-Cabrera A, Guo X, Yu G, Garcia-Perez CA, Rodriguez-Perez MA. 2011. New implications on genomic adaptation derived from the Helicobacter pylori genome comparison. PLoS One 6:e17300.
24. Aras RA, Kang J, Tschumi AI, Harasaki Y, Blaser MJ. 2003. Extensive repetitive DNA facilitates prokaryotic genome plasticity. Proc Natl Acad Sci U S A 100:13579-13584.
25. Lennon NJ, Lintner RE, Anderson S, Alvarez P, Barry A, Brockman W, Daza R, Erlich RL, Giannoukos G, Green L, Hollinger A, Hoover CA, Jaffe DB, Juhn F, McCarthy D, Perrin D, Ponchner K, Powers TL, Rizzolo K, Robbins D, Ryan E, Russ C, Sparrow T, Stalker J, Steelman S, Weiand M, Zimmer A, Henn MR, Nusbaum C, Nicol R. 2010. A scalable, fully automated process for construction of sequence-ready barcoded libraries for 454. Genome Biol 11:R15.
26. Gordon D, Abajian C, Green P. 1998. Consed: a graphical tool for sequence finishing. Genome research 8:195-202.
27. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119.
28. Lowe TM, Eddy SR. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research 25:955-964.
29. Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW. 2007. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic acids research 35:3100-3108.
30. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A. 2008. The Pfam protein families database. Nucleic Acids Res 36:D281-288.
31. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O. 2001. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res 29:41-43.
32. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 27:29-34.
33. Tatusov RL, Koonin EV, Lipman DJ. 1997. A genomic perspective on protein families. Science 278:631-637.
34. Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21:3674-3676.
35. Tian W, Arakaki AK, Skolnick J. 2004. EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucleic Acids Res 32:6226-6239.
36. Griggs A, Wapinski, I., Wortman, J., Haas, B. 2014. SYNERGY2: Accurate and scalable ortholog identification. in preparation.
37. Wapinski I, Pfeffer A, Friedman N, Regev A. 2007. Automatic genome-wide reconstruction of phylogenetic gene trees. Bioinformatics 23:i549-558.
38. Wapinski I, Pfeffer A, Friedman N, Regev A. 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54-61.
20
39. Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688-2690.
40. Thompson JD, Higgins DG, Gibson TJ. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680.
41. Felsenstein J. 1989. PHYLIP -- Phylogeny Inference Package (Version 3.2). . Cladistics 5:164-166.
42. Li L, Stoeckert CJ, Jr., Roos DS. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178-2189.
43. Corander J, Marttinen P, Siren J, Tang J. 2008. Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinformatics 9:539.
44. Corander J, Tang J. 2007. Bayesian analysis of population structure based on linked molecular information. Math Biosci 205:19-31.
45. Waack S, Keller O, Asper R, Brodag T, Damm C, Fricke WF, Surovcik K, Meinicke P, Merkl R. 2006. Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models. BMC Bioinformatics 7:142.
46. Hsiao W, Wan I, Jones SJ, Brinkman FS. 2003. IslandPath: aiding detection of genomic islands in prokaryotes. Bioinformatics 19:418-420.
47. Langille MG, Hsiao WW, Brinkman FS. 2008. Evaluation of genomic island predictors using a comparative genomics approach. BMC Bioinformatics 9:329.
48. Siguier P, Perochon J, Lestrade L, Mahillon J, Chandler M. 2006. ISfinder: the reference centre for bacterial insertion sequences. Nucleic acids research 34:D32-36.
49. Darling AE, Mau B, Perna NT. 2010. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5:e11147.
50. Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT. 2009. Reordering contigs of draft genomes using the Mauve aligner. Bioinformatics 25:2071-2073.
21