+ All Categories
Home > Documents > Lateral gene transfer and phylogenetic assignment of environmental fosmid clones: LGT and...

Lateral gene transfer and phylogenetic assignment of environmental fosmid clones: LGT and...

Date post: 22-Feb-2023
Category:
Upload: ualberta
View: 0 times
Download: 0 times
Share this document with a friend
16
Environmental Microbiology (2005) 7(12), 2011–2026 doi:10.1111/j.1462-2920.2005.00918.x © 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Blackwell Science, LtdOxford, UKEMIEnvironmental Microbiology 1462-2912Society for Applied Microbiology and Blackwell Publishing Ltd, 2005 71220112026Original ArticleLGT and phylogenetic assignment of metagenomic clonesC. L. Nesbø, Y. Boucher, M. Dlutek and W. F. Doolittle Received 29 April, 2005; accepted 1 August, 2005. *For correspon- dence. E-mail [email protected]; Tel. (+1) 902 494 2968; Fax (+1) 902 494 1355. Lateral gene transfer and phylogenetic assignment of environmental fosmid clones Camilla L. Nesbø, 1 * Yan Boucher, 2 Marlena Dlutek 1 and W. Ford Doolittle 1 1 Department of Biochemistry and Molecular Biology, Dalhousie University and Genome Atlantic, 5850 College Street, Halifax, Nova Scotia, Canada, B3H 1X5. 2 Department of Biological Sciences, Macquarie University, Sydney, NSW, Australia. Summary Metagenomic data, especially sequence data from large insert clones, are most useful when reasonable inferences about phylogenetic origins of inserts can be made. Often, clones that bear phylotypic markers (usually ribosomal RNA genes) are sought, but some- times phylogenetic assignments have been based on the preponderance of BLAST hits obtained with pre- dicted protein coding sequences (CDSs). Here we use a cloning method which greatly enriches for riboso- mal RNA-bearing fosmid clones to ask two questions: (i) how reliably can we judge the phylogenetic origin of a clone (that is, its RNA phylotype) from the sequences of its CDSs? and (ii) how much lateral gene transfer (LGT) do we see, as assessed by CDSs of different phylogenetic origins on the same fosmid? We sequenced 12 rRNA containing fosmid clones, obtained from libraries constructed using DNA iso- lated from Baltimore harbour sediments. Three of the clones are from bacterial candidate divisions for which no cultured representatives are available, and thus represent the first protein coding sequences from these major bacterial lineages. The amount of LGT was assessed by making phylogenetic trees of all the CDSs in the fosmid clones and comparing the phylogenetic position of the CDS to the rRNA phylo- type. We find that the majority of CDSs in each fosmid, 57–96%, agree with their respective rRNA genes. However, we also find that a significant fraction of the CDSs in each fosmid, 7–44%, has been acquired by LGT. In several cases, we can infer co-transfer of functionally related genes, and generate hypotheses about mechanism and ecological significance of transfer. Introduction Metagenomics, or culture-independent genome analyses, is increasingly being used in microbial ecology studies (Riesenfeld et al ., 2004). In one of the first metagenome studies Rondon and colleagues (2000) isolated DNA from soil and cloned it into a BAC vector to construct a ‘soil metagenome’ library. They screened the library for expres- sion of heterologous genes from the inserts and found antibacterial, lipase, amylase, nuclease and haemolytic activities. In another pioneering study, Beja and col- leagues (2000) identified a novel type of rhodopsin, pro- teorhodopsin, on a genomic fragment from an uncultured γ -proteobacterium. This novel type of phototropy in pro- teobacteria plays an important role in marine ecosystems (Beja et al ., 2001; de la Torre et al ., 2003), and DeLong and collaborators have shown how readily genomic, phys- iological and ecological data can be incorporated into a new interdisciplinary science ‘environmental genomics’. In metagenomic libraries, clones containing a phyloge- netic anchor such as rRNA genes are particularly useful, as the identification of the cloned fragment’s original host is greatly facilitated (Riesenfeld et al ., 2004). One signifi- cant problem associated with efficient screening of BAC and fosmid libraries for bacterial rDNA containing clones is the presence of DNA from the host used in cloning (i.e. Escherichia coli DNA). This hinders detection with the commonly used universal bacterial 16S rRNA primers, and several different alternative screening procedures have been developed for identifying rRNA containing clones. For instance, Suzuki and colleagues (2004) used a screening method based on length heterogeneity of the internal transcribed spacer (ITS) region as well as the presence and location of tRNA-Ala to identify rRNA-gene containing BAC-clones. In some cases (de la Torre et al ., 2003), phylogenetic origins of large-insert clones which lack phylogenetic anchors have been inferred from the preponderance of best BLAST hits to GenBank sequences of known phylogenetic origin. Here we have identified rRNA clones by utilizing 23S- rRNA-intron-encoded homing endonucleases. Such endonucleases are encoded by group I introns that are
Transcript

Environmental Microbiology (2005)

7

(12) 2011ndash2026 doi101111j1462-2920200500918x

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Blackwell Science LtdOxford UKEMIEnvironmental Microbiology 1462-2912Society for Applied Microbiology and Blackwell Publishing Ltd 2005

7

1220112026

Original Article

LGT and phylogenetic assignment of metagenomic clonesC L Nesboslash

Y Boucher M Dlutek and W F Doolittle

Received 29 April 2005 accepted 1 August 2005 For correspon-dence E-mail cnesbodalca Tel (

+

1) 902 494 2968 Fax(

+

1) 902 494 1355

Lateral gene transfer and phylogenetic assignment of environmental fosmid clones

Camilla L Nesboslash

1

Yan Boucher

2

Marlena Dlutek

1

and W Ford Doolittle

1

1

Department of Biochemistry and Molecular Biology Dalhousie University and Genome Atlantic 5850 College Street Halifax Nova Scotia Canada B3H 1X5

2

Department of Biological Sciences Macquarie University Sydney NSW Australia

Summary

Metagenomic data especially sequence data fromlarge insert clones are most useful when reasonableinferences about phylogenetic origins of inserts canbe made Often clones that bear phylotypic markers(usually ribosomal RNA genes) are sought but some-times phylogenetic assignments have been based onthe preponderance of

BLAST

hits obtained with pre-dicted protein coding sequences (CDSs) Here we usea cloning method which greatly enriches for riboso-mal RNA-bearing fosmid clones to ask two questions(i) how reliably can we judge the phylogenetic originof a clone (that is its RNA phylotype) from thesequences of its CDSs and (ii) how much lateralgene transfer (LGT) do we see as assessed by CDSsof different phylogenetic origins on the same fosmidWe sequenced 12 rRNA containing fosmid clonesobtained from libraries constructed using DNA iso-lated from Baltimore harbour sediments Three of theclones are from bacterial candidate divisions forwhich no cultured representatives are available andthus represent the first protein coding sequencesfrom these major bacterial lineages The amount ofLGT was assessed by making phylogenetic trees ofall the CDSs in the fosmid clones and comparing thephylogenetic position of the CDS to the rRNA phylo-type We find that the majority of CDSs in each fosmid57ndash96 agree with their respective rRNA genesHowever we also find that a significant fraction of theCDSs in each fosmid 7ndash44 has been acquired byLGT In several cases we can infer co-transfer offunctionally related genes and generate hypotheses

about mechanism and ecological significance oftransfer

Introduction

Metagenomics or culture-independent genome analysesis increasingly being used in microbial ecology studies(Riesenfeld

et al

2004) In one of the first metagenomestudies Rondon and colleagues (2000) isolated DNA fromsoil and cloned it into a BAC vector to construct a lsquosoilmetagenomersquo library They screened the library for expres-sion of heterologous genes from the inserts and foundantibacterial lipase amylase nuclease and haemolyticactivities In another pioneering study Beja and col-leagues (2000) identified a novel type of rhodopsin pro-teorhodopsin on a genomic fragment from an uncultured

γ

-proteobacterium This novel type of phototropy in pro-teobacteria plays an important role in marine ecosystems(Beja

et al

2001 de la Torre

et al

2003) and DeLongand collaborators have shown how readily genomic phys-iological and ecological data can be incorporated into anew interdisciplinary science lsquoenvironmental genomicsrsquo

In metagenomic libraries clones containing a phyloge-netic anchor such as rRNA genes are particularly usefulas the identification of the cloned fragmentrsquos original hostis greatly facilitated (Riesenfeld

et al

2004) One signifi-cant problem associated with efficient screening of BACand fosmid libraries for bacterial rDNA containing clonesis the presence of DNA from the host used in cloning (ie

Escherichia coli

DNA) This hinders detection with thecommonly used universal bacterial 16S rRNA primersand several different alternative screening procedureshave been developed for identifying rRNA containingclones For instance Suzuki and colleagues (2004) useda screening method based on length heterogeneity of theinternal transcribed spacer (ITS) region as well as thepresence and location of tRNA-Ala to identify rRNA-genecontaining BAC-clones In some cases (de la Torre

et al

2003) phylogenetic origins of large-insert clones whichlack phylogenetic anchors have been inferred from thepreponderance of best

BLAST

hits to GenBank sequencesof known phylogenetic origin

Here we have identified rRNA clones by utilizing 23S-rRNA-intron-encoded homing endonucleases Suchendonucleases are encoded by group I introns that are

2012

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

found in the 23S rRNA of eukaryotic chloroplasts andmitochondria (Cannone

et al

2002) as well as a fewbacteria (Nesboslash and Doolittle 2003) They very specifi-cally cleave conserved sequences in intron-free 23SrRNA genes The recognition sites are usually located inthe most conserved part of the host rRNA gene and being15ndash25 bp long are highly specific to rRNA genes Theslow evolutionary rate of these sites as well as the toler-ance for minor sequence changes by homing endonu-cleases means that rRNA genes from a wide range of taxacan usually be cut Three such enzymes belonging to theLAGLIDADG family have been particularly well character-ized I-CeuI (from the chloroplast 23S gene of

Chlamy-domonas eugametos

) (Marshall and Lemieux 1992) I-CreI (from the chloroplast 23S gene of

Chlamydomonasreinhardtii

) (Chevalier

et al

2003) and I-DmoI (from the23S gene of the crenarchaeon

Desulphurococcus mobilis

)(Aagaard

et al

1997) These enzymes have different cut-ting sites and specificities The enzyme used here I-CeuIis commercially available from New England Biolabs(httpwwwnebcom) and targets a 19-bp cut site at 23SrRNA position 1923 (relative to the

E coli

23S rRNA) thatis conserved in most bacteria

Here we present the sequences of 12 environmentalfosmid clones 10 that contain about 1000 bp of the 23SrRNA gene one that contains both 23S rRNA and 16SrRNA genes and one that contains 1079 bp of the 16SrRNA gene The metagenomic libraries containing theseclones were constructed using DNA isolated from anaer-obic sediments from Baltimore harbour Microbial commu-nities from these sediments have been shown earlier tobe capable of reductive dechlorination of PCBs (Holoman

et al

1998) The taxonomic position of each fosmid clonewas assessed using its rRNA gene The amount of lateralgene transfer (LGT) was assessed by making phyloge-netic trees of all the predicted protein coding sequences(CDSs) in the fosmid clones and comparing the phyloge-netic position of the CDS to that indicated by the rRNAgene

Results and discussion

Two different types of fosmid libraries were made fromthe anaerobic sediment DNA The first used thepCCFos- vector from Epicenter (B1BF1) and the seconda modification of that vector containing an I-CeuI sitefor specifically cloning DNA fragments containing 23SrRNA genes (B1BCF1 B1DCF1 B3CF5 B1DCF5) TheB1BF1-library contained about 10 000 clones The I-CeuI-libraries were considerably smaller and we identi-fied only 49 clones with unique 23S rRNA end-sequences However assuming one to three bacterialrRNA containing clones among every 100 clones (Suzuki

et al

2004) and considering that not all bacterial 23S

rRNA are cut by I-CeuI the number of clones obtained isclose to expected values

End-sequencing and subcloning analyses of lsquonormalrsquo fosmids

In order to get information on the diversity of genomicfragments captured in the B1BF1 library we obtained 576end-sequences resulting in 565 unique sequences withan average of 408 high-quality base pairs correspondingto 232 kb of environmental DNA Among the sequenceswe identified a 16S rRNA sequence in B1BF110d03which was fully sequenced (see below) as well as one23S rRNA containing clone We also attempted to identify23S rRNA containing clones by screening 10 96-well-plates from the B1BF1 library using I-CeuI However offour clones that appeared to be cut by I-CeuI only oneproved to contain 23S rRNAs and was fully sequenced(B1BF1a01 see below)

The distribution of G

+

C content of the end-sequencessignificant hits to proteins in GenBank (based on

BLASTX

results with

e

-values

lt

1 e

minus

10

) as well as matches to pro-teins that have been assigned to COG categories areshown in supplemental Fig S1AndashC As observed byTreusch and colleagues (2004) the distribution of the COGcategories are similar to what is observed for singlegenomes of cultivated organisms suggesting that thisrepresents an average of the genomes in this habitatTaken together the large G

+

C content variation as wellas the wide functional and phylogenetic diversity of thesequences suggests that we have sampled sequencesoriginating from a large diversity of genomes

End-sequencing of I-CeuI fosmid libraries

Four libraries were made using the pCC1FOSCeuI23Svector High-quality end-sequences of 91 clones revealed62 unique clones of which at least 49 (79) contained1000 bp of 23S rRNA Eight clones (129) did not con-tain a 23S rRNA and for five clones we could not obtainhigh-quality sequence from the end that should containthe 23S rRNA The 23S rRNA fragments showed highestsimilarity in

BLASTN

searches to sequences from severaldifferent bacterial groups

α

-proteobacteria (1)

δ

-proteo-bacteria (15)

γ

-proteobacteria (9) Firmicutes (8) Planc-tomycetes (6) Bacteroidetes (1) Actinobacteria (1)Spirochaetes (1) ChlamydiaeVerrucomicrobia (6) Thesesequence tags were not long enough to make well-supported phylogenetic trees (average sequence length

=

431 bp) however this gives a rough indication of thediversity captured by this method

These results demonstrate the efficiency of usingintron-encoded endonucleases to specifically clone rRNAcontaining DNA fragments Screening a lsquonormalrsquo fosmid

LGT and phylogenetic assignment of metagenomic clones

2013

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

library for rRNA genes usually leads to about 1 ofpositives clones (Suzuki

et al

2004) Although ourpCC1FOSCeuI23S libraries contained only 62 uniqueclones at least 49 of them (79) contained a 23S genewhich is the equivalent of screening 4900 clones from alsquonormalrsquo fosmid library Also the peripheral location of therDNA on the DNA fragments greatly facilitates screeningand sequencing It is also unlikely to have the samebiases as polymerase chain reaction (PCR) screening forthe recovery of rDNA containing clones For I-CeuI recov-ery of positives is based on a single

sim

20 bp DNA region(rather than two for PCR) and allows for a different typeof degeneracy of this DNA region (drop in cuttingefficiency for divergent sequences) Although the I-CeuIrecognition sequence is specific to bacteria (except Acti-nobacteria) other homing endonucleases such as I-CreIand I-DmoI could be used to recover archaeal and acti-nobacterial DNA fragments

Phylogenetic analyses of rRNA genes demonstrate the recovery of protein-coding genes from a wide diversity of bacterial lineages

Twelve rRNA containing fosmid clones were fullysequenced The annotation of these clones is given inTables S1ndash12 (see

Supplementary material

) and Fig 1gives an overview of the fosmid clonesrsquo phylogenetic affil-iation Figure 2A shows the phylogenetic trees estimatedfrom the 1000 bp 23S rRNA from the I-CeuI-fosmids forseven of the fosmid clones this 23S tag could be used toassign the clone to a specific bacterial lineage We have23S rRNA containing fragments from two

δ

-proteobacte-ria two

γ

-proteobacteria one

ε

-proteobacterium one

β

-proteobacterium (from this we also have the 16S rRNA)and one taxon from the phylum Chloroflexi Two fosmidclones contained 16S rRNA genes ndash B1BF110d03 andB1BF11a01 ndash and phylogenetic analyses placed thesesequences within the

Flavobacteriaceae

and

β

-proteobac-teria respectively (Figs 2B and C)

For four fosmid clones ndash b1bcf11f04 b1dcf51c12b3cf12f09 and b1dcf55a06 ndash the 23S rRNA-tag did notcluster with any specific 23S lineage For these clones weattempted to obtain the 16S rRNA sequence by using onespecific 23S rRNA primer and a universal 16S primer Wesuccessfully obtained four 16S-23S rRNA sequencesthat showed 98ndash99 identity to the 23S fragment inb11bcf11f04 (715 bp overlap) One of these ampliconswas fully sequenced and phylogenetic analyses showedthat it belong to the candidate division WS3 (Dojka

et al

1998) (Fig 2B) Because b1dcf51c12 clusters signifi-cantly with b1bcf11f04 in the 23S rRNA tree (Fig 2A)we also assigned this clone to the WS3 division Forb3cf12f09 we obtained two different 16S-23S rRNAclones that showed 100 and 99 identity to the 23S

fragment of this clone (281 bp overlap) and phylogeneticanalyses showed that this clone belongs to the candidatedivision OP8 (Hugenholtz

et al

1998) (Fig 2B) The ITSregions of both the WS3 and the OP8 rRNA operonscontained tRNA Ile and tRNA Ala For b1dcf51a06 no16S rRNA sequence could be obtained

Most protein coding sequences are in agreement with the adjacent rRNA genes in phylogenetic analyses

Phylogenetic trees were obtained for all predicted CDSsof each fosmid clone sequenced We compared the phy-logenetic placement of each CDS to the phylogenysuggested by the rRNA If the phylogeny of the CDSsuggested that it belonged to another bacterial group andthis relationship was supported in bootstrap analysesacquisition by LGT was inferred for the CDS For the clonewhere no specific phylogenetic relationship could beinferred (b1dcf51a06) and for the fosmid clones wherethe rRNA showed that it originated from a bacterium withno cultivated representative we classified as likelyinstances of LGT all CDSs that did cluster specifically(with bootstrap support) with another bacterial group Asummary of the phylogenetic analysis of the rRNA genesas well as of all protein coding CDSs is given in Table 1

The majority of the CDSs did agree with their respectiverRNA phylogeny and 57ndash96 (average 768) of theCDSs that gave good alignments and robust phylogeniesshowed the same phylogenetic relationship as the rRNAgenes This was also true for the fosmid clones frombacterial lineages with no cultivated representative asmost CDSs from these clones did not cluster with anyspecific lineage or had no or only a few significantmatches in GenBank (Fig 1) However for these clonesthe number of CDSs that robustly agree with the rRNAtopology is problematic to calculate as they may or maynot fall into well-supported clades when more sequencesfrom these phyla become available The fosmid cloneswith the highest number of congruent trees areb1bf11a01 which originated from a

β

-proteobacteriumvery similar to

Thiobacillus denitrificans

where 96 of theCDSs with robust phylogenies agree with the rRNA genesand b1dcf13c08 which originated from an

isin

-proteobac-terium where 90 of the lsquotreeablersquo CDSs agree with therRNA

High levels of LGT detected in phylogenetic analyses

Phylogenetic analyses showed that 7ndash44 (average17) of the CDSs have been acquired by LGT from dis-tantly related bacterial lineages (Fig 1 Table 1) For manyof the fosmid clones there were additional CDSs thatprobably also have been involved in LGT these caseswere not scored as LGT either because the CDS was too

2014

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

Fig

1

Ove

rvie

w o

f th

e se

quen

ced

fosm

id c

lone

s Y

ello

w C

DS

s ar

e su

gges

ted

to h

ave

been

acq

uire

d by

LG

T a

nd b

lue

CD

Ss

have

no

sign

ifica

nt m

atch

in G

enB

ank

A

α

-pro

teob

acte

ria B

β

-pr

oteo

bact

eria

D

δ

-pro

teob

acte

ria

E

ε

-pro

teob

acte

ria

G

γ

-pro

teob

acte

ria

C

Cya

noba

cter

ia

CB

C

hlor

obi-B

acte

roid

etes

F

Fir

mic

utes

P

pro

teob

acte

ria

CH

C

hlor

oflex

i T

D

The

rmus

-D

eino

cocc

us g

roup

A

CT

Act

inob

acte

ria

PL

Pla

ncto

myc

etes

S

PIR

S

piro

chae

tes

TH

ER

T

herm

otog

ales

A

Q

Aqu

ifeca

les

FU

SO

F

usob

acte

ria

AR

CH

A

rcha

eal

EU

K

Euk

aryo

tes

EN

Ven

viro

nmen

tal s

eque

nce

c

lust

er r

obus

tly w

ithin

a m

ixed

cla

de in

phy

loge

netic

tree

s ndash

no

sign

ifica

nt m

atch

in G

enB

ank

Upp

erca

se s

uppo

rted

by

phyl

ogen

etic

ana

lysi

s L

ower

case

sug

gest

edby

BLA

ST

sea

rche

s as

the

re w

as n

o su

ppor

ted

phyl

ogen

y T

he lo

w-q

ualit

y re

gion

in b

1dcf

13

c08

(pos

ition

119

2ndash13

42)

is in

dica

ted

by a

bla

ck b

ox T

he o

rang

e sh

adin

gs in

dica

tes

LGT-

CD

Ss

that

are

foun

d in

mor

e th

an o

ne fo

smid

ORFAN

A c

onju

gativ

e tr

ansp

oson

ob

tain

ed fr

om a

Bac

terio

ides

bac

teriu

m

unkn

own

b1dc

f51

a06

Chl

orof

exi

b1dc

f13

f01

Can

dida

te d

ivsi

on O

P8

b3cf

12

f09

Can

dida

te d

ivsi

on W

S3

b1bc

f11

f4

Can

dida

te d

ivsi

on W

S3

b1bc

f51

c12

d-pr

oteo

bact

eria

b1bc

f11

h03

d-pr

oteo

bact

eria

b1bc

f11

d04

e-pr

oteo

bact

era

b1dc

f13

c08

g-pr

oteo

bact

eria

b1dc

f12

d07

g-pr

oteo

bact

eria

b1bc

f11

c04

b-pr

oteo

bact

eria

b1bf

11

a01

Fla

voba

cter

iace

aeb1

bf1

10d

03

LGT and phylogenetic assignment of metagenomic clones

2015

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

Fig 2

rRNA phylogeniesA The minimum evolution tree estimated from LogDet distances of the 23S-tag from the CeuI-fosmids (984 positions in alignment) For the sequences from the fosmid clones the Maximum Likelihood topology was similar (GTR

+

G

+

I) except that the

δ

-proteobacteria where paraphyletic with the

γ

- and

β

-proteobacteria clustering within the

δ

-proteobacteria Moreover b1bcf11d04 fell at the bottom of this cladeB The minimum evolution tree estimated from LogDet distances of the 16S sequences (1243 positions in alignment) For the sequences from the fosmid clones the Maximum Likelihood (GTR

+

G

+

I) topology was identical However there where several differences in the backbone of the tree with for instance Geobacter clustering with Firmicutes The trees in both A and B were rooted by the

Thermotoga maritima

sequenceC The minimum evolution tree estimated from LogDet distances of the closest matches of the 16S fragment in b1bf110d03 (1046 positions in alignment) The Maximum Likelihood (GTR

+

G

+

I) topology was identicalFor all three trees numbers on branches refers to bootstrap values from the minimum evolution analysis (

italic

) and from the Maximum Likelihood analysis (plain text) If both bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 90

B

Thermotoga maritima Coprothermobacter proteolyticus

Acidobacterium capsulatumPirellula marina

R76-B102OPB95

OPB5HMMVPog-54

HS9-30

PBS-II-35

LD1-PB19PBS-III-30

PRR-12Simkania negevensisBorrelia burgdorferi

Synechococcus Chloroflexus aurantiacus

Dehalococcoides ethenogenes Bacteroides thetaiotaomicron

Cytophaga hutchinsoniiChlorobium tepidum

Leptospirillum ferrooxidans Deinococcus radiodurans

Geobacillus subterraneus Paenibacillus popilliae

Fusobacterium nucleatum Geobacter metallireducens

Bradyrhizobium japonicum Vibrio splendidus

Methylobacillus flagellatum Thiobacillus denitrificans

005 substitutionssite

b3cf12f09

b1bcf11f04

b1bf11a01

candidate division OP8

candidate division WS3

Betaproteobacteria

92

72

54

78

57

75

Porphyromonas gingivalis

Bacteroides thetaiotaomicron

Cytophaga hutchinsonii

Cellulophaga pacifica

Flavobacterium gelidilacus

Flavobacterium psychrolimnae

Flavobacterium frigoris

Flavobacterium xinjiangensis

Gelidibacter algens

Bizionia paragorgiae

Formosa algae

Algibacter lectus

Flavobacterium sp 5N-3

Psychroserpens burtonensis

Mesophilibacter yeosuensis

b1bf110d03

Flavobacteriaceae bacterium BSA CS 02

Flavobacteriaceae bacterium BSD RB 42

001 substitutionssite

C

isolated from estuarine and salt marsh sediments

b3cf12f09Chlorobium tepidum

Synechocystis sp D64000

Deinococcus radiodurans

b1dcf13f01Dehalococcoides ethenogenes

b1dcf511a06Fusobacterium nucleatum

b1bcf11f04b1dcf51c12

Mycobacterium kansasiiStreptomyces coelicolor Thermomonospora chromogena

Paenibacillus popilliaeOceanobacillus iheyensis

Geobacillus kaustophilus

Simkania negevensis Pirellula sp strain 1

b3cf12d07Pseudomonas stutzeri

005 substitutionssite

candidate division WS3

Wolinella succinogenes Helicobacter hepaticus

Campylobacter jejuni b1dcf13c08

Epsilonproteobacteria

b1bcf11d04Desulfotalea psychrophila

b1bcf11h03Nannocystis exedens

Stigmatella aurantiacaGeobacter metallireducens

Deltaproteobacteria

Methylobacillus flagellatusb1bf11a01Thiobacillus denitrificans

Halomonas pantelleriensis

Microbulbifer degradansVibrio splendidus

b1bcf11c04Uncultured bacterium 463 clone EBAC080-L32B05

Betaproteobacteria

Gammaproteobacteria

Thermotoga maritima

candidate division OP8

Chloroflexi

Symbiobacterium thermophilum

Bacillus cereus

Desulfovibrio vulgaris

A

51

6197

87

55

67

54100

61

58

84

57

8968

58

97

54

65

64

68

73

51

58

53

87

58

2016

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

short to obtain reliable alignments the CDS was found ina lsquomixedrsquo clade also containing genes from the same bac-terial group or the CDS was found outside its group butdid not cluster with any specific lineage For three of theclones more than 30 of the CDSs have been acquiredby LGT (Table 1) two of these are from candidate divi-sions and one is from a

δ

-proteobacterium For all threeof these fosmids there appears to have been a transfer ofa large island of genes from a phylogenetically distantlineage Specifically we infer an

α

-proteobacterial islandin b3cf12f09 a

δ

-proteobacterial island in b1dcf51c12and an archaeal

β

-proteobacterial island in b1bcf11d04(Fig 1) It should be noted that the proportions of foreigngenes identified here might not represent the proportion

of foreign genes in the respective genomes that we havesampled but

rather the amount of LGT to be expectedwhen sequencing environmental fosmid clones

Forinstance in some genomes LGT might be enriched incertain variable parts of the genome Indeed the distribu-tion of proteins that match COG categories was signifi-cantly different (

P

=

13 e-13 in a

χ

2

-test) to what weobserved for the end-sequencing of lsquonormalrsquo fosmidclones (supplemental Fig S1) the main difference beingproportionally more J K U F and H category sequencesin the full fosmid sequences and more L P R and Scategory sequences among the end-sequences Whencomparing the distributions of different COG-groups (ieinformational metabolism etc) however the two datasets were significantly different only when including thepoorly characterized categories (R S) If such genes aremore frequently transferred than the other categoriesthen we would be underestimating the level of LGT thatwould be expected when analysing metagenomic clones

Interestingly in b1bcf11d04 the transfer vector for oneof the acquired gene clusters could be identified ORF6encodes an acetyl transferase gene and ORF8 ORF9and ORF10 encode subunits for an acyl-CoA synthase ndashtwo

α

-subunits and one

β

-subunit Phylogenetic analysessuggested all four CDSs have been acquired by LGTlikely from a

β

-proteobacterium The

β

-proteobacteriahave in turn likely acquired the acyl-CoA synthase genesfrom Archaea (Fig 3) In support of the archaeal origin ofthese genes the acyl-CoA synthase in bcf11d04 hassimilar domain organization to the acetyl-CoA synthase in

Pyrococcus

spp with two subunits (Sanchez

et al

2000)Furthermore these genes have been transferred multipletimes and the transfers involved all three domains of life[Fig 3 (Andersson

et al

2003)] These transferred CDSsare preceded by one integrase gene (ORF3) a trans-posase gene (ORF4) and an intergerasetransposasegene (ORF5 COG2801 Tra5 which contains an inte-gerase core domain Table S7) which probably wereresponsible for transferring this cluster into this genomeThe

α

-proteobacterial island in the b3cf12f09 cloneencodes a wide range of different functions and no typicalmobile elements were identified However as this islandextends to the 3

prime

end of the fosmid mobile genes mightbe found further downstream The first CDS of this islandencodes a DnaJ-class chaperone (ORF29) which is trun-cated at the 5

prime

end This pseudogene still shows 65protein identity to a homologue in

Magnetoospirillummagnetotacticum

(Table S3) Hence this probably repre-sents a very recent transfer (or rearrangement) Anotherpossibility is that this fosmid might be a chimera Howeverthe G

+

C content of the CDSs in the

α

-proteobacterialisland (595 G

+

C) is very similar to the rest of thefosmid (596 G

+

C supplemental Table S3) Also fur-ther upstream there is a proteobacterial transposase

Fig 3

Maximum Likelihood phylogeny of acetyl-CoA synthetase (ADP-forming) homologues estimated using PMBML (459 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF8 and ORF10 sequences against GenBank and the 100 best matches were retrieved and aligned Groups of very similar sequences from the same species or sister species were trimmed down to one sequence representative The tree was arbi-trarily rooted by Entamoeba histolytica Numbers on branches refers to bootstrap support obtained from using PMBML in bold PUZZLEBOOT in plain text and Neighbour-joining in italic If all bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 80

10

Entamoeba histolytica Parachlamydia sp UWE25

Rubrobacter xylanophilus Gloeobacter violaceus

Nostoc sp PCC 7120Thermosynechococcus elongatus

Dechloromonas aromaticaMesorhizobium sp BNC1

Sinorhizobium melilotiXanthomonas axonopodisRhodopseudomonas palustris

Bradyrhizobium japonicum Desulfovibrio desulfuricans

Rhodospirillum rubrumMagnetospirillum magnetotacticum

Magnetospirillum magnetotacticumShewanella oneidensis

Photobacterium profundumVibrio cholerae

Vibrio vulnificus Photorhabdus luminescens

Yersinia pestis Salmonella enterica

Escherichia coli Methanopyrus kandleri

Pyrococcus furiosus Archaeoglobus fulgidus

Methanococcus maripaludisMethanocaldococcus jannaschii

Magnetococcus sp MC-1 Chloroflexus aurantiacus

Spironucleus barkhanus Giardia intestinalis

Pyrococcus furiosusThermoplasma acidophilum Thermoplasma volcanium

Pyrococcus furiosus Streptomyces avermitilisBradyrhizobium japonicum

Ralstonia metalliduransFerroplasma acidarmanus

Sulfolobus solfataricusSulfolobus tokodaii

Pyrococcus furiosusPyrococcus furiosus

Pyrobaculum aerophilumMethanosarcina mazei Methanosarcina acetivoransThermobifida fusca

Archaeoglobus fulgidusArchaeoglobus fulgidus

Archaeoglobus fulgidusArchaeoglobus fulgidus

b1bcf11d04ORF8b1bcf11d04ORF10

Bordetella bronchiseptica Ralstonia metallidurans

Bordetella pertussis Bordetella bronchiseptica

Burkholderia fungorumBurkholderia fungorumRalstonia eutropha

Bordetella bronchisepticaRalstonia eutropha

Bradyrhizobium japonicumRalstonia eutropha

Burkholderia fungorumBordetella bronchiseptica

Ralstonia eutrophaBordetella bronchiseptica

Bradyrhizobium japonicumBordetella bronchiseptica

Pseudomonas mendocina Bradyrhizobium japonicum

7480

9764

75

52

83

52

57

60

61

70

89

51

64

6262

64

57

58

50

7173

62

100100

LGT and phylogenetic assignment of metagenomic clones 2017

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Tab

le 1

S

umm

ary

of p

hylo

gene

tic a

naly

ses

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

b1dc

f51

a06

No

clea

r af

filia

tion

with

exi

stin

gse

quen

ces

Cou

ld n

ot b

eam

plifi

ed

Mos

t C

DS

s ha

ve n

o or

only

a f

ew s

igni

fican

tm

atch

es in

Gen

Ban

kO

RF

4 cl

uste

rs w

ithLe

ptos

pira

inte

rrog

ans

with

in a

mix

ed c

lade

ho

wev

er

L in

terr

ogan

sha

s se

vera

l par

alog

ues

and

this

gen

e ap

pear

sto

hav

e be

en f

requ

ently

tran

sfer

red

and

islik

ely

to b

e a

tran

sfer

OR

F20

clu

ster

s w

ithM

etha

nosa

rcin

a w

ithin

δ-pr

oteo

bact

eria

O

RF

19cl

uste

rs w

ith G

eoba

cter

but

is m

ostly

foun

d in

met

hano

gens

OR

F17

and

OR

F18

have

hom

olog

ues

inM

etha

noge

ns o

nly

4 C

DS

s (1

9 o

f th

eto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

33

(38

)

b1dc

f13

f01

Clu

ster

s w

ithD

ehal

ococ

coid

eset

heno

gene

sC

hlor

oflex

usau

rant

iacu

s 23

SrR

NA

seq

uenc

eof

too

poo

r qu

ality

to in

clud

e in

the

tree

7 of

10

CD

Ss

(70

) w

ithsu

ppor

ted

phyl

ogen

etic

topo

logi

es a

gree

with

23S

fra

gmen

t In

addi

tion

6 C

DS

s w

hich

only

hit

Chl

orofl

exus

aura

ntia

cus

Two

CD

Ss

have

like

lybe

en a

cqui

red

thro

ugh

LGT

One

clu

ster

s w

ithhi

gh s

uppo

rt w

ithT

herm

otog

a m

ariti

ma

(OR

F16

) an

d on

e cl

uste

rsw

ithin

the

euk

aryo

tes

(OR

F25

)

OR

F2

has

only

sign

ifica

ntho

mol

ogue

s in

Cro

cosp

haer

aw

atso

nii

3 C

DS

s (1

1 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

14

(5

)

b3cf

12

f09

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A g

enes

and

do

not

clus

ter

with

in a

nysp

ecifi

c ba

cter

ial g

roup

Phy

loge

netic

ana

lysi

ssu

gges

ts t

hat

10 C

DS

sha

ve li

kely

bee

n ac

quire

dby

LG

T 8

of

thes

e ha

vebe

en a

cqui

red

from

an

α-pr

oteo

bact

eriu

man

d ar

e fo

und

linke

d

Thr

ee C

DS

s fo

und

linke

d to

CD

Ss

whe

reph

ylog

enet

ic a

naly

ses

sugg

est

LGT

hav

eal

so li

kely

bee

nac

quire

d by

LG

T

13 C

DS

s (3

2 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

OR

F16

is a

tran

spos

ase

of

prot

eoba

cter

ial

orig

in

and

show

slo

wer

GC

con

tent

than

the

res

t of

the

fosm

id T

wel

ve o

fth

e tr

ansf

erre

dC

DS

s (O

RF

29ndash

41)

are

linke

d an

dal

l app

ear

to h

ave

been

acq

uire

dfr

om a

n α-

prot

eoba

cter

ium

22

(9

)

b1bc

f11

f04

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A a

nd d

oes

not

clus

ter

with

any

spe

cific

bact

eria

l lin

eage

A

mon

g th

ese

was

the

high

ly c

onse

rved

Dna

Ege

ne

Two

CD

Ss

(OR

F14

and

OR

F15

) cl

uste

r w

ithse

quen

ces

from

the

Chl

orob

iBac

tero

idet

esgr

oup

2 C

DS

s (9

o

f to

tal)

hav

e b

een

acq

uir

ed b

y L

GT

26

(14

)

2018 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f51c

12C

andi

date

div

isio

nW

S3

bact

eriu

mM

ost

CD

Ss

have

no

oron

ly a

few

sig

nific

ant

mat

ches

in G

enB

ank

OR

F6ndash

OR

F11

are

als

ofo

und

in b

1bcf

11

h3 in

sam

e or

der

and

phyl

ogen

etic

ana

lysi

ssu

ppor

ts t

hat

OR

F7

OR

F8

and

OR

F10

wer

etr

ansf

erre

d fr

om a

δ-

prot

eoba

cter

ium

to

b1bc

f51c

12 O

RF

10 a

ndO

RF

11 a

lso

clus

ter

with

δ-pr

oteo

bact

eria

ho

wev

er

with

no

boot

stra

p su

ppor

t O

RF

9ha

s on

ly o

ne m

atch

inG

enB

ank

OR

F15

(fu

sA)

clus

ters

with

Chl

orob

ium

tepi

dum

with

inF

irm

icut

es

OR

F12

has

no

hom

olog

ue in

b1bc

f11

h3

but

doe

scl

uste

r w

ith δ

-pr

oteo

bact

eria

ho

wev

er w

ith n

obo

otst

rap

supp

ort

It is

like

ly t

hat

also

thi

sC

DS

was

tra

nsfe

rred

as p

art

of w

ith a

δ-

prot

eoba

cter

ial i

slan

d

8 C

DS

s (4

4 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

One

lar

ge lsquoi

slan

drsquo o

fδ-

prot

eoba

cter

ial

orig

in

22

(29

)

b1cf

11

1h0

3δ-

Pro

teob

acte

rium

ndash8

of 1

3 C

DS

s (5

7)

that

give

s su

ppor

ted

phyl

ogen

ies

agre

e w

ithth

e fr

agm

ent

orig

inat

ing

from

a δ

-pr

oteo

bact

eriu

m

Six

CD

Ss

have

like

ly b

een

acqu

ired

by L

GT

OR

F8

clus

ters

with

Clo

strid

ium

ther

moc

ellu

m a

ndTr

epon

ema

dent

icol

aO

RF

18 is

fou

ndse

para

ted

from

oth

erpr

oteo

bact

eria

inph

ylog

enet

ic t

rees

cl

uste

ring

with

Pla

smod

ium

spp

O

RF

23is

fou

nd in

a m

ixed

cla

dean

d ap

pear

s to

hav

ebe

en f

requ

ently

tran

sfer

red

OR

F28

clus

ters

with

β-

prot

eoba

cter

ia

OR

F29

clus

ters

with

γ-

prot

eoba

cter

ia a

ndO

RF

30 is

fou

nd a

tbo

ttom

of

clad

e th

atco

ntai

ns α

-pr

oteo

bact

eria

and

Act

inob

acte

ria

6 C

DS

s (1

7 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

OR

F11

ndashOR

F16

ha

ve b

een

tran

sfer

red

from

an

ance

stor

of

B1B

CF

11

h03

tob1

dcf5

1c

12 a

sw

ell t

o th

eC

hlor

obiu

m li

neag

e

6 (

1)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2019

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f11

d04

δ-P

rote

obac

teriu

mndash

12 o

f 18

CD

Ss

(67

)w

ith s

uppo

rted

phyl

ogen

etic

top

olog

ies

agre

e w

ith a

δ-

prot

eoba

cter

ial o

rigin

of

the

frag

men

t

Six

CD

Ss

are

sugg

este

dby

phy

loge

netic

ana

lyse

sto

hav

e be

en a

cqui

red

byLG

T O

ne o

f th

ese

tran

sfer

red

gene

s ndasht

hefu

sA h

omol

ogue

(OR

F19

) ndash is

als

o fo

und

inb1

bcf5

c12

Thi

s C

DS

has

been

tra

nsfe

rred

to

othe

r δ-

prot

eoba

cter

ia a

sw

ell

Thr

ee C

DS

s (O

RF

3ndash5)

that

enc

ode

anin

tege

rase

and

tw

otr

ansp

osas

es t

hat

prec

edes

fou

r of

the

LGT

gen

es d

etec

ted

in t

he p

hylo

gene

tican

alys

is

OR

F7

also

likel

y tr

ansf

erre

d w

ithO

RF

3 ndashO

RF

10

OR

F20

and

OR

F21

have

mai

nly

hom

olog

ues

inF

irm

icut

es a

nd is

the

neig

hbou

r of

OR

F19

that

has

als

o be

enac

quire

d fr

omF

irm

icut

es

12 C

DS

s (3

1 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

Inte

rest

ingl

y th

isfo

smid

clo

nepr

ovid

es t

hetr

ansf

er v

ecto

r ndash

the

inte

gera

se a

ndtr

ansp

osas

e ndash

for

8of

the

tra

nsfe

rred

gene

s

ndash

b1bc

f13

c08

ε-P

rote

obac

teriu

m

mos

t cl

osel

yre

late

d to

Cam

pylo

bact

erje

juni

21 C

DS

s gi

ve s

uppo

rted

phyl

ogen

ies

and

ofth

ese

19 (

90

) ag

ree

with

rR

NA

OR

F4

clus

ters

with

Geo

bact

er a

ndC

lost

ridiu

m

and

OR

F23

does

not

hav

eho

mol

ogue

s in

ε-

prot

eoba

cter

ia a

ndcl

uste

rs w

ith γ

- an

d β-

prot

eoba

cter

ia

OR

F24

doe

s no

t gi

ve a

supp

orte

d tr

ee b

utha

s al

so p

roba

bly

been

tra

nsfe

rred

fro

mγ-

or

β-pr

oteo

bact

eria

3 C

DS

s (7

o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

10

(3

)

b3cf

12

d07

γ-P

rote

obac

teriu

m

Clu

ster

s w

ithin

the

γ-pr

oteo

bact

eria

inLo

gDet

dis

tanc

etr

ees

but

at t

heba

se o

f γ-

prot

eoba

cter

ia a

ndβ-

prot

eoba

cter

iain

the

bes

tm

axim

umlik

elih

ood

tree

Onl

y 7

CD

Ss

give

su

ppor

ted

phyl

ogen

ies

O

f th

ese

4 (5

7)

agre

e w

ith r

RN

A

OR

F7

clus

ter

with

in β

-pr

oteo

bact

eria

OR

F15

ha

s a

patc

hy d

istr

ibut

ion

and

does

not

clu

ster

with

ot

her

prot

eoba

cter

ia in

th

e ph

ylog

enet

ic t

ree

Sev

eral

add

ition

al C

DS

s (O

RF

16ndashO

RF

25)

that

did

not

prod

uce

wel

l-re

solv

ed t

rees

ha

d on

ly d

iver

gent

hom

olog

ues

inG

enB

ank

or

nosi

gnifi

cant

hom

olog

ues

may

also

hav

e be

enac

quire

d by

LG

T I

nsu

ppor

t of

thi

sO

RF

26 e

ncod

es a

tran

spos

ase

2 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

O

RF

16 ndash

OR

F25

w

as n

ot in

clud

ed in

es

timat

e du

e to

lim

ited

evid

ence

for

th

e tr

ansf

er o

f the

se

23

(23

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

2020 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f1c

04γ-

Pro

teob

acte

rium

ndash14

CD

Ss

give

sup

port

edph

ylog

enie

s an

d of

thes

e 13

(93

)

agre

ew

ith r

RN

A

Phy

loge

netic

ana

lyse

ssh

ow t

hat

two

CD

Ss

have

bee

n ac

quire

d by

LGT

OR

F3

is f

ound

in a

mix

ed c

lade

whi

leO

RF

30 c

lust

er w

ithin

β-

prot

eoba

cter

ia

Thr

ee g

enes

tha

t sh

owun

cong

ruen

tph

ylog

enie

s b

utw

ith lo

w b

oots

trap

supp

ort

foun

d cl

ose

to O

RF

3 an

d O

RF

34ha

ve p

roba

bly

also

been

acq

uire

d by

LGT

O

RF

5 cl

uste

rsw

ith β

-pro

teob

acte

ria

OR

F31

clu

ster

s w

ithδ-

prot

eoba

cter

ia

and

OR

F32

(G

ST

) cl

uste

rsw

ith a

γ-pr

oteo

bact

eriu

m

but

appe

ars

toha

ve b

een

freq

uent

lytr

ansf

erre

d

5 C

DS

s (1

3 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

3 (

1)

b1bf

11

a01

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns (

98

iden

tity

at 2

3S

rRN

A)

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns(9

8 id

entit

yat

16S

rR

NA

)

Hig

h de

gree

of

gene

sy

nten

y co

mpa

red

with

Thi

obac

illus

de

nitr

ifica

ns

29 C

DS

sha

ve b

est

BLA

ST

mat

chin

Thi

obac

illus

de

nitr

ifica

ns 2

7 of

28

CD

Ss

(96

) th

at g

ive

stat

istic

ally

sup

port

edph

ylog

enie

s ag

ree

with

rR

NA

gen

es

One

OR

F30

(R

suA

)cl

uste

r w

ith γ

-pr

oteo

bact

eria

and

has

no

hom

olog

ue in

T

hiob

acill

us d

enitr

ifica

ns

Two

CD

Ss

(OR

F14

and

O

RF

31)

have

bee

n tr

ansf

erre

d to

bot

h fo

smid

an

d T

hiob

acill

us

deni

trifi

cans

OR

F29

has

no

sign

ifica

nt

hom

olog

ues

inpr

oteo

bact

eria

4 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

3 (

2)

b1bf

110

d03

ndashA

Fla

voba

cter

iace

aeba

cter

ium

am

ong

sequ

ence

dge

nom

es m

ost

clos

ely

rela

ted

toC

ytop

haga

hutc

hins

onii

16 o

f 18

(84

) C

DS

s w

ith

supp

orte

d ph

ylog

enet

icto

polo

gies

agr

ee w

ith16

S f

ragm

ent

OR

F5

and

OR

F10

hav

e no

cl

ose

hom

olog

ues

in

othe

r B

acte

roid

es a

ndph

ylog

enet

ic a

naly

sis

sugg

ests

fre

quen

ttr

ansf

er

OR

F4

has

no d

etec

tabl

eho

mol

ogue

s in

oth

er

Bac

tero

ides

A

tran

spos

on w

ith 8

C

DS

s lik

ely

acqu

ired

from

rel

ativ

e of

Bac

tero

ides

thet

aiot

aoim

icro

n

3 C

DS

s (1

0 o

f to

tal)

h

ave

likel

y b

een

acq

uir

ed b

y L

GT

The

tra

nspo

son

not

incl

uded

as

it ha

sbe

en t

rans

ferr

edw

ithin

the

B

acte

roid

es

10

(3

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

a O

nly

LGT

eve

nts

invo

lvin

g th

e C

DS

fro

m t

he fo

smid

clo

ne a

naly

sed

was

cou

nted

and

onl

y w

hen

they

wer

e su

ppor

ted

by p

hylo

gene

tic a

naly

ses

or c

lear

phy

loge

netic

dis

trib

utio

n pa

ttern

s (i

e

the

gene

is n

ot p

rese

nt in

its

rRN

A g

roup

but

pre

sent

in s

ome

othe

r di

stin

ct b

acte

rial g

roup

) N

umbe

r of

CD

Ss

acqu

ired

by L

GT

is s

how

n in

bol

db

O

RFa

ns w

here

cla

ssifi

ed a

s C

DS

s w

ith n

o si

gnifi

cant

mat

ch in

Gen

Ban

k M

atch

es t

o se

quen

ces

in t

he e

nviro

nmen

tal p

ortio

n of

Gen

Ban

k w

ere

not

cons

ider

ed I

n pa

rent

hesi

s is

giv

en t

he

prop

ortio

n of

pro

tein

cod

ing

DN

A t

hat

has

no m

atch

in G

enB

ank

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

2012

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

found in the 23S rRNA of eukaryotic chloroplasts andmitochondria (Cannone

et al

2002) as well as a fewbacteria (Nesboslash and Doolittle 2003) They very specifi-cally cleave conserved sequences in intron-free 23SrRNA genes The recognition sites are usually located inthe most conserved part of the host rRNA gene and being15ndash25 bp long are highly specific to rRNA genes Theslow evolutionary rate of these sites as well as the toler-ance for minor sequence changes by homing endonu-cleases means that rRNA genes from a wide range of taxacan usually be cut Three such enzymes belonging to theLAGLIDADG family have been particularly well character-ized I-CeuI (from the chloroplast 23S gene of

Chlamy-domonas eugametos

) (Marshall and Lemieux 1992) I-CreI (from the chloroplast 23S gene of

Chlamydomonasreinhardtii

) (Chevalier

et al

2003) and I-DmoI (from the23S gene of the crenarchaeon

Desulphurococcus mobilis

)(Aagaard

et al

1997) These enzymes have different cut-ting sites and specificities The enzyme used here I-CeuIis commercially available from New England Biolabs(httpwwwnebcom) and targets a 19-bp cut site at 23SrRNA position 1923 (relative to the

E coli

23S rRNA) thatis conserved in most bacteria

Here we present the sequences of 12 environmentalfosmid clones 10 that contain about 1000 bp of the 23SrRNA gene one that contains both 23S rRNA and 16SrRNA genes and one that contains 1079 bp of the 16SrRNA gene The metagenomic libraries containing theseclones were constructed using DNA isolated from anaer-obic sediments from Baltimore harbour Microbial commu-nities from these sediments have been shown earlier tobe capable of reductive dechlorination of PCBs (Holoman

et al

1998) The taxonomic position of each fosmid clonewas assessed using its rRNA gene The amount of lateralgene transfer (LGT) was assessed by making phyloge-netic trees of all the predicted protein coding sequences(CDSs) in the fosmid clones and comparing the phyloge-netic position of the CDS to that indicated by the rRNAgene

Results and discussion

Two different types of fosmid libraries were made fromthe anaerobic sediment DNA The first used thepCCFos- vector from Epicenter (B1BF1) and the seconda modification of that vector containing an I-CeuI sitefor specifically cloning DNA fragments containing 23SrRNA genes (B1BCF1 B1DCF1 B3CF5 B1DCF5) TheB1BF1-library contained about 10 000 clones The I-CeuI-libraries were considerably smaller and we identi-fied only 49 clones with unique 23S rRNA end-sequences However assuming one to three bacterialrRNA containing clones among every 100 clones (Suzuki

et al

2004) and considering that not all bacterial 23S

rRNA are cut by I-CeuI the number of clones obtained isclose to expected values

End-sequencing and subcloning analyses of lsquonormalrsquo fosmids

In order to get information on the diversity of genomicfragments captured in the B1BF1 library we obtained 576end-sequences resulting in 565 unique sequences withan average of 408 high-quality base pairs correspondingto 232 kb of environmental DNA Among the sequenceswe identified a 16S rRNA sequence in B1BF110d03which was fully sequenced (see below) as well as one23S rRNA containing clone We also attempted to identify23S rRNA containing clones by screening 10 96-well-plates from the B1BF1 library using I-CeuI However offour clones that appeared to be cut by I-CeuI only oneproved to contain 23S rRNAs and was fully sequenced(B1BF1a01 see below)

The distribution of G

+

C content of the end-sequencessignificant hits to proteins in GenBank (based on

BLASTX

results with

e

-values

lt

1 e

minus

10

) as well as matches to pro-teins that have been assigned to COG categories areshown in supplemental Fig S1AndashC As observed byTreusch and colleagues (2004) the distribution of the COGcategories are similar to what is observed for singlegenomes of cultivated organisms suggesting that thisrepresents an average of the genomes in this habitatTaken together the large G

+

C content variation as wellas the wide functional and phylogenetic diversity of thesequences suggests that we have sampled sequencesoriginating from a large diversity of genomes

End-sequencing of I-CeuI fosmid libraries

Four libraries were made using the pCC1FOSCeuI23Svector High-quality end-sequences of 91 clones revealed62 unique clones of which at least 49 (79) contained1000 bp of 23S rRNA Eight clones (129) did not con-tain a 23S rRNA and for five clones we could not obtainhigh-quality sequence from the end that should containthe 23S rRNA The 23S rRNA fragments showed highestsimilarity in

BLASTN

searches to sequences from severaldifferent bacterial groups

α

-proteobacteria (1)

δ

-proteo-bacteria (15)

γ

-proteobacteria (9) Firmicutes (8) Planc-tomycetes (6) Bacteroidetes (1) Actinobacteria (1)Spirochaetes (1) ChlamydiaeVerrucomicrobia (6) Thesesequence tags were not long enough to make well-supported phylogenetic trees (average sequence length

=

431 bp) however this gives a rough indication of thediversity captured by this method

These results demonstrate the efficiency of usingintron-encoded endonucleases to specifically clone rRNAcontaining DNA fragments Screening a lsquonormalrsquo fosmid

LGT and phylogenetic assignment of metagenomic clones

2013

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

library for rRNA genes usually leads to about 1 ofpositives clones (Suzuki

et al

2004) Although ourpCC1FOSCeuI23S libraries contained only 62 uniqueclones at least 49 of them (79) contained a 23S genewhich is the equivalent of screening 4900 clones from alsquonormalrsquo fosmid library Also the peripheral location of therDNA on the DNA fragments greatly facilitates screeningand sequencing It is also unlikely to have the samebiases as polymerase chain reaction (PCR) screening forthe recovery of rDNA containing clones For I-CeuI recov-ery of positives is based on a single

sim

20 bp DNA region(rather than two for PCR) and allows for a different typeof degeneracy of this DNA region (drop in cuttingefficiency for divergent sequences) Although the I-CeuIrecognition sequence is specific to bacteria (except Acti-nobacteria) other homing endonucleases such as I-CreIand I-DmoI could be used to recover archaeal and acti-nobacterial DNA fragments

Phylogenetic analyses of rRNA genes demonstrate the recovery of protein-coding genes from a wide diversity of bacterial lineages

Twelve rRNA containing fosmid clones were fullysequenced The annotation of these clones is given inTables S1ndash12 (see

Supplementary material

) and Fig 1gives an overview of the fosmid clonesrsquo phylogenetic affil-iation Figure 2A shows the phylogenetic trees estimatedfrom the 1000 bp 23S rRNA from the I-CeuI-fosmids forseven of the fosmid clones this 23S tag could be used toassign the clone to a specific bacterial lineage We have23S rRNA containing fragments from two

δ

-proteobacte-ria two

γ

-proteobacteria one

ε

-proteobacterium one

β

-proteobacterium (from this we also have the 16S rRNA)and one taxon from the phylum Chloroflexi Two fosmidclones contained 16S rRNA genes ndash B1BF110d03 andB1BF11a01 ndash and phylogenetic analyses placed thesesequences within the

Flavobacteriaceae

and

β

-proteobac-teria respectively (Figs 2B and C)

For four fosmid clones ndash b1bcf11f04 b1dcf51c12b3cf12f09 and b1dcf55a06 ndash the 23S rRNA-tag did notcluster with any specific 23S lineage For these clones weattempted to obtain the 16S rRNA sequence by using onespecific 23S rRNA primer and a universal 16S primer Wesuccessfully obtained four 16S-23S rRNA sequencesthat showed 98ndash99 identity to the 23S fragment inb11bcf11f04 (715 bp overlap) One of these ampliconswas fully sequenced and phylogenetic analyses showedthat it belong to the candidate division WS3 (Dojka

et al

1998) (Fig 2B) Because b1dcf51c12 clusters signifi-cantly with b1bcf11f04 in the 23S rRNA tree (Fig 2A)we also assigned this clone to the WS3 division Forb3cf12f09 we obtained two different 16S-23S rRNAclones that showed 100 and 99 identity to the 23S

fragment of this clone (281 bp overlap) and phylogeneticanalyses showed that this clone belongs to the candidatedivision OP8 (Hugenholtz

et al

1998) (Fig 2B) The ITSregions of both the WS3 and the OP8 rRNA operonscontained tRNA Ile and tRNA Ala For b1dcf51a06 no16S rRNA sequence could be obtained

Most protein coding sequences are in agreement with the adjacent rRNA genes in phylogenetic analyses

Phylogenetic trees were obtained for all predicted CDSsof each fosmid clone sequenced We compared the phy-logenetic placement of each CDS to the phylogenysuggested by the rRNA If the phylogeny of the CDSsuggested that it belonged to another bacterial group andthis relationship was supported in bootstrap analysesacquisition by LGT was inferred for the CDS For the clonewhere no specific phylogenetic relationship could beinferred (b1dcf51a06) and for the fosmid clones wherethe rRNA showed that it originated from a bacterium withno cultivated representative we classified as likelyinstances of LGT all CDSs that did cluster specifically(with bootstrap support) with another bacterial group Asummary of the phylogenetic analysis of the rRNA genesas well as of all protein coding CDSs is given in Table 1

The majority of the CDSs did agree with their respectiverRNA phylogeny and 57ndash96 (average 768) of theCDSs that gave good alignments and robust phylogeniesshowed the same phylogenetic relationship as the rRNAgenes This was also true for the fosmid clones frombacterial lineages with no cultivated representative asmost CDSs from these clones did not cluster with anyspecific lineage or had no or only a few significantmatches in GenBank (Fig 1) However for these clonesthe number of CDSs that robustly agree with the rRNAtopology is problematic to calculate as they may or maynot fall into well-supported clades when more sequencesfrom these phyla become available The fosmid cloneswith the highest number of congruent trees areb1bf11a01 which originated from a

β

-proteobacteriumvery similar to

Thiobacillus denitrificans

where 96 of theCDSs with robust phylogenies agree with the rRNA genesand b1dcf13c08 which originated from an

isin

-proteobac-terium where 90 of the lsquotreeablersquo CDSs agree with therRNA

High levels of LGT detected in phylogenetic analyses

Phylogenetic analyses showed that 7ndash44 (average17) of the CDSs have been acquired by LGT from dis-tantly related bacterial lineages (Fig 1 Table 1) For manyof the fosmid clones there were additional CDSs thatprobably also have been involved in LGT these caseswere not scored as LGT either because the CDS was too

2014

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

Fig

1

Ove

rvie

w o

f th

e se

quen

ced

fosm

id c

lone

s Y

ello

w C

DS

s ar

e su

gges

ted

to h

ave

been

acq

uire

d by

LG

T a

nd b

lue

CD

Ss

have

no

sign

ifica

nt m

atch

in G

enB

ank

A

α

-pro

teob

acte

ria B

β

-pr

oteo

bact

eria

D

δ

-pro

teob

acte

ria

E

ε

-pro

teob

acte

ria

G

γ

-pro

teob

acte

ria

C

Cya

noba

cter

ia

CB

C

hlor

obi-B

acte

roid

etes

F

Fir

mic

utes

P

pro

teob

acte

ria

CH

C

hlor

oflex

i T

D

The

rmus

-D

eino

cocc

us g

roup

A

CT

Act

inob

acte

ria

PL

Pla

ncto

myc

etes

S

PIR

S

piro

chae

tes

TH

ER

T

herm

otog

ales

A

Q

Aqu

ifeca

les

FU

SO

F

usob

acte

ria

AR

CH

A

rcha

eal

EU

K

Euk

aryo

tes

EN

Ven

viro

nmen

tal s

eque

nce

c

lust

er r

obus

tly w

ithin

a m

ixed

cla

de in

phy

loge

netic

tree

s ndash

no

sign

ifica

nt m

atch

in G

enB

ank

Upp

erca

se s

uppo

rted

by

phyl

ogen

etic

ana

lysi

s L

ower

case

sug

gest

edby

BLA

ST

sea

rche

s as

the

re w

as n

o su

ppor

ted

phyl

ogen

y T

he lo

w-q

ualit

y re

gion

in b

1dcf

13

c08

(pos

ition

119

2ndash13

42)

is in

dica

ted

by a

bla

ck b

ox T

he o

rang

e sh

adin

gs in

dica

tes

LGT-

CD

Ss

that

are

foun

d in

mor

e th

an o

ne fo

smid

ORFAN

A c

onju

gativ

e tr

ansp

oson

ob

tain

ed fr

om a

Bac

terio

ides

bac

teriu

m

unkn

own

b1dc

f51

a06

Chl

orof

exi

b1dc

f13

f01

Can

dida

te d

ivsi

on O

P8

b3cf

12

f09

Can

dida

te d

ivsi

on W

S3

b1bc

f11

f4

Can

dida

te d

ivsi

on W

S3

b1bc

f51

c12

d-pr

oteo

bact

eria

b1bc

f11

h03

d-pr

oteo

bact

eria

b1bc

f11

d04

e-pr

oteo

bact

era

b1dc

f13

c08

g-pr

oteo

bact

eria

b1dc

f12

d07

g-pr

oteo

bact

eria

b1bc

f11

c04

b-pr

oteo

bact

eria

b1bf

11

a01

Fla

voba

cter

iace

aeb1

bf1

10d

03

LGT and phylogenetic assignment of metagenomic clones

2015

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

Fig 2

rRNA phylogeniesA The minimum evolution tree estimated from LogDet distances of the 23S-tag from the CeuI-fosmids (984 positions in alignment) For the sequences from the fosmid clones the Maximum Likelihood topology was similar (GTR

+

G

+

I) except that the

δ

-proteobacteria where paraphyletic with the

γ

- and

β

-proteobacteria clustering within the

δ

-proteobacteria Moreover b1bcf11d04 fell at the bottom of this cladeB The minimum evolution tree estimated from LogDet distances of the 16S sequences (1243 positions in alignment) For the sequences from the fosmid clones the Maximum Likelihood (GTR

+

G

+

I) topology was identical However there where several differences in the backbone of the tree with for instance Geobacter clustering with Firmicutes The trees in both A and B were rooted by the

Thermotoga maritima

sequenceC The minimum evolution tree estimated from LogDet distances of the closest matches of the 16S fragment in b1bf110d03 (1046 positions in alignment) The Maximum Likelihood (GTR

+

G

+

I) topology was identicalFor all three trees numbers on branches refers to bootstrap values from the minimum evolution analysis (

italic

) and from the Maximum Likelihood analysis (plain text) If both bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 90

B

Thermotoga maritima Coprothermobacter proteolyticus

Acidobacterium capsulatumPirellula marina

R76-B102OPB95

OPB5HMMVPog-54

HS9-30

PBS-II-35

LD1-PB19PBS-III-30

PRR-12Simkania negevensisBorrelia burgdorferi

Synechococcus Chloroflexus aurantiacus

Dehalococcoides ethenogenes Bacteroides thetaiotaomicron

Cytophaga hutchinsoniiChlorobium tepidum

Leptospirillum ferrooxidans Deinococcus radiodurans

Geobacillus subterraneus Paenibacillus popilliae

Fusobacterium nucleatum Geobacter metallireducens

Bradyrhizobium japonicum Vibrio splendidus

Methylobacillus flagellatum Thiobacillus denitrificans

005 substitutionssite

b3cf12f09

b1bcf11f04

b1bf11a01

candidate division OP8

candidate division WS3

Betaproteobacteria

92

72

54

78

57

75

Porphyromonas gingivalis

Bacteroides thetaiotaomicron

Cytophaga hutchinsonii

Cellulophaga pacifica

Flavobacterium gelidilacus

Flavobacterium psychrolimnae

Flavobacterium frigoris

Flavobacterium xinjiangensis

Gelidibacter algens

Bizionia paragorgiae

Formosa algae

Algibacter lectus

Flavobacterium sp 5N-3

Psychroserpens burtonensis

Mesophilibacter yeosuensis

b1bf110d03

Flavobacteriaceae bacterium BSA CS 02

Flavobacteriaceae bacterium BSD RB 42

001 substitutionssite

C

isolated from estuarine and salt marsh sediments

b3cf12f09Chlorobium tepidum

Synechocystis sp D64000

Deinococcus radiodurans

b1dcf13f01Dehalococcoides ethenogenes

b1dcf511a06Fusobacterium nucleatum

b1bcf11f04b1dcf51c12

Mycobacterium kansasiiStreptomyces coelicolor Thermomonospora chromogena

Paenibacillus popilliaeOceanobacillus iheyensis

Geobacillus kaustophilus

Simkania negevensis Pirellula sp strain 1

b3cf12d07Pseudomonas stutzeri

005 substitutionssite

candidate division WS3

Wolinella succinogenes Helicobacter hepaticus

Campylobacter jejuni b1dcf13c08

Epsilonproteobacteria

b1bcf11d04Desulfotalea psychrophila

b1bcf11h03Nannocystis exedens

Stigmatella aurantiacaGeobacter metallireducens

Deltaproteobacteria

Methylobacillus flagellatusb1bf11a01Thiobacillus denitrificans

Halomonas pantelleriensis

Microbulbifer degradansVibrio splendidus

b1bcf11c04Uncultured bacterium 463 clone EBAC080-L32B05

Betaproteobacteria

Gammaproteobacteria

Thermotoga maritima

candidate division OP8

Chloroflexi

Symbiobacterium thermophilum

Bacillus cereus

Desulfovibrio vulgaris

A

51

6197

87

55

67

54100

61

58

84

57

8968

58

97

54

65

64

68

73

51

58

53

87

58

2016

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

short to obtain reliable alignments the CDS was found ina lsquomixedrsquo clade also containing genes from the same bac-terial group or the CDS was found outside its group butdid not cluster with any specific lineage For three of theclones more than 30 of the CDSs have been acquiredby LGT (Table 1) two of these are from candidate divi-sions and one is from a

δ

-proteobacterium For all threeof these fosmids there appears to have been a transfer ofa large island of genes from a phylogenetically distantlineage Specifically we infer an

α

-proteobacterial islandin b3cf12f09 a

δ

-proteobacterial island in b1dcf51c12and an archaeal

β

-proteobacterial island in b1bcf11d04(Fig 1) It should be noted that the proportions of foreigngenes identified here might not represent the proportion

of foreign genes in the respective genomes that we havesampled but

rather the amount of LGT to be expectedwhen sequencing environmental fosmid clones

Forinstance in some genomes LGT might be enriched incertain variable parts of the genome Indeed the distribu-tion of proteins that match COG categories was signifi-cantly different (

P

=

13 e-13 in a

χ

2

-test) to what weobserved for the end-sequencing of lsquonormalrsquo fosmidclones (supplemental Fig S1) the main difference beingproportionally more J K U F and H category sequencesin the full fosmid sequences and more L P R and Scategory sequences among the end-sequences Whencomparing the distributions of different COG-groups (ieinformational metabolism etc) however the two datasets were significantly different only when including thepoorly characterized categories (R S) If such genes aremore frequently transferred than the other categoriesthen we would be underestimating the level of LGT thatwould be expected when analysing metagenomic clones

Interestingly in b1bcf11d04 the transfer vector for oneof the acquired gene clusters could be identified ORF6encodes an acetyl transferase gene and ORF8 ORF9and ORF10 encode subunits for an acyl-CoA synthase ndashtwo

α

-subunits and one

β

-subunit Phylogenetic analysessuggested all four CDSs have been acquired by LGTlikely from a

β

-proteobacterium The

β

-proteobacteriahave in turn likely acquired the acyl-CoA synthase genesfrom Archaea (Fig 3) In support of the archaeal origin ofthese genes the acyl-CoA synthase in bcf11d04 hassimilar domain organization to the acetyl-CoA synthase in

Pyrococcus

spp with two subunits (Sanchez

et al

2000)Furthermore these genes have been transferred multipletimes and the transfers involved all three domains of life[Fig 3 (Andersson

et al

2003)] These transferred CDSsare preceded by one integrase gene (ORF3) a trans-posase gene (ORF4) and an intergerasetransposasegene (ORF5 COG2801 Tra5 which contains an inte-gerase core domain Table S7) which probably wereresponsible for transferring this cluster into this genomeThe

α

-proteobacterial island in the b3cf12f09 cloneencodes a wide range of different functions and no typicalmobile elements were identified However as this islandextends to the 3

prime

end of the fosmid mobile genes mightbe found further downstream The first CDS of this islandencodes a DnaJ-class chaperone (ORF29) which is trun-cated at the 5

prime

end This pseudogene still shows 65protein identity to a homologue in

Magnetoospirillummagnetotacticum

(Table S3) Hence this probably repre-sents a very recent transfer (or rearrangement) Anotherpossibility is that this fosmid might be a chimera Howeverthe G

+

C content of the CDSs in the

α

-proteobacterialisland (595 G

+

C) is very similar to the rest of thefosmid (596 G

+

C supplemental Table S3) Also fur-ther upstream there is a proteobacterial transposase

Fig 3

Maximum Likelihood phylogeny of acetyl-CoA synthetase (ADP-forming) homologues estimated using PMBML (459 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF8 and ORF10 sequences against GenBank and the 100 best matches were retrieved and aligned Groups of very similar sequences from the same species or sister species were trimmed down to one sequence representative The tree was arbi-trarily rooted by Entamoeba histolytica Numbers on branches refers to bootstrap support obtained from using PMBML in bold PUZZLEBOOT in plain text and Neighbour-joining in italic If all bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 80

10

Entamoeba histolytica Parachlamydia sp UWE25

Rubrobacter xylanophilus Gloeobacter violaceus

Nostoc sp PCC 7120Thermosynechococcus elongatus

Dechloromonas aromaticaMesorhizobium sp BNC1

Sinorhizobium melilotiXanthomonas axonopodisRhodopseudomonas palustris

Bradyrhizobium japonicum Desulfovibrio desulfuricans

Rhodospirillum rubrumMagnetospirillum magnetotacticum

Magnetospirillum magnetotacticumShewanella oneidensis

Photobacterium profundumVibrio cholerae

Vibrio vulnificus Photorhabdus luminescens

Yersinia pestis Salmonella enterica

Escherichia coli Methanopyrus kandleri

Pyrococcus furiosus Archaeoglobus fulgidus

Methanococcus maripaludisMethanocaldococcus jannaschii

Magnetococcus sp MC-1 Chloroflexus aurantiacus

Spironucleus barkhanus Giardia intestinalis

Pyrococcus furiosusThermoplasma acidophilum Thermoplasma volcanium

Pyrococcus furiosus Streptomyces avermitilisBradyrhizobium japonicum

Ralstonia metalliduransFerroplasma acidarmanus

Sulfolobus solfataricusSulfolobus tokodaii

Pyrococcus furiosusPyrococcus furiosus

Pyrobaculum aerophilumMethanosarcina mazei Methanosarcina acetivoransThermobifida fusca

Archaeoglobus fulgidusArchaeoglobus fulgidus

Archaeoglobus fulgidusArchaeoglobus fulgidus

b1bcf11d04ORF8b1bcf11d04ORF10

Bordetella bronchiseptica Ralstonia metallidurans

Bordetella pertussis Bordetella bronchiseptica

Burkholderia fungorumBurkholderia fungorumRalstonia eutropha

Bordetella bronchisepticaRalstonia eutropha

Bradyrhizobium japonicumRalstonia eutropha

Burkholderia fungorumBordetella bronchiseptica

Ralstonia eutrophaBordetella bronchiseptica

Bradyrhizobium japonicumBordetella bronchiseptica

Pseudomonas mendocina Bradyrhizobium japonicum

7480

9764

75

52

83

52

57

60

61

70

89

51

64

6262

64

57

58

50

7173

62

100100

LGT and phylogenetic assignment of metagenomic clones 2017

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Tab

le 1

S

umm

ary

of p

hylo

gene

tic a

naly

ses

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

b1dc

f51

a06

No

clea

r af

filia

tion

with

exi

stin

gse

quen

ces

Cou

ld n

ot b

eam

plifi

ed

Mos

t C

DS

s ha

ve n

o or

only

a f

ew s

igni

fican

tm

atch

es in

Gen

Ban

kO

RF

4 cl

uste

rs w

ithLe

ptos

pira

inte

rrog

ans

with

in a

mix

ed c

lade

ho

wev

er

L in

terr

ogan

sha

s se

vera

l par

alog

ues

and

this

gen

e ap

pear

sto

hav

e be

en f

requ

ently

tran

sfer

red

and

islik

ely

to b

e a

tran

sfer

OR

F20

clu

ster

s w

ithM

etha

nosa

rcin

a w

ithin

δ-pr

oteo

bact

eria

O

RF

19cl

uste

rs w

ith G

eoba

cter

but

is m

ostly

foun

d in

met

hano

gens

OR

F17

and

OR

F18

have

hom

olog

ues

inM

etha

noge

ns o

nly

4 C

DS

s (1

9 o

f th

eto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

33

(38

)

b1dc

f13

f01

Clu

ster

s w

ithD

ehal

ococ

coid

eset

heno

gene

sC

hlor

oflex

usau

rant

iacu

s 23

SrR

NA

seq

uenc

eof

too

poo

r qu

ality

to in

clud

e in

the

tree

7 of

10

CD

Ss

(70

) w

ithsu

ppor

ted

phyl

ogen

etic

topo

logi

es a

gree

with

23S

fra

gmen

t In

addi

tion

6 C

DS

s w

hich

only

hit

Chl

orofl

exus

aura

ntia

cus

Two

CD

Ss

have

like

lybe

en a

cqui

red

thro

ugh

LGT

One

clu

ster

s w

ithhi

gh s

uppo

rt w

ithT

herm

otog

a m

ariti

ma

(OR

F16

) an

d on

e cl

uste

rsw

ithin

the

euk

aryo

tes

(OR

F25

)

OR

F2

has

only

sign

ifica

ntho

mol

ogue

s in

Cro

cosp

haer

aw

atso

nii

3 C

DS

s (1

1 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

14

(5

)

b3cf

12

f09

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A g

enes

and

do

not

clus

ter

with

in a

nysp

ecifi

c ba

cter

ial g

roup

Phy

loge

netic

ana

lysi

ssu

gges

ts t

hat

10 C

DS

sha

ve li

kely

bee

n ac

quire

dby

LG

T 8

of

thes

e ha

vebe

en a

cqui

red

from

an

α-pr

oteo

bact

eriu

man

d ar

e fo

und

linke

d

Thr

ee C

DS

s fo

und

linke

d to

CD

Ss

whe

reph

ylog

enet

ic a

naly

ses

sugg

est

LGT

hav

eal

so li

kely

bee

nac

quire

d by

LG

T

13 C

DS

s (3

2 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

OR

F16

is a

tran

spos

ase

of

prot

eoba

cter

ial

orig

in

and

show

slo

wer

GC

con

tent

than

the

res

t of

the

fosm

id T

wel

ve o

fth

e tr

ansf

erre

dC

DS

s (O

RF

29ndash

41)

are

linke

d an

dal

l app

ear

to h

ave

been

acq

uire

dfr

om a

n α-

prot

eoba

cter

ium

22

(9

)

b1bc

f11

f04

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A a

nd d

oes

not

clus

ter

with

any

spe

cific

bact

eria

l lin

eage

A

mon

g th

ese

was

the

high

ly c

onse

rved

Dna

Ege

ne

Two

CD

Ss

(OR

F14

and

OR

F15

) cl

uste

r w

ithse

quen

ces

from

the

Chl

orob

iBac

tero

idet

esgr

oup

2 C

DS

s (9

o

f to

tal)

hav

e b

een

acq

uir

ed b

y L

GT

26

(14

)

2018 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f51c

12C

andi

date

div

isio

nW

S3

bact

eriu

mM

ost

CD

Ss

have

no

oron

ly a

few

sig

nific

ant

mat

ches

in G

enB

ank

OR

F6ndash

OR

F11

are

als

ofo

und

in b

1bcf

11

h3 in

sam

e or

der

and

phyl

ogen

etic

ana

lysi

ssu

ppor

ts t

hat

OR

F7

OR

F8

and

OR

F10

wer

etr

ansf

erre

d fr

om a

δ-

prot

eoba

cter

ium

to

b1bc

f51c

12 O

RF

10 a

ndO

RF

11 a

lso

clus

ter

with

δ-pr

oteo

bact

eria

ho

wev

er

with

no

boot

stra

p su

ppor

t O

RF

9ha

s on

ly o

ne m

atch

inG

enB

ank

OR

F15

(fu

sA)

clus

ters

with

Chl

orob

ium

tepi

dum

with

inF

irm

icut

es

OR

F12

has

no

hom

olog

ue in

b1bc

f11

h3

but

doe

scl

uste

r w

ith δ

-pr

oteo

bact

eria

ho

wev

er w

ith n

obo

otst

rap

supp

ort

It is

like

ly t

hat

also

thi

sC

DS

was

tra

nsfe

rred

as p

art

of w

ith a

δ-

prot

eoba

cter

ial i

slan

d

8 C

DS

s (4

4 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

One

lar

ge lsquoi

slan

drsquo o

fδ-

prot

eoba

cter

ial

orig

in

22

(29

)

b1cf

11

1h0

3δ-

Pro

teob

acte

rium

ndash8

of 1

3 C

DS

s (5

7)

that

give

s su

ppor

ted

phyl

ogen

ies

agre

e w

ithth

e fr

agm

ent

orig

inat

ing

from

a δ

-pr

oteo

bact

eriu

m

Six

CD

Ss

have

like

ly b

een

acqu

ired

by L

GT

OR

F8

clus

ters

with

Clo

strid

ium

ther

moc

ellu

m a

ndTr

epon

ema

dent

icol

aO

RF

18 is

fou

ndse

para

ted

from

oth

erpr

oteo

bact

eria

inph

ylog

enet

ic t

rees

cl

uste

ring

with

Pla

smod

ium

spp

O

RF

23is

fou

nd in

a m

ixed

cla

dean

d ap

pear

s to

hav

ebe

en f

requ

ently

tran

sfer

red

OR

F28

clus

ters

with

β-

prot

eoba

cter

ia

OR

F29

clus

ters

with

γ-

prot

eoba

cter

ia a

ndO

RF

30 is

fou

nd a

tbo

ttom

of

clad

e th

atco

ntai

ns α

-pr

oteo

bact

eria

and

Act

inob

acte

ria

6 C

DS

s (1

7 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

OR

F11

ndashOR

F16

ha

ve b

een

tran

sfer

red

from

an

ance

stor

of

B1B

CF

11

h03

tob1

dcf5

1c

12 a

sw

ell t

o th

eC

hlor

obiu

m li

neag

e

6 (

1)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2019

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f11

d04

δ-P

rote

obac

teriu

mndash

12 o

f 18

CD

Ss

(67

)w

ith s

uppo

rted

phyl

ogen

etic

top

olog

ies

agre

e w

ith a

δ-

prot

eoba

cter

ial o

rigin

of

the

frag

men

t

Six

CD

Ss

are

sugg

este

dby

phy

loge

netic

ana

lyse

sto

hav

e be

en a

cqui

red

byLG

T O

ne o

f th

ese

tran

sfer

red

gene

s ndasht

hefu

sA h

omol

ogue

(OR

F19

) ndash is

als

o fo

und

inb1

bcf5

c12

Thi

s C

DS

has

been

tra

nsfe

rred

to

othe

r δ-

prot

eoba

cter

ia a

sw

ell

Thr

ee C

DS

s (O

RF

3ndash5)

that

enc

ode

anin

tege

rase

and

tw

otr

ansp

osas

es t

hat

prec

edes

fou

r of

the

LGT

gen

es d

etec

ted

in t

he p

hylo

gene

tican

alys

is

OR

F7

also

likel

y tr

ansf

erre

d w

ithO

RF

3 ndashO

RF

10

OR

F20

and

OR

F21

have

mai

nly

hom

olog

ues

inF

irm

icut

es a

nd is

the

neig

hbou

r of

OR

F19

that

has

als

o be

enac

quire

d fr

omF

irm

icut

es

12 C

DS

s (3

1 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

Inte

rest

ingl

y th

isfo

smid

clo

nepr

ovid

es t

hetr

ansf

er v

ecto

r ndash

the

inte

gera

se a

ndtr

ansp

osas

e ndash

for

8of

the

tra

nsfe

rred

gene

s

ndash

b1bc

f13

c08

ε-P

rote

obac

teriu

m

mos

t cl

osel

yre

late

d to

Cam

pylo

bact

erje

juni

21 C

DS

s gi

ve s

uppo

rted

phyl

ogen

ies

and

ofth

ese

19 (

90

) ag

ree

with

rR

NA

OR

F4

clus

ters

with

Geo

bact

er a

ndC

lost

ridiu

m

and

OR

F23

does

not

hav

eho

mol

ogue

s in

ε-

prot

eoba

cter

ia a

ndcl

uste

rs w

ith γ

- an

d β-

prot

eoba

cter

ia

OR

F24

doe

s no

t gi

ve a

supp

orte

d tr

ee b

utha

s al

so p

roba

bly

been

tra

nsfe

rred

fro

mγ-

or

β-pr

oteo

bact

eria

3 C

DS

s (7

o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

10

(3

)

b3cf

12

d07

γ-P

rote

obac

teriu

m

Clu

ster

s w

ithin

the

γ-pr

oteo

bact

eria

inLo

gDet

dis

tanc

etr

ees

but

at t

heba

se o

f γ-

prot

eoba

cter

ia a

ndβ-

prot

eoba

cter

iain

the

bes

tm

axim

umlik

elih

ood

tree

Onl

y 7

CD

Ss

give

su

ppor

ted

phyl

ogen

ies

O

f th

ese

4 (5

7)

agre

e w

ith r

RN

A

OR

F7

clus

ter

with

in β

-pr

oteo

bact

eria

OR

F15

ha

s a

patc

hy d

istr

ibut

ion

and

does

not

clu

ster

with

ot

her

prot

eoba

cter

ia in

th

e ph

ylog

enet

ic t

ree

Sev

eral

add

ition

al C

DS

s (O

RF

16ndashO

RF

25)

that

did

not

prod

uce

wel

l-re

solv

ed t

rees

ha

d on

ly d

iver

gent

hom

olog

ues

inG

enB

ank

or

nosi

gnifi

cant

hom

olog

ues

may

also

hav

e be

enac

quire

d by

LG

T I

nsu

ppor

t of

thi

sO

RF

26 e

ncod

es a

tran

spos

ase

2 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

O

RF

16 ndash

OR

F25

w

as n

ot in

clud

ed in

es

timat

e du

e to

lim

ited

evid

ence

for

th

e tr

ansf

er o

f the

se

23

(23

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

2020 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f1c

04γ-

Pro

teob

acte

rium

ndash14

CD

Ss

give

sup

port

edph

ylog

enie

s an

d of

thes

e 13

(93

)

agre

ew

ith r

RN

A

Phy

loge

netic

ana

lyse

ssh

ow t

hat

two

CD

Ss

have

bee

n ac

quire

d by

LGT

OR

F3

is f

ound

in a

mix

ed c

lade

whi

leO

RF

30 c

lust

er w

ithin

β-

prot

eoba

cter

ia

Thr

ee g

enes

tha

t sh

owun

cong

ruen

tph

ylog

enie

s b

utw

ith lo

w b

oots

trap

supp

ort

foun

d cl

ose

to O

RF

3 an

d O

RF

34ha

ve p

roba

bly

also

been

acq

uire

d by

LGT

O

RF

5 cl

uste

rsw

ith β

-pro

teob

acte

ria

OR

F31

clu

ster

s w

ithδ-

prot

eoba

cter

ia

and

OR

F32

(G

ST

) cl

uste

rsw

ith a

γ-pr

oteo

bact

eriu

m

but

appe

ars

toha

ve b

een

freq

uent

lytr

ansf

erre

d

5 C

DS

s (1

3 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

3 (

1)

b1bf

11

a01

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns (

98

iden

tity

at 2

3S

rRN

A)

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns(9

8 id

entit

yat

16S

rR

NA

)

Hig

h de

gree

of

gene

sy

nten

y co

mpa

red

with

Thi

obac

illus

de

nitr

ifica

ns

29 C

DS

sha

ve b

est

BLA

ST

mat

chin

Thi

obac

illus

de

nitr

ifica

ns 2

7 of

28

CD

Ss

(96

) th

at g

ive

stat

istic

ally

sup

port

edph

ylog

enie

s ag

ree

with

rR

NA

gen

es

One

OR

F30

(R

suA

)cl

uste

r w

ith γ

-pr

oteo

bact

eria

and

has

no

hom

olog

ue in

T

hiob

acill

us d

enitr

ifica

ns

Two

CD

Ss

(OR

F14

and

O

RF

31)

have

bee

n tr

ansf

erre

d to

bot

h fo

smid

an

d T

hiob

acill

us

deni

trifi

cans

OR

F29

has

no

sign

ifica

nt

hom

olog

ues

inpr

oteo

bact

eria

4 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

3 (

2)

b1bf

110

d03

ndashA

Fla

voba

cter

iace

aeba

cter

ium

am

ong

sequ

ence

dge

nom

es m

ost

clos

ely

rela

ted

toC

ytop

haga

hutc

hins

onii

16 o

f 18

(84

) C

DS

s w

ith

supp

orte

d ph

ylog

enet

icto

polo

gies

agr

ee w

ith16

S f

ragm

ent

OR

F5

and

OR

F10

hav

e no

cl

ose

hom

olog

ues

in

othe

r B

acte

roid

es a

ndph

ylog

enet

ic a

naly

sis

sugg

ests

fre

quen

ttr

ansf

er

OR

F4

has

no d

etec

tabl

eho

mol

ogue

s in

oth

er

Bac

tero

ides

A

tran

spos

on w

ith 8

C

DS

s lik

ely

acqu

ired

from

rel

ativ

e of

Bac

tero

ides

thet

aiot

aoim

icro

n

3 C

DS

s (1

0 o

f to

tal)

h

ave

likel

y b

een

acq

uir

ed b

y L

GT

The

tra

nspo

son

not

incl

uded

as

it ha

sbe

en t

rans

ferr

edw

ithin

the

B

acte

roid

es

10

(3

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

a O

nly

LGT

eve

nts

invo

lvin

g th

e C

DS

fro

m t

he fo

smid

clo

ne a

naly

sed

was

cou

nted

and

onl

y w

hen

they

wer

e su

ppor

ted

by p

hylo

gene

tic a

naly

ses

or c

lear

phy

loge

netic

dis

trib

utio

n pa

ttern

s (i

e

the

gene

is n

ot p

rese

nt in

its

rRN

A g

roup

but

pre

sent

in s

ome

othe

r di

stin

ct b

acte

rial g

roup

) N

umbe

r of

CD

Ss

acqu

ired

by L

GT

is s

how

n in

bol

db

O

RFa

ns w

here

cla

ssifi

ed a

s C

DS

s w

ith n

o si

gnifi

cant

mat

ch in

Gen

Ban

k M

atch

es t

o se

quen

ces

in t

he e

nviro

nmen

tal p

ortio

n of

Gen

Ban

k w

ere

not

cons

ider

ed I

n pa

rent

hesi

s is

giv

en t

he

prop

ortio

n of

pro

tein

cod

ing

DN

A t

hat

has

no m

atch

in G

enB

ank

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

LGT and phylogenetic assignment of metagenomic clones

2013

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

library for rRNA genes usually leads to about 1 ofpositives clones (Suzuki

et al

2004) Although ourpCC1FOSCeuI23S libraries contained only 62 uniqueclones at least 49 of them (79) contained a 23S genewhich is the equivalent of screening 4900 clones from alsquonormalrsquo fosmid library Also the peripheral location of therDNA on the DNA fragments greatly facilitates screeningand sequencing It is also unlikely to have the samebiases as polymerase chain reaction (PCR) screening forthe recovery of rDNA containing clones For I-CeuI recov-ery of positives is based on a single

sim

20 bp DNA region(rather than two for PCR) and allows for a different typeof degeneracy of this DNA region (drop in cuttingefficiency for divergent sequences) Although the I-CeuIrecognition sequence is specific to bacteria (except Acti-nobacteria) other homing endonucleases such as I-CreIand I-DmoI could be used to recover archaeal and acti-nobacterial DNA fragments

Phylogenetic analyses of rRNA genes demonstrate the recovery of protein-coding genes from a wide diversity of bacterial lineages

Twelve rRNA containing fosmid clones were fullysequenced The annotation of these clones is given inTables S1ndash12 (see

Supplementary material

) and Fig 1gives an overview of the fosmid clonesrsquo phylogenetic affil-iation Figure 2A shows the phylogenetic trees estimatedfrom the 1000 bp 23S rRNA from the I-CeuI-fosmids forseven of the fosmid clones this 23S tag could be used toassign the clone to a specific bacterial lineage We have23S rRNA containing fragments from two

δ

-proteobacte-ria two

γ

-proteobacteria one

ε

-proteobacterium one

β

-proteobacterium (from this we also have the 16S rRNA)and one taxon from the phylum Chloroflexi Two fosmidclones contained 16S rRNA genes ndash B1BF110d03 andB1BF11a01 ndash and phylogenetic analyses placed thesesequences within the

Flavobacteriaceae

and

β

-proteobac-teria respectively (Figs 2B and C)

For four fosmid clones ndash b1bcf11f04 b1dcf51c12b3cf12f09 and b1dcf55a06 ndash the 23S rRNA-tag did notcluster with any specific 23S lineage For these clones weattempted to obtain the 16S rRNA sequence by using onespecific 23S rRNA primer and a universal 16S primer Wesuccessfully obtained four 16S-23S rRNA sequencesthat showed 98ndash99 identity to the 23S fragment inb11bcf11f04 (715 bp overlap) One of these ampliconswas fully sequenced and phylogenetic analyses showedthat it belong to the candidate division WS3 (Dojka

et al

1998) (Fig 2B) Because b1dcf51c12 clusters signifi-cantly with b1bcf11f04 in the 23S rRNA tree (Fig 2A)we also assigned this clone to the WS3 division Forb3cf12f09 we obtained two different 16S-23S rRNAclones that showed 100 and 99 identity to the 23S

fragment of this clone (281 bp overlap) and phylogeneticanalyses showed that this clone belongs to the candidatedivision OP8 (Hugenholtz

et al

1998) (Fig 2B) The ITSregions of both the WS3 and the OP8 rRNA operonscontained tRNA Ile and tRNA Ala For b1dcf51a06 no16S rRNA sequence could be obtained

Most protein coding sequences are in agreement with the adjacent rRNA genes in phylogenetic analyses

Phylogenetic trees were obtained for all predicted CDSsof each fosmid clone sequenced We compared the phy-logenetic placement of each CDS to the phylogenysuggested by the rRNA If the phylogeny of the CDSsuggested that it belonged to another bacterial group andthis relationship was supported in bootstrap analysesacquisition by LGT was inferred for the CDS For the clonewhere no specific phylogenetic relationship could beinferred (b1dcf51a06) and for the fosmid clones wherethe rRNA showed that it originated from a bacterium withno cultivated representative we classified as likelyinstances of LGT all CDSs that did cluster specifically(with bootstrap support) with another bacterial group Asummary of the phylogenetic analysis of the rRNA genesas well as of all protein coding CDSs is given in Table 1

The majority of the CDSs did agree with their respectiverRNA phylogeny and 57ndash96 (average 768) of theCDSs that gave good alignments and robust phylogeniesshowed the same phylogenetic relationship as the rRNAgenes This was also true for the fosmid clones frombacterial lineages with no cultivated representative asmost CDSs from these clones did not cluster with anyspecific lineage or had no or only a few significantmatches in GenBank (Fig 1) However for these clonesthe number of CDSs that robustly agree with the rRNAtopology is problematic to calculate as they may or maynot fall into well-supported clades when more sequencesfrom these phyla become available The fosmid cloneswith the highest number of congruent trees areb1bf11a01 which originated from a

β

-proteobacteriumvery similar to

Thiobacillus denitrificans

where 96 of theCDSs with robust phylogenies agree with the rRNA genesand b1dcf13c08 which originated from an

isin

-proteobac-terium where 90 of the lsquotreeablersquo CDSs agree with therRNA

High levels of LGT detected in phylogenetic analyses

Phylogenetic analyses showed that 7ndash44 (average17) of the CDSs have been acquired by LGT from dis-tantly related bacterial lineages (Fig 1 Table 1) For manyof the fosmid clones there were additional CDSs thatprobably also have been involved in LGT these caseswere not scored as LGT either because the CDS was too

2014

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

Fig

1

Ove

rvie

w o

f th

e se

quen

ced

fosm

id c

lone

s Y

ello

w C

DS

s ar

e su

gges

ted

to h

ave

been

acq

uire

d by

LG

T a

nd b

lue

CD

Ss

have

no

sign

ifica

nt m

atch

in G

enB

ank

A

α

-pro

teob

acte

ria B

β

-pr

oteo

bact

eria

D

δ

-pro

teob

acte

ria

E

ε

-pro

teob

acte

ria

G

γ

-pro

teob

acte

ria

C

Cya

noba

cter

ia

CB

C

hlor

obi-B

acte

roid

etes

F

Fir

mic

utes

P

pro

teob

acte

ria

CH

C

hlor

oflex

i T

D

The

rmus

-D

eino

cocc

us g

roup

A

CT

Act

inob

acte

ria

PL

Pla

ncto

myc

etes

S

PIR

S

piro

chae

tes

TH

ER

T

herm

otog

ales

A

Q

Aqu

ifeca

les

FU

SO

F

usob

acte

ria

AR

CH

A

rcha

eal

EU

K

Euk

aryo

tes

EN

Ven

viro

nmen

tal s

eque

nce

c

lust

er r

obus

tly w

ithin

a m

ixed

cla

de in

phy

loge

netic

tree

s ndash

no

sign

ifica

nt m

atch

in G

enB

ank

Upp

erca

se s

uppo

rted

by

phyl

ogen

etic

ana

lysi

s L

ower

case

sug

gest

edby

BLA

ST

sea

rche

s as

the

re w

as n

o su

ppor

ted

phyl

ogen

y T

he lo

w-q

ualit

y re

gion

in b

1dcf

13

c08

(pos

ition

119

2ndash13

42)

is in

dica

ted

by a

bla

ck b

ox T

he o

rang

e sh

adin

gs in

dica

tes

LGT-

CD

Ss

that

are

foun

d in

mor

e th

an o

ne fo

smid

ORFAN

A c

onju

gativ

e tr

ansp

oson

ob

tain

ed fr

om a

Bac

terio

ides

bac

teriu

m

unkn

own

b1dc

f51

a06

Chl

orof

exi

b1dc

f13

f01

Can

dida

te d

ivsi

on O

P8

b3cf

12

f09

Can

dida

te d

ivsi

on W

S3

b1bc

f11

f4

Can

dida

te d

ivsi

on W

S3

b1bc

f51

c12

d-pr

oteo

bact

eria

b1bc

f11

h03

d-pr

oteo

bact

eria

b1bc

f11

d04

e-pr

oteo

bact

era

b1dc

f13

c08

g-pr

oteo

bact

eria

b1dc

f12

d07

g-pr

oteo

bact

eria

b1bc

f11

c04

b-pr

oteo

bact

eria

b1bf

11

a01

Fla

voba

cter

iace

aeb1

bf1

10d

03

LGT and phylogenetic assignment of metagenomic clones

2015

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

Fig 2

rRNA phylogeniesA The minimum evolution tree estimated from LogDet distances of the 23S-tag from the CeuI-fosmids (984 positions in alignment) For the sequences from the fosmid clones the Maximum Likelihood topology was similar (GTR

+

G

+

I) except that the

δ

-proteobacteria where paraphyletic with the

γ

- and

β

-proteobacteria clustering within the

δ

-proteobacteria Moreover b1bcf11d04 fell at the bottom of this cladeB The minimum evolution tree estimated from LogDet distances of the 16S sequences (1243 positions in alignment) For the sequences from the fosmid clones the Maximum Likelihood (GTR

+

G

+

I) topology was identical However there where several differences in the backbone of the tree with for instance Geobacter clustering with Firmicutes The trees in both A and B were rooted by the

Thermotoga maritima

sequenceC The minimum evolution tree estimated from LogDet distances of the closest matches of the 16S fragment in b1bf110d03 (1046 positions in alignment) The Maximum Likelihood (GTR

+

G

+

I) topology was identicalFor all three trees numbers on branches refers to bootstrap values from the minimum evolution analysis (

italic

) and from the Maximum Likelihood analysis (plain text) If both bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 90

B

Thermotoga maritima Coprothermobacter proteolyticus

Acidobacterium capsulatumPirellula marina

R76-B102OPB95

OPB5HMMVPog-54

HS9-30

PBS-II-35

LD1-PB19PBS-III-30

PRR-12Simkania negevensisBorrelia burgdorferi

Synechococcus Chloroflexus aurantiacus

Dehalococcoides ethenogenes Bacteroides thetaiotaomicron

Cytophaga hutchinsoniiChlorobium tepidum

Leptospirillum ferrooxidans Deinococcus radiodurans

Geobacillus subterraneus Paenibacillus popilliae

Fusobacterium nucleatum Geobacter metallireducens

Bradyrhizobium japonicum Vibrio splendidus

Methylobacillus flagellatum Thiobacillus denitrificans

005 substitutionssite

b3cf12f09

b1bcf11f04

b1bf11a01

candidate division OP8

candidate division WS3

Betaproteobacteria

92

72

54

78

57

75

Porphyromonas gingivalis

Bacteroides thetaiotaomicron

Cytophaga hutchinsonii

Cellulophaga pacifica

Flavobacterium gelidilacus

Flavobacterium psychrolimnae

Flavobacterium frigoris

Flavobacterium xinjiangensis

Gelidibacter algens

Bizionia paragorgiae

Formosa algae

Algibacter lectus

Flavobacterium sp 5N-3

Psychroserpens burtonensis

Mesophilibacter yeosuensis

b1bf110d03

Flavobacteriaceae bacterium BSA CS 02

Flavobacteriaceae bacterium BSD RB 42

001 substitutionssite

C

isolated from estuarine and salt marsh sediments

b3cf12f09Chlorobium tepidum

Synechocystis sp D64000

Deinococcus radiodurans

b1dcf13f01Dehalococcoides ethenogenes

b1dcf511a06Fusobacterium nucleatum

b1bcf11f04b1dcf51c12

Mycobacterium kansasiiStreptomyces coelicolor Thermomonospora chromogena

Paenibacillus popilliaeOceanobacillus iheyensis

Geobacillus kaustophilus

Simkania negevensis Pirellula sp strain 1

b3cf12d07Pseudomonas stutzeri

005 substitutionssite

candidate division WS3

Wolinella succinogenes Helicobacter hepaticus

Campylobacter jejuni b1dcf13c08

Epsilonproteobacteria

b1bcf11d04Desulfotalea psychrophila

b1bcf11h03Nannocystis exedens

Stigmatella aurantiacaGeobacter metallireducens

Deltaproteobacteria

Methylobacillus flagellatusb1bf11a01Thiobacillus denitrificans

Halomonas pantelleriensis

Microbulbifer degradansVibrio splendidus

b1bcf11c04Uncultured bacterium 463 clone EBAC080-L32B05

Betaproteobacteria

Gammaproteobacteria

Thermotoga maritima

candidate division OP8

Chloroflexi

Symbiobacterium thermophilum

Bacillus cereus

Desulfovibrio vulgaris

A

51

6197

87

55

67

54100

61

58

84

57

8968

58

97

54

65

64

68

73

51

58

53

87

58

2016

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

short to obtain reliable alignments the CDS was found ina lsquomixedrsquo clade also containing genes from the same bac-terial group or the CDS was found outside its group butdid not cluster with any specific lineage For three of theclones more than 30 of the CDSs have been acquiredby LGT (Table 1) two of these are from candidate divi-sions and one is from a

δ

-proteobacterium For all threeof these fosmids there appears to have been a transfer ofa large island of genes from a phylogenetically distantlineage Specifically we infer an

α

-proteobacterial islandin b3cf12f09 a

δ

-proteobacterial island in b1dcf51c12and an archaeal

β

-proteobacterial island in b1bcf11d04(Fig 1) It should be noted that the proportions of foreigngenes identified here might not represent the proportion

of foreign genes in the respective genomes that we havesampled but

rather the amount of LGT to be expectedwhen sequencing environmental fosmid clones

Forinstance in some genomes LGT might be enriched incertain variable parts of the genome Indeed the distribu-tion of proteins that match COG categories was signifi-cantly different (

P

=

13 e-13 in a

χ

2

-test) to what weobserved for the end-sequencing of lsquonormalrsquo fosmidclones (supplemental Fig S1) the main difference beingproportionally more J K U F and H category sequencesin the full fosmid sequences and more L P R and Scategory sequences among the end-sequences Whencomparing the distributions of different COG-groups (ieinformational metabolism etc) however the two datasets were significantly different only when including thepoorly characterized categories (R S) If such genes aremore frequently transferred than the other categoriesthen we would be underestimating the level of LGT thatwould be expected when analysing metagenomic clones

Interestingly in b1bcf11d04 the transfer vector for oneof the acquired gene clusters could be identified ORF6encodes an acetyl transferase gene and ORF8 ORF9and ORF10 encode subunits for an acyl-CoA synthase ndashtwo

α

-subunits and one

β

-subunit Phylogenetic analysessuggested all four CDSs have been acquired by LGTlikely from a

β

-proteobacterium The

β

-proteobacteriahave in turn likely acquired the acyl-CoA synthase genesfrom Archaea (Fig 3) In support of the archaeal origin ofthese genes the acyl-CoA synthase in bcf11d04 hassimilar domain organization to the acetyl-CoA synthase in

Pyrococcus

spp with two subunits (Sanchez

et al

2000)Furthermore these genes have been transferred multipletimes and the transfers involved all three domains of life[Fig 3 (Andersson

et al

2003)] These transferred CDSsare preceded by one integrase gene (ORF3) a trans-posase gene (ORF4) and an intergerasetransposasegene (ORF5 COG2801 Tra5 which contains an inte-gerase core domain Table S7) which probably wereresponsible for transferring this cluster into this genomeThe

α

-proteobacterial island in the b3cf12f09 cloneencodes a wide range of different functions and no typicalmobile elements were identified However as this islandextends to the 3

prime

end of the fosmid mobile genes mightbe found further downstream The first CDS of this islandencodes a DnaJ-class chaperone (ORF29) which is trun-cated at the 5

prime

end This pseudogene still shows 65protein identity to a homologue in

Magnetoospirillummagnetotacticum

(Table S3) Hence this probably repre-sents a very recent transfer (or rearrangement) Anotherpossibility is that this fosmid might be a chimera Howeverthe G

+

C content of the CDSs in the

α

-proteobacterialisland (595 G

+

C) is very similar to the rest of thefosmid (596 G

+

C supplemental Table S3) Also fur-ther upstream there is a proteobacterial transposase

Fig 3

Maximum Likelihood phylogeny of acetyl-CoA synthetase (ADP-forming) homologues estimated using PMBML (459 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF8 and ORF10 sequences against GenBank and the 100 best matches were retrieved and aligned Groups of very similar sequences from the same species or sister species were trimmed down to one sequence representative The tree was arbi-trarily rooted by Entamoeba histolytica Numbers on branches refers to bootstrap support obtained from using PMBML in bold PUZZLEBOOT in plain text and Neighbour-joining in italic If all bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 80

10

Entamoeba histolytica Parachlamydia sp UWE25

Rubrobacter xylanophilus Gloeobacter violaceus

Nostoc sp PCC 7120Thermosynechococcus elongatus

Dechloromonas aromaticaMesorhizobium sp BNC1

Sinorhizobium melilotiXanthomonas axonopodisRhodopseudomonas palustris

Bradyrhizobium japonicum Desulfovibrio desulfuricans

Rhodospirillum rubrumMagnetospirillum magnetotacticum

Magnetospirillum magnetotacticumShewanella oneidensis

Photobacterium profundumVibrio cholerae

Vibrio vulnificus Photorhabdus luminescens

Yersinia pestis Salmonella enterica

Escherichia coli Methanopyrus kandleri

Pyrococcus furiosus Archaeoglobus fulgidus

Methanococcus maripaludisMethanocaldococcus jannaschii

Magnetococcus sp MC-1 Chloroflexus aurantiacus

Spironucleus barkhanus Giardia intestinalis

Pyrococcus furiosusThermoplasma acidophilum Thermoplasma volcanium

Pyrococcus furiosus Streptomyces avermitilisBradyrhizobium japonicum

Ralstonia metalliduransFerroplasma acidarmanus

Sulfolobus solfataricusSulfolobus tokodaii

Pyrococcus furiosusPyrococcus furiosus

Pyrobaculum aerophilumMethanosarcina mazei Methanosarcina acetivoransThermobifida fusca

Archaeoglobus fulgidusArchaeoglobus fulgidus

Archaeoglobus fulgidusArchaeoglobus fulgidus

b1bcf11d04ORF8b1bcf11d04ORF10

Bordetella bronchiseptica Ralstonia metallidurans

Bordetella pertussis Bordetella bronchiseptica

Burkholderia fungorumBurkholderia fungorumRalstonia eutropha

Bordetella bronchisepticaRalstonia eutropha

Bradyrhizobium japonicumRalstonia eutropha

Burkholderia fungorumBordetella bronchiseptica

Ralstonia eutrophaBordetella bronchiseptica

Bradyrhizobium japonicumBordetella bronchiseptica

Pseudomonas mendocina Bradyrhizobium japonicum

7480

9764

75

52

83

52

57

60

61

70

89

51

64

6262

64

57

58

50

7173

62

100100

LGT and phylogenetic assignment of metagenomic clones 2017

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Tab

le 1

S

umm

ary

of p

hylo

gene

tic a

naly

ses

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

b1dc

f51

a06

No

clea

r af

filia

tion

with

exi

stin

gse

quen

ces

Cou

ld n

ot b

eam

plifi

ed

Mos

t C

DS

s ha

ve n

o or

only

a f

ew s

igni

fican

tm

atch

es in

Gen

Ban

kO

RF

4 cl

uste

rs w

ithLe

ptos

pira

inte

rrog

ans

with

in a

mix

ed c

lade

ho

wev

er

L in

terr

ogan

sha

s se

vera

l par

alog

ues

and

this

gen

e ap

pear

sto

hav

e be

en f

requ

ently

tran

sfer

red

and

islik

ely

to b

e a

tran

sfer

OR

F20

clu

ster

s w

ithM

etha

nosa

rcin

a w

ithin

δ-pr

oteo

bact

eria

O

RF

19cl

uste

rs w

ith G

eoba

cter

but

is m

ostly

foun

d in

met

hano

gens

OR

F17

and

OR

F18

have

hom

olog

ues

inM

etha

noge

ns o

nly

4 C

DS

s (1

9 o

f th

eto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

33

(38

)

b1dc

f13

f01

Clu

ster

s w

ithD

ehal

ococ

coid

eset

heno

gene

sC

hlor

oflex

usau

rant

iacu

s 23

SrR

NA

seq

uenc

eof

too

poo

r qu

ality

to in

clud

e in

the

tree

7 of

10

CD

Ss

(70

) w

ithsu

ppor

ted

phyl

ogen

etic

topo

logi

es a

gree

with

23S

fra

gmen

t In

addi

tion

6 C

DS

s w

hich

only

hit

Chl

orofl

exus

aura

ntia

cus

Two

CD

Ss

have

like

lybe

en a

cqui

red

thro

ugh

LGT

One

clu

ster

s w

ithhi

gh s

uppo

rt w

ithT

herm

otog

a m

ariti

ma

(OR

F16

) an

d on

e cl

uste

rsw

ithin

the

euk

aryo

tes

(OR

F25

)

OR

F2

has

only

sign

ifica

ntho

mol

ogue

s in

Cro

cosp

haer

aw

atso

nii

3 C

DS

s (1

1 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

14

(5

)

b3cf

12

f09

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A g

enes

and

do

not

clus

ter

with

in a

nysp

ecifi

c ba

cter

ial g

roup

Phy

loge

netic

ana

lysi

ssu

gges

ts t

hat

10 C

DS

sha

ve li

kely

bee

n ac

quire

dby

LG

T 8

of

thes

e ha

vebe

en a

cqui

red

from

an

α-pr

oteo

bact

eriu

man

d ar

e fo

und

linke

d

Thr

ee C

DS

s fo

und

linke

d to

CD

Ss

whe

reph

ylog

enet

ic a

naly

ses

sugg

est

LGT

hav

eal

so li

kely

bee

nac

quire

d by

LG

T

13 C

DS

s (3

2 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

OR

F16

is a

tran

spos

ase

of

prot

eoba

cter

ial

orig

in

and

show

slo

wer

GC

con

tent

than

the

res

t of

the

fosm

id T

wel

ve o

fth

e tr

ansf

erre

dC

DS

s (O

RF

29ndash

41)

are

linke

d an

dal

l app

ear

to h

ave

been

acq

uire

dfr

om a

n α-

prot

eoba

cter

ium

22

(9

)

b1bc

f11

f04

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A a

nd d

oes

not

clus

ter

with

any

spe

cific

bact

eria

l lin

eage

A

mon

g th

ese

was

the

high

ly c

onse

rved

Dna

Ege

ne

Two

CD

Ss

(OR

F14

and

OR

F15

) cl

uste

r w

ithse

quen

ces

from

the

Chl

orob

iBac

tero

idet

esgr

oup

2 C

DS

s (9

o

f to

tal)

hav

e b

een

acq

uir

ed b

y L

GT

26

(14

)

2018 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f51c

12C

andi

date

div

isio

nW

S3

bact

eriu

mM

ost

CD

Ss

have

no

oron

ly a

few

sig

nific

ant

mat

ches

in G

enB

ank

OR

F6ndash

OR

F11

are

als

ofo

und

in b

1bcf

11

h3 in

sam

e or

der

and

phyl

ogen

etic

ana

lysi

ssu

ppor

ts t

hat

OR

F7

OR

F8

and

OR

F10

wer

etr

ansf

erre

d fr

om a

δ-

prot

eoba

cter

ium

to

b1bc

f51c

12 O

RF

10 a

ndO

RF

11 a

lso

clus

ter

with

δ-pr

oteo

bact

eria

ho

wev

er

with

no

boot

stra

p su

ppor

t O

RF

9ha

s on

ly o

ne m

atch

inG

enB

ank

OR

F15

(fu

sA)

clus

ters

with

Chl

orob

ium

tepi

dum

with

inF

irm

icut

es

OR

F12

has

no

hom

olog

ue in

b1bc

f11

h3

but

doe

scl

uste

r w

ith δ

-pr

oteo

bact

eria

ho

wev

er w

ith n

obo

otst

rap

supp

ort

It is

like

ly t

hat

also

thi

sC

DS

was

tra

nsfe

rred

as p

art

of w

ith a

δ-

prot

eoba

cter

ial i

slan

d

8 C

DS

s (4

4 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

One

lar

ge lsquoi

slan

drsquo o

fδ-

prot

eoba

cter

ial

orig

in

22

(29

)

b1cf

11

1h0

3δ-

Pro

teob

acte

rium

ndash8

of 1

3 C

DS

s (5

7)

that

give

s su

ppor

ted

phyl

ogen

ies

agre

e w

ithth

e fr

agm

ent

orig

inat

ing

from

a δ

-pr

oteo

bact

eriu

m

Six

CD

Ss

have

like

ly b

een

acqu

ired

by L

GT

OR

F8

clus

ters

with

Clo

strid

ium

ther

moc

ellu

m a

ndTr

epon

ema

dent

icol

aO

RF

18 is

fou

ndse

para

ted

from

oth

erpr

oteo

bact

eria

inph

ylog

enet

ic t

rees

cl

uste

ring

with

Pla

smod

ium

spp

O

RF

23is

fou

nd in

a m

ixed

cla

dean

d ap

pear

s to

hav

ebe

en f

requ

ently

tran

sfer

red

OR

F28

clus

ters

with

β-

prot

eoba

cter

ia

OR

F29

clus

ters

with

γ-

prot

eoba

cter

ia a

ndO

RF

30 is

fou

nd a

tbo

ttom

of

clad

e th

atco

ntai

ns α

-pr

oteo

bact

eria

and

Act

inob

acte

ria

6 C

DS

s (1

7 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

OR

F11

ndashOR

F16

ha

ve b

een

tran

sfer

red

from

an

ance

stor

of

B1B

CF

11

h03

tob1

dcf5

1c

12 a

sw

ell t

o th

eC

hlor

obiu

m li

neag

e

6 (

1)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2019

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f11

d04

δ-P

rote

obac

teriu

mndash

12 o

f 18

CD

Ss

(67

)w

ith s

uppo

rted

phyl

ogen

etic

top

olog

ies

agre

e w

ith a

δ-

prot

eoba

cter

ial o

rigin

of

the

frag

men

t

Six

CD

Ss

are

sugg

este

dby

phy

loge

netic

ana

lyse

sto

hav

e be

en a

cqui

red

byLG

T O

ne o

f th

ese

tran

sfer

red

gene

s ndasht

hefu

sA h

omol

ogue

(OR

F19

) ndash is

als

o fo

und

inb1

bcf5

c12

Thi

s C

DS

has

been

tra

nsfe

rred

to

othe

r δ-

prot

eoba

cter

ia a

sw

ell

Thr

ee C

DS

s (O

RF

3ndash5)

that

enc

ode

anin

tege

rase

and

tw

otr

ansp

osas

es t

hat

prec

edes

fou

r of

the

LGT

gen

es d

etec

ted

in t

he p

hylo

gene

tican

alys

is

OR

F7

also

likel

y tr

ansf

erre

d w

ithO

RF

3 ndashO

RF

10

OR

F20

and

OR

F21

have

mai

nly

hom

olog

ues

inF

irm

icut

es a

nd is

the

neig

hbou

r of

OR

F19

that

has

als

o be

enac

quire

d fr

omF

irm

icut

es

12 C

DS

s (3

1 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

Inte

rest

ingl

y th

isfo

smid

clo

nepr

ovid

es t

hetr

ansf

er v

ecto

r ndash

the

inte

gera

se a

ndtr

ansp

osas

e ndash

for

8of

the

tra

nsfe

rred

gene

s

ndash

b1bc

f13

c08

ε-P

rote

obac

teriu

m

mos

t cl

osel

yre

late

d to

Cam

pylo

bact

erje

juni

21 C

DS

s gi

ve s

uppo

rted

phyl

ogen

ies

and

ofth

ese

19 (

90

) ag

ree

with

rR

NA

OR

F4

clus

ters

with

Geo

bact

er a

ndC

lost

ridiu

m

and

OR

F23

does

not

hav

eho

mol

ogue

s in

ε-

prot

eoba

cter

ia a

ndcl

uste

rs w

ith γ

- an

d β-

prot

eoba

cter

ia

OR

F24

doe

s no

t gi

ve a

supp

orte

d tr

ee b

utha

s al

so p

roba

bly

been

tra

nsfe

rred

fro

mγ-

or

β-pr

oteo

bact

eria

3 C

DS

s (7

o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

10

(3

)

b3cf

12

d07

γ-P

rote

obac

teriu

m

Clu

ster

s w

ithin

the

γ-pr

oteo

bact

eria

inLo

gDet

dis

tanc

etr

ees

but

at t

heba

se o

f γ-

prot

eoba

cter

ia a

ndβ-

prot

eoba

cter

iain

the

bes

tm

axim

umlik

elih

ood

tree

Onl

y 7

CD

Ss

give

su

ppor

ted

phyl

ogen

ies

O

f th

ese

4 (5

7)

agre

e w

ith r

RN

A

OR

F7

clus

ter

with

in β

-pr

oteo

bact

eria

OR

F15

ha

s a

patc

hy d

istr

ibut

ion

and

does

not

clu

ster

with

ot

her

prot

eoba

cter

ia in

th

e ph

ylog

enet

ic t

ree

Sev

eral

add

ition

al C

DS

s (O

RF

16ndashO

RF

25)

that

did

not

prod

uce

wel

l-re

solv

ed t

rees

ha

d on

ly d

iver

gent

hom

olog

ues

inG

enB

ank

or

nosi

gnifi

cant

hom

olog

ues

may

also

hav

e be

enac

quire

d by

LG

T I

nsu

ppor

t of

thi

sO

RF

26 e

ncod

es a

tran

spos

ase

2 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

O

RF

16 ndash

OR

F25

w

as n

ot in

clud

ed in

es

timat

e du

e to

lim

ited

evid

ence

for

th

e tr

ansf

er o

f the

se

23

(23

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

2020 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f1c

04γ-

Pro

teob

acte

rium

ndash14

CD

Ss

give

sup

port

edph

ylog

enie

s an

d of

thes

e 13

(93

)

agre

ew

ith r

RN

A

Phy

loge

netic

ana

lyse

ssh

ow t

hat

two

CD

Ss

have

bee

n ac

quire

d by

LGT

OR

F3

is f

ound

in a

mix

ed c

lade

whi

leO

RF

30 c

lust

er w

ithin

β-

prot

eoba

cter

ia

Thr

ee g

enes

tha

t sh

owun

cong

ruen

tph

ylog

enie

s b

utw

ith lo

w b

oots

trap

supp

ort

foun

d cl

ose

to O

RF

3 an

d O

RF

34ha

ve p

roba

bly

also

been

acq

uire

d by

LGT

O

RF

5 cl

uste

rsw

ith β

-pro

teob

acte

ria

OR

F31

clu

ster

s w

ithδ-

prot

eoba

cter

ia

and

OR

F32

(G

ST

) cl

uste

rsw

ith a

γ-pr

oteo

bact

eriu

m

but

appe

ars

toha

ve b

een

freq

uent

lytr

ansf

erre

d

5 C

DS

s (1

3 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

3 (

1)

b1bf

11

a01

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns (

98

iden

tity

at 2

3S

rRN

A)

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns(9

8 id

entit

yat

16S

rR

NA

)

Hig

h de

gree

of

gene

sy

nten

y co

mpa

red

with

Thi

obac

illus

de

nitr

ifica

ns

29 C

DS

sha

ve b

est

BLA

ST

mat

chin

Thi

obac

illus

de

nitr

ifica

ns 2

7 of

28

CD

Ss

(96

) th

at g

ive

stat

istic

ally

sup

port

edph

ylog

enie

s ag

ree

with

rR

NA

gen

es

One

OR

F30

(R

suA

)cl

uste

r w

ith γ

-pr

oteo

bact

eria

and

has

no

hom

olog

ue in

T

hiob

acill

us d

enitr

ifica

ns

Two

CD

Ss

(OR

F14

and

O

RF

31)

have

bee

n tr

ansf

erre

d to

bot

h fo

smid

an

d T

hiob

acill

us

deni

trifi

cans

OR

F29

has

no

sign

ifica

nt

hom

olog

ues

inpr

oteo

bact

eria

4 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

3 (

2)

b1bf

110

d03

ndashA

Fla

voba

cter

iace

aeba

cter

ium

am

ong

sequ

ence

dge

nom

es m

ost

clos

ely

rela

ted

toC

ytop

haga

hutc

hins

onii

16 o

f 18

(84

) C

DS

s w

ith

supp

orte

d ph

ylog

enet

icto

polo

gies

agr

ee w

ith16

S f

ragm

ent

OR

F5

and

OR

F10

hav

e no

cl

ose

hom

olog

ues

in

othe

r B

acte

roid

es a

ndph

ylog

enet

ic a

naly

sis

sugg

ests

fre

quen

ttr

ansf

er

OR

F4

has

no d

etec

tabl

eho

mol

ogue

s in

oth

er

Bac

tero

ides

A

tran

spos

on w

ith 8

C

DS

s lik

ely

acqu

ired

from

rel

ativ

e of

Bac

tero

ides

thet

aiot

aoim

icro

n

3 C

DS

s (1

0 o

f to

tal)

h

ave

likel

y b

een

acq

uir

ed b

y L

GT

The

tra

nspo

son

not

incl

uded

as

it ha

sbe

en t

rans

ferr

edw

ithin

the

B

acte

roid

es

10

(3

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

a O

nly

LGT

eve

nts

invo

lvin

g th

e C

DS

fro

m t

he fo

smid

clo

ne a

naly

sed

was

cou

nted

and

onl

y w

hen

they

wer

e su

ppor

ted

by p

hylo

gene

tic a

naly

ses

or c

lear

phy

loge

netic

dis

trib

utio

n pa

ttern

s (i

e

the

gene

is n

ot p

rese

nt in

its

rRN

A g

roup

but

pre

sent

in s

ome

othe

r di

stin

ct b

acte

rial g

roup

) N

umbe

r of

CD

Ss

acqu

ired

by L

GT

is s

how

n in

bol

db

O

RFa

ns w

here

cla

ssifi

ed a

s C

DS

s w

ith n

o si

gnifi

cant

mat

ch in

Gen

Ban

k M

atch

es t

o se

quen

ces

in t

he e

nviro

nmen

tal p

ortio

n of

Gen

Ban

k w

ere

not

cons

ider

ed I

n pa

rent

hesi

s is

giv

en t

he

prop

ortio

n of

pro

tein

cod

ing

DN

A t

hat

has

no m

atch

in G

enB

ank

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

2014

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

Fig

1

Ove

rvie

w o

f th

e se

quen

ced

fosm

id c

lone

s Y

ello

w C

DS

s ar

e su

gges

ted

to h

ave

been

acq

uire

d by

LG

T a

nd b

lue

CD

Ss

have

no

sign

ifica

nt m

atch

in G

enB

ank

A

α

-pro

teob

acte

ria B

β

-pr

oteo

bact

eria

D

δ

-pro

teob

acte

ria

E

ε

-pro

teob

acte

ria

G

γ

-pro

teob

acte

ria

C

Cya

noba

cter

ia

CB

C

hlor

obi-B

acte

roid

etes

F

Fir

mic

utes

P

pro

teob

acte

ria

CH

C

hlor

oflex

i T

D

The

rmus

-D

eino

cocc

us g

roup

A

CT

Act

inob

acte

ria

PL

Pla

ncto

myc

etes

S

PIR

S

piro

chae

tes

TH

ER

T

herm

otog

ales

A

Q

Aqu

ifeca

les

FU

SO

F

usob

acte

ria

AR

CH

A

rcha

eal

EU

K

Euk

aryo

tes

EN

Ven

viro

nmen

tal s

eque

nce

c

lust

er r

obus

tly w

ithin

a m

ixed

cla

de in

phy

loge

netic

tree

s ndash

no

sign

ifica

nt m

atch

in G

enB

ank

Upp

erca

se s

uppo

rted

by

phyl

ogen

etic

ana

lysi

s L

ower

case

sug

gest

edby

BLA

ST

sea

rche

s as

the

re w

as n

o su

ppor

ted

phyl

ogen

y T

he lo

w-q

ualit

y re

gion

in b

1dcf

13

c08

(pos

ition

119

2ndash13

42)

is in

dica

ted

by a

bla

ck b

ox T

he o

rang

e sh

adin

gs in

dica

tes

LGT-

CD

Ss

that

are

foun

d in

mor

e th

an o

ne fo

smid

ORFAN

A c

onju

gativ

e tr

ansp

oson

ob

tain

ed fr

om a

Bac

terio

ides

bac

teriu

m

unkn

own

b1dc

f51

a06

Chl

orof

exi

b1dc

f13

f01

Can

dida

te d

ivsi

on O

P8

b3cf

12

f09

Can

dida

te d

ivsi

on W

S3

b1bc

f11

f4

Can

dida

te d

ivsi

on W

S3

b1bc

f51

c12

d-pr

oteo

bact

eria

b1bc

f11

h03

d-pr

oteo

bact

eria

b1bc

f11

d04

e-pr

oteo

bact

era

b1dc

f13

c08

g-pr

oteo

bact

eria

b1dc

f12

d07

g-pr

oteo

bact

eria

b1bc

f11

c04

b-pr

oteo

bact

eria

b1bf

11

a01

Fla

voba

cter

iace

aeb1

bf1

10d

03

LGT and phylogenetic assignment of metagenomic clones

2015

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

Fig 2

rRNA phylogeniesA The minimum evolution tree estimated from LogDet distances of the 23S-tag from the CeuI-fosmids (984 positions in alignment) For the sequences from the fosmid clones the Maximum Likelihood topology was similar (GTR

+

G

+

I) except that the

δ

-proteobacteria where paraphyletic with the

γ

- and

β

-proteobacteria clustering within the

δ

-proteobacteria Moreover b1bcf11d04 fell at the bottom of this cladeB The minimum evolution tree estimated from LogDet distances of the 16S sequences (1243 positions in alignment) For the sequences from the fosmid clones the Maximum Likelihood (GTR

+

G

+

I) topology was identical However there where several differences in the backbone of the tree with for instance Geobacter clustering with Firmicutes The trees in both A and B were rooted by the

Thermotoga maritima

sequenceC The minimum evolution tree estimated from LogDet distances of the closest matches of the 16S fragment in b1bf110d03 (1046 positions in alignment) The Maximum Likelihood (GTR

+

G

+

I) topology was identicalFor all three trees numbers on branches refers to bootstrap values from the minimum evolution analysis (

italic

) and from the Maximum Likelihood analysis (plain text) If both bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 90

B

Thermotoga maritima Coprothermobacter proteolyticus

Acidobacterium capsulatumPirellula marina

R76-B102OPB95

OPB5HMMVPog-54

HS9-30

PBS-II-35

LD1-PB19PBS-III-30

PRR-12Simkania negevensisBorrelia burgdorferi

Synechococcus Chloroflexus aurantiacus

Dehalococcoides ethenogenes Bacteroides thetaiotaomicron

Cytophaga hutchinsoniiChlorobium tepidum

Leptospirillum ferrooxidans Deinococcus radiodurans

Geobacillus subterraneus Paenibacillus popilliae

Fusobacterium nucleatum Geobacter metallireducens

Bradyrhizobium japonicum Vibrio splendidus

Methylobacillus flagellatum Thiobacillus denitrificans

005 substitutionssite

b3cf12f09

b1bcf11f04

b1bf11a01

candidate division OP8

candidate division WS3

Betaproteobacteria

92

72

54

78

57

75

Porphyromonas gingivalis

Bacteroides thetaiotaomicron

Cytophaga hutchinsonii

Cellulophaga pacifica

Flavobacterium gelidilacus

Flavobacterium psychrolimnae

Flavobacterium frigoris

Flavobacterium xinjiangensis

Gelidibacter algens

Bizionia paragorgiae

Formosa algae

Algibacter lectus

Flavobacterium sp 5N-3

Psychroserpens burtonensis

Mesophilibacter yeosuensis

b1bf110d03

Flavobacteriaceae bacterium BSA CS 02

Flavobacteriaceae bacterium BSD RB 42

001 substitutionssite

C

isolated from estuarine and salt marsh sediments

b3cf12f09Chlorobium tepidum

Synechocystis sp D64000

Deinococcus radiodurans

b1dcf13f01Dehalococcoides ethenogenes

b1dcf511a06Fusobacterium nucleatum

b1bcf11f04b1dcf51c12

Mycobacterium kansasiiStreptomyces coelicolor Thermomonospora chromogena

Paenibacillus popilliaeOceanobacillus iheyensis

Geobacillus kaustophilus

Simkania negevensis Pirellula sp strain 1

b3cf12d07Pseudomonas stutzeri

005 substitutionssite

candidate division WS3

Wolinella succinogenes Helicobacter hepaticus

Campylobacter jejuni b1dcf13c08

Epsilonproteobacteria

b1bcf11d04Desulfotalea psychrophila

b1bcf11h03Nannocystis exedens

Stigmatella aurantiacaGeobacter metallireducens

Deltaproteobacteria

Methylobacillus flagellatusb1bf11a01Thiobacillus denitrificans

Halomonas pantelleriensis

Microbulbifer degradansVibrio splendidus

b1bcf11c04Uncultured bacterium 463 clone EBAC080-L32B05

Betaproteobacteria

Gammaproteobacteria

Thermotoga maritima

candidate division OP8

Chloroflexi

Symbiobacterium thermophilum

Bacillus cereus

Desulfovibrio vulgaris

A

51

6197

87

55

67

54100

61

58

84

57

8968

58

97

54

65

64

68

73

51

58

53

87

58

2016

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

short to obtain reliable alignments the CDS was found ina lsquomixedrsquo clade also containing genes from the same bac-terial group or the CDS was found outside its group butdid not cluster with any specific lineage For three of theclones more than 30 of the CDSs have been acquiredby LGT (Table 1) two of these are from candidate divi-sions and one is from a

δ

-proteobacterium For all threeof these fosmids there appears to have been a transfer ofa large island of genes from a phylogenetically distantlineage Specifically we infer an

α

-proteobacterial islandin b3cf12f09 a

δ

-proteobacterial island in b1dcf51c12and an archaeal

β

-proteobacterial island in b1bcf11d04(Fig 1) It should be noted that the proportions of foreigngenes identified here might not represent the proportion

of foreign genes in the respective genomes that we havesampled but

rather the amount of LGT to be expectedwhen sequencing environmental fosmid clones

Forinstance in some genomes LGT might be enriched incertain variable parts of the genome Indeed the distribu-tion of proteins that match COG categories was signifi-cantly different (

P

=

13 e-13 in a

χ

2

-test) to what weobserved for the end-sequencing of lsquonormalrsquo fosmidclones (supplemental Fig S1) the main difference beingproportionally more J K U F and H category sequencesin the full fosmid sequences and more L P R and Scategory sequences among the end-sequences Whencomparing the distributions of different COG-groups (ieinformational metabolism etc) however the two datasets were significantly different only when including thepoorly characterized categories (R S) If such genes aremore frequently transferred than the other categoriesthen we would be underestimating the level of LGT thatwould be expected when analysing metagenomic clones

Interestingly in b1bcf11d04 the transfer vector for oneof the acquired gene clusters could be identified ORF6encodes an acetyl transferase gene and ORF8 ORF9and ORF10 encode subunits for an acyl-CoA synthase ndashtwo

α

-subunits and one

β

-subunit Phylogenetic analysessuggested all four CDSs have been acquired by LGTlikely from a

β

-proteobacterium The

β

-proteobacteriahave in turn likely acquired the acyl-CoA synthase genesfrom Archaea (Fig 3) In support of the archaeal origin ofthese genes the acyl-CoA synthase in bcf11d04 hassimilar domain organization to the acetyl-CoA synthase in

Pyrococcus

spp with two subunits (Sanchez

et al

2000)Furthermore these genes have been transferred multipletimes and the transfers involved all three domains of life[Fig 3 (Andersson

et al

2003)] These transferred CDSsare preceded by one integrase gene (ORF3) a trans-posase gene (ORF4) and an intergerasetransposasegene (ORF5 COG2801 Tra5 which contains an inte-gerase core domain Table S7) which probably wereresponsible for transferring this cluster into this genomeThe

α

-proteobacterial island in the b3cf12f09 cloneencodes a wide range of different functions and no typicalmobile elements were identified However as this islandextends to the 3

prime

end of the fosmid mobile genes mightbe found further downstream The first CDS of this islandencodes a DnaJ-class chaperone (ORF29) which is trun-cated at the 5

prime

end This pseudogene still shows 65protein identity to a homologue in

Magnetoospirillummagnetotacticum

(Table S3) Hence this probably repre-sents a very recent transfer (or rearrangement) Anotherpossibility is that this fosmid might be a chimera Howeverthe G

+

C content of the CDSs in the

α

-proteobacterialisland (595 G

+

C) is very similar to the rest of thefosmid (596 G

+

C supplemental Table S3) Also fur-ther upstream there is a proteobacterial transposase

Fig 3

Maximum Likelihood phylogeny of acetyl-CoA synthetase (ADP-forming) homologues estimated using PMBML (459 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF8 and ORF10 sequences against GenBank and the 100 best matches were retrieved and aligned Groups of very similar sequences from the same species or sister species were trimmed down to one sequence representative The tree was arbi-trarily rooted by Entamoeba histolytica Numbers on branches refers to bootstrap support obtained from using PMBML in bold PUZZLEBOOT in plain text and Neighbour-joining in italic If all bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 80

10

Entamoeba histolytica Parachlamydia sp UWE25

Rubrobacter xylanophilus Gloeobacter violaceus

Nostoc sp PCC 7120Thermosynechococcus elongatus

Dechloromonas aromaticaMesorhizobium sp BNC1

Sinorhizobium melilotiXanthomonas axonopodisRhodopseudomonas palustris

Bradyrhizobium japonicum Desulfovibrio desulfuricans

Rhodospirillum rubrumMagnetospirillum magnetotacticum

Magnetospirillum magnetotacticumShewanella oneidensis

Photobacterium profundumVibrio cholerae

Vibrio vulnificus Photorhabdus luminescens

Yersinia pestis Salmonella enterica

Escherichia coli Methanopyrus kandleri

Pyrococcus furiosus Archaeoglobus fulgidus

Methanococcus maripaludisMethanocaldococcus jannaschii

Magnetococcus sp MC-1 Chloroflexus aurantiacus

Spironucleus barkhanus Giardia intestinalis

Pyrococcus furiosusThermoplasma acidophilum Thermoplasma volcanium

Pyrococcus furiosus Streptomyces avermitilisBradyrhizobium japonicum

Ralstonia metalliduransFerroplasma acidarmanus

Sulfolobus solfataricusSulfolobus tokodaii

Pyrococcus furiosusPyrococcus furiosus

Pyrobaculum aerophilumMethanosarcina mazei Methanosarcina acetivoransThermobifida fusca

Archaeoglobus fulgidusArchaeoglobus fulgidus

Archaeoglobus fulgidusArchaeoglobus fulgidus

b1bcf11d04ORF8b1bcf11d04ORF10

Bordetella bronchiseptica Ralstonia metallidurans

Bordetella pertussis Bordetella bronchiseptica

Burkholderia fungorumBurkholderia fungorumRalstonia eutropha

Bordetella bronchisepticaRalstonia eutropha

Bradyrhizobium japonicumRalstonia eutropha

Burkholderia fungorumBordetella bronchiseptica

Ralstonia eutrophaBordetella bronchiseptica

Bradyrhizobium japonicumBordetella bronchiseptica

Pseudomonas mendocina Bradyrhizobium japonicum

7480

9764

75

52

83

52

57

60

61

70

89

51

64

6262

64

57

58

50

7173

62

100100

LGT and phylogenetic assignment of metagenomic clones 2017

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Tab

le 1

S

umm

ary

of p

hylo

gene

tic a

naly

ses

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

b1dc

f51

a06

No

clea

r af

filia

tion

with

exi

stin

gse

quen

ces

Cou

ld n

ot b

eam

plifi

ed

Mos

t C

DS

s ha

ve n

o or

only

a f

ew s

igni

fican

tm

atch

es in

Gen

Ban

kO

RF

4 cl

uste

rs w

ithLe

ptos

pira

inte

rrog

ans

with

in a

mix

ed c

lade

ho

wev

er

L in

terr

ogan

sha

s se

vera

l par

alog

ues

and

this

gen

e ap

pear

sto

hav

e be

en f

requ

ently

tran

sfer

red

and

islik

ely

to b

e a

tran

sfer

OR

F20

clu

ster

s w

ithM

etha

nosa

rcin

a w

ithin

δ-pr

oteo

bact

eria

O

RF

19cl

uste

rs w

ith G

eoba

cter

but

is m

ostly

foun

d in

met

hano

gens

OR

F17

and

OR

F18

have

hom

olog

ues

inM

etha

noge

ns o

nly

4 C

DS

s (1

9 o

f th

eto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

33

(38

)

b1dc

f13

f01

Clu

ster

s w

ithD

ehal

ococ

coid

eset

heno

gene

sC

hlor

oflex

usau

rant

iacu

s 23

SrR

NA

seq

uenc

eof

too

poo

r qu

ality

to in

clud

e in

the

tree

7 of

10

CD

Ss

(70

) w

ithsu

ppor

ted

phyl

ogen

etic

topo

logi

es a

gree

with

23S

fra

gmen

t In

addi

tion

6 C

DS

s w

hich

only

hit

Chl

orofl

exus

aura

ntia

cus

Two

CD

Ss

have

like

lybe

en a

cqui

red

thro

ugh

LGT

One

clu

ster

s w

ithhi

gh s

uppo

rt w

ithT

herm

otog

a m

ariti

ma

(OR

F16

) an

d on

e cl

uste

rsw

ithin

the

euk

aryo

tes

(OR

F25

)

OR

F2

has

only

sign

ifica

ntho

mol

ogue

s in

Cro

cosp

haer

aw

atso

nii

3 C

DS

s (1

1 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

14

(5

)

b3cf

12

f09

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A g

enes

and

do

not

clus

ter

with

in a

nysp

ecifi

c ba

cter

ial g

roup

Phy

loge

netic

ana

lysi

ssu

gges

ts t

hat

10 C

DS

sha

ve li

kely

bee

n ac

quire

dby

LG

T 8

of

thes

e ha

vebe

en a

cqui

red

from

an

α-pr

oteo

bact

eriu

man

d ar

e fo

und

linke

d

Thr

ee C

DS

s fo

und

linke

d to

CD

Ss

whe

reph

ylog

enet

ic a

naly

ses

sugg

est

LGT

hav

eal

so li

kely

bee

nac

quire

d by

LG

T

13 C

DS

s (3

2 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

OR

F16

is a

tran

spos

ase

of

prot

eoba

cter

ial

orig

in

and

show

slo

wer

GC

con

tent

than

the

res

t of

the

fosm

id T

wel

ve o

fth

e tr

ansf

erre

dC

DS

s (O

RF

29ndash

41)

are

linke

d an

dal

l app

ear

to h

ave

been

acq

uire

dfr

om a

n α-

prot

eoba

cter

ium

22

(9

)

b1bc

f11

f04

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A a

nd d

oes

not

clus

ter

with

any

spe

cific

bact

eria

l lin

eage

A

mon

g th

ese

was

the

high

ly c

onse

rved

Dna

Ege

ne

Two

CD

Ss

(OR

F14

and

OR

F15

) cl

uste

r w

ithse

quen

ces

from

the

Chl

orob

iBac

tero

idet

esgr

oup

2 C

DS

s (9

o

f to

tal)

hav

e b

een

acq

uir

ed b

y L

GT

26

(14

)

2018 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f51c

12C

andi

date

div

isio

nW

S3

bact

eriu

mM

ost

CD

Ss

have

no

oron

ly a

few

sig

nific

ant

mat

ches

in G

enB

ank

OR

F6ndash

OR

F11

are

als

ofo

und

in b

1bcf

11

h3 in

sam

e or

der

and

phyl

ogen

etic

ana

lysi

ssu

ppor

ts t

hat

OR

F7

OR

F8

and

OR

F10

wer

etr

ansf

erre

d fr

om a

δ-

prot

eoba

cter

ium

to

b1bc

f51c

12 O

RF

10 a

ndO

RF

11 a

lso

clus

ter

with

δ-pr

oteo

bact

eria

ho

wev

er

with

no

boot

stra

p su

ppor

t O

RF

9ha

s on

ly o

ne m

atch

inG

enB

ank

OR

F15

(fu

sA)

clus

ters

with

Chl

orob

ium

tepi

dum

with

inF

irm

icut

es

OR

F12

has

no

hom

olog

ue in

b1bc

f11

h3

but

doe

scl

uste

r w

ith δ

-pr

oteo

bact

eria

ho

wev

er w

ith n

obo

otst

rap

supp

ort

It is

like

ly t

hat

also

thi

sC

DS

was

tra

nsfe

rred

as p

art

of w

ith a

δ-

prot

eoba

cter

ial i

slan

d

8 C

DS

s (4

4 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

One

lar

ge lsquoi

slan

drsquo o

fδ-

prot

eoba

cter

ial

orig

in

22

(29

)

b1cf

11

1h0

3δ-

Pro

teob

acte

rium

ndash8

of 1

3 C

DS

s (5

7)

that

give

s su

ppor

ted

phyl

ogen

ies

agre

e w

ithth

e fr

agm

ent

orig

inat

ing

from

a δ

-pr

oteo

bact

eriu

m

Six

CD

Ss

have

like

ly b

een

acqu

ired

by L

GT

OR

F8

clus

ters

with

Clo

strid

ium

ther

moc

ellu

m a

ndTr

epon

ema

dent

icol

aO

RF

18 is

fou

ndse

para

ted

from

oth

erpr

oteo

bact

eria

inph

ylog

enet

ic t

rees

cl

uste

ring

with

Pla

smod

ium

spp

O

RF

23is

fou

nd in

a m

ixed

cla

dean

d ap

pear

s to

hav

ebe

en f

requ

ently

tran

sfer

red

OR

F28

clus

ters

with

β-

prot

eoba

cter

ia

OR

F29

clus

ters

with

γ-

prot

eoba

cter

ia a

ndO

RF

30 is

fou

nd a

tbo

ttom

of

clad

e th

atco

ntai

ns α

-pr

oteo

bact

eria

and

Act

inob

acte

ria

6 C

DS

s (1

7 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

OR

F11

ndashOR

F16

ha

ve b

een

tran

sfer

red

from

an

ance

stor

of

B1B

CF

11

h03

tob1

dcf5

1c

12 a

sw

ell t

o th

eC

hlor

obiu

m li

neag

e

6 (

1)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2019

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f11

d04

δ-P

rote

obac

teriu

mndash

12 o

f 18

CD

Ss

(67

)w

ith s

uppo

rted

phyl

ogen

etic

top

olog

ies

agre

e w

ith a

δ-

prot

eoba

cter

ial o

rigin

of

the

frag

men

t

Six

CD

Ss

are

sugg

este

dby

phy

loge

netic

ana

lyse

sto

hav

e be

en a

cqui

red

byLG

T O

ne o

f th

ese

tran

sfer

red

gene

s ndasht

hefu

sA h

omol

ogue

(OR

F19

) ndash is

als

o fo

und

inb1

bcf5

c12

Thi

s C

DS

has

been

tra

nsfe

rred

to

othe

r δ-

prot

eoba

cter

ia a

sw

ell

Thr

ee C

DS

s (O

RF

3ndash5)

that

enc

ode

anin

tege

rase

and

tw

otr

ansp

osas

es t

hat

prec

edes

fou

r of

the

LGT

gen

es d

etec

ted

in t

he p

hylo

gene

tican

alys

is

OR

F7

also

likel

y tr

ansf

erre

d w

ithO

RF

3 ndashO

RF

10

OR

F20

and

OR

F21

have

mai

nly

hom

olog

ues

inF

irm

icut

es a

nd is

the

neig

hbou

r of

OR

F19

that

has

als

o be

enac

quire

d fr

omF

irm

icut

es

12 C

DS

s (3

1 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

Inte

rest

ingl

y th

isfo

smid

clo

nepr

ovid

es t

hetr

ansf

er v

ecto

r ndash

the

inte

gera

se a

ndtr

ansp

osas

e ndash

for

8of

the

tra

nsfe

rred

gene

s

ndash

b1bc

f13

c08

ε-P

rote

obac

teriu

m

mos

t cl

osel

yre

late

d to

Cam

pylo

bact

erje

juni

21 C

DS

s gi

ve s

uppo

rted

phyl

ogen

ies

and

ofth

ese

19 (

90

) ag

ree

with

rR

NA

OR

F4

clus

ters

with

Geo

bact

er a

ndC

lost

ridiu

m

and

OR

F23

does

not

hav

eho

mol

ogue

s in

ε-

prot

eoba

cter

ia a

ndcl

uste

rs w

ith γ

- an

d β-

prot

eoba

cter

ia

OR

F24

doe

s no

t gi

ve a

supp

orte

d tr

ee b

utha

s al

so p

roba

bly

been

tra

nsfe

rred

fro

mγ-

or

β-pr

oteo

bact

eria

3 C

DS

s (7

o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

10

(3

)

b3cf

12

d07

γ-P

rote

obac

teriu

m

Clu

ster

s w

ithin

the

γ-pr

oteo

bact

eria

inLo

gDet

dis

tanc

etr

ees

but

at t

heba

se o

f γ-

prot

eoba

cter

ia a

ndβ-

prot

eoba

cter

iain

the

bes

tm

axim

umlik

elih

ood

tree

Onl

y 7

CD

Ss

give

su

ppor

ted

phyl

ogen

ies

O

f th

ese

4 (5

7)

agre

e w

ith r

RN

A

OR

F7

clus

ter

with

in β

-pr

oteo

bact

eria

OR

F15

ha

s a

patc

hy d

istr

ibut

ion

and

does

not

clu

ster

with

ot

her

prot

eoba

cter

ia in

th

e ph

ylog

enet

ic t

ree

Sev

eral

add

ition

al C

DS

s (O

RF

16ndashO

RF

25)

that

did

not

prod

uce

wel

l-re

solv

ed t

rees

ha

d on

ly d

iver

gent

hom

olog

ues

inG

enB

ank

or

nosi

gnifi

cant

hom

olog

ues

may

also

hav

e be

enac

quire

d by

LG

T I

nsu

ppor

t of

thi

sO

RF

26 e

ncod

es a

tran

spos

ase

2 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

O

RF

16 ndash

OR

F25

w

as n

ot in

clud

ed in

es

timat

e du

e to

lim

ited

evid

ence

for

th

e tr

ansf

er o

f the

se

23

(23

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

2020 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f1c

04γ-

Pro

teob

acte

rium

ndash14

CD

Ss

give

sup

port

edph

ylog

enie

s an

d of

thes

e 13

(93

)

agre

ew

ith r

RN

A

Phy

loge

netic

ana

lyse

ssh

ow t

hat

two

CD

Ss

have

bee

n ac

quire

d by

LGT

OR

F3

is f

ound

in a

mix

ed c

lade

whi

leO

RF

30 c

lust

er w

ithin

β-

prot

eoba

cter

ia

Thr

ee g

enes

tha

t sh

owun

cong

ruen

tph

ylog

enie

s b

utw

ith lo

w b

oots

trap

supp

ort

foun

d cl

ose

to O

RF

3 an

d O

RF

34ha

ve p

roba

bly

also

been

acq

uire

d by

LGT

O

RF

5 cl

uste

rsw

ith β

-pro

teob

acte

ria

OR

F31

clu

ster

s w

ithδ-

prot

eoba

cter

ia

and

OR

F32

(G

ST

) cl

uste

rsw

ith a

γ-pr

oteo

bact

eriu

m

but

appe

ars

toha

ve b

een

freq

uent

lytr

ansf

erre

d

5 C

DS

s (1

3 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

3 (

1)

b1bf

11

a01

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns (

98

iden

tity

at 2

3S

rRN

A)

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns(9

8 id

entit

yat

16S

rR

NA

)

Hig

h de

gree

of

gene

sy

nten

y co

mpa

red

with

Thi

obac

illus

de

nitr

ifica

ns

29 C

DS

sha

ve b

est

BLA

ST

mat

chin

Thi

obac

illus

de

nitr

ifica

ns 2

7 of

28

CD

Ss

(96

) th

at g

ive

stat

istic

ally

sup

port

edph

ylog

enie

s ag

ree

with

rR

NA

gen

es

One

OR

F30

(R

suA

)cl

uste

r w

ith γ

-pr

oteo

bact

eria

and

has

no

hom

olog

ue in

T

hiob

acill

us d

enitr

ifica

ns

Two

CD

Ss

(OR

F14

and

O

RF

31)

have

bee

n tr

ansf

erre

d to

bot

h fo

smid

an

d T

hiob

acill

us

deni

trifi

cans

OR

F29

has

no

sign

ifica

nt

hom

olog

ues

inpr

oteo

bact

eria

4 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

3 (

2)

b1bf

110

d03

ndashA

Fla

voba

cter

iace

aeba

cter

ium

am

ong

sequ

ence

dge

nom

es m

ost

clos

ely

rela

ted

toC

ytop

haga

hutc

hins

onii

16 o

f 18

(84

) C

DS

s w

ith

supp

orte

d ph

ylog

enet

icto

polo

gies

agr

ee w

ith16

S f

ragm

ent

OR

F5

and

OR

F10

hav

e no

cl

ose

hom

olog

ues

in

othe

r B

acte

roid

es a

ndph

ylog

enet

ic a

naly

sis

sugg

ests

fre

quen

ttr

ansf

er

OR

F4

has

no d

etec

tabl

eho

mol

ogue

s in

oth

er

Bac

tero

ides

A

tran

spos

on w

ith 8

C

DS

s lik

ely

acqu

ired

from

rel

ativ

e of

Bac

tero

ides

thet

aiot

aoim

icro

n

3 C

DS

s (1

0 o

f to

tal)

h

ave

likel

y b

een

acq

uir

ed b

y L

GT

The

tra

nspo

son

not

incl

uded

as

it ha

sbe

en t

rans

ferr

edw

ithin

the

B

acte

roid

es

10

(3

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

a O

nly

LGT

eve

nts

invo

lvin

g th

e C

DS

fro

m t

he fo

smid

clo

ne a

naly

sed

was

cou

nted

and

onl

y w

hen

they

wer

e su

ppor

ted

by p

hylo

gene

tic a

naly

ses

or c

lear

phy

loge

netic

dis

trib

utio

n pa

ttern

s (i

e

the

gene

is n

ot p

rese

nt in

its

rRN

A g

roup

but

pre

sent

in s

ome

othe

r di

stin

ct b

acte

rial g

roup

) N

umbe

r of

CD

Ss

acqu

ired

by L

GT

is s

how

n in

bol

db

O

RFa

ns w

here

cla

ssifi

ed a

s C

DS

s w

ith n

o si

gnifi

cant

mat

ch in

Gen

Ban

k M

atch

es t

o se

quen

ces

in t

he e

nviro

nmen

tal p

ortio

n of

Gen

Ban

k w

ere

not

cons

ider

ed I

n pa

rent

hesi

s is

giv

en t

he

prop

ortio

n of

pro

tein

cod

ing

DN

A t

hat

has

no m

atch

in G

enB

ank

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

LGT and phylogenetic assignment of metagenomic clones

2015

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

Fig 2

rRNA phylogeniesA The minimum evolution tree estimated from LogDet distances of the 23S-tag from the CeuI-fosmids (984 positions in alignment) For the sequences from the fosmid clones the Maximum Likelihood topology was similar (GTR

+

G

+

I) except that the

δ

-proteobacteria where paraphyletic with the

γ

- and

β

-proteobacteria clustering within the

δ

-proteobacteria Moreover b1bcf11d04 fell at the bottom of this cladeB The minimum evolution tree estimated from LogDet distances of the 16S sequences (1243 positions in alignment) For the sequences from the fosmid clones the Maximum Likelihood (GTR

+

G

+

I) topology was identical However there where several differences in the backbone of the tree with for instance Geobacter clustering with Firmicutes The trees in both A and B were rooted by the

Thermotoga maritima

sequenceC The minimum evolution tree estimated from LogDet distances of the closest matches of the 16S fragment in b1bf110d03 (1046 positions in alignment) The Maximum Likelihood (GTR

+

G

+

I) topology was identicalFor all three trees numbers on branches refers to bootstrap values from the minimum evolution analysis (

italic

) and from the Maximum Likelihood analysis (plain text) If both bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 90

B

Thermotoga maritima Coprothermobacter proteolyticus

Acidobacterium capsulatumPirellula marina

R76-B102OPB95

OPB5HMMVPog-54

HS9-30

PBS-II-35

LD1-PB19PBS-III-30

PRR-12Simkania negevensisBorrelia burgdorferi

Synechococcus Chloroflexus aurantiacus

Dehalococcoides ethenogenes Bacteroides thetaiotaomicron

Cytophaga hutchinsoniiChlorobium tepidum

Leptospirillum ferrooxidans Deinococcus radiodurans

Geobacillus subterraneus Paenibacillus popilliae

Fusobacterium nucleatum Geobacter metallireducens

Bradyrhizobium japonicum Vibrio splendidus

Methylobacillus flagellatum Thiobacillus denitrificans

005 substitutionssite

b3cf12f09

b1bcf11f04

b1bf11a01

candidate division OP8

candidate division WS3

Betaproteobacteria

92

72

54

78

57

75

Porphyromonas gingivalis

Bacteroides thetaiotaomicron

Cytophaga hutchinsonii

Cellulophaga pacifica

Flavobacterium gelidilacus

Flavobacterium psychrolimnae

Flavobacterium frigoris

Flavobacterium xinjiangensis

Gelidibacter algens

Bizionia paragorgiae

Formosa algae

Algibacter lectus

Flavobacterium sp 5N-3

Psychroserpens burtonensis

Mesophilibacter yeosuensis

b1bf110d03

Flavobacteriaceae bacterium BSA CS 02

Flavobacteriaceae bacterium BSD RB 42

001 substitutionssite

C

isolated from estuarine and salt marsh sediments

b3cf12f09Chlorobium tepidum

Synechocystis sp D64000

Deinococcus radiodurans

b1dcf13f01Dehalococcoides ethenogenes

b1dcf511a06Fusobacterium nucleatum

b1bcf11f04b1dcf51c12

Mycobacterium kansasiiStreptomyces coelicolor Thermomonospora chromogena

Paenibacillus popilliaeOceanobacillus iheyensis

Geobacillus kaustophilus

Simkania negevensis Pirellula sp strain 1

b3cf12d07Pseudomonas stutzeri

005 substitutionssite

candidate division WS3

Wolinella succinogenes Helicobacter hepaticus

Campylobacter jejuni b1dcf13c08

Epsilonproteobacteria

b1bcf11d04Desulfotalea psychrophila

b1bcf11h03Nannocystis exedens

Stigmatella aurantiacaGeobacter metallireducens

Deltaproteobacteria

Methylobacillus flagellatusb1bf11a01Thiobacillus denitrificans

Halomonas pantelleriensis

Microbulbifer degradansVibrio splendidus

b1bcf11c04Uncultured bacterium 463 clone EBAC080-L32B05

Betaproteobacteria

Gammaproteobacteria

Thermotoga maritima

candidate division OP8

Chloroflexi

Symbiobacterium thermophilum

Bacillus cereus

Desulfovibrio vulgaris

A

51

6197

87

55

67

54100

61

58

84

57

8968

58

97

54

65

64

68

73

51

58

53

87

58

2016

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

short to obtain reliable alignments the CDS was found ina lsquomixedrsquo clade also containing genes from the same bac-terial group or the CDS was found outside its group butdid not cluster with any specific lineage For three of theclones more than 30 of the CDSs have been acquiredby LGT (Table 1) two of these are from candidate divi-sions and one is from a

δ

-proteobacterium For all threeof these fosmids there appears to have been a transfer ofa large island of genes from a phylogenetically distantlineage Specifically we infer an

α

-proteobacterial islandin b3cf12f09 a

δ

-proteobacterial island in b1dcf51c12and an archaeal

β

-proteobacterial island in b1bcf11d04(Fig 1) It should be noted that the proportions of foreigngenes identified here might not represent the proportion

of foreign genes in the respective genomes that we havesampled but

rather the amount of LGT to be expectedwhen sequencing environmental fosmid clones

Forinstance in some genomes LGT might be enriched incertain variable parts of the genome Indeed the distribu-tion of proteins that match COG categories was signifi-cantly different (

P

=

13 e-13 in a

χ

2

-test) to what weobserved for the end-sequencing of lsquonormalrsquo fosmidclones (supplemental Fig S1) the main difference beingproportionally more J K U F and H category sequencesin the full fosmid sequences and more L P R and Scategory sequences among the end-sequences Whencomparing the distributions of different COG-groups (ieinformational metabolism etc) however the two datasets were significantly different only when including thepoorly characterized categories (R S) If such genes aremore frequently transferred than the other categoriesthen we would be underestimating the level of LGT thatwould be expected when analysing metagenomic clones

Interestingly in b1bcf11d04 the transfer vector for oneof the acquired gene clusters could be identified ORF6encodes an acetyl transferase gene and ORF8 ORF9and ORF10 encode subunits for an acyl-CoA synthase ndashtwo

α

-subunits and one

β

-subunit Phylogenetic analysessuggested all four CDSs have been acquired by LGTlikely from a

β

-proteobacterium The

β

-proteobacteriahave in turn likely acquired the acyl-CoA synthase genesfrom Archaea (Fig 3) In support of the archaeal origin ofthese genes the acyl-CoA synthase in bcf11d04 hassimilar domain organization to the acetyl-CoA synthase in

Pyrococcus

spp with two subunits (Sanchez

et al

2000)Furthermore these genes have been transferred multipletimes and the transfers involved all three domains of life[Fig 3 (Andersson

et al

2003)] These transferred CDSsare preceded by one integrase gene (ORF3) a trans-posase gene (ORF4) and an intergerasetransposasegene (ORF5 COG2801 Tra5 which contains an inte-gerase core domain Table S7) which probably wereresponsible for transferring this cluster into this genomeThe

α

-proteobacterial island in the b3cf12f09 cloneencodes a wide range of different functions and no typicalmobile elements were identified However as this islandextends to the 3

prime

end of the fosmid mobile genes mightbe found further downstream The first CDS of this islandencodes a DnaJ-class chaperone (ORF29) which is trun-cated at the 5

prime

end This pseudogene still shows 65protein identity to a homologue in

Magnetoospirillummagnetotacticum

(Table S3) Hence this probably repre-sents a very recent transfer (or rearrangement) Anotherpossibility is that this fosmid might be a chimera Howeverthe G

+

C content of the CDSs in the

α

-proteobacterialisland (595 G

+

C) is very similar to the rest of thefosmid (596 G

+

C supplemental Table S3) Also fur-ther upstream there is a proteobacterial transposase

Fig 3

Maximum Likelihood phylogeny of acetyl-CoA synthetase (ADP-forming) homologues estimated using PMBML (459 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF8 and ORF10 sequences against GenBank and the 100 best matches were retrieved and aligned Groups of very similar sequences from the same species or sister species were trimmed down to one sequence representative The tree was arbi-trarily rooted by Entamoeba histolytica Numbers on branches refers to bootstrap support obtained from using PMBML in bold PUZZLEBOOT in plain text and Neighbour-joining in italic If all bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 80

10

Entamoeba histolytica Parachlamydia sp UWE25

Rubrobacter xylanophilus Gloeobacter violaceus

Nostoc sp PCC 7120Thermosynechococcus elongatus

Dechloromonas aromaticaMesorhizobium sp BNC1

Sinorhizobium melilotiXanthomonas axonopodisRhodopseudomonas palustris

Bradyrhizobium japonicum Desulfovibrio desulfuricans

Rhodospirillum rubrumMagnetospirillum magnetotacticum

Magnetospirillum magnetotacticumShewanella oneidensis

Photobacterium profundumVibrio cholerae

Vibrio vulnificus Photorhabdus luminescens

Yersinia pestis Salmonella enterica

Escherichia coli Methanopyrus kandleri

Pyrococcus furiosus Archaeoglobus fulgidus

Methanococcus maripaludisMethanocaldococcus jannaschii

Magnetococcus sp MC-1 Chloroflexus aurantiacus

Spironucleus barkhanus Giardia intestinalis

Pyrococcus furiosusThermoplasma acidophilum Thermoplasma volcanium

Pyrococcus furiosus Streptomyces avermitilisBradyrhizobium japonicum

Ralstonia metalliduransFerroplasma acidarmanus

Sulfolobus solfataricusSulfolobus tokodaii

Pyrococcus furiosusPyrococcus furiosus

Pyrobaculum aerophilumMethanosarcina mazei Methanosarcina acetivoransThermobifida fusca

Archaeoglobus fulgidusArchaeoglobus fulgidus

Archaeoglobus fulgidusArchaeoglobus fulgidus

b1bcf11d04ORF8b1bcf11d04ORF10

Bordetella bronchiseptica Ralstonia metallidurans

Bordetella pertussis Bordetella bronchiseptica

Burkholderia fungorumBurkholderia fungorumRalstonia eutropha

Bordetella bronchisepticaRalstonia eutropha

Bradyrhizobium japonicumRalstonia eutropha

Burkholderia fungorumBordetella bronchiseptica

Ralstonia eutrophaBordetella bronchiseptica

Bradyrhizobium japonicumBordetella bronchiseptica

Pseudomonas mendocina Bradyrhizobium japonicum

7480

9764

75

52

83

52

57

60

61

70

89

51

64

6262

64

57

58

50

7173

62

100100

LGT and phylogenetic assignment of metagenomic clones 2017

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Tab

le 1

S

umm

ary

of p

hylo

gene

tic a

naly

ses

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

b1dc

f51

a06

No

clea

r af

filia

tion

with

exi

stin

gse

quen

ces

Cou

ld n

ot b

eam

plifi

ed

Mos

t C

DS

s ha

ve n

o or

only

a f

ew s

igni

fican

tm

atch

es in

Gen

Ban

kO

RF

4 cl

uste

rs w

ithLe

ptos

pira

inte

rrog

ans

with

in a

mix

ed c

lade

ho

wev

er

L in

terr

ogan

sha

s se

vera

l par

alog

ues

and

this

gen

e ap

pear

sto

hav

e be

en f

requ

ently

tran

sfer

red

and

islik

ely

to b

e a

tran

sfer

OR

F20

clu

ster

s w

ithM

etha

nosa

rcin

a w

ithin

δ-pr

oteo

bact

eria

O

RF

19cl

uste

rs w

ith G

eoba

cter

but

is m

ostly

foun

d in

met

hano

gens

OR

F17

and

OR

F18

have

hom

olog

ues

inM

etha

noge

ns o

nly

4 C

DS

s (1

9 o

f th

eto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

33

(38

)

b1dc

f13

f01

Clu

ster

s w

ithD

ehal

ococ

coid

eset

heno

gene

sC

hlor

oflex

usau

rant

iacu

s 23

SrR

NA

seq

uenc

eof

too

poo

r qu

ality

to in

clud

e in

the

tree

7 of

10

CD

Ss

(70

) w

ithsu

ppor

ted

phyl

ogen

etic

topo

logi

es a

gree

with

23S

fra

gmen

t In

addi

tion

6 C

DS

s w

hich

only

hit

Chl

orofl

exus

aura

ntia

cus

Two

CD

Ss

have

like

lybe

en a

cqui

red

thro

ugh

LGT

One

clu

ster

s w

ithhi

gh s

uppo

rt w

ithT

herm

otog

a m

ariti

ma

(OR

F16

) an

d on

e cl

uste

rsw

ithin

the

euk

aryo

tes

(OR

F25

)

OR

F2

has

only

sign

ifica

ntho

mol

ogue

s in

Cro

cosp

haer

aw

atso

nii

3 C

DS

s (1

1 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

14

(5

)

b3cf

12

f09

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A g

enes

and

do

not

clus

ter

with

in a

nysp

ecifi

c ba

cter

ial g

roup

Phy

loge

netic

ana

lysi

ssu

gges

ts t

hat

10 C

DS

sha

ve li

kely

bee

n ac

quire

dby

LG

T 8

of

thes

e ha

vebe

en a

cqui

red

from

an

α-pr

oteo

bact

eriu

man

d ar

e fo

und

linke

d

Thr

ee C

DS

s fo

und

linke

d to

CD

Ss

whe

reph

ylog

enet

ic a

naly

ses

sugg

est

LGT

hav

eal

so li

kely

bee

nac

quire

d by

LG

T

13 C

DS

s (3

2 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

OR

F16

is a

tran

spos

ase

of

prot

eoba

cter

ial

orig

in

and

show

slo

wer

GC

con

tent

than

the

res

t of

the

fosm

id T

wel

ve o

fth

e tr

ansf

erre

dC

DS

s (O

RF

29ndash

41)

are

linke

d an

dal

l app

ear

to h

ave

been

acq

uire

dfr

om a

n α-

prot

eoba

cter

ium

22

(9

)

b1bc

f11

f04

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A a

nd d

oes

not

clus

ter

with

any

spe

cific

bact

eria

l lin

eage

A

mon

g th

ese

was

the

high

ly c

onse

rved

Dna

Ege

ne

Two

CD

Ss

(OR

F14

and

OR

F15

) cl

uste

r w

ithse

quen

ces

from

the

Chl

orob

iBac

tero

idet

esgr

oup

2 C

DS

s (9

o

f to

tal)

hav

e b

een

acq

uir

ed b

y L

GT

26

(14

)

2018 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f51c

12C

andi

date

div

isio

nW

S3

bact

eriu

mM

ost

CD

Ss

have

no

oron

ly a

few

sig

nific

ant

mat

ches

in G

enB

ank

OR

F6ndash

OR

F11

are

als

ofo

und

in b

1bcf

11

h3 in

sam

e or

der

and

phyl

ogen

etic

ana

lysi

ssu

ppor

ts t

hat

OR

F7

OR

F8

and

OR

F10

wer

etr

ansf

erre

d fr

om a

δ-

prot

eoba

cter

ium

to

b1bc

f51c

12 O

RF

10 a

ndO

RF

11 a

lso

clus

ter

with

δ-pr

oteo

bact

eria

ho

wev

er

with

no

boot

stra

p su

ppor

t O

RF

9ha

s on

ly o

ne m

atch

inG

enB

ank

OR

F15

(fu

sA)

clus

ters

with

Chl

orob

ium

tepi

dum

with

inF

irm

icut

es

OR

F12

has

no

hom

olog

ue in

b1bc

f11

h3

but

doe

scl

uste

r w

ith δ

-pr

oteo

bact

eria

ho

wev

er w

ith n

obo

otst

rap

supp

ort

It is

like

ly t

hat

also

thi

sC

DS

was

tra

nsfe

rred

as p

art

of w

ith a

δ-

prot

eoba

cter

ial i

slan

d

8 C

DS

s (4

4 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

One

lar

ge lsquoi

slan

drsquo o

fδ-

prot

eoba

cter

ial

orig

in

22

(29

)

b1cf

11

1h0

3δ-

Pro

teob

acte

rium

ndash8

of 1

3 C

DS

s (5

7)

that

give

s su

ppor

ted

phyl

ogen

ies

agre

e w

ithth

e fr

agm

ent

orig

inat

ing

from

a δ

-pr

oteo

bact

eriu

m

Six

CD

Ss

have

like

ly b

een

acqu

ired

by L

GT

OR

F8

clus

ters

with

Clo

strid

ium

ther

moc

ellu

m a

ndTr

epon

ema

dent

icol

aO

RF

18 is

fou

ndse

para

ted

from

oth

erpr

oteo

bact

eria

inph

ylog

enet

ic t

rees

cl

uste

ring

with

Pla

smod

ium

spp

O

RF

23is

fou

nd in

a m

ixed

cla

dean

d ap

pear

s to

hav

ebe

en f

requ

ently

tran

sfer

red

OR

F28

clus

ters

with

β-

prot

eoba

cter

ia

OR

F29

clus

ters

with

γ-

prot

eoba

cter

ia a

ndO

RF

30 is

fou

nd a

tbo

ttom

of

clad

e th

atco

ntai

ns α

-pr

oteo

bact

eria

and

Act

inob

acte

ria

6 C

DS

s (1

7 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

OR

F11

ndashOR

F16

ha

ve b

een

tran

sfer

red

from

an

ance

stor

of

B1B

CF

11

h03

tob1

dcf5

1c

12 a

sw

ell t

o th

eC

hlor

obiu

m li

neag

e

6 (

1)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2019

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f11

d04

δ-P

rote

obac

teriu

mndash

12 o

f 18

CD

Ss

(67

)w

ith s

uppo

rted

phyl

ogen

etic

top

olog

ies

agre

e w

ith a

δ-

prot

eoba

cter

ial o

rigin

of

the

frag

men

t

Six

CD

Ss

are

sugg

este

dby

phy

loge

netic

ana

lyse

sto

hav

e be

en a

cqui

red

byLG

T O

ne o

f th

ese

tran

sfer

red

gene

s ndasht

hefu

sA h

omol

ogue

(OR

F19

) ndash is

als

o fo

und

inb1

bcf5

c12

Thi

s C

DS

has

been

tra

nsfe

rred

to

othe

r δ-

prot

eoba

cter

ia a

sw

ell

Thr

ee C

DS

s (O

RF

3ndash5)

that

enc

ode

anin

tege

rase

and

tw

otr

ansp

osas

es t

hat

prec

edes

fou

r of

the

LGT

gen

es d

etec

ted

in t

he p

hylo

gene

tican

alys

is

OR

F7

also

likel

y tr

ansf

erre

d w

ithO

RF

3 ndashO

RF

10

OR

F20

and

OR

F21

have

mai

nly

hom

olog

ues

inF

irm

icut

es a

nd is

the

neig

hbou

r of

OR

F19

that

has

als

o be

enac

quire

d fr

omF

irm

icut

es

12 C

DS

s (3

1 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

Inte

rest

ingl

y th

isfo

smid

clo

nepr

ovid

es t

hetr

ansf

er v

ecto

r ndash

the

inte

gera

se a

ndtr

ansp

osas

e ndash

for

8of

the

tra

nsfe

rred

gene

s

ndash

b1bc

f13

c08

ε-P

rote

obac

teriu

m

mos

t cl

osel

yre

late

d to

Cam

pylo

bact

erje

juni

21 C

DS

s gi

ve s

uppo

rted

phyl

ogen

ies

and

ofth

ese

19 (

90

) ag

ree

with

rR

NA

OR

F4

clus

ters

with

Geo

bact

er a

ndC

lost

ridiu

m

and

OR

F23

does

not

hav

eho

mol

ogue

s in

ε-

prot

eoba

cter

ia a

ndcl

uste

rs w

ith γ

- an

d β-

prot

eoba

cter

ia

OR

F24

doe

s no

t gi

ve a

supp

orte

d tr

ee b

utha

s al

so p

roba

bly

been

tra

nsfe

rred

fro

mγ-

or

β-pr

oteo

bact

eria

3 C

DS

s (7

o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

10

(3

)

b3cf

12

d07

γ-P

rote

obac

teriu

m

Clu

ster

s w

ithin

the

γ-pr

oteo

bact

eria

inLo

gDet

dis

tanc

etr

ees

but

at t

heba

se o

f γ-

prot

eoba

cter

ia a

ndβ-

prot

eoba

cter

iain

the

bes

tm

axim

umlik

elih

ood

tree

Onl

y 7

CD

Ss

give

su

ppor

ted

phyl

ogen

ies

O

f th

ese

4 (5

7)

agre

e w

ith r

RN

A

OR

F7

clus

ter

with

in β

-pr

oteo

bact

eria

OR

F15

ha

s a

patc

hy d

istr

ibut

ion

and

does

not

clu

ster

with

ot

her

prot

eoba

cter

ia in

th

e ph

ylog

enet

ic t

ree

Sev

eral

add

ition

al C

DS

s (O

RF

16ndashO

RF

25)

that

did

not

prod

uce

wel

l-re

solv

ed t

rees

ha

d on

ly d

iver

gent

hom

olog

ues

inG

enB

ank

or

nosi

gnifi

cant

hom

olog

ues

may

also

hav

e be

enac

quire

d by

LG

T I

nsu

ppor

t of

thi

sO

RF

26 e

ncod

es a

tran

spos

ase

2 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

O

RF

16 ndash

OR

F25

w

as n

ot in

clud

ed in

es

timat

e du

e to

lim

ited

evid

ence

for

th

e tr

ansf

er o

f the

se

23

(23

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

2020 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f1c

04γ-

Pro

teob

acte

rium

ndash14

CD

Ss

give

sup

port

edph

ylog

enie

s an

d of

thes

e 13

(93

)

agre

ew

ith r

RN

A

Phy

loge

netic

ana

lyse

ssh

ow t

hat

two

CD

Ss

have

bee

n ac

quire

d by

LGT

OR

F3

is f

ound

in a

mix

ed c

lade

whi

leO

RF

30 c

lust

er w

ithin

β-

prot

eoba

cter

ia

Thr

ee g

enes

tha

t sh

owun

cong

ruen

tph

ylog

enie

s b

utw

ith lo

w b

oots

trap

supp

ort

foun

d cl

ose

to O

RF

3 an

d O

RF

34ha

ve p

roba

bly

also

been

acq

uire

d by

LGT

O

RF

5 cl

uste

rsw

ith β

-pro

teob

acte

ria

OR

F31

clu

ster

s w

ithδ-

prot

eoba

cter

ia

and

OR

F32

(G

ST

) cl

uste

rsw

ith a

γ-pr

oteo

bact

eriu

m

but

appe

ars

toha

ve b

een

freq

uent

lytr

ansf

erre

d

5 C

DS

s (1

3 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

3 (

1)

b1bf

11

a01

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns (

98

iden

tity

at 2

3S

rRN

A)

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns(9

8 id

entit

yat

16S

rR

NA

)

Hig

h de

gree

of

gene

sy

nten

y co

mpa

red

with

Thi

obac

illus

de

nitr

ifica

ns

29 C

DS

sha

ve b

est

BLA

ST

mat

chin

Thi

obac

illus

de

nitr

ifica

ns 2

7 of

28

CD

Ss

(96

) th

at g

ive

stat

istic

ally

sup

port

edph

ylog

enie

s ag

ree

with

rR

NA

gen

es

One

OR

F30

(R

suA

)cl

uste

r w

ith γ

-pr

oteo

bact

eria

and

has

no

hom

olog

ue in

T

hiob

acill

us d

enitr

ifica

ns

Two

CD

Ss

(OR

F14

and

O

RF

31)

have

bee

n tr

ansf

erre

d to

bot

h fo

smid

an

d T

hiob

acill

us

deni

trifi

cans

OR

F29

has

no

sign

ifica

nt

hom

olog

ues

inpr

oteo

bact

eria

4 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

3 (

2)

b1bf

110

d03

ndashA

Fla

voba

cter

iace

aeba

cter

ium

am

ong

sequ

ence

dge

nom

es m

ost

clos

ely

rela

ted

toC

ytop

haga

hutc

hins

onii

16 o

f 18

(84

) C

DS

s w

ith

supp

orte

d ph

ylog

enet

icto

polo

gies

agr

ee w

ith16

S f

ragm

ent

OR

F5

and

OR

F10

hav

e no

cl

ose

hom

olog

ues

in

othe

r B

acte

roid

es a

ndph

ylog

enet

ic a

naly

sis

sugg

ests

fre

quen

ttr

ansf

er

OR

F4

has

no d

etec

tabl

eho

mol

ogue

s in

oth

er

Bac

tero

ides

A

tran

spos

on w

ith 8

C

DS

s lik

ely

acqu

ired

from

rel

ativ

e of

Bac

tero

ides

thet

aiot

aoim

icro

n

3 C

DS

s (1

0 o

f to

tal)

h

ave

likel

y b

een

acq

uir

ed b

y L

GT

The

tra

nspo

son

not

incl

uded

as

it ha

sbe

en t

rans

ferr

edw

ithin

the

B

acte

roid

es

10

(3

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

a O

nly

LGT

eve

nts

invo

lvin

g th

e C

DS

fro

m t

he fo

smid

clo

ne a

naly

sed

was

cou

nted

and

onl

y w

hen

they

wer

e su

ppor

ted

by p

hylo

gene

tic a

naly

ses

or c

lear

phy

loge

netic

dis

trib

utio

n pa

ttern

s (i

e

the

gene

is n

ot p

rese

nt in

its

rRN

A g

roup

but

pre

sent

in s

ome

othe

r di

stin

ct b

acte

rial g

roup

) N

umbe

r of

CD

Ss

acqu

ired

by L

GT

is s

how

n in

bol

db

O

RFa

ns w

here

cla

ssifi

ed a

s C

DS

s w

ith n

o si

gnifi

cant

mat

ch in

Gen

Ban

k M

atch

es t

o se

quen

ces

in t

he e

nviro

nmen

tal p

ortio

n of

Gen

Ban

k w

ere

not

cons

ider

ed I

n pa

rent

hesi

s is

giv

en t

he

prop

ortio

n of

pro

tein

cod

ing

DN

A t

hat

has

no m

atch

in G

enB

ank

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

2016

C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd

Environmental Microbiology

7

2011ndash2026

short to obtain reliable alignments the CDS was found ina lsquomixedrsquo clade also containing genes from the same bac-terial group or the CDS was found outside its group butdid not cluster with any specific lineage For three of theclones more than 30 of the CDSs have been acquiredby LGT (Table 1) two of these are from candidate divi-sions and one is from a

δ

-proteobacterium For all threeof these fosmids there appears to have been a transfer ofa large island of genes from a phylogenetically distantlineage Specifically we infer an

α

-proteobacterial islandin b3cf12f09 a

δ

-proteobacterial island in b1dcf51c12and an archaeal

β

-proteobacterial island in b1bcf11d04(Fig 1) It should be noted that the proportions of foreigngenes identified here might not represent the proportion

of foreign genes in the respective genomes that we havesampled but

rather the amount of LGT to be expectedwhen sequencing environmental fosmid clones

Forinstance in some genomes LGT might be enriched incertain variable parts of the genome Indeed the distribu-tion of proteins that match COG categories was signifi-cantly different (

P

=

13 e-13 in a

χ

2

-test) to what weobserved for the end-sequencing of lsquonormalrsquo fosmidclones (supplemental Fig S1) the main difference beingproportionally more J K U F and H category sequencesin the full fosmid sequences and more L P R and Scategory sequences among the end-sequences Whencomparing the distributions of different COG-groups (ieinformational metabolism etc) however the two datasets were significantly different only when including thepoorly characterized categories (R S) If such genes aremore frequently transferred than the other categoriesthen we would be underestimating the level of LGT thatwould be expected when analysing metagenomic clones

Interestingly in b1bcf11d04 the transfer vector for oneof the acquired gene clusters could be identified ORF6encodes an acetyl transferase gene and ORF8 ORF9and ORF10 encode subunits for an acyl-CoA synthase ndashtwo

α

-subunits and one

β

-subunit Phylogenetic analysessuggested all four CDSs have been acquired by LGTlikely from a

β

-proteobacterium The

β

-proteobacteriahave in turn likely acquired the acyl-CoA synthase genesfrom Archaea (Fig 3) In support of the archaeal origin ofthese genes the acyl-CoA synthase in bcf11d04 hassimilar domain organization to the acetyl-CoA synthase in

Pyrococcus

spp with two subunits (Sanchez

et al

2000)Furthermore these genes have been transferred multipletimes and the transfers involved all three domains of life[Fig 3 (Andersson

et al

2003)] These transferred CDSsare preceded by one integrase gene (ORF3) a trans-posase gene (ORF4) and an intergerasetransposasegene (ORF5 COG2801 Tra5 which contains an inte-gerase core domain Table S7) which probably wereresponsible for transferring this cluster into this genomeThe

α

-proteobacterial island in the b3cf12f09 cloneencodes a wide range of different functions and no typicalmobile elements were identified However as this islandextends to the 3

prime

end of the fosmid mobile genes mightbe found further downstream The first CDS of this islandencodes a DnaJ-class chaperone (ORF29) which is trun-cated at the 5

prime

end This pseudogene still shows 65protein identity to a homologue in

Magnetoospirillummagnetotacticum

(Table S3) Hence this probably repre-sents a very recent transfer (or rearrangement) Anotherpossibility is that this fosmid might be a chimera Howeverthe G

+

C content of the CDSs in the

α

-proteobacterialisland (595 G

+

C) is very similar to the rest of thefosmid (596 G

+

C supplemental Table S3) Also fur-ther upstream there is a proteobacterial transposase

Fig 3

Maximum Likelihood phylogeny of acetyl-CoA synthetase (ADP-forming) homologues estimated using PMBML (459 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF8 and ORF10 sequences against GenBank and the 100 best matches were retrieved and aligned Groups of very similar sequences from the same species or sister species were trimmed down to one sequence representative The tree was arbi-trarily rooted by Entamoeba histolytica Numbers on branches refers to bootstrap support obtained from using PMBML in bold PUZZLEBOOT in plain text and Neighbour-joining in italic If all bootstrap values were above 70 this is indicated by a grey circle while a black circle indicated that all three values were above 80

10

Entamoeba histolytica Parachlamydia sp UWE25

Rubrobacter xylanophilus Gloeobacter violaceus

Nostoc sp PCC 7120Thermosynechococcus elongatus

Dechloromonas aromaticaMesorhizobium sp BNC1

Sinorhizobium melilotiXanthomonas axonopodisRhodopseudomonas palustris

Bradyrhizobium japonicum Desulfovibrio desulfuricans

Rhodospirillum rubrumMagnetospirillum magnetotacticum

Magnetospirillum magnetotacticumShewanella oneidensis

Photobacterium profundumVibrio cholerae

Vibrio vulnificus Photorhabdus luminescens

Yersinia pestis Salmonella enterica

Escherichia coli Methanopyrus kandleri

Pyrococcus furiosus Archaeoglobus fulgidus

Methanococcus maripaludisMethanocaldococcus jannaschii

Magnetococcus sp MC-1 Chloroflexus aurantiacus

Spironucleus barkhanus Giardia intestinalis

Pyrococcus furiosusThermoplasma acidophilum Thermoplasma volcanium

Pyrococcus furiosus Streptomyces avermitilisBradyrhizobium japonicum

Ralstonia metalliduransFerroplasma acidarmanus

Sulfolobus solfataricusSulfolobus tokodaii

Pyrococcus furiosusPyrococcus furiosus

Pyrobaculum aerophilumMethanosarcina mazei Methanosarcina acetivoransThermobifida fusca

Archaeoglobus fulgidusArchaeoglobus fulgidus

Archaeoglobus fulgidusArchaeoglobus fulgidus

b1bcf11d04ORF8b1bcf11d04ORF10

Bordetella bronchiseptica Ralstonia metallidurans

Bordetella pertussis Bordetella bronchiseptica

Burkholderia fungorumBurkholderia fungorumRalstonia eutropha

Bordetella bronchisepticaRalstonia eutropha

Bradyrhizobium japonicumRalstonia eutropha

Burkholderia fungorumBordetella bronchiseptica

Ralstonia eutrophaBordetella bronchiseptica

Bradyrhizobium japonicumBordetella bronchiseptica

Pseudomonas mendocina Bradyrhizobium japonicum

7480

9764

75

52

83

52

57

60

61

70

89

51

64

6262

64

57

58

50

7173

62

100100

LGT and phylogenetic assignment of metagenomic clones 2017

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Tab

le 1

S

umm

ary

of p

hylo

gene

tic a

naly

ses

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

b1dc

f51

a06

No

clea

r af

filia

tion

with

exi

stin

gse

quen

ces

Cou

ld n

ot b

eam

plifi

ed

Mos

t C

DS

s ha

ve n

o or

only

a f

ew s

igni

fican

tm

atch

es in

Gen

Ban

kO

RF

4 cl

uste

rs w

ithLe

ptos

pira

inte

rrog

ans

with

in a

mix

ed c

lade

ho

wev

er

L in

terr

ogan

sha

s se

vera

l par

alog

ues

and

this

gen

e ap

pear

sto

hav

e be

en f

requ

ently

tran

sfer

red

and

islik

ely

to b

e a

tran

sfer

OR

F20

clu

ster

s w

ithM

etha

nosa

rcin

a w

ithin

δ-pr

oteo

bact

eria

O

RF

19cl

uste

rs w

ith G

eoba

cter

but

is m

ostly

foun

d in

met

hano

gens

OR

F17

and

OR

F18

have

hom

olog

ues

inM

etha

noge

ns o

nly

4 C

DS

s (1

9 o

f th

eto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

33

(38

)

b1dc

f13

f01

Clu

ster

s w

ithD

ehal

ococ

coid

eset

heno

gene

sC

hlor

oflex

usau

rant

iacu

s 23

SrR

NA

seq

uenc

eof

too

poo

r qu

ality

to in

clud

e in

the

tree

7 of

10

CD

Ss

(70

) w

ithsu

ppor

ted

phyl

ogen

etic

topo

logi

es a

gree

with

23S

fra

gmen

t In

addi

tion

6 C

DS

s w

hich

only

hit

Chl

orofl

exus

aura

ntia

cus

Two

CD

Ss

have

like

lybe

en a

cqui

red

thro

ugh

LGT

One

clu

ster

s w

ithhi

gh s

uppo

rt w

ithT

herm

otog

a m

ariti

ma

(OR

F16

) an

d on

e cl

uste

rsw

ithin

the

euk

aryo

tes

(OR

F25

)

OR

F2

has

only

sign

ifica

ntho

mol

ogue

s in

Cro

cosp

haer

aw

atso

nii

3 C

DS

s (1

1 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

14

(5

)

b3cf

12

f09

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A g

enes

and

do

not

clus

ter

with

in a

nysp

ecifi

c ba

cter

ial g

roup

Phy

loge

netic

ana

lysi

ssu

gges

ts t

hat

10 C

DS

sha

ve li

kely

bee

n ac

quire

dby

LG

T 8

of

thes

e ha

vebe

en a

cqui

red

from

an

α-pr

oteo

bact

eriu

man

d ar

e fo

und

linke

d

Thr

ee C

DS

s fo

und

linke

d to

CD

Ss

whe

reph

ylog

enet

ic a

naly

ses

sugg

est

LGT

hav

eal

so li

kely

bee

nac

quire

d by

LG

T

13 C

DS

s (3

2 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

OR

F16

is a

tran

spos

ase

of

prot

eoba

cter

ial

orig

in

and

show

slo

wer

GC

con

tent

than

the

res

t of

the

fosm

id T

wel

ve o

fth

e tr

ansf

erre

dC

DS

s (O

RF

29ndash

41)

are

linke

d an

dal

l app

ear

to h

ave

been

acq

uire

dfr

om a

n α-

prot

eoba

cter

ium

22

(9

)

b1bc

f11

f04

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A a

nd d

oes

not

clus

ter

with

any

spe

cific

bact

eria

l lin

eage

A

mon

g th

ese

was

the

high

ly c

onse

rved

Dna

Ege

ne

Two

CD

Ss

(OR

F14

and

OR

F15

) cl

uste

r w

ithse

quen

ces

from

the

Chl

orob

iBac

tero

idet

esgr

oup

2 C

DS

s (9

o

f to

tal)

hav

e b

een

acq

uir

ed b

y L

GT

26

(14

)

2018 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f51c

12C

andi

date

div

isio

nW

S3

bact

eriu

mM

ost

CD

Ss

have

no

oron

ly a

few

sig

nific

ant

mat

ches

in G

enB

ank

OR

F6ndash

OR

F11

are

als

ofo

und

in b

1bcf

11

h3 in

sam

e or

der

and

phyl

ogen

etic

ana

lysi

ssu

ppor

ts t

hat

OR

F7

OR

F8

and

OR

F10

wer

etr

ansf

erre

d fr

om a

δ-

prot

eoba

cter

ium

to

b1bc

f51c

12 O

RF

10 a

ndO

RF

11 a

lso

clus

ter

with

δ-pr

oteo

bact

eria

ho

wev

er

with

no

boot

stra

p su

ppor

t O

RF

9ha

s on

ly o

ne m

atch

inG

enB

ank

OR

F15

(fu

sA)

clus

ters

with

Chl

orob

ium

tepi

dum

with

inF

irm

icut

es

OR

F12

has

no

hom

olog

ue in

b1bc

f11

h3

but

doe

scl

uste

r w

ith δ

-pr

oteo

bact

eria

ho

wev

er w

ith n

obo

otst

rap

supp

ort

It is

like

ly t

hat

also

thi

sC

DS

was

tra

nsfe

rred

as p

art

of w

ith a

δ-

prot

eoba

cter

ial i

slan

d

8 C

DS

s (4

4 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

One

lar

ge lsquoi

slan

drsquo o

fδ-

prot

eoba

cter

ial

orig

in

22

(29

)

b1cf

11

1h0

3δ-

Pro

teob

acte

rium

ndash8

of 1

3 C

DS

s (5

7)

that

give

s su

ppor

ted

phyl

ogen

ies

agre

e w

ithth

e fr

agm

ent

orig

inat

ing

from

a δ

-pr

oteo

bact

eriu

m

Six

CD

Ss

have

like

ly b

een

acqu

ired

by L

GT

OR

F8

clus

ters

with

Clo

strid

ium

ther

moc

ellu

m a

ndTr

epon

ema

dent

icol

aO

RF

18 is

fou

ndse

para

ted

from

oth

erpr

oteo

bact

eria

inph

ylog

enet

ic t

rees

cl

uste

ring

with

Pla

smod

ium

spp

O

RF

23is

fou

nd in

a m

ixed

cla

dean

d ap

pear

s to

hav

ebe

en f

requ

ently

tran

sfer

red

OR

F28

clus

ters

with

β-

prot

eoba

cter

ia

OR

F29

clus

ters

with

γ-

prot

eoba

cter

ia a

ndO

RF

30 is

fou

nd a

tbo

ttom

of

clad

e th

atco

ntai

ns α

-pr

oteo

bact

eria

and

Act

inob

acte

ria

6 C

DS

s (1

7 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

OR

F11

ndashOR

F16

ha

ve b

een

tran

sfer

red

from

an

ance

stor

of

B1B

CF

11

h03

tob1

dcf5

1c

12 a

sw

ell t

o th

eC

hlor

obiu

m li

neag

e

6 (

1)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2019

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f11

d04

δ-P

rote

obac

teriu

mndash

12 o

f 18

CD

Ss

(67

)w

ith s

uppo

rted

phyl

ogen

etic

top

olog

ies

agre

e w

ith a

δ-

prot

eoba

cter

ial o

rigin

of

the

frag

men

t

Six

CD

Ss

are

sugg

este

dby

phy

loge

netic

ana

lyse

sto

hav

e be

en a

cqui

red

byLG

T O

ne o

f th

ese

tran

sfer

red

gene

s ndasht

hefu

sA h

omol

ogue

(OR

F19

) ndash is

als

o fo

und

inb1

bcf5

c12

Thi

s C

DS

has

been

tra

nsfe

rred

to

othe

r δ-

prot

eoba

cter

ia a

sw

ell

Thr

ee C

DS

s (O

RF

3ndash5)

that

enc

ode

anin

tege

rase

and

tw

otr

ansp

osas

es t

hat

prec

edes

fou

r of

the

LGT

gen

es d

etec

ted

in t

he p

hylo

gene

tican

alys

is

OR

F7

also

likel

y tr

ansf

erre

d w

ithO

RF

3 ndashO

RF

10

OR

F20

and

OR

F21

have

mai

nly

hom

olog

ues

inF

irm

icut

es a

nd is

the

neig

hbou

r of

OR

F19

that

has

als

o be

enac

quire

d fr

omF

irm

icut

es

12 C

DS

s (3

1 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

Inte

rest

ingl

y th

isfo

smid

clo

nepr

ovid

es t

hetr

ansf

er v

ecto

r ndash

the

inte

gera

se a

ndtr

ansp

osas

e ndash

for

8of

the

tra

nsfe

rred

gene

s

ndash

b1bc

f13

c08

ε-P

rote

obac

teriu

m

mos

t cl

osel

yre

late

d to

Cam

pylo

bact

erje

juni

21 C

DS

s gi

ve s

uppo

rted

phyl

ogen

ies

and

ofth

ese

19 (

90

) ag

ree

with

rR

NA

OR

F4

clus

ters

with

Geo

bact

er a

ndC

lost

ridiu

m

and

OR

F23

does

not

hav

eho

mol

ogue

s in

ε-

prot

eoba

cter

ia a

ndcl

uste

rs w

ith γ

- an

d β-

prot

eoba

cter

ia

OR

F24

doe

s no

t gi

ve a

supp

orte

d tr

ee b

utha

s al

so p

roba

bly

been

tra

nsfe

rred

fro

mγ-

or

β-pr

oteo

bact

eria

3 C

DS

s (7

o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

10

(3

)

b3cf

12

d07

γ-P

rote

obac

teriu

m

Clu

ster

s w

ithin

the

γ-pr

oteo

bact

eria

inLo

gDet

dis

tanc

etr

ees

but

at t

heba

se o

f γ-

prot

eoba

cter

ia a

ndβ-

prot

eoba

cter

iain

the

bes

tm

axim

umlik

elih

ood

tree

Onl

y 7

CD

Ss

give

su

ppor

ted

phyl

ogen

ies

O

f th

ese

4 (5

7)

agre

e w

ith r

RN

A

OR

F7

clus

ter

with

in β

-pr

oteo

bact

eria

OR

F15

ha

s a

patc

hy d

istr

ibut

ion

and

does

not

clu

ster

with

ot

her

prot

eoba

cter

ia in

th

e ph

ylog

enet

ic t

ree

Sev

eral

add

ition

al C

DS

s (O

RF

16ndashO

RF

25)

that

did

not

prod

uce

wel

l-re

solv

ed t

rees

ha

d on

ly d

iver

gent

hom

olog

ues

inG

enB

ank

or

nosi

gnifi

cant

hom

olog

ues

may

also

hav

e be

enac

quire

d by

LG

T I

nsu

ppor

t of

thi

sO

RF

26 e

ncod

es a

tran

spos

ase

2 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

O

RF

16 ndash

OR

F25

w

as n

ot in

clud

ed in

es

timat

e du

e to

lim

ited

evid

ence

for

th

e tr

ansf

er o

f the

se

23

(23

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

2020 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f1c

04γ-

Pro

teob

acte

rium

ndash14

CD

Ss

give

sup

port

edph

ylog

enie

s an

d of

thes

e 13

(93

)

agre

ew

ith r

RN

A

Phy

loge

netic

ana

lyse

ssh

ow t

hat

two

CD

Ss

have

bee

n ac

quire

d by

LGT

OR

F3

is f

ound

in a

mix

ed c

lade

whi

leO

RF

30 c

lust

er w

ithin

β-

prot

eoba

cter

ia

Thr

ee g

enes

tha

t sh

owun

cong

ruen

tph

ylog

enie

s b

utw

ith lo

w b

oots

trap

supp

ort

foun

d cl

ose

to O

RF

3 an

d O

RF

34ha

ve p

roba

bly

also

been

acq

uire

d by

LGT

O

RF

5 cl

uste

rsw

ith β

-pro

teob

acte

ria

OR

F31

clu

ster

s w

ithδ-

prot

eoba

cter

ia

and

OR

F32

(G

ST

) cl

uste

rsw

ith a

γ-pr

oteo

bact

eriu

m

but

appe

ars

toha

ve b

een

freq

uent

lytr

ansf

erre

d

5 C

DS

s (1

3 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

3 (

1)

b1bf

11

a01

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns (

98

iden

tity

at 2

3S

rRN

A)

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns(9

8 id

entit

yat

16S

rR

NA

)

Hig

h de

gree

of

gene

sy

nten

y co

mpa

red

with

Thi

obac

illus

de

nitr

ifica

ns

29 C

DS

sha

ve b

est

BLA

ST

mat

chin

Thi

obac

illus

de

nitr

ifica

ns 2

7 of

28

CD

Ss

(96

) th

at g

ive

stat

istic

ally

sup

port

edph

ylog

enie

s ag

ree

with

rR

NA

gen

es

One

OR

F30

(R

suA

)cl

uste

r w

ith γ

-pr

oteo

bact

eria

and

has

no

hom

olog

ue in

T

hiob

acill

us d

enitr

ifica

ns

Two

CD

Ss

(OR

F14

and

O

RF

31)

have

bee

n tr

ansf

erre

d to

bot

h fo

smid

an

d T

hiob

acill

us

deni

trifi

cans

OR

F29

has

no

sign

ifica

nt

hom

olog

ues

inpr

oteo

bact

eria

4 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

3 (

2)

b1bf

110

d03

ndashA

Fla

voba

cter

iace

aeba

cter

ium

am

ong

sequ

ence

dge

nom

es m

ost

clos

ely

rela

ted

toC

ytop

haga

hutc

hins

onii

16 o

f 18

(84

) C

DS

s w

ith

supp

orte

d ph

ylog

enet

icto

polo

gies

agr

ee w

ith16

S f

ragm

ent

OR

F5

and

OR

F10

hav

e no

cl

ose

hom

olog

ues

in

othe

r B

acte

roid

es a

ndph

ylog

enet

ic a

naly

sis

sugg

ests

fre

quen

ttr

ansf

er

OR

F4

has

no d

etec

tabl

eho

mol

ogue

s in

oth

er

Bac

tero

ides

A

tran

spos

on w

ith 8

C

DS

s lik

ely

acqu

ired

from

rel

ativ

e of

Bac

tero

ides

thet

aiot

aoim

icro

n

3 C

DS

s (1

0 o

f to

tal)

h

ave

likel

y b

een

acq

uir

ed b

y L

GT

The

tra

nspo

son

not

incl

uded

as

it ha

sbe

en t

rans

ferr

edw

ithin

the

B

acte

roid

es

10

(3

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

a O

nly

LGT

eve

nts

invo

lvin

g th

e C

DS

fro

m t

he fo

smid

clo

ne a

naly

sed

was

cou

nted

and

onl

y w

hen

they

wer

e su

ppor

ted

by p

hylo

gene

tic a

naly

ses

or c

lear

phy

loge

netic

dis

trib

utio

n pa

ttern

s (i

e

the

gene

is n

ot p

rese

nt in

its

rRN

A g

roup

but

pre

sent

in s

ome

othe

r di

stin

ct b

acte

rial g

roup

) N

umbe

r of

CD

Ss

acqu

ired

by L

GT

is s

how

n in

bol

db

O

RFa

ns w

here

cla

ssifi

ed a

s C

DS

s w

ith n

o si

gnifi

cant

mat

ch in

Gen

Ban

k M

atch

es t

o se

quen

ces

in t

he e

nviro

nmen

tal p

ortio

n of

Gen

Ban

k w

ere

not

cons

ider

ed I

n pa

rent

hesi

s is

giv

en t

he

prop

ortio

n of

pro

tein

cod

ing

DN

A t

hat

has

no m

atch

in G

enB

ank

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

LGT and phylogenetic assignment of metagenomic clones 2017

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Tab

le 1

S

umm

ary

of p

hylo

gene

tic a

naly

ses

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

b1dc

f51

a06

No

clea

r af

filia

tion

with

exi

stin

gse

quen

ces

Cou

ld n

ot b

eam

plifi

ed

Mos

t C

DS

s ha

ve n

o or

only

a f

ew s

igni

fican

tm

atch

es in

Gen

Ban

kO

RF

4 cl

uste

rs w

ithLe

ptos

pira

inte

rrog

ans

with

in a

mix

ed c

lade

ho

wev

er

L in

terr

ogan

sha

s se

vera

l par

alog

ues

and

this

gen

e ap

pear

sto

hav

e be

en f

requ

ently

tran

sfer

red

and

islik

ely

to b

e a

tran

sfer

OR

F20

clu

ster

s w

ithM

etha

nosa

rcin

a w

ithin

δ-pr

oteo

bact

eria

O

RF

19cl

uste

rs w

ith G

eoba

cter

but

is m

ostly

foun

d in

met

hano

gens

OR

F17

and

OR

F18

have

hom

olog

ues

inM

etha

noge

ns o

nly

4 C

DS

s (1

9 o

f th

eto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

33

(38

)

b1dc

f13

f01

Clu

ster

s w

ithD

ehal

ococ

coid

eset

heno

gene

sC

hlor

oflex

usau

rant

iacu

s 23

SrR

NA

seq

uenc

eof

too

poo

r qu

ality

to in

clud

e in

the

tree

7 of

10

CD

Ss

(70

) w

ithsu

ppor

ted

phyl

ogen

etic

topo

logi

es a

gree

with

23S

fra

gmen

t In

addi

tion

6 C

DS

s w

hich

only

hit

Chl

orofl

exus

aura

ntia

cus

Two

CD

Ss

have

like

lybe

en a

cqui

red

thro

ugh

LGT

One

clu

ster

s w

ithhi

gh s

uppo

rt w

ithT

herm

otog

a m

ariti

ma

(OR

F16

) an

d on

e cl

uste

rsw

ithin

the

euk

aryo

tes

(OR

F25

)

OR

F2

has

only

sign

ifica

ntho

mol

ogue

s in

Cro

cosp

haer

aw

atso

nii

3 C

DS

s (1

1 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

14

(5

)

b3cf

12

f09

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Can

dida

te d

ivis

ion

OP

8 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A g

enes

and

do

not

clus

ter

with

in a

nysp

ecifi

c ba

cter

ial g

roup

Phy

loge

netic

ana

lysi

ssu

gges

ts t

hat

10 C

DS

sha

ve li

kely

bee

n ac

quire

dby

LG

T 8

of

thes

e ha

vebe

en a

cqui

red

from

an

α-pr

oteo

bact

eriu

man

d ar

e fo

und

linke

d

Thr

ee C

DS

s fo

und

linke

d to

CD

Ss

whe

reph

ylog

enet

ic a

naly

ses

sugg

est

LGT

hav

eal

so li

kely

bee

nac

quire

d by

LG

T

13 C

DS

s (3

2 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

OR

F16

is a

tran

spos

ase

of

prot

eoba

cter

ial

orig

in

and

show

slo

wer

GC

con

tent

than

the

res

t of

the

fosm

id T

wel

ve o

fth

e tr

ansf

erre

dC

DS

s (O

RF

29ndash

41)

are

linke

d an

dal

l app

ear

to h

ave

been

acq

uire

dfr

om a

n α-

prot

eoba

cter

ium

22

(9

)

b1bc

f11

f04

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Can

dida

te d

ivis

ion

WS

3 ba

cter

ium

Mos

t C

DS

s ag

ree

with

the

rRN

A a

nd d

oes

not

clus

ter

with

any

spe

cific

bact

eria

l lin

eage

A

mon

g th

ese

was

the

high

ly c

onse

rved

Dna

Ege

ne

Two

CD

Ss

(OR

F14

and

OR

F15

) cl

uste

r w

ithse

quen

ces

from

the

Chl

orob

iBac

tero

idet

esgr

oup

2 C

DS

s (9

o

f to

tal)

hav

e b

een

acq

uir

ed b

y L

GT

26

(14

)

2018 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f51c

12C

andi

date

div

isio

nW

S3

bact

eriu

mM

ost

CD

Ss

have

no

oron

ly a

few

sig

nific

ant

mat

ches

in G

enB

ank

OR

F6ndash

OR

F11

are

als

ofo

und

in b

1bcf

11

h3 in

sam

e or

der

and

phyl

ogen

etic

ana

lysi

ssu

ppor

ts t

hat

OR

F7

OR

F8

and

OR

F10

wer

etr

ansf

erre

d fr

om a

δ-

prot

eoba

cter

ium

to

b1bc

f51c

12 O

RF

10 a

ndO

RF

11 a

lso

clus

ter

with

δ-pr

oteo

bact

eria

ho

wev

er

with

no

boot

stra

p su

ppor

t O

RF

9ha

s on

ly o

ne m

atch

inG

enB

ank

OR

F15

(fu

sA)

clus

ters

with

Chl

orob

ium

tepi

dum

with

inF

irm

icut

es

OR

F12

has

no

hom

olog

ue in

b1bc

f11

h3

but

doe

scl

uste

r w

ith δ

-pr

oteo

bact

eria

ho

wev

er w

ith n

obo

otst

rap

supp

ort

It is

like

ly t

hat

also

thi

sC

DS

was

tra

nsfe

rred

as p

art

of w

ith a

δ-

prot

eoba

cter

ial i

slan

d

8 C

DS

s (4

4 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

One

lar

ge lsquoi

slan

drsquo o

fδ-

prot

eoba

cter

ial

orig

in

22

(29

)

b1cf

11

1h0

3δ-

Pro

teob

acte

rium

ndash8

of 1

3 C

DS

s (5

7)

that

give

s su

ppor

ted

phyl

ogen

ies

agre

e w

ithth

e fr

agm

ent

orig

inat

ing

from

a δ

-pr

oteo

bact

eriu

m

Six

CD

Ss

have

like

ly b

een

acqu

ired

by L

GT

OR

F8

clus

ters

with

Clo

strid

ium

ther

moc

ellu

m a

ndTr

epon

ema

dent

icol

aO

RF

18 is

fou

ndse

para

ted

from

oth

erpr

oteo

bact

eria

inph

ylog

enet

ic t

rees

cl

uste

ring

with

Pla

smod

ium

spp

O

RF

23is

fou

nd in

a m

ixed

cla

dean

d ap

pear

s to

hav

ebe

en f

requ

ently

tran

sfer

red

OR

F28

clus

ters

with

β-

prot

eoba

cter

ia

OR

F29

clus

ters

with

γ-

prot

eoba

cter

ia a

ndO

RF

30 is

fou

nd a

tbo

ttom

of

clad

e th

atco

ntai

ns α

-pr

oteo

bact

eria

and

Act

inob

acte

ria

6 C

DS

s (1

7 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

OR

F11

ndashOR

F16

ha

ve b

een

tran

sfer

red

from

an

ance

stor

of

B1B

CF

11

h03

tob1

dcf5

1c

12 a

sw

ell t

o th

eC

hlor

obiu

m li

neag

e

6 (

1)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2019

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f11

d04

δ-P

rote

obac

teriu

mndash

12 o

f 18

CD

Ss

(67

)w

ith s

uppo

rted

phyl

ogen

etic

top

olog

ies

agre

e w

ith a

δ-

prot

eoba

cter

ial o

rigin

of

the

frag

men

t

Six

CD

Ss

are

sugg

este

dby

phy

loge

netic

ana

lyse

sto

hav

e be

en a

cqui

red

byLG

T O

ne o

f th

ese

tran

sfer

red

gene

s ndasht

hefu

sA h

omol

ogue

(OR

F19

) ndash is

als

o fo

und

inb1

bcf5

c12

Thi

s C

DS

has

been

tra

nsfe

rred

to

othe

r δ-

prot

eoba

cter

ia a

sw

ell

Thr

ee C

DS

s (O

RF

3ndash5)

that

enc

ode

anin

tege

rase

and

tw

otr

ansp

osas

es t

hat

prec

edes

fou

r of

the

LGT

gen

es d

etec

ted

in t

he p

hylo

gene

tican

alys

is

OR

F7

also

likel

y tr

ansf

erre

d w

ithO

RF

3 ndashO

RF

10

OR

F20

and

OR

F21

have

mai

nly

hom

olog

ues

inF

irm

icut

es a

nd is

the

neig

hbou

r of

OR

F19

that

has

als

o be

enac

quire

d fr

omF

irm

icut

es

12 C

DS

s (3

1 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

Inte

rest

ingl

y th

isfo

smid

clo

nepr

ovid

es t

hetr

ansf

er v

ecto

r ndash

the

inte

gera

se a

ndtr

ansp

osas

e ndash

for

8of

the

tra

nsfe

rred

gene

s

ndash

b1bc

f13

c08

ε-P

rote

obac

teriu

m

mos

t cl

osel

yre

late

d to

Cam

pylo

bact

erje

juni

21 C

DS

s gi

ve s

uppo

rted

phyl

ogen

ies

and

ofth

ese

19 (

90

) ag

ree

with

rR

NA

OR

F4

clus

ters

with

Geo

bact

er a

ndC

lost

ridiu

m

and

OR

F23

does

not

hav

eho

mol

ogue

s in

ε-

prot

eoba

cter

ia a

ndcl

uste

rs w

ith γ

- an

d β-

prot

eoba

cter

ia

OR

F24

doe

s no

t gi

ve a

supp

orte

d tr

ee b

utha

s al

so p

roba

bly

been

tra

nsfe

rred

fro

mγ-

or

β-pr

oteo

bact

eria

3 C

DS

s (7

o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

10

(3

)

b3cf

12

d07

γ-P

rote

obac

teriu

m

Clu

ster

s w

ithin

the

γ-pr

oteo

bact

eria

inLo

gDet

dis

tanc

etr

ees

but

at t

heba

se o

f γ-

prot

eoba

cter

ia a

ndβ-

prot

eoba

cter

iain

the

bes

tm

axim

umlik

elih

ood

tree

Onl

y 7

CD

Ss

give

su

ppor

ted

phyl

ogen

ies

O

f th

ese

4 (5

7)

agre

e w

ith r

RN

A

OR

F7

clus

ter

with

in β

-pr

oteo

bact

eria

OR

F15

ha

s a

patc

hy d

istr

ibut

ion

and

does

not

clu

ster

with

ot

her

prot

eoba

cter

ia in

th

e ph

ylog

enet

ic t

ree

Sev

eral

add

ition

al C

DS

s (O

RF

16ndashO

RF

25)

that

did

not

prod

uce

wel

l-re

solv

ed t

rees

ha

d on

ly d

iver

gent

hom

olog

ues

inG

enB

ank

or

nosi

gnifi

cant

hom

olog

ues

may

also

hav

e be

enac

quire

d by

LG

T I

nsu

ppor

t of

thi

sO

RF

26 e

ncod

es a

tran

spos

ase

2 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

O

RF

16 ndash

OR

F25

w

as n

ot in

clud

ed in

es

timat

e du

e to

lim

ited

evid

ence

for

th

e tr

ansf

er o

f the

se

23

(23

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

2020 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f1c

04γ-

Pro

teob

acte

rium

ndash14

CD

Ss

give

sup

port

edph

ylog

enie

s an

d of

thes

e 13

(93

)

agre

ew

ith r

RN

A

Phy

loge

netic

ana

lyse

ssh

ow t

hat

two

CD

Ss

have

bee

n ac

quire

d by

LGT

OR

F3

is f

ound

in a

mix

ed c

lade

whi

leO

RF

30 c

lust

er w

ithin

β-

prot

eoba

cter

ia

Thr

ee g

enes

tha

t sh

owun

cong

ruen

tph

ylog

enie

s b

utw

ith lo

w b

oots

trap

supp

ort

foun

d cl

ose

to O

RF

3 an

d O

RF

34ha

ve p

roba

bly

also

been

acq

uire

d by

LGT

O

RF

5 cl

uste

rsw

ith β

-pro

teob

acte

ria

OR

F31

clu

ster

s w

ithδ-

prot

eoba

cter

ia

and

OR

F32

(G

ST

) cl

uste

rsw

ith a

γ-pr

oteo

bact

eriu

m

but

appe

ars

toha

ve b

een

freq

uent

lytr

ansf

erre

d

5 C

DS

s (1

3 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

3 (

1)

b1bf

11

a01

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns (

98

iden

tity

at 2

3S

rRN

A)

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns(9

8 id

entit

yat

16S

rR

NA

)

Hig

h de

gree

of

gene

sy

nten

y co

mpa

red

with

Thi

obac

illus

de

nitr

ifica

ns

29 C

DS

sha

ve b

est

BLA

ST

mat

chin

Thi

obac

illus

de

nitr

ifica

ns 2

7 of

28

CD

Ss

(96

) th

at g

ive

stat

istic

ally

sup

port

edph

ylog

enie

s ag

ree

with

rR

NA

gen

es

One

OR

F30

(R

suA

)cl

uste

r w

ith γ

-pr

oteo

bact

eria

and

has

no

hom

olog

ue in

T

hiob

acill

us d

enitr

ifica

ns

Two

CD

Ss

(OR

F14

and

O

RF

31)

have

bee

n tr

ansf

erre

d to

bot

h fo

smid

an

d T

hiob

acill

us

deni

trifi

cans

OR

F29

has

no

sign

ifica

nt

hom

olog

ues

inpr

oteo

bact

eria

4 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

3 (

2)

b1bf

110

d03

ndashA

Fla

voba

cter

iace

aeba

cter

ium

am

ong

sequ

ence

dge

nom

es m

ost

clos

ely

rela

ted

toC

ytop

haga

hutc

hins

onii

16 o

f 18

(84

) C

DS

s w

ith

supp

orte

d ph

ylog

enet

icto

polo

gies

agr

ee w

ith16

S f

ragm

ent

OR

F5

and

OR

F10

hav

e no

cl

ose

hom

olog

ues

in

othe

r B

acte

roid

es a

ndph

ylog

enet

ic a

naly

sis

sugg

ests

fre

quen

ttr

ansf

er

OR

F4

has

no d

etec

tabl

eho

mol

ogue

s in

oth

er

Bac

tero

ides

A

tran

spos

on w

ith 8

C

DS

s lik

ely

acqu

ired

from

rel

ativ

e of

Bac

tero

ides

thet

aiot

aoim

icro

n

3 C

DS

s (1

0 o

f to

tal)

h

ave

likel

y b

een

acq

uir

ed b

y L

GT

The

tra

nspo

son

not

incl

uded

as

it ha

sbe

en t

rans

ferr

edw

ithin

the

B

acte

roid

es

10

(3

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

a O

nly

LGT

eve

nts

invo

lvin

g th

e C

DS

fro

m t

he fo

smid

clo

ne a

naly

sed

was

cou

nted

and

onl

y w

hen

they

wer

e su

ppor

ted

by p

hylo

gene

tic a

naly

ses

or c

lear

phy

loge

netic

dis

trib

utio

n pa

ttern

s (i

e

the

gene

is n

ot p

rese

nt in

its

rRN

A g

roup

but

pre

sent

in s

ome

othe

r di

stin

ct b

acte

rial g

roup

) N

umbe

r of

CD

Ss

acqu

ired

by L

GT

is s

how

n in

bol

db

O

RFa

ns w

here

cla

ssifi

ed a

s C

DS

s w

ith n

o si

gnifi

cant

mat

ch in

Gen

Ban

k M

atch

es t

o se

quen

ces

in t

he e

nviro

nmen

tal p

ortio

n of

Gen

Ban

k w

ere

not

cons

ider

ed I

n pa

rent

hesi

s is

giv

en t

he

prop

ortio

n of

pro

tein

cod

ing

DN

A t

hat

has

no m

atch

in G

enB

ank

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

2018 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f51c

12C

andi

date

div

isio

nW

S3

bact

eriu

mM

ost

CD

Ss

have

no

oron

ly a

few

sig

nific

ant

mat

ches

in G

enB

ank

OR

F6ndash

OR

F11

are

als

ofo

und

in b

1bcf

11

h3 in

sam

e or

der

and

phyl

ogen

etic

ana

lysi

ssu

ppor

ts t

hat

OR

F7

OR

F8

and

OR

F10

wer

etr

ansf

erre

d fr

om a

δ-

prot

eoba

cter

ium

to

b1bc

f51c

12 O

RF

10 a

ndO

RF

11 a

lso

clus

ter

with

δ-pr

oteo

bact

eria

ho

wev

er

with

no

boot

stra

p su

ppor

t O

RF

9ha

s on

ly o

ne m

atch

inG

enB

ank

OR

F15

(fu

sA)

clus

ters

with

Chl

orob

ium

tepi

dum

with

inF

irm

icut

es

OR

F12

has

no

hom

olog

ue in

b1bc

f11

h3

but

doe

scl

uste

r w

ith δ

-pr

oteo

bact

eria

ho

wev

er w

ith n

obo

otst

rap

supp

ort

It is

like

ly t

hat

also

thi

sC

DS

was

tra

nsfe

rred

as p

art

of w

ith a

δ-

prot

eoba

cter

ial i

slan

d

8 C

DS

s (4

4 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

One

lar

ge lsquoi

slan

drsquo o

fδ-

prot

eoba

cter

ial

orig

in

22

(29

)

b1cf

11

1h0

3δ-

Pro

teob

acte

rium

ndash8

of 1

3 C

DS

s (5

7)

that

give

s su

ppor

ted

phyl

ogen

ies

agre

e w

ithth

e fr

agm

ent

orig

inat

ing

from

a δ

-pr

oteo

bact

eriu

m

Six

CD

Ss

have

like

ly b

een

acqu

ired

by L

GT

OR

F8

clus

ters

with

Clo

strid

ium

ther

moc

ellu

m a

ndTr

epon

ema

dent

icol

aO

RF

18 is

fou

ndse

para

ted

from

oth

erpr

oteo

bact

eria

inph

ylog

enet

ic t

rees

cl

uste

ring

with

Pla

smod

ium

spp

O

RF

23is

fou

nd in

a m

ixed

cla

dean

d ap

pear

s to

hav

ebe

en f

requ

ently

tran

sfer

red

OR

F28

clus

ters

with

β-

prot

eoba

cter

ia

OR

F29

clus

ters

with

γ-

prot

eoba

cter

ia a

ndO

RF

30 is

fou

nd a

tbo

ttom

of

clad

e th

atco

ntai

ns α

-pr

oteo

bact

eria

and

Act

inob

acte

ria

6 C

DS

s (1

7 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

OR

F11

ndashOR

F16

ha

ve b

een

tran

sfer

red

from

an

ance

stor

of

B1B

CF

11

h03

tob1

dcf5

1c

12 a

sw

ell t

o th

eC

hlor

obiu

m li

neag

e

6 (

1)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2019

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f11

d04

δ-P

rote

obac

teriu

mndash

12 o

f 18

CD

Ss

(67

)w

ith s

uppo

rted

phyl

ogen

etic

top

olog

ies

agre

e w

ith a

δ-

prot

eoba

cter

ial o

rigin

of

the

frag

men

t

Six

CD

Ss

are

sugg

este

dby

phy

loge

netic

ana

lyse

sto

hav

e be

en a

cqui

red

byLG

T O

ne o

f th

ese

tran

sfer

red

gene

s ndasht

hefu

sA h

omol

ogue

(OR

F19

) ndash is

als

o fo

und

inb1

bcf5

c12

Thi

s C

DS

has

been

tra

nsfe

rred

to

othe

r δ-

prot

eoba

cter

ia a

sw

ell

Thr

ee C

DS

s (O

RF

3ndash5)

that

enc

ode

anin

tege

rase

and

tw

otr

ansp

osas

es t

hat

prec

edes

fou

r of

the

LGT

gen

es d

etec

ted

in t

he p

hylo

gene

tican

alys

is

OR

F7

also

likel

y tr

ansf

erre

d w

ithO

RF

3 ndashO

RF

10

OR

F20

and

OR

F21

have

mai

nly

hom

olog

ues

inF

irm

icut

es a

nd is

the

neig

hbou

r of

OR

F19

that

has

als

o be

enac

quire

d fr

omF

irm

icut

es

12 C

DS

s (3

1 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

Inte

rest

ingl

y th

isfo

smid

clo

nepr

ovid

es t

hetr

ansf

er v

ecto

r ndash

the

inte

gera

se a

ndtr

ansp

osas

e ndash

for

8of

the

tra

nsfe

rred

gene

s

ndash

b1bc

f13

c08

ε-P

rote

obac

teriu

m

mos

t cl

osel

yre

late

d to

Cam

pylo

bact

erje

juni

21 C

DS

s gi

ve s

uppo

rted

phyl

ogen

ies

and

ofth

ese

19 (

90

) ag

ree

with

rR

NA

OR

F4

clus

ters

with

Geo

bact

er a

ndC

lost

ridiu

m

and

OR

F23

does

not

hav

eho

mol

ogue

s in

ε-

prot

eoba

cter

ia a

ndcl

uste

rs w

ith γ

- an

d β-

prot

eoba

cter

ia

OR

F24

doe

s no

t gi

ve a

supp

orte

d tr

ee b

utha

s al

so p

roba

bly

been

tra

nsfe

rred

fro

mγ-

or

β-pr

oteo

bact

eria

3 C

DS

s (7

o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

10

(3

)

b3cf

12

d07

γ-P

rote

obac

teriu

m

Clu

ster

s w

ithin

the

γ-pr

oteo

bact

eria

inLo

gDet

dis

tanc

etr

ees

but

at t

heba

se o

f γ-

prot

eoba

cter

ia a

ndβ-

prot

eoba

cter

iain

the

bes

tm

axim

umlik

elih

ood

tree

Onl

y 7

CD

Ss

give

su

ppor

ted

phyl

ogen

ies

O

f th

ese

4 (5

7)

agre

e w

ith r

RN

A

OR

F7

clus

ter

with

in β

-pr

oteo

bact

eria

OR

F15

ha

s a

patc

hy d

istr

ibut

ion

and

does

not

clu

ster

with

ot

her

prot

eoba

cter

ia in

th

e ph

ylog

enet

ic t

ree

Sev

eral

add

ition

al C

DS

s (O

RF

16ndashO

RF

25)

that

did

not

prod

uce

wel

l-re

solv

ed t

rees

ha

d on

ly d

iver

gent

hom

olog

ues

inG

enB

ank

or

nosi

gnifi

cant

hom

olog

ues

may

also

hav

e be

enac

quire

d by

LG

T I

nsu

ppor

t of

thi

sO

RF

26 e

ncod

es a

tran

spos

ase

2 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

O

RF

16 ndash

OR

F25

w

as n

ot in

clud

ed in

es

timat

e du

e to

lim

ited

evid

ence

for

th

e tr

ansf

er o

f the

se

23

(23

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

2020 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f1c

04γ-

Pro

teob

acte

rium

ndash14

CD

Ss

give

sup

port

edph

ylog

enie

s an

d of

thes

e 13

(93

)

agre

ew

ith r

RN

A

Phy

loge

netic

ana

lyse

ssh

ow t

hat

two

CD

Ss

have

bee

n ac

quire

d by

LGT

OR

F3

is f

ound

in a

mix

ed c

lade

whi

leO

RF

30 c

lust

er w

ithin

β-

prot

eoba

cter

ia

Thr

ee g

enes

tha

t sh

owun

cong

ruen

tph

ylog

enie

s b

utw

ith lo

w b

oots

trap

supp

ort

foun

d cl

ose

to O

RF

3 an

d O

RF

34ha

ve p

roba

bly

also

been

acq

uire

d by

LGT

O

RF

5 cl

uste

rsw

ith β

-pro

teob

acte

ria

OR

F31

clu

ster

s w

ithδ-

prot

eoba

cter

ia

and

OR

F32

(G

ST

) cl

uste

rsw

ith a

γ-pr

oteo

bact

eriu

m

but

appe

ars

toha

ve b

een

freq

uent

lytr

ansf

erre

d

5 C

DS

s (1

3 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

3 (

1)

b1bf

11

a01

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns (

98

iden

tity

at 2

3S

rRN

A)

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns(9

8 id

entit

yat

16S

rR

NA

)

Hig

h de

gree

of

gene

sy

nten

y co

mpa

red

with

Thi

obac

illus

de

nitr

ifica

ns

29 C

DS

sha

ve b

est

BLA

ST

mat

chin

Thi

obac

illus

de

nitr

ifica

ns 2

7 of

28

CD

Ss

(96

) th

at g

ive

stat

istic

ally

sup

port

edph

ylog

enie

s ag

ree

with

rR

NA

gen

es

One

OR

F30

(R

suA

)cl

uste

r w

ith γ

-pr

oteo

bact

eria

and

has

no

hom

olog

ue in

T

hiob

acill

us d

enitr

ifica

ns

Two

CD

Ss

(OR

F14

and

O

RF

31)

have

bee

n tr

ansf

erre

d to

bot

h fo

smid

an

d T

hiob

acill

us

deni

trifi

cans

OR

F29

has

no

sign

ifica

nt

hom

olog

ues

inpr

oteo

bact

eria

4 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

3 (

2)

b1bf

110

d03

ndashA

Fla

voba

cter

iace

aeba

cter

ium

am

ong

sequ

ence

dge

nom

es m

ost

clos

ely

rela

ted

toC

ytop

haga

hutc

hins

onii

16 o

f 18

(84

) C

DS

s w

ith

supp

orte

d ph

ylog

enet

icto

polo

gies

agr

ee w

ith16

S f

ragm

ent

OR

F5

and

OR

F10

hav

e no

cl

ose

hom

olog

ues

in

othe

r B

acte

roid

es a

ndph

ylog

enet

ic a

naly

sis

sugg

ests

fre

quen

ttr

ansf

er

OR

F4

has

no d

etec

tabl

eho

mol

ogue

s in

oth

er

Bac

tero

ides

A

tran

spos

on w

ith 8

C

DS

s lik

ely

acqu

ired

from

rel

ativ

e of

Bac

tero

ides

thet

aiot

aoim

icro

n

3 C

DS

s (1

0 o

f to

tal)

h

ave

likel

y b

een

acq

uir

ed b

y L

GT

The

tra

nspo

son

not

incl

uded

as

it ha

sbe

en t

rans

ferr

edw

ithin

the

B

acte

roid

es

10

(3

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

a O

nly

LGT

eve

nts

invo

lvin

g th

e C

DS

fro

m t

he fo

smid

clo

ne a

naly

sed

was

cou

nted

and

onl

y w

hen

they

wer

e su

ppor

ted

by p

hylo

gene

tic a

naly

ses

or c

lear

phy

loge

netic

dis

trib

utio

n pa

ttern

s (i

e

the

gene

is n

ot p

rese

nt in

its

rRN

A g

roup

but

pre

sent

in s

ome

othe

r di

stin

ct b

acte

rial g

roup

) N

umbe

r of

CD

Ss

acqu

ired

by L

GT

is s

how

n in

bol

db

O

RFa

ns w

here

cla

ssifi

ed a

s C

DS

s w

ith n

o si

gnifi

cant

mat

ch in

Gen

Ban

k M

atch

es t

o se

quen

ces

in t

he e

nviro

nmen

tal p

ortio

n of

Gen

Ban

k w

ere

not

cons

ider

ed I

n pa

rent

hesi

s is

giv

en t

he

prop

ortio

n of

pro

tein

cod

ing

DN

A t

hat

has

no m

atch

in G

enB

ank

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

LGT and phylogenetic assignment of metagenomic clones 2019

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f11

d04

δ-P

rote

obac

teriu

mndash

12 o

f 18

CD

Ss

(67

)w

ith s

uppo

rted

phyl

ogen

etic

top

olog

ies

agre

e w

ith a

δ-

prot

eoba

cter

ial o

rigin

of

the

frag

men

t

Six

CD

Ss

are

sugg

este

dby

phy

loge

netic

ana

lyse

sto

hav

e be

en a

cqui

red

byLG

T O

ne o

f th

ese

tran

sfer

red

gene

s ndasht

hefu

sA h

omol

ogue

(OR

F19

) ndash is

als

o fo

und

inb1

bcf5

c12

Thi

s C

DS

has

been

tra

nsfe

rred

to

othe

r δ-

prot

eoba

cter

ia a

sw

ell

Thr

ee C

DS

s (O

RF

3ndash5)

that

enc

ode

anin

tege

rase

and

tw

otr

ansp

osas

es t

hat

prec

edes

fou

r of

the

LGT

gen

es d

etec

ted

in t

he p

hylo

gene

tican

alys

is

OR

F7

also

likel

y tr

ansf

erre

d w

ithO

RF

3 ndashO

RF

10

OR

F20

and

OR

F21

have

mai

nly

hom

olog

ues

inF

irm

icut

es a

nd is

the

neig

hbou

r of

OR

F19

that

has

als

o be

enac

quire

d fr

omF

irm

icut

es

12 C

DS

s (3

1 o

fto

tal)

hav

e b

een

acq

uir

ed b

y L

GT

Inte

rest

ingl

y th

isfo

smid

clo

nepr

ovid

es t

hetr

ansf

er v

ecto

r ndash

the

inte

gera

se a

ndtr

ansp

osas

e ndash

for

8of

the

tra

nsfe

rred

gene

s

ndash

b1bc

f13

c08

ε-P

rote

obac

teriu

m

mos

t cl

osel

yre

late

d to

Cam

pylo

bact

erje

juni

21 C

DS

s gi

ve s

uppo

rted

phyl

ogen

ies

and

ofth

ese

19 (

90

) ag

ree

with

rR

NA

OR

F4

clus

ters

with

Geo

bact

er a

ndC

lost

ridiu

m

and

OR

F23

does

not

hav

eho

mol

ogue

s in

ε-

prot

eoba

cter

ia a

ndcl

uste

rs w

ith γ

- an

d β-

prot

eoba

cter

ia

OR

F24

doe

s no

t gi

ve a

supp

orte

d tr

ee b

utha

s al

so p

roba

bly

been

tra

nsfe

rred

fro

mγ-

or

β-pr

oteo

bact

eria

3 C

DS

s (7

o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

10

(3

)

b3cf

12

d07

γ-P

rote

obac

teriu

m

Clu

ster

s w

ithin

the

γ-pr

oteo

bact

eria

inLo

gDet

dis

tanc

etr

ees

but

at t

heba

se o

f γ-

prot

eoba

cter

ia a

ndβ-

prot

eoba

cter

iain

the

bes

tm

axim

umlik

elih

ood

tree

Onl

y 7

CD

Ss

give

su

ppor

ted

phyl

ogen

ies

O

f th

ese

4 (5

7)

agre

e w

ith r

RN

A

OR

F7

clus

ter

with

in β

-pr

oteo

bact

eria

OR

F15

ha

s a

patc

hy d

istr

ibut

ion

and

does

not

clu

ster

with

ot

her

prot

eoba

cter

ia in

th

e ph

ylog

enet

ic t

ree

Sev

eral

add

ition

al C

DS

s (O

RF

16ndashO

RF

25)

that

did

not

prod

uce

wel

l-re

solv

ed t

rees

ha

d on

ly d

iver

gent

hom

olog

ues

inG

enB

ank

or

nosi

gnifi

cant

hom

olog

ues

may

also

hav

e be

enac

quire

d by

LG

T I

nsu

ppor

t of

thi

sO

RF

26 e

ncod

es a

tran

spos

ase

2 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

O

RF

16 ndash

OR

F25

w

as n

ot in

clud

ed in

es

timat

e du

e to

lim

ited

evid

ence

for

th

e tr

ansf

er o

f the

se

23

(23

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

2020 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f1c

04γ-

Pro

teob

acte

rium

ndash14

CD

Ss

give

sup

port

edph

ylog

enie

s an

d of

thes

e 13

(93

)

agre

ew

ith r

RN

A

Phy

loge

netic

ana

lyse

ssh

ow t

hat

two

CD

Ss

have

bee

n ac

quire

d by

LGT

OR

F3

is f

ound

in a

mix

ed c

lade

whi

leO

RF

30 c

lust

er w

ithin

β-

prot

eoba

cter

ia

Thr

ee g

enes

tha

t sh

owun

cong

ruen

tph

ylog

enie

s b

utw

ith lo

w b

oots

trap

supp

ort

foun

d cl

ose

to O

RF

3 an

d O

RF

34ha

ve p

roba

bly

also

been

acq

uire

d by

LGT

O

RF

5 cl

uste

rsw

ith β

-pro

teob

acte

ria

OR

F31

clu

ster

s w

ithδ-

prot

eoba

cter

ia

and

OR

F32

(G

ST

) cl

uste

rsw

ith a

γ-pr

oteo

bact

eriu

m

but

appe

ars

toha

ve b

een

freq

uent

lytr

ansf

erre

d

5 C

DS

s (1

3 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

3 (

1)

b1bf

11

a01

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns (

98

iden

tity

at 2

3S

rRN

A)

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns(9

8 id

entit

yat

16S

rR

NA

)

Hig

h de

gree

of

gene

sy

nten

y co

mpa

red

with

Thi

obac

illus

de

nitr

ifica

ns

29 C

DS

sha

ve b

est

BLA

ST

mat

chin

Thi

obac

illus

de

nitr

ifica

ns 2

7 of

28

CD

Ss

(96

) th

at g

ive

stat

istic

ally

sup

port

edph

ylog

enie

s ag

ree

with

rR

NA

gen

es

One

OR

F30

(R

suA

)cl

uste

r w

ith γ

-pr

oteo

bact

eria

and

has

no

hom

olog

ue in

T

hiob

acill

us d

enitr

ifica

ns

Two

CD

Ss

(OR

F14

and

O

RF

31)

have

bee

n tr

ansf

erre

d to

bot

h fo

smid

an

d T

hiob

acill

us

deni

trifi

cans

OR

F29

has

no

sign

ifica

nt

hom

olog

ues

inpr

oteo

bact

eria

4 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

3 (

2)

b1bf

110

d03

ndashA

Fla

voba

cter

iace

aeba

cter

ium

am

ong

sequ

ence

dge

nom

es m

ost

clos

ely

rela

ted

toC

ytop

haga

hutc

hins

onii

16 o

f 18

(84

) C

DS

s w

ith

supp

orte

d ph

ylog

enet

icto

polo

gies

agr

ee w

ith16

S f

ragm

ent

OR

F5

and

OR

F10

hav

e no

cl

ose

hom

olog

ues

in

othe

r B

acte

roid

es a

ndph

ylog

enet

ic a

naly

sis

sugg

ests

fre

quen

ttr

ansf

er

OR

F4

has

no d

etec

tabl

eho

mol

ogue

s in

oth

er

Bac

tero

ides

A

tran

spos

on w

ith 8

C

DS

s lik

ely

acqu

ired

from

rel

ativ

e of

Bac

tero

ides

thet

aiot

aoim

icro

n

3 C

DS

s (1

0 o

f to

tal)

h

ave

likel

y b

een

acq

uir

ed b

y L

GT

The

tra

nspo

son

not

incl

uded

as

it ha

sbe

en t

rans

ferr

edw

ithin

the

B

acte

roid

es

10

(3

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

a O

nly

LGT

eve

nts

invo

lvin

g th

e C

DS

fro

m t

he fo

smid

clo

ne a

naly

sed

was

cou

nted

and

onl

y w

hen

they

wer

e su

ppor

ted

by p

hylo

gene

tic a

naly

ses

or c

lear

phy

loge

netic

dis

trib

utio

n pa

ttern

s (i

e

the

gene

is n

ot p

rese

nt in

its

rRN

A g

roup

but

pre

sent

in s

ome

othe

r di

stin

ct b

acte

rial g

roup

) N

umbe

r of

CD

Ss

acqu

ired

by L

GT

is s

how

n in

bol

db

O

RFa

ns w

here

cla

ssifi

ed a

s C

DS

s w

ith n

o si

gnifi

cant

mat

ch in

Gen

Ban

k M

atch

es t

o se

quen

ces

in t

he e

nviro

nmen

tal p

ortio

n of

Gen

Ban

k w

ere

not

cons

ider

ed I

n pa

rent

hesi

s is

giv

en t

he

prop

ortio

n of

pro

tein

cod

ing

DN

A t

hat

has

no m

atch

in G

enB

ank

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

2020 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

b1bc

f1c

04γ-

Pro

teob

acte

rium

ndash14

CD

Ss

give

sup

port

edph

ylog

enie

s an

d of

thes

e 13

(93

)

agre

ew

ith r

RN

A

Phy

loge

netic

ana

lyse

ssh

ow t

hat

two

CD

Ss

have

bee

n ac

quire

d by

LGT

OR

F3

is f

ound

in a

mix

ed c

lade

whi

leO

RF

30 c

lust

er w

ithin

β-

prot

eoba

cter

ia

Thr

ee g

enes

tha

t sh

owun

cong

ruen

tph

ylog

enie

s b

utw

ith lo

w b

oots

trap

supp

ort

foun

d cl

ose

to O

RF

3 an

d O

RF

34ha

ve p

roba

bly

also

been

acq

uire

d by

LGT

O

RF

5 cl

uste

rsw

ith β

-pro

teob

acte

ria

OR

F31

clu

ster

s w

ithδ-

prot

eoba

cter

ia

and

OR

F32

(G

ST

) cl

uste

rsw

ith a

γ-pr

oteo

bact

eriu

m

but

appe

ars

toha

ve b

een

freq

uent

lytr

ansf

erre

d

5 C

DS

s (1

3 o

f to

tal)

hav

e b

een

ac

qu

ired

by

LG

T

3 (

1)

b1bf

11

a01

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns (

98

iden

tity

at 2

3S

rRN

A)

β-P

rote

obac

teriu

m

clos

ely

rela

ted

toT

hiob

acill

usde

nitr

ifica

ns(9

8 id

entit

yat

16S

rR

NA

)

Hig

h de

gree

of

gene

sy

nten

y co

mpa

red

with

Thi

obac

illus

de

nitr

ifica

ns

29 C

DS

sha

ve b

est

BLA

ST

mat

chin

Thi

obac

illus

de

nitr

ifica

ns 2

7 of

28

CD

Ss

(96

) th

at g

ive

stat

istic

ally

sup

port

edph

ylog

enie

s ag

ree

with

rR

NA

gen

es

One

OR

F30

(R

suA

)cl

uste

r w

ith γ

-pr

oteo

bact

eria

and

has

no

hom

olog

ue in

T

hiob

acill

us d

enitr

ifica

ns

Two

CD

Ss

(OR

F14

and

O

RF

31)

have

bee

n tr

ansf

erre

d to

bot

h fo

smid

an

d T

hiob

acill

us

deni

trifi

cans

OR

F29

has

no

sign

ifica

nt

hom

olog

ues

inpr

oteo

bact

eria

4 C

DS

s (8

o

f to

tal)

h

ave

bee

n

acq

uir

ed b

y L

GT

3 (

2)

b1bf

110

d03

ndashA

Fla

voba

cter

iace

aeba

cter

ium

am

ong

sequ

ence

dge

nom

es m

ost

clos

ely

rela

ted

toC

ytop

haga

hutc

hins

onii

16 o

f 18

(84

) C

DS

s w

ith

supp

orte

d ph

ylog

enet

icto

polo

gies

agr

ee w

ith16

S f

ragm

ent

OR

F5

and

OR

F10

hav

e no

cl

ose

hom

olog

ues

in

othe

r B

acte

roid

es a

ndph

ylog

enet

ic a

naly

sis

sugg

ests

fre

quen

ttr

ansf

er

OR

F4

has

no d

etec

tabl

eho

mol

ogue

s in

oth

er

Bac

tero

ides

A

tran

spos

on w

ith 8

C

DS

s lik

ely

acqu

ired

from

rel

ativ

e of

Bac

tero

ides

thet

aiot

aoim

icro

n

3 C

DS

s (1

0 o

f to

tal)

h

ave

likel

y b

een

acq

uir

ed b

y L

GT

The

tra

nspo

son

not

incl

uded

as

it ha

sbe

en t

rans

ferr

edw

ithin

the

B

acte

roid

es

10

(3

)

Fos

mid

Phy

loge

ny

LGT

a

O

RFa

nsb

23S

rR

NA

16S

rR

NA

CD

Ss

Phy

loge

netic

tre

esP

hylo

gene

tic d

istr

ibut

ion

or g

enom

e co

ntex

tTo

tal

a O

nly

LGT

eve

nts

invo

lvin

g th

e C

DS

fro

m t

he fo

smid

clo

ne a

naly

sed

was

cou

nted

and

onl

y w

hen

they

wer

e su

ppor

ted

by p

hylo

gene

tic a

naly

ses

or c

lear

phy

loge

netic

dis

trib

utio

n pa

ttern

s (i

e

the

gene

is n

ot p

rese

nt in

its

rRN

A g

roup

but

pre

sent

in s

ome

othe

r di

stin

ct b

acte

rial g

roup

) N

umbe

r of

CD

Ss

acqu

ired

by L

GT

is s

how

n in

bol

db

O

RFa

ns w

here

cla

ssifi

ed a

s C

DS

s w

ith n

o si

gnifi

cant

mat

ch in

Gen

Ban

k M

atch

es t

o se

quen

ces

in t

he e

nviro

nmen

tal p

ortio

n of

Gen

Ban

k w

ere

not

cons

ider

ed I

n pa

rent

hesi

s is

giv

en t

he

prop

ortio

n of

pro

tein

cod

ing

DN

A t

hat

has

no m

atch

in G

enB

ank

Tab

le 1

co

nt

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

LGT and phylogenetic assignment of metagenomic clones 2021

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

(ORF16) showing that this lineage has indeed acquiredproteobacterial genes This CDS might have been part ofthe α-proteobacterial island upon transfer

In the Flavobacteriaceae fosmid b1bf11d10 a largeself-transmitting conjugative transposon was identified(Fig 1) This transposon is inserted next to a tRNA and issimilar in sequence and structure to the transposonsfound in Bacteroides thetaiotaomicron (Xu et al 2003)Bacteroides fragilis (Kuwahara et al 2004) and Porphy-romonas gingivalis (Nelson et al 2003) In the phyloge-netic tree of the transposase gene (ORF21) the CDSfrom the fosmid falls into a cluster containing numerousB thetaiotaomicron sequences separated from the singleCytophaga hutchinsonii homologue detected among the100 best BLAST hits For the other CDSs that are clearlypart of this transposon (ORF22ndashORF27) we found no

significant homologues in C hutchinsonii and the best(and in most cases the only) match was always to Bthetaiotaomicron and P gingivalis genes suggesting thatthis transposon has been acquired from the Bacteroidaleslineage It is likely that we have captured only part of thistransposon ndash because many of the CDSs found in thetransposons in B thetaiotaomicron are not present in thefragment we have sequenced ndash and that also the 3prime CDSsin this fosmid clone (ORF28ndashORF30) were transferredalong with this transposon Additional CDSs (possibly notinvolved in transposon function) where also present in theB thetaiotaomicron transposons (Xu et al 2003) Wenote that the acquisition of this transposon was notincluded in our LGT estimate as it originated from thesame major bacterial group as the fosmid clone

Interestingly one gene was found to have been trans-ferred to two of the fosmids the fusA paralogue inb1bcf11d04 and b1dcf51c12 (Figs 1 and 4) This pro-tein appears to be a distant paralogue of fusA and it hasa very patchy phylogenetic distribution suggesting that itoriginated in one of the lineages that possesses it andthen has been transferred to the other lineages Onecharacteristic common to the organisms encoding thisprotein is that they are all anaerobes or microaerophilic(Symbiobacterium thermophilum) and they are all foundin environments similar to the one sampled here Trans-ferred genes are likely to give a selective advantage in theenvironment where the organisms harbouring them liveand an ecological function for this fusA paralogue shouldbe sought

Another set of genes identified in two of the fosmidclones forms a cluster encoding outer membrane proteinsand proteins involved in biopolymer transport (OmpATolB TonB ExbD TolQ) This cluster is found in both thecandidate division WS3 clone b1dcf51c12 and the δ-proteobacterial clone b1bcf11h03 (Fig 1) In this casethe gene cluster appears to have been transferred from aδ-proteobacterium to b1dcf51c12 while it might be nativeto b1bcf11h03 (Fig 5) This gene cluster also appearsto have been transferred to Chlorobium tepidum as bothb1dcf51c12 and C tepidum cluster within the δ-proteo-bacteria for all these genes except TonB (from which wecould not make a reliable alignment) Robust phylogenieswere only obtained from OmpA and TolB However theconserved gene order in b1dcf51c12 C tepidumb1bcf11h03 and other δ-proteobacteria such as Geo-bacter suggests that this entire 4-kb fragment was trans-ferred from a δ-proteobacterium to C tepidum andb1dcf51c12 probably as two separate events Moreoverfor b1dcf51c12 the fusA paralogue discussed abovemay have been transferred as part of this gene cluster asthey are found close together in this clone The second δ-proteobacterial fosmid clone b1bcf11d04 also containsan OmpA homologue However this CDS is distantly

Fig 4 Maximum Likelihood phylogeny fusA homologues estimated using PMBML (661 positions in alignment) The sequences were obtained by blasting the b1bcf11d04 ORF19 and b1dcf51c12 ORF15 sequences against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbitrarily rooted by Aquifex aeolicus Results from bootstrap analyses are indicated as in Fig 3

10

Aquifex aeolicus Thermotoga maritima

Chlorobium tepidum b1dcf51c12ORF15

b1bcf11d04ORF19Desulfovibrio vulgaris

Desulfotalea psychrophila Magnetococcus sp MC-1

Geobacter sulfurreducens Geobacter metallireducens

Moorella thermoacetica Desulfitobacterium hafniense

Symbiobacterium thermophilum Chloroflexus aurantiacus

Dehalococcoides ethenogenesThermoanaerobacter tengcongensis

Clostridium thermocellumFusobacterium nucleatum

Clostridium perfringensClostridium tetani

Thermus thermophilus Rubrobacter xylanophilus

Mycoplasma penetransUreaplasma parvum

Geobacillus stearothermophilusExiguobacterium sp 255-15

Bacillus cereus Bacillus halodurans

Listeria monocytogenes Bacillus subtilis

Oceanobacillus iheyensis Staphylococcus aureus

Lactobacillus johnsonii Pediococcus pentosaceusLactobacillus plantarum

Enterococcus faecalisLactococcus lactis

Streptococcus mutans Streptococcus agalactiae

Moorella thermoacetica Symbiobacterium thermophilum

Thermoanaerobacter tengcongensis Clostridium thermocellum

Clostridium acetobutylicumClostridium perfringens

Clostridium tetani Chlorobium tepidum

Fusobacterium nucleatumThermobifida fusca

Desulfovibrio desulfuricansMagnetococcus sp MC-1

Geobacter sulfurreducensSynechococcus elongatus

Prochlorococcus marinus Synechococcus sp WH 8102

Thermosynechococcus elongatus Nostoc punctiforme

Synechocystis sp PCC 6803 Trichodesmium erythraeum

Spirulina platensis Campylobacter jejuni Helicobacter pylori Wolinella succinogenes

Legionella pneumophilaMethylococcus capsulatus

Coxiella burnetii Photorhabdus luminescens

Pasteurella multocida Shewanella oneidensis Photobacterium profundum Vibrio parahaemolyticusNeisseria meningitidis

Chromobacterium violaceum Bordetella parapertussis

Ralstonia metallidurans Bordetella bronchiseptica Burkholderia pseudomalleiRalstonia metallidurans

Azoarcus sp EbN1 Dechloromonas aromatica

Nitrosomonas europaea Thiobacillus denitrificans

66

57 65 55

61

5160

9072

80

86

88

6090

63

50 52 75 74

9094

50 68 74

78

53

7985

8481

72

53 9968

7790

70

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

2022 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

related to the OmpA found in this gene cluster and wasnot included in the alignment

We also identified some mobile genes that might beinvolved in biodegradation of pollutants by searching thePfam database In one of the γ-proteobacterial fosmidsb1bcf11c4 we identified a glutathione-S-transferase(GST ORF36) gene that was flanked by an acetyltrans-ferase gene (ORF35) and a transporter (ORF34) Eukary-otic GSTs are important in detoxifying metabolism Wellcharacterized bacterial GSTs (such as dichloromethanedehalogenase and 12-dichloroepoxyethane epoxidase)on the other hand are catabolic enzymes that play anessential role in growth on various difficult-to-degradechemicals (Vuilleumier and Pagni 2002) Considering theenvironment the fosmid originated from ndash highly pollutedmarine sediments ndash these CDSs would be good candi-dates for genes involved in biodegradation of a xenbiotic

compound The b1bf11c4 GST-gene clusters with a γ-proteobacterium (Acinetobacter sp ADP1 Accession noYP_046221) However as observed by Vuilleumier andPagni (2002) the phylogeny suggests that this gene hasbeen frequently transferred In support of this CDS havingbeen acquired by LGT its neighbour ndash ORF34 ndash clustersrobustly within the β-proteobacteria while ORF35 clusterswith δ-proteobacteria (although with no bootstrapsupport)

Another gene that might be involved in biodegradationof pollutants was identified among the CDSs that havebeen transferred into the β-proteobacterial fosmidb1bf11a01 ndash ORF31 which encodes a dienelactonehydrolases Dienelactone hydrolases play a crucial role inchlorocatechol degradation via the modified ortho cleav-age pathway (Eulberg et al 1998 Muller et al 2004)suggesting that the bacterium from which this fragmentoriginated might use chloraromatic compounds as energysource However it should be noted that this CDS is foundin a cluster of CDSs from genome projects with no exper-imentally confirmed function Again this gene is flankedby other genes that also have been acquired by LGT Thephylogeny of the neighbouring genes ndash ORF30 an S4domain protein suggests that it has been acquired froma γ-proteobacterium The next gene upstream ORF29could not be used in phylogenetic analyses However thisCDS has no match in its close relative T denitrificans andits best match was to a conserved membrane protein fromClostridium tetani (Table S11) Thus it is likely that allthese genes have been acquired by LGT Notably a shortinverted repeat (80 identity) was found to flank thesegenes (34021ndash34040 36693ndash36674)

Few laterally transferred CDSs identified by G + C content

Differences in G + C content are commonly used as anindication of recent LGT (Lawrence and Ochman 1997)We identified only eight CDSs that showed a G + C con-tent 10 higher or lower than the average for the respec-tive fosmid clone (see Tables S1ndash12) ORF20 in the δ-proteobacterial clone b1bcf11h3 has a G + C content of475 compared with 366 for the complete fosmid ThisCDS clusters with Desulfovibrio vulgaris within a mixedclade with no bootstrap support and was not included inthe LGT estimate for this fosmid A very short ORFan(ORF1) in the candidate division OP8 clone b3cf12f09has a G + C content of 436 compared with 594 forthe fosmid clone In addition the transposase (ORF16)and its neighbouring ORFan (ORF17) in the same clonehave a G + C content of 463 and 402 respectivelyORF11 ORF13 and ORF14 in the γ-proteobacterial cloneb3cf12d07 all show higher G + C content than the restof the fosmid with 664 657 and 647 comparedwith 525 for the rest of the fosmid All these CDSs

Fig 5 Maximum Likelihood phylogeny of OmpA homologues esti-mated using PMBML (135 positions in alignment) The sequences were obtained by blasting the b1dcf51c12 ORF7 sequence against Gen-Bank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative We also removed three sequences from Chlamydiaceae as these sequences formed a long unstable branch in the tree as well as some sequences that where considerably shorter than the remaining alignment The tree was arbitrarily rooted by Agrobacterium tumefaciens Results from bootstrap analyses are indicated as in Fig 3

10

Agrobacterium tumefaciens Sinorhizobium meliloti

Brucella melitensis Mesorhizobium loti

Mesorhizobium sp BNC1 Helicobacter bizzozeronii

Bartonella henselae Rhodopseudomonas palustris Bradyrhizobium japonicum

Rhodobacter sphaeroidesSilicibacter sp TM1040

Rhodospirillum rubrum Caulobacter crescentus

Magnetospirillum gryphiswaldense Rickettsia typhi

Rickettsia sibirica Gluconobacter oxydans

Zymomonas mobilis Novosphingobium aromaticivorans

Novosphingobium aromaticivorans Magnetococcus sp MC-1

Myxococcus xanthusXanthomonas campestris

Desulfotalea psychrophila Wolinella succinogenes

Desulfotalea psychrophila Desulfovibrio vulgaris

Geobacter metallireducens Geobacter sulfurreducens

Geobacter metallireducens Geobacter sulfurreducens

Chlorobium tepidum b1bcf11h03ORF12

Bdellovibrio bacteriovorus b1dcf51c12ORF7

Psychrobacter sp 273-4 Acinetobacter sp ADP1

Microbulbifer degradans Pseudomonas syringae Pseudomonas aeruginosa

Rubrivivax gelatinosus Thiobacillus denitrificans Nitrosomonas europaea

Ralstonia solanacearum Ralstonia eutropha

Burkholderia fungorum Burkholderia cepacia

Burkholderia cepacia Burkholderia pseudomallei

Idiomarina loihiensisPhotobacterium profundum

Shewanella oneidensis Vibrio cholerae Vibrio vulnificus Vibrio parahaemolyticus

Haemophilus somnus Haemophilus influenzae

Pasteurella multocida Photorhabdus luminescens Yersinia pseudotuberculosis

Erwinia carotovora Salmonella enterica

Erwinia chrysanthemi

6155

79 61 83

7255

5467

71

52

65

5152

5474

82

52

73

528498 52

508992

8472 54

527383

698372

8783

77 92

52

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

LGT and phylogenetic assignment of metagenomic clones 2023

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

cluster with γ-proteobacteria and might therefore repre-sent recent within γ-proteobacteria transfers ORF40 inthe isin-proteobacterial clone b1dcf13c08 a short ORFanhas a G + C content of 222 compared with 347 forthe complete clone In addition ORF9 another ORFan inb1dcf13c08 has a marginally lower G + C content com-pared with the rest of the fosmid clone with 257 Simi-larly ORF26 in the Chloroflexi clone b1dcf13f01 has aG + C content of 478 G + C compared with 569 forthe complete fosmid clone

The first protein coding sequences from uncultivated lineages

Four of the fosmids that we sequenced were from uncul-tivated lineages These fosmid clones represent to ourknowledge the first protein coding sequences obtainedfrom these major bacterial lineages In agreement withtheir rRNA phylotype most of the CDSs with homologuesin GenBank are found as independent lineages in phylo-genetic trees (Fig 1 Table 1) These clones also containseveral large CDSs with no significant matches in Gen-Bank or only partial matches to known proteins (Fig 1Table 1) A t-test showed that both the proportion ofORFans (P = 0002) and the proportion of coding bases(P = 002) with no match in GenBank (excluding the envi-ronmental part of GenBank) were significantly higherthan what was observed in fosmid clones from lineagesthat have cultivated representatives

The two candidate division WS3 clones b1bcf11f04and b1dcf51c12 contain several large CDSs for whichwe can make no clear functional prediction or that haveno match in GenBank For instance for b1dcf51c12 halfof the clone is occupied by two CDSs that have no signif-icant matches in GenBank (ORF4) or only a single match(ORF5) Also none of these CDSs had significantmatches to domains in Pfam These CDSs might repre-sent lineage-specific proteins and homologues may beidentified when more sequences from this lineages areavailable The candidate division OP8 also contains anumber of ORFans however in this fosmid the predictedproteins tend to be smaller than what we observed for thetwo WS3 clones

The b1dcf51a06 clone encodes a large ORFan(ORF1) as well as several smaller ORFans (ORF5ORF7ndash9 ORF14) and CDSs with only single hits in Gen-Bank (ORF6 ORF11ndash13) (Fig 1) For ORF1 we canmake some functional prediction based on Pfamsearches This protein contains a nucleoside diphosphatekinases domain a fibronectin type III domain as well asa PBS lyase HEAT-like repeat (three repeat units) ThePBS lyase repeat is responsible for specifically attachingparticular phycobilins to apophycobiliprotein subunits inthe phycobilisomes (PBS) which are light harvesting mac-

romolecular complexes of cyanobacteria and red algae(Zhao et al 2000) The phycobilins are open-chain tet-rapyrrole chromophores which function as the photosyn-thetic light-harvesting pigments Interestingly two otherCDSs ndash ORF15 and ORF16 ndash also contain several PBSrepeats It is possible that the proteins encoded by thePBS-containing CDSs in b1dcf51a06 has a similar func-tion as the PBS lyase proteins in cyanobacteria andthat this fosmid clone originated from a photosyntheticorganism

Among the CDSs that do have matches in GenBank arepotential phylogenetic markers The candidate divisionWS3 clone b1bcf11f04 clone contains two CDSs withsimilarity to DNA polymerase III subunit A homologuesDnaE and the Gram-positive type PolC In phylogenetictrees of both genes the b1bcf11f04 homologue forms aseparate lineage (Fig 6) Conserved domain searches atNCBI showed that the PolC-like CDS shows similarity toonly part of this gene ndash the exonuclease domain ndash and itis fused to DinG that encodes Rad3-related DNA heli-cases Proteins with similar domain architecture are foundin several other bacterial genomes mostly Firmicutes aswell as S thermophilum and Chloroflexus aurantiacussuggesting that the candidate division WS3 might be spe-cifically related to one of these lineages In phylogenetictrees of the DinG domain of these proteins the fusionproteins are all found in the same clade (Fig 6) Howeverthe monophyly of this clade was not supported by boot-strap analyses In the Maximum Likelihood phylogeny theb1bcf11f4 CDS clusters at the bottom of the clade withC aurantiacus No non-fusion proteins are found inthis clade suggesting a single origin of this domainorganization

Summary

Metagenomic approaches play an increasing and highlyvisible role in microbial ecology The data sets they gen-erate are complex and coupling the information they pro-vide concerning the metabolic potential of an environmentto organismal lineage that may be present there remainsa challenge Here we have shown the utility of rRNA-targeted cloning and phylogenetic analysis of CDSs inmaking such a coupling We also show that LGT evenwhen not precluding provisional assignment to lineages(taxonomy) will likely complicate the history of any lin-eage (phylogenetics) making phylotype-ecotype infer-ences provisional Environmental metagenomic data opena window into a rich world of genetic interactions someof which might be partially reconstructed as we havedescribed here The bioinformatic challenges associatedwith a complete metagenomic assessment of an environ-ment as complex as Baltimore harbour sediment aredaunting indeed But progress in understanding our own

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

2024 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

genome when only 20 years ago the notion of sequenc-ing it was not widely supported gives reason forconfidence

Experimental procedures

DNA was isolated from anaerobic sediments sampled fromBaltimore harbour The samples were a gift from Dr Joy Watts(Center of Marine Biotechnology University of MarylandBiotechnology Institute) and were obtained as described inHoloman and colleagues (1998) DNA was extracted follow-ing the protocol in Rondon and colleagues (2000) except thatinstead of electroeluting the DNA after preparative pulsed-field gel electrophoresis we cleaned it using the GELase-kitfrom Epicentre

The B1BF1 fosmid libraries were constructed using theCopyControltrade Fosmid Library Production Kit from Epicentrefollowing the protocol of manufacturer Fosmid clones wereminipreped using either alkaline lysis with GeneMachinerobotics (Genomic Solutions) or the REAL Prep 96 Plas-mid Kit (Qiagen) End-sequencing of minipreped fosmidclones was performed using the DYEnamictrade ET Dye Termi-nator Kit (MegaBACE) and a MegaBACEtrade 1000 (Amer-sham) Ten 96-plates of preped fosmids were screened usingthe I-CeuI homing endonuclease (NEB)

A fosmid vector containing an I-CeuI site and a blunt-endsite was constructed by ligating the adaptor CGTAACTATAACGGTCCTAAGGTAGCGAACACGTG into pCC1Fos(Epicentre) In order to obtain as many CDSs as possible in

our fosmid clones we chose to clone in the direction 23SrRNAminus5S rRNA for our present study The vector for cloningin the direction 23S rRNAminus16S rRNA was also constructedand is available from the authors (pCC1FosCeuI16S) Themodified vector pCC1FosCeuI23S was prepared using theLarge Construct Kit (Qiagen) and cut with I-CeuI overnightAfter cleaning the vector from gel the vector was cut withPmlI overnight to make a blunt site The vector was thendephosphorylated using shrimp alkaline phosphatase(Amersham Biosciences) followed by phenolchloroformextraction and ethanol precipitation Ligation of DNA intopCC1FosCeuI23S was performed as described aboveexcept DNA was cut overnight with I-CeuI following the end-repair step in the CopyControltrade Fosmid Library ProductionKit protocol

Subcloning of fosmid clones was performed using theTOPOreg Shotgun Subcloning Kit (Invitrogen) and each fos-mid was sequenced to gt8 times coverage Low-quality regionsand gaps were targeted by PCR (final 82ndash143 times coverage)For one low-quality region we were not able to obtain high-quality sequence position 1192ndash1342 in b1dcf13c08 Thefosmid clones were assembled using PhredPhrap CDSswere identified using the run-glimmer2 script using the stan-dard settings provided in this script (Delcher et al 1999) andCDSs shorter than 100 bp were eliminated If two overlap-ping CDSs were identified we selected the one that hadsignificant homologues in GenBank In cases where CDSswhere idenitified that have no match in GenBank we analy-sed the region using ORF-finder (httpwwwncbinlmnihgovgorfgorfhtml) and finally by doing BLASTX searches If an

PolC + DinG fusion proteinssame domain structure as b1bcf11f04ORF17

10

Actinobacillus pleuropneumoniae

Yersinia pestis

Vibrio cholerae

Photobacterium profundum

Idiomarina loihiensis

Methylococcus capsulatus

Xanthomonas oryzae

62

876175

Polaromonas sp JS666

Thiobacillus denitrificans

71

Burkholderia cepacia Bordetella parapertussis

74

Methylobacillus flagellatusAzoarcus sp EbN1

Desulfotalea psychrophila Magnetococcus sp MC-1 61

53Gloeobacter violaceus

Propionibacterium acnes Mycobacterium avium

Corynebacterium diphtheriae

Nocardia farcinica 62 92100

Shewanella oneidensis

Vibrio cholerae

Photobacterium profundum

83

Xanthomonas axonopodis

Neisseria meningitidisProteus vulgaris Microbulbifer degradansAzotobacter vinelandii

Leptospira interrogans

51

Rhodopirellula baltica

6463

Fusobacterium nucleatum

59Treponema denticola

558960

Parachlamydia sp UWE25

Geobacter sulfurreducens

Geobacter metallireducens

b1bcf11f04ORF17Chloroflexus aurantiacus

Moorella thermoacetica

Desulfitobacterium hafniense5353

80

5269

61

Exiguobacterium sp 255-15

Symbiobacterium thermophilum

Bacillus halodurans

Geobacillus kaustophilus

Bacillus cereus Oceanobacillus iheyensis

Listeria monocytogenes Pediococcus pentosaceus

Bacillus licheniformis

Bacillus subtilis

Fig 6 Maximum Likelihood phylogeny of the DinG domain of homologues of b1bcf11f04 ORF17 estimated using PMBML (517 positions in alignment) The sequences were obtained by blasting the b1bcf11f04 ORF17 sequence against GenBank and the 100 best matches where retrieved and aligned Groups of very similar sequences from the same species or sister species where trimmed down to one sequence representative The tree was arbi-trarily rooted by Actinobacillus pleuropneumo-niae Results from bootstrap analyses are indicated as in Fig 3

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

LGT and phylogenetic assignment of metagenomic clones 2025

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

alternative CDS was obtained using ORF-finder that did havea match in GenBank then that CDS was selected T-RNAswere identified with tRNAscan-SE (Lowe and Eddy 1997)The CDSs were annotated using BLASTP searches (Altschulet al 1997) of GenBank at httpwwwncbinlmnihgovBLAST and Pfam searches (Bateman et al 2004) at httpwwwsangeracukSoftwarePfamsearchshtml

Phylogenetic analyses of the 1000 bp 23S rRNA fragmentand 16S rRNA genes were carried out in PAUP (Swofford2001) Minimum evolution trees were constructed using Log-Det distances and Maximum Likelihood trees were con-structed using a general time-reversible model with gammadistributed rates with four categories and invariable sites(GTR + Γ + Ι) Ten random addition cycles of the sequencesand tree bisection and reconnection (TBR) branch swappingwere used in both cases Homologues of the CDSs in Gen-Bank were identified and retrieved using BLASTP searches athttpwwwncbinlmnihgovBLAST For b1dcf13f01 wealso searched the draft genome of C aurantiacus at httpgenomejgi-psforgmicrobial Initially up to 100 significantmatches were retrieved and aligned Clusters of very similarsequences from the same or sister taxa were trimmeddown to one representative sequence We also removedsequences that were considerably shorter than the rest of thealignment as well as sequences that were difficult to alignThe alignments were edited by deleting regions with many orlarge gaps Phylogenetic analysis of protein sequences(CDSs) was carried out in two steps First simple Neighbour-joining trees with bootstrap analyses were performed for allCDSs with significant matches in BLASTP searches If thephylogeny of the CDS disagreed with the phylogeny of therRNA ie if the CDS clustered with another major bacterialgroup than the rRNA a minimum evolution tree (with boot-strap analysis 100 replicates with global rearrangements)was estimated from Maximum Likelihood distances [JTT(Jones et al 1992) + Γ global rearrangements and 10 ran-dom addition replicates] If the trees supported a differentphylogenetic grouping than that observed from the rRNA(with bootstrap support gt50) the CDS was classified asbeing acquired by LGT It should be noted that we onlyclassified as LGT transfers between bacterial groups orphyla eg from α-proteobacteria to γ-proteobacteria or fromthe BacteroidetesChlorobi-group to γ-proteobacteria nowithin-group transfers were included For some of these treesthe CDS from the fosmid was found within a clade containingrepresentatives from several different bacterial groups sug-gesting frequent transfers of the gene (see Table 1) In thesecases we classified the CDS as acquired by LGT but itshould be noted that for such phylogenies it is not possibleto identify the donor and recipients For some LGT-CDSs wealso constructed protein Maximum Likelihood phylogeniesusing PMBML (Veerassamy et al 2003) a modified version ofthe of PROML within the PHYLIP package version 36a2(Felsenstein 2001) For these analyses we used a JTT + Γmodel global rearrangements and 10 random addition repli-cates In the Maximum Likelihood bootstrap analyses we didnot use global rearrangements and we only did one randomaddition of sequences per bootstrap replicate

All sequences have been submitted to GenBank withAccession numbers AJ937675 and AJ937676 (rRNA oper-ons) and AJ937760ndashAJ937771 (fosmid clones)

Acknowledgements

This work was supported by funds from the Canadian Insti-tutes for Health Research (MOP 4467) and Genome Canada(Genome Atlantic) Sequencing was performed at theGenome Atlantic sequencing platform We want to thank DrFrancisco E Rodriguez Valera Rebecca J Case and Ter-ence L Marsh for invaluable discussions on the I-CeuIapproach to obtaining rRNA containing clones environmen-tal microbiology and LGT

References

Aagaard C Awayez MJ and Garrett RA (1997) Profileof the DNA recognition site of the archaeal homing endo-nuclease I-DmoI Nucleic Acids Res 25 1523ndash1530

Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) GappedBLAST and PSI-BLAST a new generation of protein databasesearch programs Nucleic Acids Res 25 3389ndash3402

Andersson JO Sjogren AM Davis LA Embley TMand Roger AJ (2003) Phylogenetic analyses ofdiplomonad genes reveal frequent lateral gene transfersaffecting eukaryotes Curr Biol 13 94ndash104

Bateman A Coin L Durbin R Finn RD Hollich VGriffiths-Jones S et al (2004) The Pfam protein familiesdatabase Nucleic Acids Res 32 D138ndashD141

Beja O Aravind L Koonin EV Suzuki MT Hadd ANguyen LP et al (2000) Bacterial rhodopsin evidencefor a new type of phototrophy in the sea Science 2891902ndash1906

Beja O Spudich EN Spudich JL Leclerc M andDeLong EF (2001) Proteorhodopsin phototrophy in theocean Nature 411 786ndash789

Cannone JJ Subramanian S Schnare MN Collett JRDu DrsquoSouza LM Y et al (2002) The comparative RNAWeb (CRW) site an online database of comparativesequence and structure information for ribosomal intronand other RNAs [WWW document] URL httpwwwrnaicmbutexasedu BMC Bioinformatics 3 2

Chevalier B Turmel M Lemieux C Monnat RJ Jr andStoddard BL (2003) Flexible DNA target site recognitionby divergent homing endonuclease isoschizomers I-CreIand I-MsoI J Mol Biol 329 253ndash269

de la Torre JR Christianson LM Beja O Suzuki MTKarl DM Heidelberg J amp DeLong EF (2003) Proteor-hodopsin genes are distributed among divergent marinebacterial taxa Proc Natl Acad Sci USA 100 12830ndash12835

Delcher AL Harmon D Kasif S White O and SalzbergSL (1999) Improved microbial gene identification withGLIMMER Nucleic Acids Res 27 4636ndash4641

Dojka MA Hugenholtz P Haack SK and Pace NR(1998) Microbial diversity in a hydrocarbon- and chlori-nated-solvent-contaminated aquifer undergoing intrinsicbioremediation Appl Environ Microbiol 64 3869ndash3877

Eulberg D Kourbatova EM Golovleva LA and Schlo-mann M (1998) Evolutionary relationship between chloro-catechol catabolic enzymes from Rhodococcus opacus1CP and their counterparts in proteobacteria sequencedivergence and functional convergence J Bacteriol 1801082ndash1094

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom

2026 C L Nesboslash Y Boucher M Dlutek and W F Doolittle

copy 2005 Society for Applied Microbiology and Blackwell Publishing Ltd Environmental Microbiology 7 2011ndash2026

Felsenstein J (2001) PHYLIP Phylogeny Inference PackageSeattle USA Department of Genetics University of Wash-ington

Holoman TR Elberson MA Cutter LA May HD andSowers KR (1998) Characterization of a defined 2356-tetrachlorobiphenyl-ortho-dechlorinating microbial com-munity by comparative sequence analysis of genes codingfor 16S rRNA Appl Environ Microbiol 64 3359ndash3367

Hugenholtz P Pitulle C Hershberger KL and Pace NR(1998) Novel division level bacterial diversity in a Yellow-stone hot spring J Bacteriol 180 366ndash376

Jones DT Taylor WR and Thornton JM (1992) Therapid generation of mutation data matrices from proteinsequences Comput Appl Biosci 8 275ndash282

Kuwahara T Yamashita A Hirakawa H Nakayama HToh H Okada N et al (2004) Genomic analysis ofBacteroides fragilis reveals extensive DNA inversions reg-ulating cell surface adaptation Proc Natl Acad Sci USA101 14919ndash14924

Lawrence JG and Ochman H (1997) Amelioration of bac-terial genomes rates of change and exchange J Mol Evol44 383ndash397

Lowe TM and Eddy SR (1997) tRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequence Nucleic Acids Res 25 955ndash964

Marshall P and Lemieux C (1992) The I-CeuI endonu-clease recognizes a sequence of 19 base pairs and pref-erentially cleaves the coding strand of the Chlamydomonasmoewusii chloroplast large subunit rRNA gene NucleicAcids Res 20 6401ndash6407

Muller TA Byrde SM Werlen C van der Meer JR andKohler HP (2004) Genetic analysis of phenoxyalkanoicacid degradation in Sphingomonas herbicidovorans MHAppl Environ Microbiol 70 6066ndash6075

Nelson KE Fleischmann RD DeBoy RT Paulsen ITFouts DE Eisen JA et al (2003) Complete genomesequence of the oral pathogenic Bacterium porphyromo-nas gingivalis strain W83 J Bacteriol 185 5591ndash5601

Nesboslash CL and Doolittle WF (2003) Active self-splicinggroup I introns in the 23S rRNA genes of hyperthermophilicbacteria derived from introns in eukaryotic organellesPNAS 100 10806ndash10811

Riesenfeld CS Schloss PD and Handelsman J (2004)Metagenomics genomic analysis of microbial communi-ties Annu Rev Genet 38 525ndash552

Rondon MR August PR Bettermann AD Brady SFGrossman TH Liles MR et al (2000) Cloning the soilmetagenome a strategy for accessing the genetic andfunctional diversity of uncultured microorganisms ApplEnviron Microbiol 66 2541ndash2547

Sanchez LB Galperin MY and Muller M (2000) Acetyl-CoA synthetase from the amitochondriate eukaryote Giar-

dia lamblia belongs to the newly recognized superfamily ofacyl-CoA synthetases (Nucleoside diphosphate-forming)J Biol Chem 275 5794ndash5803

Suzuki MT Preston CM Beja O de la Torre JRSteward GF and DeLong EF (2004) Phylogeneticscreening of ribosomal RNA gene-containing clones inbacterial artificial chromosome (BAC) libraries from dif-ferent depths in Monterey Bay Microb Ecol 48 473ndash488

Swofford DL (2001) PAUP Phylogenetic Analysis UsingParsimony (and Other Methods) Sunderland MA USASinauer Associates

Treusch AH Kletzin A Raddatz G Ochsenreiter TQuaiser A Meurer G et al (2004) Characterization oflarge-insert DNA libraries from soil for environmentalgenomic studies of Archaea Environ Microbiol 6 970ndash980

Veerassamy S Smith A and Tillier ER (2003) A transi-tion probability model for amino acid substitutions fromblocks J Comput Biol 10 997ndash1010

Vuilleumier S and Pagni M (2002) The elusive roles ofbacterial glutathione S-transferases new lessons fromgenomes Appl Microbiol Biotechnol 58 138ndash146

Xu J Bjursell MK Himrod J Deng S Carmichael LKChiang HC et al (2003) A genomic view of thehumanndashBacteroides thetaiotaomicron symbiosis Science299 2074ndash2076

Zhao KH Deng MG Zheng M Zhou M Parbel AStorf M et al (2000) Novel activity of a phycobiliproteinlyase both the attachment of phycocyanobilin and theisomerization to phycoviolobilin are catalyzed by the pro-teins PecE and PecF encoded by the phycoerythrocyaninoperon FEBS Lett 469 9ndash13

Supplementary material

The following supplementary material is available for thisarticle onlineFigure S1 A Number of BLAST hits with exp lt10 eminus10 todifferent taxonomic groupsB Distribution of G + C content of the sequencesC Distribution of the COG category of the BLAST hits explt10 eminus10Black bars refer to end-sequences and grey bars refer to thesequenced fosmid clonesTables S1ndash12 Annotation of b1dcf51a06 b1dcf13f01b3cf12f09 b1bcf11f04 b1dcf51c12 b1bcf11h03b1bcf11d04 b1dcf13c8 b3cf12d07 b1bcf11c04b1bf11a01 b1bf110d03

This material is available as part of the online article fromhttpwwwblackwell-synergycom


Recommended