Bioinformatic Insights into the Evolution of Bacterial...

Howard Ochman Department of Integrat ive Biology

Univers i ty of Texas

Bioinformatic Insights into the Evolution of Bacterial Genomes

Bacterial Genomics (before 1995)

• Bacterial chromosomes are (typically) circular, with a single replication origin

• Bacterial genomes are tightly packed with coding genes & functional elements, and have little repetitive or non-coding DNA

• Bacterial genomes are typically small, ranging in size from 0.5-10 Mb

• Bacterial coding genes have no introns, are arranged in operons, and are assorted onto both strands

• Gene order is conserved among closely related bacteria

• Base composition varies among species (25-75% GC) and is similar among closely related taxa

• Base composition is relatively homogeneous over the entire chromosome

• Rates & patterns of mutations vary with gene location & transcriptional status

Haemophilus influenzae

Mycoplasma genitalium

The field of Bacterial Genomics is often considered to have begun in

What was learned from the first full genome sequences?

2. Enabled resolution complete gene inventories (& encoded functions)1. Possible to assemble full genomes by WGS with paired-end reads

1. Start to observe a consistent trend in genome size & gene number

Mycoplasma genitalium (580 kb; 564 ORFs)

Mycoplasma pneumoniae (816 kb; 736 ORFs)

2. Members of the same genus can differ in genome size & contents

What is the source of such differences in genome contents?Loss of ancestral genes? Generation of new genes ?

Difficult to resolve without knowledge of ancestral (or outgroup) genomegenes acquired by lateral gene transfer (LGT)

--Geneswithatypicalfeatures(e.g.,G+Ccontent)areconsideredtoarisebyLGT--Strain- & species-specific genes often have

atypical base compositionsSequenced genes present in Salmonella, but not in E. coli (circa 1990)

Gene Potential Function Map %G+C CAI

cbi Cobinamide synthesis 41 59.3 .233

fljA Flagellar synthesis 56 40.9 .210

fljB Flagellar synthesis 56 52.3 .216

inv/spa Host recognition/invasion 59 45.5 .261

nanH Sialidase 20.5 40.9 .263

ORF Unknown 98 38.2 .296

pagC Envelope protein 25 43.4 .274

phoN Phosphatase 96 46.5 .248

rfc LPS Synthesis 31 33.5 .175

sinR Transcriptional control 7 39.8 .218

tctABCD Tricarboxylate transport 57 55.0 .278

pgtE Phosphoglycerate transport – 45.3 .277

Salmonella genome is 52% G+C

Differences in base composition among species are caused by mutational biases

Neisseria meningitidis, 52% G+C

(from Tettelin et al. 2000. Science)

% G+C

Inferring the Incidence of Lateral Gene Transfer (LGT) from Sequence Heterogeneity along Bacterial Chromosomes

⌇

⌇(Mostly) Free-living bacteria

(Mostly) Pathogens

A relationship between between bacterial lifestyle and genome size but with some exceptions (M. tuberculosis, 4.4 mb ; Aquifex, 1.6 Mb)

Remarkable relationship between genome size & gene number across >10-fold range in genome size in organisms representing eight phyla

⌇ Intracellular pathogens & obligate endosymbionts

Mycobacteriumleprae

Then why haven’t pseudogenes accumulated in all of the other sequenced bacterial genomes?

Since mutations occur as an on-going process & pseudogenes are continually being generated, what about all those other

(big free-living & small symbiont) genomes that fall right on the diagonal?

The genomes of other recent pathogens of humans possess high numbers of inactivated genes (e.g., Shigella has 400; Yersinia pestis has 150)

Shigella flexneri

Yersinia pestis

Mycobacterium leprae

Unlike most eukaryotes (in whichnoncoding DNA accumulates),

bacteria show a strong mutational biastowards deletions of all sizes

When comparing pseudogenes tofunctional their counterparts,

deletions outnumber insertionsdeletions outnumber insertions

In bacteria, non-functional regions are removed by a pervasive mutational bias towards deletions

This deletional bias was detected computationally and confirmed experimentally

Lineages of bacteria with small genomes derive from ancestors with larger genomes

< 2 Mb

2–4 Mb

> 4 Mb

The progression towards a compact genome1.(Ancestral)free-living(large-genomed)bacteriamovesintoa(nutrient-rich)host

2.Hostassociationrendersmanygenessuperfluous/useless&deadgenesaccumulate

3.Deletionalbiasreducesgenomesizeandremovesinactivatedgenes

Sampling multiple strains within speciesreveals wide variation in genome size (concept of a “core” vs. “pan” genome)

1. Very large sampling of several phyla at various taxonomic levels

2. Minimal size of bacterial genome is ~500 kb (pathogens & symbionts)

A new lower limit to the size of a cellular genomeThe symbiont of a psyllid collected on a hackberry tree in Tucson, Arizona

Carsonella ruddii: the smallest bacterial genome

159,662 bp 182 ORFs16.5% GC 98% coding

Other smallest cellular genome is 420 kb, in the aphid symbiont Buchnera

(Nakabachi et al., 2005)

…with the current record of 112 kb held by Nasuia deltocephalinicola

A thousand bacterial genomes sequenced

What has the sequencing of 1000s of genomes told us about factors controlling the size & composition of bacterial genomes?

• Bacterial genomes are typically small, ranging in size from 0.1–13 Mb

• Bacterial genomes are USUALLY tightly packed with coding genes, and USUALLY have little repetitive & non-coding sequences (e.g., pseudogenes). Therefore, in bacteria, genome size is tightly linked to gene number

• There is a pervasive mutational bias that removes non-functional regions

• Small bacterial genomes derive from lineages with large genomes

• Base composition varies among species (13-80% GC) and is relatively homogeneous over the entire chromosome

• Many bacterial genomes harbor substantial amounts of laterally acquired DNA

• Strains within a named bacterial species can vary greatly in genome size

But what are the major determinants of bacterial genome size & complexity?

The (New) Observations:

1.InBacteria,itisvery(very)difficulttoestimateNe

2.Methodsbasedonlevelsofintraspecificpolymorphismsareconfoundedbyproblemsindefiningbacterialspecies

Butassessingthelevelofdriftaffectingbacteriaisproblematic:

Due to the link between genome size and gene density in bacteria, evolutionary forces that act on individual genes will affect genome size

No Benefit (decay)

Essential (preserved)

-- Fate of Bacterial Genes --

genes whose presence is affected by drift

The effects of genetic drift (changes in gene frequencies that take place strictly by chance) are more dramatic in species with small effective population sizes (Ne) & will shape

the contents of genomes by affecting the fixation of deleterious mutations

First test if Bacterial Genome Size is Adaptive (i.e., shaped by natural selection)

or Non-Adaptive (i.e., shaped by genetic drift)

Using Ka/Ks values as a proxy for level of drift

2. An increased level of drift, produced from reduced Ne (+ genome-wide relaxation of selection) will result in an increased incidence in slightly deleterious mutations and increase Ka/Ks genome-wide.

1. Because point mutations that cause amino acid replacements are often deleterious, the rate of nonsynonymous changes (Ka) is expected to be less than the rate of synonymous substitutions (Ks) in functional genes.

Examined genome-wide Ka/Ks in species pairs from eight phyla

2. Similarly significant relationship is observed when considering only pairs with low, or only those with intermediate, levels of sequence divergence

3. Those with high genome-wide Ka/Ks have a lifestyle that reduces Ne (obligate endosymbionts, vector-borne pathogens, extremophiles)

1. Same significant relationship is observed when considering all genes shared by a genome-pair or only those genes common to the majority of genomes

Genome size exhibits a strong negative correlation with the level of genetic drift

more efficient selection less efficient selection

Gene density is related to the level of genetic drift

1. Genome-pairs with low levels of drift (i.e., more efficient selection) display a relatively narrow range of coding densities (usually 85-90%)

2. Most genome pairs displaying high levels of drift have coding densities that lie outside of the 85-90% range This occurs due to pseudogene formation in recent pathogens (lowering coding density)

and to tight gene-packing in the highly reduced genomes (increasing coding density)

1a. Bacterial genome size is not caused by selection for replication efficiency

But don’t bacteria have small genomes so that they can replicate quickly?

Comparisons across species

Genetic Drift as the Major Determinant of Bacterial Genome Size & Complexity

Eukaryotes

Smaller Ne –>

2. The effect of drift on bacterial genomes is opposite to the pattern proposed by Lynch

Bacteria

The difference arises from the fact that bacterial genomes comprise sequences that are maintained by selection, or otherwise deleted

(Lynch & Conery 2003)

More drift More drift

Genetic Drift as the Major Determinant of Bacterial Genome Size & Complexity

2. What is the basis of the variation in genomic base composition?

Proposed that differences in base composition among species are caused by mutational biases (again, a non-adaptive process)

1. What is the basis of the variation in bacterial genome size?✓

In bacteria, variation in genomic base composition has long been thought to be due to mutational biases

but this has not slowed the search for an adaptive basis for the observed differences

What might be a reason why some bacteria have G+C-rich genomes whereas others have A+T-rich genomes?

Thermal tolerance?? GC basepairs (with 3 H-bonds) are stronger than are AT basepairs, so high GC genomes would seem to be less prone to denaturing at higher temperatures - - - - N

OPE! - - - -

What might be a reason why some bacteria have G+C-rich genomes whereas others have A+T-rich genomes?

Wait, there are more.....

(Feil&Rocha.2010.PLoSGenetics)

Why are there GC-rich bacterial genomes?

Given these mutational patterns, why are E. coli & Salmonella GC-rich?

Experimental analysis of mutations in E. coli and Salmonella (both of which are >50% G+C)

Gene-level selection on base composition: A simple experiment

An alternative plasmid with bacterial promoter re-duced overall expression levels, but the correlationbetween the two expression systems remained high(r = 0.9) (fig. S4). A similar pattern of fluorescencevariation was observed in fluorescence-activatedcell sorting measurements (fig. S5). Because theencoded protein sequence was identical for allgenes, we attributed fluorescence variation to dif-ferences in protein levels. This was confirmed bystrong correlations between fluorescence and to-tal GFP levels in Western blots (fig. S5) andCoomassie staining (r = 0.9, P < 10–15).

To test the theory that E. coli translation ratesand eventual protein levels depend on the con-cordance between codon usage and cellular tRNAabundances (10–12), we compared codon usageto fluorescence among the 154 synonymous GFPvariants. Notably, neither of the two most com-mon measures of codon bias, the CAI or the fre-quency of optimal codons (3), was significantlycorrelated with fluorescence levels (r = 0.14, P =0.09, and r = 0.11, P = 0.16, respectively) (Fig.2A). Moreover, some of the most highly ex-pressed genes featured low CAI and vice versa.

Although codon adaptation near the 5′ termi-nus is considered particularly important for ex-pression (12, 13), the CAI value of the first 42bases in a GFP gene was not significantly corre-

lated with the gene’s fluorescence intensity (r =0.1, P = 0.2). Similarly, the number of rare codons(sites with CAI < 0.1) in a sequence was notsignificantly correlated with fluorescence (r =–0.02, P = 0.7), and neither was the number ofpairs of consecutive rare codons (r = –0.14, P =0.09). Although specific consecutive codon pairshave been proposed to influence translation(14, 15), the frequency of such rare pairs in agene was not significantly correlated with its flu-orescence (r = 0.07, P = 0.35) (8).

Statistical analyses of which nucleotide posi-tions influenced gene expression (fig. S6) indi-cated the importance of local sequence patterns,as opposed to global codon bias. This pattern isconsistent with studies of base content (16, 17),which suggest that mRNA structure may shapeexpression levels (18–21). Therefore, for eachGFP construct, we computed the predicted mini-mum free energy associated with the secondarystructure of its entire mRNA or specific regionsof its mRNA. The folding energy of the entiremRNAwas not significantly correlated with flu-orescence (r = 0.16, P = 0.051), but the foldingenergy of the first third of the mRNAwas stronglycorrelated: mRNAs with stronger structure pro-duced lower fluorescence (r = 0.60, P < 10–15).A moving window analysis identified a region,

from nucleotide (nt) –4 to +37 relative to start, forwhich predicted folding energy explained 44%of the variation in fluorescence levels across theGFP library (r = 0.66, P < 10–15) (Fig. 2B). Thesame folding energies explained 59% of fluores-cence variation when constructs were expressedusing a bacterial promoter (r = 0.77, P < 4 × 10–16)(fig. S7). mRNA folding also correlated with flu-orescence in a separate analysis of GFP con-structs differing by single mutations (8).

The strong correlation between mRNA fold-ing and fluorescence suggests the simple mech-anistic explanation that tightly folded messagesobstruct translation initiation and thereby reduceprotein synthesis (22). Predicted mRNA struc-tures for highly expressed GFPs characteristicallycontained many unpaired nucleotides near thestart codon, whereas constructs expressed at lowlevels featured long hairpin loops (Fig. 2B andfig. S8), consistent with known obstructions toinitiation (22). The region of strongest correlationbetween folding energy and expression did notoverlap with the Shine-Dalgarno (SD) sequence,which suggested that SD occlusion by secondarystructure (22, 23) did not play a major role ininhibiting expression, probably because our con-structs contained no noncoding mutations. Bycontrast, the region of strongest effect overlapped

Fig. 1. Synthetic library of GFP genes with randomized codon usage. (A)Degenerate oligonucleotides were mixed and assembled by polymerase chainreaction. Fragments were then cloned, sequenced, and assembled intocomplete GFP genes. Red indicates third-codon positions. Degenerate symbolsare as follows: D (A or G or T); H (A or C or T); N (A or C or G or T); R (A or G);

and Y (C or T). (B) Example alignment illustrating sequence diversity among15 synthetic genes. Shaded boxes indicate first and second codon positions,which are conserved across the library. (C and D) The distribution of GC3 andCAI among the 154 synthetic GFP genes (C) is representative of the diversityamong the 4288 endogenous E. coli genes (D).

10 APRIL 2009 VOL 324 SCIENCE www.sciencemag.org256

REPORTS

GFP constructs with synonymous mutations in 3rd positions

(Kudla G, et al. 2009. Science. 324: 255)

Selected clones of low (40-42%), medium (46-48%) and high (51-54%) G+C content

All selected genes had similar codon usage biases: CAIE.coli (0.58-0.68)

Due to the link between genome size and gene density in bacteria, forces that act on the base composition of individual genes will

affect overall genomic base composition

Strains expressing GC-rich genes have higher growth rates!?! (timepoints are hours after induction with IPTG)

The effect is observed with either of two anonymous genes

• Bacterial genomes (and those of archaea & probably many eukarya) display a mutational bias towards deletions among small indels

• Bacterial genomes are usually packed with functional genes, but almost every genome contains some pseudogenes

• Bacteria with small genomes (pathogens, symbionts) derive from large-genomed ancestors; and during the transition to a host-associated lifestyle, functional redundancy and lower efficacy of selection causes accumulation of pseudogenes

• There is a strong negative association between bacterial genome size and the level of drift, such that species with small Ne have small genomes

Conclusions: Observations

Conclusions: Findings

• Variation in bacterial genome size, which has usually

been attributed to selection for replication efficiency, is actually caused by non-adaptive processes

• Variation in the base composition of bacterial genomes, long thought be determined by a strictly neutral mutational process, is now known to be caused by selection.

Date post:	03-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Bioinformatic Insights into the Evolution of Bacterial...

Documents