Howard Ochman Department of Integrat ive Biology
Univers i ty of Texas
Bioinformatic Insights into the Evolution of Bacterial Genomes
Bacterial Genomics (before 1995)
• Bacterial chromosomes are (typically) circular, with a single replication origin
• Bacterial genomes are tightly packed with coding genes & functional elements, and have little repetitive or non-coding DNA
• Bacterial genomes are typically small, ranging in size from 0.5-10 Mb
• Bacterial coding genes have no introns, are arranged in operons, and are assorted onto both strands
• Gene order is conserved among closely related bacteria
• Base composition varies among species (25-75% GC) and is similar among closely related taxa
• Base composition is relatively homogeneous over the entire chromosome
• Rates & patterns of mutations vary with gene location & transcriptional status
Haemophilus influenzae
Mycoplasma genitalium
The field of Bacterial Genomics is often considered to have begun in
What was learned from the first full genome sequences?
2. Enabled resolution complete gene inventories (& encoded functions)1. Possible to assemble full genomes by WGS with paired-end reads
1. Start to observe a consistent trend in genome size & gene number
Mycoplasma genitalium (580 kb; 564 ORFs)
Mycoplasma pneumoniae (816 kb; 736 ORFs)
2. Members of the same genus can differ in genome size & contents
What is the source of such differences in genome contents?Loss of ancestral genes? Generation of new genes ?
Difficult to resolve without knowledge of ancestral (or outgroup) genomegenes acquired by lateral gene transfer (LGT)
--Geneswithatypicalfeatures(e.g.,G+Ccontent)areconsideredtoarisebyLGT--Strain- & species-specific genes often have
atypical base compositionsSequenced genes present in Salmonella, but not in E. coli (circa 1990)
Gene Potential Function Map %G+C CAI
cbi Cobinamide synthesis 41 59.3 .233
fljA Flagellar synthesis 56 40.9 .210
fljB Flagellar synthesis 56 52.3 .216
inv/spa Host recognition/invasion 59 45.5 .261
nanH Sialidase 20.5 40.9 .263
ORF Unknown 98 38.2 .296
pagC Envelope protein 25 43.4 .274
phoN Phosphatase 96 46.5 .248
rfc LPS Synthesis 31 33.5 .175
sinR Transcriptional control 7 39.8 .218
tctABCD Tricarboxylate transport 57 55.0 .278
pgtE Phosphoglycerate transport – 45.3 .277
Salmonella genome is 52% G+C
Differences in base composition among species are caused by mutational biases
Neisseria meningitidis, 52% G+C
(from Tettelin et al. 2000. Science)
% G+C
Inferring the Incidence of Lateral Gene Transfer (LGT) from Sequence Heterogeneity along Bacterial Chromosomes
⌇
⌇(Mostly) Free-living bacteria
(Mostly) Pathogens
A relationship between between bacterial lifestyle and genome size but with some exceptions (M. tuberculosis, 4.4 mb ; Aquifex, 1.6 Mb)
Remarkable relationship between genome size & gene number across >10-fold range in genome size in organisms representing eight phyla
⌇ Intracellular pathogens & obligate endosymbionts
Mycobacteriumleprae
Then why haven’t pseudogenes accumulated in all of the other sequenced bacterial genomes?
Since mutations occur as an on-going process & pseudogenes are continually being generated, what about all those other
(big free-living & small symbiont) genomes that fall right on the diagonal?
The genomes of other recent pathogens of humans possess high numbers of inactivated genes (e.g., Shigella has 400; Yersinia pestis has 150)
Shigella flexneri
Yersinia pestis
Mycobacterium leprae
Unlike most eukaryotes (in whichnoncoding DNA accumulates),
bacteria show a strong mutational biastowards deletions of all sizes
When comparing pseudogenes tofunctional their counterparts,
deletions outnumber insertionsdeletions outnumber insertions
In bacteria, non-functional regions are removed by a pervasive mutational bias towards deletions
This deletional bias was detected computationally and confirmed experimentally
Lineages of bacteria with small genomes derive from ancestors with larger genomes
< 2 Mb
2–4 Mb
> 4 Mb
The progression towards a compact genome1.(Ancestral)free-living(large-genomed)bacteriamovesintoa(nutrient-rich)host
2.Hostassociationrendersmanygenessuperfluous/useless&deadgenesaccumulate
3.Deletionalbiasreducesgenomesizeandremovesinactivatedgenes
Sampling multiple strains within speciesreveals wide variation in genome size (concept of a “core” vs. “pan” genome)
1. Very large sampling of several phyla at various taxonomic levels
2. Minimal size of bacterial genome is ~500 kb (pathogens & symbionts)
A new lower limit to the size of a cellular genomeThe symbiont of a psyllid collected on a hackberry tree in Tucson, Arizona
Carsonella ruddii: the smallest bacterial genome
159,662 bp 182 ORFs16.5% GC 98% coding
Other smallest cellular genome is 420 kb, in the aphid symbiont Buchnera
(Nakabachi et al., 2005)
…with the current record of 112 kb held by Nasuia deltocephalinicola
A thousand bacterial genomes sequenced
What has the sequencing of 1000s of genomes told us about factors controlling the size & composition of bacterial genomes?
• Bacterial genomes are typically small, ranging in size from 0.1–13 Mb
• Bacterial genomes are USUALLY tightly packed with coding genes, and USUALLY have little repetitive & non-coding sequences (e.g., pseudogenes). Therefore, in bacteria, genome size is tightly linked to gene number
• There is a pervasive mutational bias that removes non-functional regions
• Small bacterial genomes derive from lineages with large genomes
• Base composition varies among species (13-80% GC) and is relatively homogeneous over the entire chromosome
• Many bacterial genomes harbor substantial amounts of laterally acquired DNA
• Strains within a named bacterial species can vary greatly in genome size
But what are the major determinants of bacterial genome size & complexity?
The (New) Observations:
1.InBacteria,itisvery(very)difficulttoestimateNe
2.Methodsbasedonlevelsofintraspecificpolymorphismsareconfoundedbyproblemsindefiningbacterialspecies
Butassessingthelevelofdriftaffectingbacteriaisproblematic:
Due to the link between genome size and gene density in bacteria, evolutionary forces that act on individual genes will affect genome size
No Benefit (decay)
Essential (preserved)
-- Fate of Bacterial Genes --
genes whose presence is affected by drift
The effects of genetic drift (changes in gene frequencies that take place strictly by chance) are more dramatic in species with small effective population sizes (Ne) & will shape
the contents of genomes by affecting the fixation of deleterious mutations
First test if Bacterial Genome Size is Adaptive (i.e., shaped by natural selection)
or Non-Adaptive (i.e., shaped by genetic drift)
Using Ka/Ks values as a proxy for level of drift
2. An increased level of drift, produced from reduced Ne (+ genome-wide relaxation of selection) will result in an increased incidence in slightly deleterious mutations and increase Ka/Ks genome-wide.
1. Because point mutations that cause amino acid replacements are often deleterious, the rate of nonsynonymous changes (Ka) is expected to be less than the rate of synonymous substitutions (Ks) in functional genes.
Examined genome-wide Ka/Ks in species pairs from eight phyla
2. Similarly significant relationship is observed when considering only pairs with low, or only those with intermediate, levels of sequence divergence
3. Those with high genome-wide Ka/Ks have a lifestyle that reduces Ne (obligate endosymbionts, vector-borne pathogens, extremophiles)
1. Same significant relationship is observed when considering all genes shared by a genome-pair or only those genes common to the majority of genomes
Genome size exhibits a strong negative correlation with the level of genetic drift
more efficient selection less efficient selection
Gene density is related to the level of genetic drift
1. Genome-pairs with low levels of drift (i.e., more efficient selection) display a relatively narrow range of coding densities (usually 85-90%)
2. Most genome pairs displaying high levels of drift have coding densities that lie outside of the 85-90% range This occurs due to pseudogene formation in recent pathogens (lowering coding density)
and to tight gene-packing in the highly reduced genomes (increasing coding density)
1a. Bacterial genome size is not caused by selection for replication efficiency
But don’t bacteria have small genomes so that they can replicate quickly?
Comparisons across species
Genetic Drift as the Major Determinant of Bacterial Genome Size & Complexity
Eukaryotes
Smaller Ne –>
2. The effect of drift on bacterial genomes is opposite to the pattern proposed by Lynch
Bacteria
The difference arises from the fact that bacterial genomes comprise sequences that are maintained by selection, or otherwise deleted
(Lynch & Conery 2003)
More drift More drift
Genetic Drift as the Major Determinant of Bacterial Genome Size & Complexity
2. What is the basis of the variation in genomic base composition?
Proposed that differences in base composition among species are caused by mutational biases (again, a non-adaptive process)
1. What is the basis of the variation in bacterial genome size?✓
In bacteria, variation in genomic base composition has long been thought to be due to mutational biases
but this has not slowed the search for an adaptive basis for the observed differences
What might be a reason why some bacteria have G+C-rich genomes whereas others have A+T-rich genomes?
Thermal tolerance?? GC basepairs (with 3 H-bonds) are stronger than are AT basepairs, so high GC genomes would seem to be less prone to denaturing at higher temperatures - - - - N
OPE! - - - -
What might be a reason why some bacteria have G+C-rich genomes whereas others have A+T-rich genomes?
Wait, there are more.....
(Feil&Rocha.2010.PLoSGenetics)
Why are there GC-rich bacterial genomes?
Given these mutational patterns, why are E. coli & Salmonella GC-rich?
Experimental analysis of mutations in E. coli and Salmonella (both of which are >50% G+C)
Gene-level selection on base composition: A simple experiment
An alternative plasmid with bacterial promoter re-duced overall expression levels, but the correlationbetween the two expression systems remained high(r = 0.9) (fig. S4). A similar pattern of fluorescencevariation was observed in fluorescence-activatedcell sorting measurements (fig. S5). Because theencoded protein sequence was identical for allgenes, we attributed fluorescence variation to dif-ferences in protein levels. This was confirmed bystrong correlations between fluorescence and to-tal GFP levels in Western blots (fig. S5) andCoomassie staining (r = 0.9, P < 10–15).
To test the theory that E. coli translation ratesand eventual protein levels depend on the con-cordance between codon usage and cellular tRNAabundances (10–12), we compared codon usageto fluorescence among the 154 synonymous GFPvariants. Notably, neither of the two most com-mon measures of codon bias, the CAI or the fre-quency of optimal codons (3), was significantlycorrelated with fluorescence levels (r = 0.14, P =0.09, and r = 0.11, P = 0.16, respectively) (Fig.2A). Moreover, some of the most highly ex-pressed genes featured low CAI and vice versa.
Although codon adaptation near the 5′ termi-nus is considered particularly important for ex-pression (12, 13), the CAI value of the first 42bases in a GFP gene was not significantly corre-
lated with the gene’s fluorescence intensity (r =0.1, P = 0.2). Similarly, the number of rare codons(sites with CAI < 0.1) in a sequence was notsignificantly correlated with fluorescence (r =–0.02, P = 0.7), and neither was the number ofpairs of consecutive rare codons (r = –0.14, P =0.09). Although specific consecutive codon pairshave been proposed to influence translation(14, 15), the frequency of such rare pairs in agene was not significantly correlated with its flu-orescence (r = 0.07, P = 0.35) (8).
Statistical analyses of which nucleotide posi-tions influenced gene expression (fig. S6) indi-cated the importance of local sequence patterns,as opposed to global codon bias. This pattern isconsistent with studies of base content (16, 17),which suggest that mRNA structure may shapeexpression levels (18–21). Therefore, for eachGFP construct, we computed the predicted mini-mum free energy associated with the secondarystructure of its entire mRNA or specific regionsof its mRNA. The folding energy of the entiremRNAwas not significantly correlated with flu-orescence (r = 0.16, P = 0.051), but the foldingenergy of the first third of the mRNAwas stronglycorrelated: mRNAs with stronger structure pro-duced lower fluorescence (r = 0.60, P < 10–15).A moving window analysis identified a region,
from nucleotide (nt) –4 to +37 relative to start, forwhich predicted folding energy explained 44%of the variation in fluorescence levels across theGFP library (r = 0.66, P < 10–15) (Fig. 2B). Thesame folding energies explained 59% of fluores-cence variation when constructs were expressedusing a bacterial promoter (r = 0.77, P < 4 × 10–16)(fig. S7). mRNA folding also correlated with flu-orescence in a separate analysis of GFP con-structs differing by single mutations (8).
The strong correlation between mRNA fold-ing and fluorescence suggests the simple mech-anistic explanation that tightly folded messagesobstruct translation initiation and thereby reduceprotein synthesis (22). Predicted mRNA struc-tures for highly expressed GFPs characteristicallycontained many unpaired nucleotides near thestart codon, whereas constructs expressed at lowlevels featured long hairpin loops (Fig. 2B andfig. S8), consistent with known obstructions toinitiation (22). The region of strongest correlationbetween folding energy and expression did notoverlap with the Shine-Dalgarno (SD) sequence,which suggested that SD occlusion by secondarystructure (22, 23) did not play a major role ininhibiting expression, probably because our con-structs contained no noncoding mutations. Bycontrast, the region of strongest effect overlapped
Fig. 1. Synthetic library of GFP genes with randomized codon usage. (A)Degenerate oligonucleotides were mixed and assembled by polymerase chainreaction. Fragments were then cloned, sequenced, and assembled intocomplete GFP genes. Red indicates third-codon positions. Degenerate symbolsare as follows: D (A or G or T); H (A or C or T); N (A or C or G or T); R (A or G);
and Y (C or T). (B) Example alignment illustrating sequence diversity among15 synthetic genes. Shaded boxes indicate first and second codon positions,which are conserved across the library. (C and D) The distribution of GC3 andCAI among the 154 synthetic GFP genes (C) is representative of the diversityamong the 4288 endogenous E. coli genes (D).
10 APRIL 2009 VOL 324 SCIENCE www.sciencemag.org256
REPORTS
GFP constructs with synonymous mutations in 3rd positions
(Kudla G, et al. 2009. Science. 324: 255)
Selected clones of low (40-42%), medium (46-48%) and high (51-54%) G+C content
All selected genes had similar codon usage biases: CAIE.coli (0.58-0.68)
Due to the link between genome size and gene density in bacteria, forces that act on the base composition of individual genes will
affect overall genomic base composition
Strains expressing GC-rich genes have higher growth rates!?! (timepoints are hours after induction with IPTG)
The effect is observed with either of two anonymous genes
• Bacterial genomes (and those of archaea & probably many eukarya) display a mutational bias towards deletions among small indels
• Bacterial genomes are usually packed with functional genes, but almost every genome contains some pseudogenes
• Bacteria with small genomes (pathogens, symbionts) derive from large-genomed ancestors; and during the transition to a host-associated lifestyle, functional redundancy and lower efficacy of selection causes accumulation of pseudogenes
• There is a strong negative association between bacterial genome size and the level of drift, such that species with small Ne have small genomes
Conclusions: Observations
Conclusions: Findings
• Variation in bacterial genome size, which has usually
been attributed to selection for replication efficiency, is actually caused by non-adaptive processes
• Variation in the base composition of bacterial genomes, long thought be determined by a strictly neutral mutational process, is now known to be caused by selection.