Bioinformatics and Evolutionary GenomicsBioinformatics and Evolutionary Genomics
Genome Evolution (Genome Evolution (II) and Genomics Context for ) and Genomics Context for function predictionfunction prediction
Bioinformatics and Evolutionary GenomicsBioinformatics and Evolutionary Genomics
Genome Evolution (Genome Evolution (II) and Genomics Context for ) and Genomics Context for function predictionfunction prediction
(bacterial) Genome Evolution: added value of genomes(bacterial) Genome Evolution: added value of genomes(bacterial) Genome Evolution: added value of genomes(bacterial) Genome Evolution: added value of genomes
• How does gene content evolve?How does gene content evolve?• How does gene order evolve?How does gene order evolve?• How important are various evolutionary dynamics of How important are various evolutionary dynamics of
genes on a genomic scale (e.g. gene fusion, gene genes on a genomic scale (e.g. gene fusion, gene loss, gene duplication): moving from anecdotes to loss, gene duplication): moving from anecdotes to trendstrends
• How does gene content evolve?How does gene content evolve?• How does gene order evolve?How does gene order evolve?• How important are various evolutionary dynamics of How important are various evolutionary dynamics of
genes on a genomic scale (e.g. gene fusion, gene genes on a genomic scale (e.g. gene fusion, gene loss, gene duplication): moving from anecdotes to loss, gene duplication): moving from anecdotes to trendstrends
functionally associated functionally associated proteins leave proteins leave
evolutionary tracesevolutionary traces of of their relation in genomestheir relation in genomes
functionally associated functionally associated proteins leave proteins leave
evolutionary tracesevolutionary traces of of their relation in genomestheir relation in genomes
Genomic context / in silico interaction prediction
Gene order evolution:Gene order evolution:
-Establish orthologous relations between pairs of genomes -Establish orthologous relations between pairs of genomes (e.g. S-W best bidirectional hit approach(e.g. S-W best bidirectional hit approach-Put them in a dotplot, color the relative direction of -Put them in a dotplot, color the relative direction of transcription (transcription (GreenGreen for the same relative direction. for the same relative direction. RedRed for for the opposite direction.)the opposite direction.)
Gene order evolution:Gene order evolution:
-Establish orthologous relations between pairs of genomes -Establish orthologous relations between pairs of genomes (e.g. S-W best bidirectional hit approach(e.g. S-W best bidirectional hit approach-Put them in a dotplot, color the relative direction of -Put them in a dotplot, color the relative direction of transcription (transcription (GreenGreen for the same relative direction. for the same relative direction. RedRed for for the opposite direction.)the opposite direction.)
Evolution of genome organization:Evolution of genome organization:
-In prokaryotes, genome inversions centered around the In prokaryotes, genome inversions centered around the origin/terminus of replication are a major source of genome origin/terminus of replication are a major source of genome rearrangements.rearrangements.
-This suggests that both replication forks are in close contact -> This suggests that both replication forks are in close contact -> comparative genome analysis provides support for a hypothesis comparative genome analysis provides support for a hypothesis about genome replicationabout genome replication““and a close proximity of the forks would increase theand a close proximity of the forks would increase theprobability of reciprocal recombination or transposition between probability of reciprocal recombination or transposition between sequences at the two forks. That the forks are near each other is sequences at the two forks. That the forks are near each other is also consistent with the 'replication factory' model based on also consistent with the 'replication factory' model based on immunolocalization of components of the replication machinery in immunolocalization of components of the replication machinery in Bacillus subtilis” (Tillier and Collins, 2000. Nat. Gen)”Bacillus subtilis” (Tillier and Collins, 2000. Nat. Gen)”
b
Gene order evolves Gene order evolves rapidlyrapidly
Gene order evolves Gene order evolves rapidlyrapidly
But …But …But …But …
Differential retention Differential retention of divergent / of divergent / convergent gene convergent gene pairs suggests that pairs suggests that conservation implies conservation implies a functional a functional associationassociation
OperonsOperons
Gene Order Gene Order EvolutionEvolution
Gene Order Gene Order EvolutionEvolution
Conserved gene orderConserved gene orderConserved gene orderConserved gene order
• i.e. genes that are present over ‘sufficiently large’ i.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene clusterevolutionary distances in the same gene cluster
• Contributes many reliable predictionsContributes many reliable predictions
• i.e. genes that are present over ‘sufficiently large’ i.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene clusterevolutionary distances in the same gene cluster
• Contributes many reliable predictionsContributes many reliable predictions
Conserved gene orderConserved gene orderConserved gene orderConserved gene order
NB1 predicting operons is not trivial; in fact NB1 predicting operons is not trivial; in fact conserved gene order or functional conserved gene order or functional association is a major clueassociation is a major clue
NB2 using ‘only’ operons NB2 using ‘only’ operons without requiring without requiring conservationconservation results in much less reliable results in much less reliable function predictionfunction prediction
Comparison to pathways conservation implies a functional Comparison to pathways conservation implies a functional associationassociation
Comparison to pathways conservation implies a functional Comparison to pathways conservation implies a functional associationassociation
1
10
100
1000
10000
0 3 6 9 12 15 18 21 24 27 30
co-occurrences in operons
num
ber
of C
OG
s
0
1
2
3
4
5
6
aver
age
met
abol
ic
dist
ance
number of COGS
average metabolicdistance
Conserved gene order: an example from Conserved gene order: an example from metabolism of propionyl-CoA
Conserved gene order: an example from Conserved gene order: an example from metabolism of propionyl-CoA
““query”query”““target”target”
Conserved gene order: an example from Conserved gene order: an example from metabolism of propionyl-CoAConserved gene order: an example from Conserved gene order: an example from metabolism of propionyl-CoA
Biochemical assays Biochemical assays confirm the function confirm the function of members of of members of COG0346 as a DL-COG0346 as a DL-methylmalonyl-CoA methylmalonyl-CoA racemase racemase
Biochemical assays Biochemical assays confirm the function confirm the function of members of of members of COG0346 as a DL-COG0346 as a DL-methylmalonyl-CoA methylmalonyl-CoA racemase racemase
Gene FusionGene FusionGene FusionGene Fusion
• ““Rare” (especially in prokaryotes): ~3000 linked Rare” (especially in prokaryotes): ~3000 linked COGs in STRING v6 (~180 genomes)COGs in STRING v6 (~180 genomes)
• But what about domain recombination?But what about domain recombination?
• ““Rare” (especially in prokaryotes): ~3000 linked Rare” (especially in prokaryotes): ~3000 linked COGs in STRING v6 (~180 genomes)COGs in STRING v6 (~180 genomes)
• But what about domain recombination?But what about domain recombination?
FusionFusionFusionFusion
Gene fusionGene fusionGene fusionGene fusion
• i.e. the orthologs of two genes in another organism are fused into one i.e. the orthologs of two genes in another organism are fused into one polypeptide polypeptide
• A very reliable indicator for functional interaction; partly because it is A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event:an relatively infrequent evolutionary event:
• i.e. the orthologs of two genes in another organism are fused into one i.e. the orthologs of two genes in another organism are fused into one polypeptide polypeptide
• A very reliable indicator for functional interaction; partly because it is A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event:an relatively infrequent evolutionary event:
Gene fusion: an exampleGene fusion: an exampleGene fusion: an exampleGene fusion: an example
Gene Content EvolutionGene Content EvolutionGene Content EvolutionGene Content Evolution
What about HGT?What about HGT?Genome trees based on gene content:Genome trees based on gene content:
What about HGT?What about HGT?Genome trees based on gene content:Genome trees based on gene content:
Escherichia coliEscherichia coliEscherichia coliEscherichia coliHaemophilus influenzaeHaemophilus influenzaeHaemophilus influenzaeHaemophilus influenzae
shared genesshared genesshared genesshared genes
Species specificSpecies specificgenesgenesSpecies specificSpecies specificgenesgenes Species specificSpecies specific
genesgenesSpecies specificSpecies specificgenesgenes
Genome trees based on gene contentGenome trees based on gene contentGenome trees based on gene contentGenome trees based on gene content
( )( )# shared OGs (spA, spB)# shared OGs (spA, spB)
Weighted average Genomesize(spA, spB)
Weighted average Genomesize(spA, spB)
\s sp1 sp2 sp3 sp4 …
sp1 \1 0.2 0.4 0.2 …
sp2 \1 0.9 0.1 …
sp3 \1 0.3 …
sp4 \1 …
… … … … …
\s sp1 sp2 sp3 sp4 …
sp1 \1 0.2 0.4 0.2 …
sp2 \1 0.9 0.1 …
sp3 \1 0.3 …
sp4 \1 …
… … … … …
Neighbor joiningNeighbor joining
d
0
0.8 0
0.6 0.1 0
0.8 0.9 0.7 0
d
0
0.8 0
0.6 0.1 0
0.8 0.9 0.7 0
dist (spA, spB) = 1 –dist (spA, spB) = 1 –
OG1 OG2 OG3 OG4 …
sp1 1 1 0 1 …
sp2 0 1 0 0 …
sp3 0 0 1 1 …
… … … … …
OG1 OG2 OG3 OG4 …
sp1 1 1 0 1 …
sp2 0 1 0 0 …
sp3 0 0 1 1 …
… … … … …
Presence / absence matrix:Presence / absence matrix:
Genome trees based on gene content are remarkably Genome trees based on gene content are remarkably similar to consensus on ToLsimilar to consensus on ToL
Genome trees based on gene content are remarkably Genome trees based on gene content are remarkably similar to consensus on ToLsimilar to consensus on ToL
C. pneumoniaeC. pneumoniae
C. trachomatisC. trachomatis
M. tuberculosisM. tuberculosis M. pneumoniaeM. pneumoniaeM. genitaliumM. genitalium
B. subtilisB. subtilis
T. pallidumT. pallidum
T. maritimaT. maritima
B. burgdorferiB. burgdorferi
P. horikoshiiP. horikoshii
M. thermoautotrophicumM. thermoautotrophicum
A. fulgidusA. fulgidus
M. jannaschiiM. jannaschii
S. cerevisiaeS. cerevisiaeC. elegansC. elegans
A. aeolicusA. aeolicus
E. coliE. coli
H. influenzaeH. influenzaeR. prowazekiiR. prowazekii
H. pylori H. pylori 2669526695
Synechocystis sp.Synechocystis sp.
H. pylori H. pylori J99J99
A. pernixA. pernix100100100100
100100 100100
100100
100100100100
100100
1001009898
93938989
6969
8888
0.10.1
9797
ProteobacteriaProteobacteria
EukaryaEukarya
EuryarchaeotaEuryarchaeota
SpirochaetalesSpirochaetales100100
Reconstruction Reconstruction of Gene Contentof Gene ContentReconstruction Reconstruction of Gene Contentof Gene Content
b
DeletionGain
Ancestral Genome Reconstruction of Ancestral Genome Reconstruction of LUCA : patchy gene distributionsLUCA : patchy gene distributions
Ancestral Genome Reconstruction of Ancestral Genome Reconstruction of LUCA : patchy gene distributionsLUCA : patchy gene distributions
ParsimonyParsimonyParsimonyParsimony
• Attach a cost (Attach a cost (cc) to HGT / independent gain in terms ) to HGT / independent gain in terms of loss; find scenario with lowest costof loss; find scenario with lowest cost
• At g = 1.5, 733 genes in LUCAAt g = 1.5, 733 genes in LUCA• At g = 2, 956 genes in LUCAAt g = 2, 956 genes in LUCA• Evolution is not parsimonious, minimal estimate?Evolution is not parsimonious, minimal estimate?• Why not use gene trees?Why not use gene trees?
• Attach a cost (Attach a cost (cc) to HGT / independent gain in terms ) to HGT / independent gain in terms of loss; find scenario with lowest costof loss; find scenario with lowest cost
• At g = 1.5, 733 genes in LUCAAt g = 1.5, 733 genes in LUCA• At g = 2, 956 genes in LUCAAt g = 2, 956 genes in LUCA• Evolution is not parsimonious, minimal estimate?Evolution is not parsimonious, minimal estimate?• Why not use gene trees?Why not use gene trees?
b
Nice resultsNice resultse.g. nucleotidee.g. nucleotidebiosynthesisbiosynthesis
Another attempt to reconstruct the genome Another attempt to reconstruct the genome of LUCAof LUCA
Another attempt to reconstruct the genome Another attempt to reconstruct the genome of LUCAof LUCA
• over 1000 gene families, of which more than 90% are over 1000 gene families, of which more than 90% are also functionally characterized.also functionally characterized.
• a fairly complex genome similar to those of free-living a fairly complex genome similar to those of free-living prokaryotes, with a variety of functional capabilities prokaryotes, with a variety of functional capabilities including metabolic transformation, information including metabolic transformation, information processing, membrane/transport proteins and processing, membrane/transport proteins and complex regulation, complex regulation,
• over 1000 gene families, of which more than 90% are over 1000 gene families, of which more than 90% are also functionally characterized.also functionally characterized.
• a fairly complex genome similar to those of free-living a fairly complex genome similar to those of free-living prokaryotes, with a variety of functional capabilities prokaryotes, with a variety of functional capabilities including metabolic transformation, information including metabolic transformation, information processing, membrane/transport proteins and processing, membrane/transport proteins and complex regulation, complex regulation,
Presence / absence of genesPresence / absence of genesPresence / absence of genesPresence / absence of genes
Gene content Gene content co-evolution. (The easy case, few genomes. ) co-evolution. (The easy case, few genomes. )Gene content Gene content co-evolution. (The easy case, few genomes. ) co-evolution. (The easy case, few genomes. )
Genomes share genes for phenotypes they have in commonGenomes share genes for phenotypes they have in commonGenomes share genes for phenotypes they have in commonGenomes share genes for phenotypes they have in common
Differences between gene Differences between gene Content reflect differences inContent reflect differences inPhenotypic potentialitiesPhenotypic potentialities
Differences between gene Differences between gene Content reflect differences inContent reflect differences inPhenotypic potentialitiesPhenotypic potentialities
Qualitative differential genome analysis:
-Find “pathogen specific” specific proteins that can serve as drug targets
-Relate the differences between genomes to the differences in the phenotypes
Qualitative differential genome analysis:
-Find “pathogen specific” specific proteins that can serve as drug targets
-Relate the differences between genomes to the differences in the phenotypes
Three-way comparisonsThree-way comparisonsThree-way comparisonsThree-way comparisons
Huynen et al., 1998, FEBS Lett
Convergence in functional classes of gene content in Convergence in functional classes of gene content in small intra cellular bacterial parasites small intra cellular bacterial parasites
Convergence in functional classes of gene content in Convergence in functional classes of gene content in small intra cellular bacterial parasites small intra cellular bacterial parasites
Zomorodipour & Andersson FEBS Letters 1999
Although we can, qualitatively, interpret the variations in Although we can, qualitatively, interpret the variations in shared gene content in terms of the phenotypes of the shared gene content in terms of the phenotypes of the species, quantitatively they depend on the relative species, quantitatively they depend on the relative phylogenetic positions of the species. The closer two phylogenetic positions of the species. The closer two species are the larger fraction of their genes they share.species are the larger fraction of their genes they share.
Presence / absence of genesPresence / absence of genesPresence / absence of genesPresence / absence of genes
L. innocua (non-pathogen)L. innocua (non-pathogen) L. monocytogenes (pathogen)L. monocytogenes (pathogen)
Occurrence of genesOccurrence of genesOccurrence of genesOccurrence of genes
L. innocua (non-pathogenic)L. innocua (non-pathogenic) L. monocytogenes (pathogenic)L. monocytogenes (pathogenic)
Genes involved in pathogenecity Genes involved in pathogenecity
Generalization: phylogenetic profiles / co-occurence
Generalization: phylogenetic profiles / co-occurence
Gene 1: Gene 2:Gene 3:....
Gene 1: Gene 2:Gene 3:....
spec
ies
1 sp
ecie
s 2
spec
ies
3
spec
ies
4
spec
ies
5 ..
....
..
... ..sp
ecie
s 1
spec
ies
2
spec
ies
3
spec
ies
4
spec
ies
5 ..
....
..
... ..
Gene 1: 1 0 1 1 0 1 Gene 2: 1 1 0 0 1 0Gene 3: 0 1 0 0 1 0....
Gene 1: 1 0 1 1 0 1 Gene 2: 1 1 0 0 1 0Gene 3: 0 1 0 0 1 0....
spec
ies
1 sp
ecie
s 2
spec
ies
3
spec
ies
4
spec
ies
5 ..
....
..
... ..sp
ecie
s 1
spec
ies
2
spec
ies
3
spec
ies
4
spec
ies
5 ..
....
..
... ..
Co-occurrence of genes across genomesCo-occurrence of genes across genomes
• i.e. two genes i.e. two genes have the same have the same presence/ absence presence/ absence pattern over pattern over multiple genomes:multiple genomes:
•AKA phylogenetic AKA phylogenetic profilesprofiles
•NB complete NB complete genomes absencegenomes absence
•Correction for Correction for phylogenetic signal phylogenetic signal needed needed → events→ events
b
Predicting function of a disease gene protein with unknown function, Predicting function of a disease gene protein with unknown function, frataxin, using co-occurrence of genes across genomesfrataxin, using co-occurrence of genes across genomes
Predicting function of a disease gene protein with unknown function, Predicting function of a disease gene protein with unknown function, frataxin, using co-occurrence of genes across genomesfrataxin, using co-occurrence of genes across genomes
• Friedreich’s ataxiaFriedreich’s ataxia• No (homolog with) known functionNo (homolog with) known function
• Friedreich’s ataxiaFriedreich’s ataxia• No (homolog with) known functionNo (homolog with) known function
A.aeolicus Synechocystis
B.subtilis
M.genitalium
M.tuberculosis
D.radiodurans
R.prow
azekii
C.crescentus
M.loti
N.m
eningitidis
X.fastidiosa
P.aeruginosa
Buchnera
V.cholerae
H.influenzae
P.multocida
E.coliA
.pernixM
.janna schii
A.th al ian a S.cer ev isiae
s
C. jejun i
C. albican s
S.p ombe
H.sapiens
C.eleg an
H. pylori
D.m
elan.
cyaY Yfh1cyaY Yfh1
hscB Jac1hscB Jac1hscAhscA
ssq1ssq1
Nfu1Nfu1
iscA Isa1-2iscA Isa1-2fdx Yah1fdx Yah1
Arh1Arh1
RnaMRnaMIscRIscRHypHyp
iscS Nfs1 iscS Nfs1 iscU Isu1-2iscU Isu1-2
Atm1Atm1
Frataxin has co-evolved with hscA and hscB indicating that it plays a role in iron-sulfur Frataxin has co-evolved with hscA and hscB indicating that it plays a role in iron-sulfur cluster assemblycluster assembly
Frataxin has co-evolved with hscA and hscB indicating that it plays a role in iron-sulfur Frataxin has co-evolved with hscA and hscB indicating that it plays a role in iron-sulfur cluster assemblycluster assembly
Iron-Sulfur (2Fe-2S) cluster in the Rieske protein
Prediction:
~Confirmation:
functionally associated functionally associated proteins leave proteins leave
evolutionary tracesevolutionary traces of of their relation in genomestheir relation in genomes
functionally associated functionally associated proteins leave proteins leave
evolutionary tracesevolutionary traces of of their relation in genomestheir relation in genomes
Genomic context / in silico interaction prediction
Evolutionary rate Evolutionary rate Evolutionary rate Evolutionary rate
Chen &Dokholyan TiG 2006Chen &Dokholyan TiG 2006Chen &Dokholyan TiG 2006Chen &Dokholyan TiG 2006
Co-evolution: Co-evolution: mirrortreemirrortree
Co-evolution: Co-evolution: mirrortreemirrortree
Pavos & Valencia PEDS 2001Pavos & Valencia PEDS 2001Pavos & Valencia PEDS 2001Pavos & Valencia PEDS 2001
Co-evolution: mirrortreeCo-evolution: mirrortreeCo-evolution: mirrortreeCo-evolution: mirrortree
0 0.2 0.4 0.6 0.8 1Score
0
0.2
0.4
0.6
0.8
1
FusionGene OrderCo-occurrenceF
ract
ion
sam
e K
EG
G m
a p
Integrating genomic context scores into one Integrating genomic context scores into one single score (post-hoc)single score (post-hoc)
• Compare each individual method against an independent benchmark Compare each individual method against an independent benchmark (KEGG), and find “equivalency”(KEGG), and find “equivalency”• Multiply the chances that two proteins are Multiply the chances that two proteins are not not interacting and subtract interacting and subtract from 1; naive bayesian i.e. assuming independencefrom 1; naive bayesian i.e. assuming independence