The impact of whole genome duplications: insights from Paramecium tetraurelia
Genome Annotation
• Ab initio gene predictions
• Comparative approach
• 90,000 ESTs
A compact Mac genome
• Protein-coding regions: 78% of the genome
• Short intergenic regions Average = 352 bp
• Introns:Short (average = 25 bp) …
… but numerous : 80% of genes contain introns (average = 2.9 introns / gene)
39642 annotated genes
Gene content
2000
6000
10000
12500
900011200
5200
40000
14000
24000
27900
24000
2690028000
37500
05000
10000
15000200002500030000
350004000045000
E. cunic
uli
S. cer
evis
iae
N. cra
ssa
D. dis
coid
eum
T. bru
cei
T. pse
udonana
P. fal
cipar
um
P. tet
raure
lia
11000
C. inte
stin
alis
D. mel
anogas
ter
20600
C. ele
gans
X. tro
pical
is
T. nig
rovi
dis
H. sap
iens
M. m
usculu
s
A. thal
iana
O. sat
iva
Num
ber
of g
enes
Not due to annotation artefacts (control with cDNA data, distribution of protein length, manual curation on chrom. 1, …)
39642
Many genes belong to multigenic families
Computing Best Reciprocal Hits (BRH) within Paramecium proteins
SW comparisons+
filtering
39 642 proteins
13 085 pairs of proteins in BRH
BRH are found in large duplicated blocs (paralogons). Example: scaffold 1 & 8
Building paralogons
• Using a sliding window of size w genes
• For each window : – Select a paralogous region if at least p % of w genes are BRH
with the sequence
• Merging overlapping windows
• Add syntenic genes which do not have BRH
Whole genome duplication (WGD)
Settings :
W = 10p = 61%
Coverage :
61.3 Mb (85%)35 503 genes (90%)
Résults :
24 052 genes in 2 copies (68%)
11 451 genes in 1 copie (32%)
51% of ancestral genes are still in 2 copies
Progressive loss of gene duplicates
• ~1500 recent pseudogenes (recognizable)• Length distribution of genic and intergenic sequences : relics of more
ancient pseudogenes in intergenic regions
Single-copy geneIntergenic region encompassing a gene lossOther intergenic regions
Sequence length (bp)
Freq
uenc
y (%
)
BRH from supercontig 8
Number of BRH (>3000) remains outside of paralogons
Paralogous genes
Inferring ancestral blocs
Arbitrary order
Ancestral blocs
Building paralogons with 131 ancestral blocs
Intermediary WGDSettings :
W = 10p = 40%
Coverage :
31,129 genes (79%)
Content before WGD : 20,578 genes
7 996 genes in 2 copies (39%)
12 582 genes in 1 copy (61%)
Old WGDSettings :
W = 20p = 30%
Coverage :
18,792 genes (47%)
Content before WGD : 9,999 genes
1 530 genes in 2 copies (15%)
8 469 genes in 1 copy (85%)
Gene content at each WGD
19 552 genes
21 172
26 214
39 642
Old WGD
Intermediary WGD
Recent WGD
x 1.1
x 1.2
x 1.5
x 2 (not x 8)
Protein sequence similarity between duplicates (ohnologs)
Old WGDIntermediary WGDRecent WGD
Distribution of the rate of synonymous substitution (dS) between ohnologs
Old WGD
Intermediary WGD
Recent WGD
dS computed with PAML
saturation
Recent gene conversion
Recent WGD
dN/dS
Freq
uenc
y (%
)
Distribution of dN/dS
• => both ohnologs are under strong negative selective pressure• Yet … the fate of most ohnologs is to be pseudogenized !• => gene-silencing mutations can be tolerated …• … but deleterious mutations affecting the coding sequence of one copy are
counterselected (i.e. dominant effect of mutations, despite the presence of a duplicate)
• Once a gene has been silenced (e.g. by mutation of regulatory elements), mutations can accumulate in coding regions
Gene duplicates are evolutionarily unstable
Gene duplication
...Time
Pseudogene
Ancient paralogsSelective pressure to maintain 2 copies
Retention of gene duplicates
• Different (non-exclusive) models have been proposed for the retention of gene duplicates:– Robustness against mutations– Functional changes: neo- or sub-functionalization– Dosage constraints
• Which are the genes that are preferentially retained after a WGD ?
• How does the pattern of gene retention vary with time ?– Compare the pattern of retention after a recent WGD and a
more ancient WGD – Paramecium: 3 successive WGDs !
Mutational robustness
• Under certain conditions (high mutation rate and very large population size) redundant genes may be maintained by selection acting against double null alleles (Force et al. 1999)
• Essential genes (e.g. ribosomal proteins) are more retained than the average
• … but most of them are present in more than 2 copies !
• … their high rate of retention may be due to other factors (see later)
Functional changes
...Time
Function: F
Function: F Function: F’
Neofunctionalization(adaptation)
Subfunctionalization(neutral evolution)
...
Function: F1F2
Function: F1 Function: F2
Functional changes:- changes in gene expression pattern- changes in the encoded protein
Force et al. (1999)
Prediction of the subfunctionalization model
• A gene that has been preserved by subfunctionalization at a given WGD, is less likely to be retained in two copies at a subsequent WGD (Force et al. 1999)
F1F2
F1F2
F1 F2
WGD1
WGD2
F1F2
F1 F2
F1 F2
WGD1
WGD2
Test of the subfunctionalization model (1)
• Apparent contradiction with the subfunctionalization model• Due to variations in retention rate between different
functional classes ?
Intermediate WGD
Retained: 47% Retained: 57%
Retention at the recent WGD ?
N=7,996 N=12,582
Test of the subfunctionalization model (2)
• A gene that has been preserved at a given WGD, is less likely to be retained in two copies at a subsequent WGD
• Difference significant (p<5%), but not very strong• Subfunctionalization is an unlikely evolutionary pathway in species with large population
sizes (Lynch 2005)
Old WGD
Intermediate WGD
Retention at the recent WGD ?
N = 343 gene families
Retained: 67% Retained: 60%
Test of the neofunctionalization model
• Analysis of gene expression (work in progress)• Analysis of the rate of protein evolution:
Outgroup (function F)
Ohnolog 1 (function F) Ohnolog 2 (function F’)
• Relative rate test (PAML); correction for multiple tests• Frequency of ohnologs with asymetric substitution rates:
– Recent WGD (N=2297) : 11%– Intermediate WGD (N=293 ) : 16%
• More functional redundancy among recent duplicates• Functional changes account for retention on the long term
Fate of neofunctionalized genes at subsequent WGD
Intermediate WGD
Slow copy: 66% retained Fast copy: 26% retained
Retention at the recent WGD ?
Neofunctionalized genes are more prone to pseudogenization at subsequent WGD
N = 62
Retention for dosage constraints (1): high expression level
• Genes that have to be expressed at very high level are often present in multiple copies (e.g. histones)
• The loss of one copy is counterselected because it cannot be compensated for by the upregulation of other copies
• => More retention among highly expressed genes
Retention rates
For each WGD, the retention rate for a given gene category is :
Proportion of genes retained in duplicates in this categoryRatio =
Proportion of total genes retained in duplicates
Ratio = 1 no specific retention above the mean value for all genes
Ratio > 1 over-retained category
Ratio < 1 under-retained category
Expression versus Retention
Retention for dosage constraints (2): the balance hypothesis (Papp et al.
2003)
• The relative expression levels of proteins involved in a same functional network have to be controled to ensure the proper stoichiometry of the network
• Initially, the loss of one copy is counterselected because it creates an imbalance within the network
• On the long term, gene losses may occur because they can be compensated for by the upregulation of other copies
Testing the balance hypothesis (1):Genes involved in multi-protein
complexes
• Protein complexes predicted by homology with yeast:
– MIPS database (curation from the litterature)– TAP / MS data (Gavin et al. Nature 2006)
Multi-protein complexes
Genes involved in the coding of protein complexes are initially over-retained
Additive effects of Expression and Inclusion in Complex
• Proteins involved in complexes are over-retained at the recent WGD
• Does this mean that complex stoichiometry tends to be conserved ?
Constraint of stoichiometry and fate of duplicates
Complexes p-valuewith conserved stoichiometry
Recent WGD 265 (44%) 2.6x10-2
74 (68%) 4.3x10-4
Intermediary WGD 114 (20%) 1.5x10-3
43 (43%) 2.4x10-4
Old WGD 106 (24%) 1.2x10-5
26 (43%) 2.5x10-3
MIPS complexesComplexes from Gavin et al. Nature 2006
Number of copy of A
Number of copy of B
complex
A
B
Testing the balance hypothesis (2): genes involved in central
metabolism
Retention of central metabolism gene
duplicates
Genes involved in the central metabolism are initially over-retained and then under-retained (less neofunctionalization ?)
Dating genome duplications
• Phylogenetic analyses of orthologous genes in other ciliate species => date WGDs relative to speciation events
Tetrahymena thermophila
P. putrinumP. bursaria
P. polycaryum
P. nephridiatum
P. duboscqui
P. multimicronucleatum
P. caudatum
P. tetraurelia
P. pentaurelia
P. primaurelia
P. sexaurelia
P. jenningsi
P. octaurelia
P. novaurelia
P. tredecaurelia
P. quadecaurelia
Paramecium aurelia complex
Intermediate WGD
Old WGD
Recent WGD
Complex aurelia: 15 sibling species (same kind of habitat, initially thought to correspond to a single species)
How does WGD
relate to speciation?
Ptetra
Pprim
With the kind permission of K. Wolfe
Polyploid paramecia
Ptetra
Pprim
Polyploid paramecia
Mating,meiosis
Dobzhansky-Muller incompatibility by reciprocal gene loss
For 1 locus, 1/4 of the offspring is inviable.
For n loci, offspring viability is (3/4)n
Reproductive isolation
Conclusions (1)
• At least 3 WGDs in paramecium (probably 4)• WGDs are rare events … that occured
recurrently in the evolution of eukaryotes (fungi, animals, plants, ciliates …)
• Major impact on the evolution of the gene repertoire
Conclusions (2)
• Dosage constraints appear as an essential force shaping the gene repertoire after WGD
• Functional changes contribute to gene retention on the long term …
• … but the fate of the vast majority of genes is to get pseudogenized
Conclusions (3)
• Relationship between the number of genes and organism complexity– The number of genes is driven by selection …– … and contingency (time since the last WGD)
• WGDs may be reponsible for (non-adaptative) explosive radiation of species (Dobzhansky-Muller incompatibility by reciprocal gene loss)
• CNRS-UPR2167 - CGM - Gif sur Yvette– Jean Cohen– Linda Sperling
• CNRS-UMR8541 – ENS - Paris – Eric Meyer– Mireille Bétermier
• CNRS-UMR8125 – IGR - Villejuif – Philippe Dessen
• CNRS-UMR5558 – PBIL - Lyon– Laurent Duret– Vincent Daubin
• Genoscope - CNRS UMR 8030– Jean-Marc Aury– Olivier Jaillon– Benjamin Noel– Betina Porcel– Vincent Schachter– Patrick Wincker– Jean Weissenbach