Bastien Boussau
LBBE, CNRS, Université de Lyon
Models of gene duplication, transfer and loss to study genome evolution
CollaboratorsLyon collaborators:
• Adrián Arellano Davín
• Gergely Szöllősi (Budapest)
• Vincent Daubin
• Eric Tannier
• Thomas Bigot
• Magali Semeria
• Manolo Gouy
• Laurent Duret
• Nicolas Lartillot
Austin/Illinois collaborators:
• Siavash Mirarab
• Md. Shamsuzzoha Bayzid
• Tandy Warnow
RevBayes collaborators:
• Sebastian Hoehna • Michael Landis • Tracy Heath • Fredrik Ronquist • Brian Moore • John Huelsenbeck • …
Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete lineage sorting
• Birds
4. 2 vignettes
To study genome evolution:
1. One species tree:
!!!
2. Thousands of gene trees:
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
To study genome evolution:
1. One species tree:
!!!
2. Thousands of gene trees:
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Why our current pipeline can be improved
�������������
��������
� ���������
�������� �
�������������
���������������
��������
�������������� ���������������������
���������������������� ������������ ���������������
�
�
�
�
�
�
�
�
�
����������� !���"� !��#����!�#$��%
���������&$�%!�������������'(%!�#$�%
�������( )'�
����!�����*+ ('�,#$��%
����!��������&�����-���!�����&( ��� $�.��"'(%
���������/���
Why our current pipeline can be improved
�������������
��������
� ���������
�������� �
�������������
���������������
��������
�������������� ���������������������
���������������������� ������������ ���������������
�
�
�
�
�
�
�
�
�
����������� !���"� !��#����!�#$��%
���������&$�%!�������������'(%!�#$�%
�������( )'�
����!�����*+ ('�,#$��%
����!��������&�����-���!�����&( ��� $�.��"'(%
���������/���
•Gene alignments: •Error prone (Genes are short) •Point es:mates
Why our current pipeline can be improved
�������������
��������
� ���������
�������� �
�������������
���������������
��������
�������������� ���������������������
���������������������� ������������ ���������������
�
�
�
�
�
�
�
�
�
����������� !���"� !��#����!�#$��%
���������&$�%!�������������'(%!�#$�%
�������( )'�
����!�����*+ ('�,#$��%
����!��������&�����-���!�����&( ��� $�.��"'(%
���������/���
•Gene trees: •based on alignments •Point es:mates
•Gene alignments: •Error prone (Genes are short) •Point es:mates
Why our current pipeline can be improved
�������������
��������
� ���������
�������� �
�������������
���������������
��������
�������������� ���������������������
���������������������� ������������ ���������������
�
�
�
�
�
�
�
�
�
����������� !���"� !��#����!�#$��%
���������&$�%!�������������'(%!�#$�%
�������( )'�
����!�����*+ ('�,#$��%
����!��������&�����-���!�����&( ��� $�.��"'(%
���������/���
•Gene trees: •based on alignments •Point es:mates
•Species trees: •based on gene trees
•Gene alignments: •Error prone (Genes are short) •Point es:mates
Why our current pipeline can be improved
�������������
��������
� ���������
�������� �
�������������
���������������
��������
�������������� ���������������������
���������������������� ������������ ���������������
�
�
�
�
�
�
�
�
�
����������� !���"� !��#����!�#$��%
���������&$�%!�������������'(%!�#$�%
�������( )'�
����!�����*+ ('�,#$��%
����!��������&�����-���!�����&( ��� $�.��"'(%
���������/���
•Gene trees: •based on alignments •Point es:mates
•Species trees: •based on gene trees
•Gene alignments: •Error prone (Genes are short) •Point es:mates
Species: A B C D
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
D
Species: A B C D
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
D DL
Species: A B C D
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
LGTD DL
Species: A B C D
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
LGT ILSD DL
Species: A B C D
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
LGT ILS
DL: Boussau et al., Genome Research 2013
D DL
Species: A B C D
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
LGT ILS
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!Szöllősi et al. "PNAS 2013
Species: A B C D
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
LGT ILS
ILS: !Mirarab et al. Science 2014
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!Szöllősi et al. "PNAS 2013
(thousands of alignments)
PHYLDOG
All gene families
Rooted species tree,numbers of duplications
and losses,rooted gene trees D1
D2
D3D4
D5
D6
L2L1
L4L3
L5
L6
Joint reconstruction of the species tree, gene trees, and
numbers of duplications and losses
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
D1D3
D2 D4
D5 D6
L1L3
L2 L4
L5 L6
Boussau et al., Genome Research 2013
(thousands of alignments)
PHYLDOG
All gene families
Rooted species tree,numbers of duplications
and losses,rooted gene trees D1
D2
D3D4
D5
D6
L2L1
L4L3
L5
L6
Joint reconstruction of the species tree, gene trees, and
numbers of duplications and losses
Species: A B C D
Discrete character:Continuous character:
a a b a0.1 0.2 0.2 0.4
TIME
D1D3
D2 D4
D5 D6
L1L3
L2 L4
L5 L6
Probabilis5c models: • sequence evolu1on • gene family evolu1on
Boussau et al., Genome Research 2013
PHYLDOG: a model of gene duplication and loss
Assumptions!•Genes evolve along the species tree:!•birth events:!•duplications (rate of duplication)!
•death events:!•losses (rate of loss)!
•Each gene family is independent of other genes!•Each gene copy is independent of other copies!!!
Study of mammalian genome evolu:on
10
• Challenging but well-‐studied phylogeny
• 36 mammalian genomes available in Ensembl v. 57
• About 7000 gene families
• Correc:on for poorly sequenced genomes
PHYLDOG finds a good species tree
Sus scrofa
Felis catus
Ornithorhynchus anatinus
Oryctolagus cuniculus
Loxodonta africana
Mus musculus
Gorilla gorilla
Dipodomys ordii
Monodelphis domestica
Vicugna pacos
Macaca mulatta
Tupaia belangeri
Procavia capensis
Spermophilus tridecemlineatus
Pongo pygmaeus
Tursiops truncatus
Microcebus murinus
Callithrix jacchus
Equus caballus
Erinaceus europaeus
Tarsius syrichta
Choloepus hoffmanni
Ochotona princeps
Cavia porcellus
Pan troglodytes
Bos taurus
Rattus norvegicus
Homo sapiens
Otolemur garnettii
Dasypus novemcinctusEchinops telfairi
Pteropus vampyrus
Macropus eugenii
Canis familiaris
Sorex araneus
Myotis lucifugus
Laurasiatheria
Afrotheria
Xenarthra
Marsupials
Primates
Glires
Quality of the gene trees
12
Comparison between: PhyML (used for the PhylomeDB and Homolens databases ) TreeBeST (used for the Ensembl-‐Compara database) PHYLDOG
Two approaches: • Looking at ancestral genome sizes • Assessing how well one can recover ancestral syntenies
using reconstructed gene trees (Bérard et al., Bioinforma:cs 2012)
Sus scrofa
Felis catus
Ornithorhynchus anatinus
Oryctolagus cuniculus
Loxodonta africana
Mus musculus
Gorilla gorilla
Dipodomys ordii
Monodelphis domestica
Vicugna pacos
Macaca mulatta
Tupaia belangeri
Procavia capensis
Spermophilus tridecemlineatus
Pongo pygmaeus
Tursiops truncatus
Microcebus murinus
Callithrix jacchus
Equus caballus
Erinaceus europaeus
Tarsius syrichta
Choloepus hoffmanni
Ochotona princeps
Cavia porcellus
Pan troglodytes
Bos taurus
Rattus norvegicus
Homo sapiens
Otolemur garnettii
Dasypus novemcinctusEchinops telfairi
Pteropus vampyrus
Macropus eugenii
Canis familiaris
Sorex araneus
Myotis lucifugus
Laurasiatheria
Afrotheria
Xenarthra
Marsupials
Primates
Glires
010
000
010
000
010
000
010
000
010
000
010
000
010
000PHYLDOG
TreeBeSTPhyML
PHYLDOG: better trees for better ancestral genomes
An example gene family
0.1
Ornithorhynchus anatinus
0.3
Ornithorhynchus anatinusMus musculusMus musculusMus musculusCavia porcellusMus musculus
Oryctolagus cuniculusCanis familiaris
Bos taurusHomo sapiens
Pongo pygmaeusOryctolagus cuniculus
Cavia porcellusEquus caballusEquus caballus
Bos taurusCallithrix jacchusHomo sapiens
Monodelphis domesticaSpermophilus tridecemlineatus
Homo sapiensOrnithorhynchus anatinusOrnithorhynchus anatinusOrnithorhynchus anatinusOrnithorhynchus anatinus
Mus musculusMus musculus
Ornithorhynchus anatinusOrnithorhynchus anatinus
Mus musculusMus musculusMus musculus
Cavia porcellus
Mus musculus
Oryctolagus cuniculus
Canis familiaris
Bos taurus
Homo sapiens
Pongo pygmaeus
Oryctolagus cuniculus
Cavia porcellus
Equus caballusEquus caballus
Bos taurus
Callithrix jacchusHomo sapiens
Monodelphis domestica
Spermophilus tridecemlineatus
Homo sapiens
Ornithorhynchus anatinusOrnithorhynchus anatinusOrnithorhynchus anatinusOrnithorhynchus anatinus
Mus musculusMus musculus
TreeBeST PHYLDOG
Boussau et al., Genome Research 2013
Recent improvements to PHYLDOG
• Easier installation using Cmake or a virtual machine!• Better algorithms for gene tree inference!• Better algorithm for starting species tree!• Faster computations using the Phylogenetic Likelihood Library
(PLL, A. Stamatakis group)!• Python scripts to help run the program
Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete lineage sorting
• Birds
4. 2 vignettes
Species: A B C D
TIME
ILS: !Mirarab et al. Science 2014
DL: Boussau et al., Genome Research 2013DL+T:!
Szöllősi et al. "PNAS 2013
Species: A B C D
TIME
LGT ILS
ILS: !Mirarab et al. Science 2014
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!Szöllősi et al. "PNAS 2013
Gene transfers and the quixo:c pursuit of the TOL
DooliYle WF,
Science 1999
“The monistic concept of a single universal tree appears […] increasingly obsolete. […][It is] no longer the most scientifically productive position to hold[…][It] accounts for only a minority of observations from genomes.”!
Bapteste, O’Malley, Beiko, Ereshefsky, Gogarten, Franklin-Hall, Lapointe, Dupré, Dagan, Boucher, Martin, !
Biology Direct 2009.
exODT: a model of gene duplication, transfer, and loss
Assumptions!•Genes evolve along the species tree:!•birth events:!•duplications (rate of duplication)!•transfers (rate of receiving a gene)!
•death events:!•losses (rate of loss)!
•Each gene family is independent of other genes!•Each gene copy is independent of other copies!•Transfers can go through unsampled/extinct species!!!
Better gene trees, fewer transfers
Usual approach
ALE+DTL
RF d
ista
nce
to re
al tr
ee
Szöllősi et al., Syst. Biol. b 2013
Better gene trees, fewer transfers
Usual approach
ALE+DTL
Tran
sfer
eve
nts
per f
amily
Usual approach
ALE+DTL
RF d
ista
nce
to re
al tr
ee
Szöllősi et al., Syst. Biol. b 2013
Application to real data: Cyanobacteria and Fungi
Cyanobacteria!• > 2.4 billion years old! !• 40 species!• 1,200 to 4,500 protein coding genes!• 7,410 gene families!
!Fungi (Dikarya)!• ~ 1 billion years old!• 28 species!• 5,200 to 10,000 protein coding genes!• 11,387 gene families!!!
Both cases: !• fixed species tree, gene trees inferred using the
Duplication, Transfer and Loss model! Szöllősi et al., under review
Application to real data: Cyanobacteria and Fungi
Cyanobacteria!• > 2.4 billion years old! !• 40 species!• 1,200 to 4,500 protein coding genes!• 7,410 gene families!
!Fungi (Dikarya)!• ~ 1 billion years old!• 28 species!• 5,200 to 10,000 protein coding genes!• 11,387 gene families!!!
Both cases: !• fixed species tree, gene trees inferred using the
Duplication, Transfer and Loss model!
Transfers are expected
Transfers should be less frequent
Szöllősi et al., under review
Comparing transfer rates
• Cyanobacteria and Fungi differ in their age:!!
We can compare normalized numbers of events:!T/(T+D)!
!
• The Cyanobacteria and Fungi data sets differ in their number of species:!
!We can perform rarefaction studies
Szöllősi et al., under review
Using transfers to date clades
?T IM E
Because we can identify gene transfers, we have information for ordering the nodes of a species tree
Bayesian species tree inference
accounting for DTL events
• STRALE: • A Bayesian probabilistic method that can interpret thousands of
gene trees in terms of: • speciation events • duplication events (D) • transfer events (T) • loss events (L)
• A method able to estimate the DTL rates • A method able to reconstruct the species tree • A method able to order the nodes of the species tree
Simulation to test the species tree reconstruction• 20 species • 200 gene families
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
1 5
1
3
14
10
6
8
12
18
13
5
4
2
9
0
11
19
7
16
17
0.0 0.25 0.5 0.75 1.0 1.25
2
13
7
17
15
1
5
12
10
16
11
9
0
4
8
3
14
19
6
18
Simulated Inferred
Conclusion on DTL models
• The use of DTL models shows that the number of gene transfers has so far been overestimated
• DTL models can be used to study genome evolution and in particular rates of gene transfer
• DTL models can be used to date the nodes of a species phylogeny
• DTL models should provide a powerful tool to infer an accurate account of the history of life
Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete lineage sorting
• Birds
4. 2 vignettes
Species: A B C D
TIME
ILS: !Mirarab et al. Science 2014
DL: Boussau et al., Genome Research 2013DL+T:!
Szöllősi et al. "PNAS 2013
Species: A B C D
TIME
LGT ILS
ILS: !Mirarab et al. Science 2014
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!Szöllősi et al. "PNAS 2013
35
The multispecies coalescent
Rannala and Yang, Genetics 2003
• Divergence times in the species tree!• Divergence times in the gene trees!• Effective population sizes in the species tree
Faster alternatives to the multispecies coalescent use fixed gene trees
E.g.: MP-EST (Liu, Yu and Edwards, 2010)!Input: fixed gene trees!Output: species tree with branch lengths in coalescent units!!Has been shown to be consistent, under one notable assumption: !gene trees are correct.
Errors in gene trees decrease the accuracy of estimated species trees
Mirarab, Bayzid and Warnow, Syst. Biol 2014
41
Statistical binning also improves the estimation of the gene tree distribution
Mirarab et al., Science 2014
44Mirarab et al., PLoS One, accepted
Improving statistical binning: weighted statistical binning
Practice: weighted binning and unweighted binning have about the same accuracy !
Theory: weighted statistical binning can be shown to be consistent, unweighted statistical binning is not.
Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete lineage sorting
• Birds
4. 2 vignettes
RevBayes
• R-like language
• Model-based phylogenetics
• Many models of sequence evolution
• Models for dating
• Models for phylogeography
• Models for continuous traits
• Models for gene tree/species tree inference
• http://revbayes.net
• Sebastian Hoehna • Michael Landis • Tracy Heath • Fredrik Ronquist • Nicolas Lartillot • Brian Moore • John Huelsenbeck • …
Conclusions
• We develop methods for gene tree and species tree inference
• Improvement of gene trees and species trees in the presence of:
• duplications and losses,
• transfers,
• incomplete lineage sorting
• Parallel algorithms applicable to genome-scale data
• We study the evolution of life, ancient and recent
RevBayes collaborators:
• Sebastian Hoehna • Michael Landis • Tracy Heath • Fredrik Ronquist • Brian Moore • John Huelsenbeck • …
Lyon collaborators:
• Adrián Arellano Davín
• Gergely Szöllősi (Budapest)
• Vincent Daubin
• Eric Tannier
• Thomas Bigot
• Magali Semeria
• Manolo Gouy
• Laurent Duret
• Nicolas Lartillot
Austin/Illinois collaborators:
• Siavash Mirarab
• Md. Shamsuzzoha Bayzid
• Tandy Warnow
Thanks!