Statistical binning enables an accurate coalescent-based estimation of the avian tree
Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. Science (2014)
Avian whole genomes phylogenies [Jarvis, Mirarab, et al., Science, 2014]
2
48 re
pres
enta
tive
bird
s
Data (i.e., # of genes)
Species tree error
Hope!
Gene tree discordance
3
Eagle Owl Falcon Finch Eagle Owl Falcon Finch
gene 1000gene 1 gene 999gene 2
gene: recombination-free orthologous regions in genomes
Gene tree discordance
3
Eagle Owl Falcon Finch
A gene tree
The species tree
Eagle Owl Falcon Finch Eagle Owl Falcon Finch
gene 1000gene 1 gene 999gene 2
Gene tree discordance
3
Eagle Owl Falcon Finch
A gene tree
The species tree
Eagle Owl Falcon Finch Eagle Owl Falcon Finch
Causes of gene tree discordance:• Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT)
• Modeled by multi-species coalescent
• Highly probable for radiations (e.g., short branches) such as the bird radiation; 60 mya
• The species is identifiable from the gene tree distribution [Degnan and Salter, 2005]
gene 1000gene 1 gene 999gene 2
4
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
Species tree estimation from phylogenomic data (approach 1: concatenation)
4
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATGCGATG AGCAGC-TGCGATG AGCAGC-TGC-ATG C-TA-CAC-GGATG
CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
Concatenation
gene 1000gene 1
Species tree estimation from phylogenomic data (approach 1: concatenation)
4
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATGCGATG AGCAGC-TGCGATG AGCAGC-TGC-ATG C-TA-CAC-GGATG
CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
Concatenation
Eagle
Owl
Falcon
Finch
81%
gene 1000gene 1
Species tree estimation from phylogenomic data (approach 1: concatenation)
ML
4
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATGCGATG AGCAGC-TGCGATG AGCAGC-TGC-ATG C-TA-CAC-GGATG
CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
Concatenation
Eagle
Owl
Falcon
Finch
81%
gene 1000gene 1
- Statistically inconsistent & positively misleading
[Roch and Steel, Theo. Pop. Gen., 2014]
- Mixed accuracy in simulations
[Kubatko and Degnan, Systematic Biology, 2007] [Mirarab, et al., Systematic Biology, 2014]
Data
Error
Species tree estimation from phylogenomic data (approach 1: concatenation)
ML
Species tree estimation from phylogenomic data (approach 2: summary methods)
5
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
Species tree estimation from phylogenomic data (approach 2: summary methods)
5
Eagle
OwlFalcon
Finch Eagle
Owl Falcon
Finch Eagle
OwlFalcon
Finch
Eagle
Owl
Falcon
Finch
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
Species tree estimation from phylogenomic data (approach 2: summary methods)
5
Eagle
Owl
Falcon
Finch
78%
Summary methodEagle
OwlFalcon
Finch Eagle
Owl Falcon
Finch Eagle
OwlFalcon
Finch
Eagle
Owl
Falcon
Finch
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
Species tree estimation from phylogenomic data (approach 2: summary methods)
5
Eagle
Owl
Falcon
Finch
78%
Summary methodEagle
OwlFalcon
Finch Eagle
Owl Falcon
Finch Eagle
OwlFalcon
Finch
Eagle
Owl
Falcon
Finch
Data
ErrorCan be statistically consistent
• MP-EST (maximum pseudo-likelihood) [Liu, Yu, Edwards, BMC Evol. Bio., 2010] • BUCKy-pop., NJst, STAR, ASTRAL, …
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
Species tree estimation from phylogenomic data (approach 2: summary methods)
5
Eagle
Owl
Falcon
Finch
78%
Summary methodEagle
OwlFalcon
Finch Eagle
Owl Falcon
Finch Eagle
OwlFalcon
Finch
Eagle
Owl
Falcon
Finch
Data
ErrorCan be statistically consistent
• MP-EST (maximum pseudo-likelihood) [Liu, Yu, Edwards, BMC Evol. Bio., 2010] • BUCKy-pop., NJst, STAR, ASTRAL, …
True gene trees
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
medianmean
0
5%
10%
15%
20%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Gene trees on the avian dataset
6
A measure of confidence in estimated gene tree branches
14,000 “genes”: 8,000 exons and 2,500 introns 3,500 Ultra-Conserved Elements
medianmean
0
5%
10%
15%
20%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Gene trees on the avian dataset
6
A measure of confidence in estimated gene tree branches
14,000 noisy gene trees
14,000 “genes”: 8,000 exons and 2,500 introns 3,500 Ultra-Conserved Elements
medianmean
0
5%
10%
15%
20%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Gene trees on the avian dataset
6
A measure of confidence in estimated gene tree branches
88
54
96
95
85
74
32
Cho
rdat
es
45
84
72
70
52
60
71
Arthr
opod
a
Hym
enop
tera
Pro
tost
omia
Cra
niat
es
B) M
etaz
oan
97
Cur
sore
s
Col
umbe
a
Otid
imor
phae
Aust
rala
ves
80
BinnedMP-EST
Unbinned MP-EST
73
67
92
79
94
99
68
88
A) A
vian
87
9888
50
88
68
86
95
BinnedMP-EST
Unbinned MP-EST
Con
flict
with
othe
rlin
esof
stro
ngev
iden
ce
Uro
chor
date
s
Cep
halo
chor
date
s
H.rob
usta
C.in
test
inal
is
C.e
lega
ns
S.p
urpu
ratu
s
B.m
ori
T.a
dhae
rens
G.g
allu
s
L.gi
gant
ea
S.m
anso
ni
T.c
asta
neum
D.p
ulex
D.m
elan
ogas
ter
X.tr
opic
alis
B.fl
orid
ae
N.v
ecte
nsis
A.m
ellif
era
D.rer
io
I.sca
pula
ris
M.m
uscu
lus
H.s
apie
ns
M.b
revi
colli
s
46
Uro
chor
date
s
Cep
halo
chor
date
s
77
C.e
lega
ns
T.c
asta
neum
G.g
allu
s
M.b
revi
colli
s
I.sca
pula
risL.
giga
ntea
D.rer
io
S.m
anso
ni
A.m
ellif
era
N.v
ecte
nsis
H.s
apie
ns
C.in
test
inal
is
S.p
urpu
ratu
s
D.m
elan
ogas
ter
H.rob
usta
B.m
ori
X.tr
opic
alis
T.a
dhae
rens
M.m
uscu
lus
B.fl
orid
ae
D.p
ulex
Bila
teria
Podic
eps
cris
tatu
s97
Pass
eri
form
es
Psi
ttaci
form
es
Falc
o p
ere
gri
nus
Cari
am
a c
rist
ata
Cora
ciim
orp
hae
Acc
ipit
rifo
rmes
Tyto
alb
a
Cari
am
a c
rist
ata
Cora
ciim
orp
hae
Pele
canus
cris
pus
Egre
tt a
garz
ett
aN
ipponia
nip
pon
Phala
croco
rax c
arb
oPro
cella
riim
orp
hae
Gavia
ste
llata
Phaeth
on leptu
rus
Eury
pyga h
elia
sB
ale
ari
ca r
egulo
rum
Chara
dri
us
voci
feru
sO
pis
thoco
mus
hoazi
n
Caly
pte
anna
Chaetu
ra p
ela
gic
aA
ntr
ost
om
us
caro
linensi
s
Taura
co e
ryth
rolo
phus
Chla
mydoti
s m
acq
ueenii
Cucu
lus
canoru
s
Colu
mbal iv
iaPte
rocl
es
gutt
ura
lisM
esi
torn
is u
nic
olo
r
Phoenic
opte
rus
ruber
Mele
agri
s gallo
pavo
Gallu
s gallu
sA
nas
pla
tyrh
ynch
os
Str
uth
io c
am
elu
sTi
nam
us
gutt
atu
s
91
58
59
99
Podic
eps
cris
tatu
sPhoenic
opte
rus
ruber
Cucu
lus
canoru
s
Pass
eri
form
es
Psi
ttaci
form
es
Falc
o p
ere
gri
nus
Acc
ipit
rifo
rmes
Tyto
alb
aPe
leca
nus
cris
pus
Egre
tt a
garz
ett
aN
ipponia
nip
pon
Phala
croco
rax c
arb
oPro
cella
riim
orp
hae
Gavia
ste
llata
Phaeth
on leptu
rus
Eury
pyga h
elia
sB
ale
ari
ca r
egulo
rum
Chara
dri
us
voci
feru
sO
pis
thoco
mus
hoazi
n
Caly
pte
anna
Chaetu
ra p
ela
gic
aA
ntr
ost
om
us
caro
linensi
s
Colu
mbal iv
iaPte
rocl
es
gutt
ura
lisM
esi
torn
is u
nic
olo
rM
ele
agri
s gallo
pavo
Gallu
s gallu
sA
nas
pla
tyrh
ynch
os
Str
uth
io c
am
elu
sTi
nam
us
gutt
atu
s
Taura
co e
ryth
rolo
phus
Chla
mydoti
s m
acq
ueenii
14,000 noisy gene trees
14,000 “genes”: 8,000 exons and 2,500 introns 3,500 Ultra-Conserved Elements
Simulation studies
Error metric: percentage of branches in true tree that are missing from the estimated tree
7
Truegenetrees Sequencedata
Es�matedspeciestree
Finch Falcon Owl Eagle Pigeon
Es�matedgenetreesFinch Owl Falcon Eagle Pigeon
True(model)speciestree
5%
10%
15%
20%
1,500 1,000 500 250Gene sequence length
Spec
ies
tree
topo
logi
cal e
rror (
FN) MP−EST
Gene trees on the avian dataset
8
Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]
A statistically consistent summary method
more gene tree error
5%
10%
15%
20%
1,500 1,000 500 250Gene sequence length
Spec
ies
tree
topo
logi
cal e
rror (
FN) MP−EST
Gene trees on the avian dataset
8
Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]
A statistically consistent summary method
more gene tree error
Gene tree error matters
[Ané, et al, MBE, 2007][Patel, et al, MBE, 2013] [Gatesy, Springer, MPE, 2014] [Mirarab, et al., Systematic Biology, 2014]
5%
10%
15%
20%
1,500 1,000 500 250Gene sequence length
Spec
ies
tree
topo
logi
cal e
rror (
FN) MP−EST
Gene trees on the avian dataset
8
Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]
5%
10%
15%
20%
1,500 1,000 500 250Gene sequence length
Spec
ies
tree
topo
logi
cal e
rror (
FN)
MP−ESTConcatenation (ML)
more gene tree error
Statistical binning: idea
• Concatenation has good accuracy with low levels of ILS
• Some pairs of genes are concordant (at least in topology)
9
Summary methods: All “genes”
independent
Concatenation: All “genes” put togetherBinning
Statistical binning: idea
• Concatenation has good accuracy with low levels of ILS
• Some pairs of genes are concordant (at least in topology)
• Concatenate “combinable” sets of genes into “supergenes” to increase the phylogenetic signal
9
Summary methods: All “genes”
independent
Concatenation: All “genes” put togetherBinning
Statistical binning: idea
• Concatenation has good accuracy with low levels of ILS
• Some pairs of genes are concordant (at least in topology)
• Concatenate “combinable” sets of genes into “supergenes” to increase the phylogenetic signal
• How combinable genes are found gene tree estimation is hard?
9
Summary methods: All “genes”
independent
Concatenation: All “genes” put togetherBinning
Statistical tests of combinability
10
A
B
CD
E
F
G
A
B
C D
E
FG
g1
g2
Statistical tests of combinability
10
40%
70%
85%
20%A
B
CD
E
F
G
65%
25%90%70%
A
B
C D
E
FG
g1
g2
Statistical tests of combinability
10
40%
70%
85%
20%A
B
CD
E
F
G
70%
85%
A
B
CD
E
F
G
65%
25%90%70%
65%90% 70%A
B
C DE
FG
A
B
C D
E
FG
• Restrict genes to parts that have a minimum support
<50%
<50%
g1
g2
Statistical tests of combinability
10
40%
70%
85%
20%A
B
CD
E
F
G
70%
85%
A
B
CD
E
F
G
65%
25%90%70%
65%90% 70%A
B
C DE
FG
A
B
C D
E
FG
A
B
CD
E
F
G
Compatible
• Restrict genes to parts that have a minimum support
• Test combinability based on the supported parts of gene trees
<50%
<50%
g1
g2
Incompatibility graph
11
12 DECEMBER 2014 • VOL 346 ISSUE 6215 1337SCIENCE sciencemag.org
INTRODUCTION: Reconstructing species
trees for rapid radiations, as in the early
diversification of birds, is complicated by
biological processes such as incomplete
lineage sorting (ILS)
that can cause differ-
ent parts of the ge-
nome to have different
evolutionary histories.
Statistical methods,
based on the multispe-
cies coalescent model and that combine
gene trees, can be highly accurate even
in the presence of massive ILS; however,
these methods can produce species trees
that are topologically far from the species
tree when estimated gene trees have error.
We have developed a statistical binning
technique to address gene tree estimation
error and have explored its use in genome-
scale species tree estimation with MP-EST,
a popular coalescent-based species tree
estimation method.
Statistical binning enables an
accurate coalescent-based estimation
of the avian tree
AVIAN GENOMICS
Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow*
RESEARCH ARTICLE SUMMARY
The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for
combinabilty, before estimating gene trees.
Statistical binning technique
Statistical binning pipeline
Traditional pipeline (unbinned)
Sequence data
Incompatibility graph
Gene alignments
Binned supergene alignments
Estimated gene trees
Supergene trees
Species tree
Species tree
RATIONALE: In statistical binning, phy-
logenetic trees on different genes are es-
timated and then placed into bins, so that
the differences between trees in the same
bin can be explained by estimation error
(see the figure). A new tree is then esti-
mated for each bin by applying maximum
likelihood to a concatenated alignment of
the multiple sequence alignments of its
genes, and a species tree is estimated us-
ing a coalescent-based species tree method
from these supergene trees.
RESULTS: Under realistic conditions in
our simulation study, statistical binning
reduced the topological error of species
trees estimated using MP-EST and enabled
a coalescent-based analysis that was more
accurate than concatenation even when
gene tree estimation error was relatively
high. Statistical binning also reduced the
error in gene tree topology and species
tree branch length estimation, especially
when the phylogenetic signal in gene se-
quence alignments was low. Species trees
estimated using MP-EST with statisti-
cal binning on four biological data sets
showed increased concordance with the
biological literature. When MP-EST was
used to analyze 14,446 gene trees in the
avian phylogenomics project, it produced
a species tree that was discordant with the
concatenation analysis and conflicted with
prior literature. However, the statistical
binning analysis produced a tree that was
highly congruent with the concatenation
analysis and was consistent with the prior
scientific literature.
CONCLUSIONS: Statistical binning re-
duces the error in species tree topology
and branch length estimation because
it reduces gene tree estimation error.
These improvements are greatest when
gene trees have reduced bootstrap sup-
port, which was the case for the avian
phylogenomics project. Because using
unbinned gene trees can result in over-
estimation of ILS, statistical binning may
be helpful in providing more accurate
estimations of ILS levels in biological
data sets. Thus, statistical binning enables
highly accurate species tree estimations,
even on genome-scale data sets. �
The list of author affiliations is available in the full article online.
*Corresponding author. E-mail: [email protected] this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463
Read the full article
at http://dx.doi
.org/10.1126/
science.1250463
ON OUR WEB SITE
Published by AAAS
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
A gene tree Incompatibility between two gene trees
Incompatibility graph
11
12 DECEMBER 2014 • VOL 346 ISSUE 6215 1337SCIENCE sciencemag.org
INTRODUCTION: Reconstructing species
trees for rapid radiations, as in the early
diversification of birds, is complicated by
biological processes such as incomplete
lineage sorting (ILS)
that can cause differ-
ent parts of the ge-
nome to have different
evolutionary histories.
Statistical methods,
based on the multispe-
cies coalescent model and that combine
gene trees, can be highly accurate even
in the presence of massive ILS; however,
these methods can produce species trees
that are topologically far from the species
tree when estimated gene trees have error.
We have developed a statistical binning
technique to address gene tree estimation
error and have explored its use in genome-
scale species tree estimation with MP-EST,
a popular coalescent-based species tree
estimation method.
Statistical binning enables an
accurate coalescent-based estimation
of the avian tree
AVIAN GENOMICS
Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow*
RESEARCH ARTICLE SUMMARY
The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for
combinabilty, before estimating gene trees.
Statistical binning technique
Statistical binning pipeline
Traditional pipeline (unbinned)
Sequence data
Incompatibility graph
Gene alignments
Binned supergene alignments
Estimated gene trees
Supergene trees
Species tree
Species tree
RATIONALE: In statistical binning, phy-
logenetic trees on different genes are es-
timated and then placed into bins, so that
the differences between trees in the same
bin can be explained by estimation error
(see the figure). A new tree is then esti-
mated for each bin by applying maximum
likelihood to a concatenated alignment of
the multiple sequence alignments of its
genes, and a species tree is estimated us-
ing a coalescent-based species tree method
from these supergene trees.
RESULTS: Under realistic conditions in
our simulation study, statistical binning
reduced the topological error of species
trees estimated using MP-EST and enabled
a coalescent-based analysis that was more
accurate than concatenation even when
gene tree estimation error was relatively
high. Statistical binning also reduced the
error in gene tree topology and species
tree branch length estimation, especially
when the phylogenetic signal in gene se-
quence alignments was low. Species trees
estimated using MP-EST with statisti-
cal binning on four biological data sets
showed increased concordance with the
biological literature. When MP-EST was
used to analyze 14,446 gene trees in the
avian phylogenomics project, it produced
a species tree that was discordant with the
concatenation analysis and conflicted with
prior literature. However, the statistical
binning analysis produced a tree that was
highly congruent with the concatenation
analysis and was consistent with the prior
scientific literature.
CONCLUSIONS: Statistical binning re-
duces the error in species tree topology
and branch length estimation because
it reduces gene tree estimation error.
These improvements are greatest when
gene trees have reduced bootstrap sup-
port, which was the case for the avian
phylogenomics project. Because using
unbinned gene trees can result in over-
estimation of ILS, statistical binning may
be helpful in providing more accurate
estimations of ILS levels in biological
data sets. Thus, statistical binning enables
highly accurate species tree estimations,
even on genome-scale data sets. �
The list of author affiliations is available in the full article online.
*Corresponding author. E-mail: [email protected] this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463
Read the full article
at http://dx.doi
.org/10.1126/
science.1250463
ON OUR WEB SITE
Published by AAAS
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
• Find independent sets: sets with no edges between any pairs of nodes
• Genes in each “bin” are all pairwise compatible
• Minimum vertex coloring (NP-hard)
• Brélaz heuristics
• Modified the heuristic to produce balanced bins where possible
A gene tree Incompatibility between two gene trees
Statistical binning: overview
12
Original version: unweighted [Mirarab, et al., Science, 2014]
Gene sequence data Estimated initial gene trees
Incompatibility Graph Binned supergene alignments Supergene trees (weighted) Species tree
g1 g2 g3 gk
(partitioned)
support threshold
MP-EST
Statistical binning: overview
12
Original version: unweighted [Mirarab, et al., Science, 2014]
New version: weighted [Bayzid, Mirarab, Warnow, arXiv, 2015]
Gene sequence data Estimated initial gene trees
Incompatibility Graph Binned supergene alignments Supergene trees (weighted) Species tree
g1 g2 g3 gk
(partitioned)
support threshold
MP-EST
Avian-like simulation results
13
48 avian-like species, 1000 genes
5%
10%
15%
20%
1,500 1,000 500 250Gene sequence length
Spec
ies
tree
topo
logi
cal e
rror (
FN)
MP−ESTMP−EST − Binned
more gene tree error
Avian-like simulation results
14
5%
10%
15%
20%
1,500 1,000 500 250Gene sequence length
Spec
ies
tree
topo
logi
cal e
rror (
FN)
MP−ESTMP−EST − BinnedCA−ML
more gene tree error
48 avian-like species, 1000 genes
Binning also improves other measures of accuracy
15
• More accurate gene tree distributions
Binning also improves other measures of accuracy
15
• More accurate gene tree distributions
• Better species tree bootstrap support (i.e., fewer highly supported false positives)
Binning also improves other measures of accuracy
15
• More accurate gene tree distributions
• Better species tree bootstrap support (i.e., fewer highly supported false positives)
• More accurate species tree branch lengths
Binning on the avian dataset
The binned tree was highly supported and was largely congruent with concatenation
16
and binning reduces this noise, which suggeststhat the overall impact of binning is beneficial.These results are also consistent with the obser-vation that coalescent-based summary methodscan be robust to recombination (49).Our study explored gene tree–estimation error
arising from insufficient phylogenetic signal inthe gene sequences; however, gene tree–estimationerror can also come from poorly estimated align-ments (50) or errors introduced during the treeinference (51, 52). Because our studies focusedon insufficient phylogenetic signal, we have no
evidence that binning could reduce phylogeneticerror due to alignment error or misspecificationfor the sequence evolution model. Consequently,appropriate care should be devoted to obtain-ing good alignments and choosing an adequatemodel of sequence evolution to reconstruct bothgene and supergene trees.In our simulation, we only allowed ILS as a
source of discord between true gene trees andtrue species trees; hence, these model condi-tions favor MP-EST (which is based on the samemodel used for simulations) over concatenation
(which assumes no ILS is present). Given this,the fact that unbinned MP-EST is less accuratethan concatenation inmany conditions is note-worthy. Future studies based on model condi-tions in which other sources of gene tree discord(e.g., duplication and loss, incorrect orthologyassessments, recombination, introgression, hori-zontal gene transfer, and hybridization) are in-cluded would enable a better understandingof the relative accuracy of concatenation andcoalescent-based species tree estimation and theimpact of using binning under those conditions.
SCIENCE sciencemag.org 00 MONTH 2014 • VOL 000 ISSUE 0000 1250463-7
Fig. 5. Results on the (A) avian and (B) metazoan biological data sets using binned and unbinned MP-EST. Branches without designation represent100% support.
RESEARCH | RESEARCH ARTICLE
MS no: RA1250463/BPO/GENETICS
90. J. F. Storz, J. C. Opazo, F. G. Hoffmann, Mol. Phylogenet. Evol.66, 469–478 (2013).
91. F. G. Hoffmann, J. F. Storz, T. A. Gorr, J. C. Opazo, Mol. Biol.Evol. 27, 1126–1138 (2010).
ACKNOWLEDGMENTS
Genome assemblies and annotations of avian genomes in thisstudy are available on the avian phylogenomics website(http://phybirds.genomics.org.cn), GigaDB (http://dx.doi.org/10.5524/101000), National Center for Biotechnology Information(NCBI), and ENSEMBL (NCBI and Ensembl accession numbersare provided in table S2). The majority of this study wassupported by an internal funding from BGI. In addition, G.Z. wassupported by a Marie Curie International Incoming Fellowshipgrant (300837); M.T.P.G. was supported by a Danish NationalResearch Foundation grant (DNRF94) and a Lundbeck Foundationgrant (R52-A5062); C.L. and Q.L. were partially supported by aDanish Council for Independent Research Grant (10-081390);and E.D.J. was supported by the Howard Hughes Medical Instituteand NIH Directors Pioneer Award DP1OD000448.
The Avian Genome ConsortiumChen Ye,1 Shaoguang Liang,1 Zengli Yan,1 M. Lisandra Zepeda,2
Paula F. Campos,2 Amhed Missael Vargas Velazquez,2
José Alfredo Samaniego,2 María Avila-Arcos,2 Michael D. Martin,2
Ross Barnett,2 Angela M. Ribeiro,3 Claudio V. Mello,4 Peter V. Lovell,4
Daniela Almeida,3,5 Emanuel Maldonado,3 Joana Pereira,3
Kartik Sunagar,3,5 Siby Philip,3,5 Maria Gloria Dominguez-Bello,6
Michael Bunce,7 David Lambert,8 Robb T. Brumfield,9
Frederick H. Sheldon,9 Edward C. Holmes,10 Paul P. Gardner,11
Tammy E. Steeves,11 Peter F. Stadler,12 Sarah W. Burge,13
Eric Lyons,14 Jacqueline Smith,15 Fiona McCarthy,16
Frederique Pitel,17 Douglas Rhoads,18 David P. Froman19
1China National GeneBank, BGI-Shenzhen, Shenzhen 518083,China. 2Centre for GeoGenetics, Natural History Museum ofDenmark, University of Copenhagen, Øster Voldgade 5-7, 1350Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar deInvestigação Marinha e Ambiental, Universidade do Porto, Ruados Bragas, 177, 4050-123 Porto, Portugal. 4Department ofBehavioral Neuroscience Oregon Health & Science UniversityPortland, OR 97239, USA. 5Departamento de Biologia, Faculdadede Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal. 6Department of Biology, University of PuertoRico, Av Ponce de Leon, Rio Piedras Campus, JGD 224, San Juan,PR 009431-3360, USA. 7Trace and Environmental DNA laboratory,Department of Environment and Agriculture, Curtin University, Perth,Western Australia 6102, Australia. 8Environmental Futures ResearchInstitute, Griffith University, Nathan, Queensland 4121, Australia.9Museum of Natural Science, Louisiana State University, BatonRouge, LA 70803, USA. 10Marie Bashir Institute for InfectiousDiseases and Biosecurity, Charles Perkins Centre, School ofBiological Sciences and Sydney Medical School, The University ofSydney, Sydney NSW 2006, Australia. 11School of BiologicalSciences, University of Canterbury, Christchurch 8140, New Zealand.12Bioinformatics Group, Department of Computer Science, andInterdisciplinary Center for Bioinformatics, University of Leipzig,Hr̈telstrasse 16-18, D-04107 Leipzig, Germany. 13European MolecularBiology Laboratory, European Bioinformatics Institute, Hinxton,Cambridge CB10 1SD, UK. 14School of Plant Sciences, BIO5 Institute,University of Arizona, Tucson, AZ 85721, USA. 15Division of Geneticsand Genomics, The Roslin Institute and Royal (Dick) School ofVeterinary Studies, The Roslin Institute Building, University ofEdinburgh, Easter Bush Campus, Midlothian EH25 9RG, UK.16Department of Veterinary Science and Microbiology, University ofArizona, 1117 E Lowell Street, Post Office Box 210090-0090, Tucson,AZ 85721, USA. 17Laboratoire de Génétique Cellulaire, INRA Cheminde Borde-Rouge, Auzeville, BP 52627 , 31326 CASTANET-TOLOSANCEDEX, France. 18Department of Biological Sciences, Science andEngineering 601, University of Arkansas, Fayetteville, AR 72701, USA.19Department of Animal Sciences, Oregon State University, Corvallis,OR 97331, USA.
SUPPLEMENTARY MATERIALS
www.sciencemag.org/content/346/6215/1311/suppl/DC1Supplementary TextFigs. S1 to S42Tables S1 to S51References (92–192)
27 January 2014; accepted 6 November 201410.1126/science.1251385
RESEARCH ARTICLE
Whole-genome analyses resolveearly branches in the tree of lifeof modern birdsErich D. Jarvis,1*† Siavash Mirarab,2* Andre J. Aberer,3 Bo Li,4,5,6 Peter Houde,7
Cai Li,4,6 Simon Y. W. Ho,8 Brant C. Faircloth,9,10 Benoit Nabholz,11
Jason T. Howard,1 Alexander Suh,12 Claudia C. Weber,12 Rute R. da Fonseca,6
Jianwen Li,4 Fang Zhang,4 Hui Li,4 Long Zhou,4 Nitish Narula,7,13 Liang Liu,14
Ganesh Ganapathy,1 Bastien Boussau,15 Md. Shamsuzzoha Bayzid,2
Volodymyr Zavidovych,1 Sankar Subramanian,16 Toni Gabaldón,17,18,19
Salvador Capella-Gutiérrez,17,18 Jaime Huerta-Cepas,17,18 Bhanu Rekepalli,20
Kasper Munch,21 Mikkel Schierup,21 Bent Lindow,6 Wesley C. Warren,22
David Ray,23,24,25 Richard E. Green,26 Michael W. Bruford,27 Xiangjiang Zhan,27,28
Andrew Dixon,29 Shengbin Li,30 Ning Li,31 Yinhua Huang,31
Elizabeth P. Derryberry,32,33 Mads Frost Bertelsen,34 Frederick H. Sheldon,33
Robb T. Brumfield,33 Claudio V. Mello,35,36 Peter V. Lovell,35 Morgan Wirthlin,35
Maria Paula Cruz Schneider,36,37 Francisco Prosdocimi,36,38 José Alfredo Samaniego,6
Amhed Missael Vargas Velazquez,6 Alonzo Alfaro-Núñez,6 Paula F. Campos,6
Bent Petersen,39 Thomas Sicheritz-Ponten,39 An Pas,40 Tom Bailey,41 Paul Scofield,42
Michael Bunce,43 David M. Lambert,16 Qi Zhou,44 Polina Perelman,45,46
Amy C. Driskell,47 Beth Shapiro,26 Zijun Xiong,4 Yongli Zeng,4 Shiping Liu,4
Zhenyu Li,4 Binghang Liu,4 Kui Wu,4 Jin Xiao,4 Xiong Yinqi,4 Qiuemei Zheng,4
Yong Zhang,4 Huanming Yang,48 Jian Wang,48 Linnea Smeds,12 Frank E. Rheindt,49
Michael Braun,50 Jon Fjeldsa,51 Ludovic Orlando,6 F. Keith Barker,52
Knud Andreas Jønsson,51,53,54 Warren Johnson,55 Klaus-Peter Koepfli,56
Stephen O’Brien,57,58 David Haussler,59 Oliver A. Ryder,60 Carsten Rahbek,51,54
Eske Willerslev,6 Gary R. Graves,51,61 Travis C. Glenn,62 John McCormack,63
Dave Burt,64 Hans Ellegren,12 Per Alström,65,66 Scott V. Edwards,67
Alexandros Stamatakis,3,68 David P. Mindell,69 Joel Cracraft,70 Edward L. Braun,71
Tandy Warnow,2,72† Wang Jun,48,73,74,75,76† M. Thomas P. Gilbert,6,43† Guojie Zhang4,77†
To better determine the history of modern birds, we performed a genome-scale phylogeneticanalysis of 48 species representing all orders of Neoaves using phylogenomic methodscreated to handle genome-scale data. We recovered a highly resolved tree that confirmspreviously controversial sister or close relationships. We identified the first divergence inNeoaves, two groups we named Passerea and Columbea, representing independent lineagesof diverse and convergently evolved land and water bird species. Among Passerea, we inferthe common ancestor of core landbirds to have been an apex predator and confirm independentgains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging tosister clades. Even with whole genomes, some of the earliest branches in Neoaves provedchallenging to resolve, which was best explained by massive protein-coding sequenceconvergence and high levels of incomplete lineage sorting that occurred during a rapidradiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago.
The diversification of species is not alwaysgradual but can occur in rapid radiations,especially aftermajor environmental changes(1, 2). Paleobiological (3–7) and molecular (8)evidence suggests that such “big bang” radia-
tions occurred for neoavian birds (e.g., songbirds,parrots, pigeons, and others) and placental mam-mals, representing 95% of extant avian and mam-malian species, after the Cretaceous to Paleogene(K-Pg)mass extinction event about 66million yearsago (Ma). However, other nuclear (9–12) and mito-chondrial (13, 14) DNA studies propose an earlier,more gradual diversification, beginning withinthe Cretaceous 80 to 125 Ma. This debate is con-founded by findings that different data sets (15–19)and analytical methods (20, 21) often yield con-
trasting species trees. Resolving such timing andphylogenetic relationships is important for com-parative genomics,which can informabout humantraits and diseases (22).Recent avian studies based on fragments of 5
[~5000 base pairs (bp) (8)] and 19 [31,000 bp (17)]genes recovered some relationships inferred frommorphological data (15, 23) and DNA-DNA hy-bridization (24), postulated new relationships,and contradicted many others. Consistent withmost previous molecular and contemporary mor-phological studies (15), they divided modernbirds (Neornithes) into Palaeognathae (tinamousand flightless ratites), Galloanseres [Galliformes(landfowl) and Anseriformes (waterfowl)], andNeoaves (all other extant birds). Within Neoaves,
1320 12 DECEMBER 2014 • VOL 346 ISSUE 6215 sciencemag.org SCIENCE
A FLOCK OF GENOMES
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
[Jarvis, Mirarab, et al., Science, 2014]
Summary• Low phylogenetic signal per gene prevented accurate
coalescent-based analyses of the avian dataset
• Statistical binning groups sets of genes based on statistical measures of combinability
• Statistical binning improves accuracy compared to both unbinned summary methods and concatenation
• Statistical binning enabled a coalescent-based analyses of the avian dataset; results were largely congruent with concatenation
17
More generally …• Genome-scale data provides a wealth of information
• Yet, reconstruction of species phylogenies remains challenging • Limited data per gene • Scalability to many species: ASTRAL-II (ISMB 2015) • Impact of model violations, missing data, etc. • Multiple sources of gene tree discordance
• Many interesting statistical and computational questions and a need for method development
18
Acknowledgments
Jim Leebens-‐mack (UGA)
Norman Wickett (U Chicago)
Gane Wong (U of Alberta)
Keshav Pingali
S.M. Bayzid Nam Nguyen (now at UIUC)
Tandy Warnow
Théo Zimmermann
Bastien Boussau (Université Lyon)
Erich Jarvis (Duke, HMMI)
Tom Gilbert (U Copenhagen)
Guojie Zhang (BGI, China)
Ed Braun (U Florida)
……
HMMI international student fellowship
0%
25%
50%
75%
100%
100 1,000 10,000 100,000Sequence length (log)
Aver
age
bran
ch b
oots
trap
supp
ort
Lack of phylogenetic signal1. Limited sequence length for each gene
2. Insufficient variation in each gene
20
Increasing the number of genes
21
[Mirarab, et al., Science, 2014]
Incomplete Lineage Sorting (ILS)• A population level process related to
inheritance and maintenance of alleles
• Omnipresent; most likely for short times between speciation events and/or large population size
22
Tracing alleles through
generations
Incomplete Lineage Sorting (ILS)• A population level process related to
inheritance and maintenance of alleles
• Omnipresent; most likely for short times between speciation events and/or large population size
• We have statistical models of ILS (multi-species coalescent)
• The species tree defines a probability distribution on the gene trees, and is identifiable from the distribution on gene trees [Degnan and Salter, Int. J. Org. Evolution, 2005]
22
Tracing alleles through
generations
Avian-like simulation results• Avian-like simulation; 1000 genes, 48 taxa, high levels of ILS
23
More information per gene
[Mirarab, et al., Science, 2014]
MP-ESTMP-EST
More information per gene
Branch length accuracy
Gene tree distribution error• We can quantify gene tree distribution error using
triplet frequency:
• We can compare triplet frequencies obtained from true gene trees and from the estimated gene trees (for all triplets of taxa)
24
A B C B A C70% 15%
C A B15%
A B C B A C C A B65% 25% 15%
true distribution estimated distribution
Compare
Binning improves gene tree distribution
Empirical commutative distribution
Binning improves gene tree distribution
Empirical commutative distribution
More information per gene
Binning improves gene tree distribution
Supergene trees represent the true gene tree distribution much better than the estimated gene trees without binning.
Empirical commutative distribution
More information per gene