Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes
Siavash Mirarab12, Tandy Warnow3 1University of Texas at Austin, 2University of California San Diego,
3University of Illinois at Urbana-Champaign
OrangutanGorilla ChimpanzeeHuman
phylogenomics
2
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
“gene” here refers to a portion of the genome (not a functional gene)
Orangutan
Gorilla
Chimpanzee
Human
I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome
Gene tree discordance
3
Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
Gene tree discordance
3
OrangutanGorilla ChimpHuman
The species tree
A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
Gene tree discordance
3
OrangutanGorilla ChimpHuman
The species tree
A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
Causes of gene tree discordance include:• Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT)
gene1000gene 1
Incomplete Lineage Sorting (ILS)• A random process related to having
multiple versions of each gene in a population
4
Tracing alleles through generations
Incomplete Lineage Sorting (ILS)• A random process related to having
multiple versions of each gene in a population
4
Tracing alleles through generations
Incomplete Lineage Sorting (ILS)• A random process related to having
multiple versions of each gene in a population
• Omnipresent; most likely for short branches or large population sizes
4
Tracing alleles through generations
Incomplete Lineage Sorting (ILS)• A random process related to having
multiple versions of each gene in a population
• Omnipresent; most likely for short branches or large population sizes
• We have statistical models of ILS (multi-species coalescent)
• The species tree defines the probability distribution on gene trees, and is identifiable from the distribution on gene trees [Degnan and Salter, Int. J. Org. Evolution, 2005]
4
Tracing alleles through generations
5
Traditional approach: concatenation
Multi-gene phylogeny reconstruction
5
Sequencing
samplesgene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
gene 1
gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG
CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG
CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000
gene 1
MSA
MSA
MSA
Summary method
Orangutan
Gorilla
Chimpanzee
Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
CC0 Public Domain http://pixabay.com/
Step 1: MSA
Step 2: Species tree reconstruction
Orangutan
Gorilla
Chimpanzee
Human
Phylogeny inferenceACTGCACACCG
ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrixgene 2 gene 1000gene 1
PASTA66[RECOMB62014]6[J.6Comp.6Bio.6(JCB)62015]
ASTRAL'I)and)ASTRAL'II))[Bioinformatics,62014,62015])Plant)phylogenomics)(1KP))[PNAS,62014]
Statistical)Binning)(SB,)WSB))[Science,62014]6[PLoS6ONE,62015]6Avian)phylogenomics))[Science,62014]
Gene tree estimation
Orang.
GorillaChimp
Humangene
1
Orang.
Gorilla Chimp
Humangene
2
Orang.
Gorilla
Chimp
Humangene
100
0
5
• Statistically inconsistent and can even be positively misleading (proved for unpartitioned maximum likelihood)[Roch and Steel, Theo. Pop. Gen., 2014]
• Mixed accuracy in simulations [Kubatko and Degnan, Systematic Biology, 2007] [Mirarab, et al., Systematic Biology, 2014] Data
Error
Traditional approach: concatenation
Multi-gene phylogeny reconstruction
5
Sequencing
samplesgene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
gene 1
gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG
CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG
CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000
gene 1
MSA
MSA
MSA
Summary method
Orangutan
Gorilla
Chimpanzee
Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
CC0 Public Domain http://pixabay.com/
Step 1: MSA
Step 2: Species tree reconstruction
Orangutan
Gorilla
Chimpanzee
Human
Phylogeny inferenceACTGCACACCG
ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrixgene 2 gene 1000gene 1
PASTA66[RECOMB62014]6[J.6Comp.6Bio.6(JCB)62015]
ASTRAL'I)and)ASTRAL'II))[Bioinformatics,62014,62015])Plant)phylogenomics)(1KP))[PNAS,62014]
Statistical)Binning)(SB,)WSB))[Science,62014]6[PLoS6ONE,62015]6Avian)phylogenomics))[Science,62014]
Gene tree estimation
Orang.
GorillaChimp
Humangene
1
Orang.
Gorilla Chimp
Humangene
2
Orang.
Gorilla
Chimp
Humangene
100
0
Scalable ILS-based summary methods
6
Data
Error• Summary methods can be statistically consistent
• STAR, STELLS, BUCKy-pop, MP-EST, NJst, ASTRAL [ECCB 2014], …
Multi-gene phylogeny reconstruction
5
Sequencing
samplesgene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
gene 1
gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG
CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG
CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000
gene 1
MSA
MSA
MSA
Summary method
Orangutan
Gorilla
Chimpanzee
Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
CC0 Public Domain http://pixabay.com/
Step 1: MSA
Step 2: Species tree reconstruction
Orangutan
Gorilla
Chimpanzee
Human
Phylogeny inferenceACTGCACACCG
ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrixgene 2 gene 1000gene 1
PASTA66[RECOMB62014]6[J.6Comp.6Bio.6(JCB)62015]
ASTRAL'I)and)ASTRAL'II))[Bioinformatics,62014,62015])Plant)phylogenomics)(1KP))[PNAS,62014]
Statistical)Binning)(SB,)WSB))[Science,62014]6[PLoS6ONE,62015]6Avian)phylogenomics))[Science,62014]
Gene tree estimation
Orang.
GorillaChimp
Humangene
1
Orang.
Gorilla Chimp
Humangene
2
Orang.
Gorilla
Chimp
Humangene
100
0
Scalable ILS-based summary methods
6
Data
Error• Summary methods can be statistically consistent
• STAR, STELLS, BUCKy-pop, MP-EST, NJst, ASTRAL [ECCB 2014], …
Error-free gene trees
Multi-gene phylogeny reconstruction
5
Sequencing
samplesgene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
gene 1
gene 2
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG
CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG
CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT
gene 1000
gene 1
MSA
MSA
MSA
Summary method
Orangutan
Gorilla
Chimpanzee
Human
Approach 2: Summary methods
Approach 1: Concatenation
Bioinformatic processing
CC0 Public Domain http://pixabay.com/
Step 1: MSA
Step 2: Species tree reconstruction
Orangutan
Gorilla
Chimpanzee
Human
Phylogeny inferenceACTGCACACCG
ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrixgene 2 gene 1000gene 1
PASTA66[RECOMB62014]6[J.6Comp.6Bio.6(JCB)62015]
ASTRAL'I)and)ASTRAL'II))[Bioinformatics,62014,62015])Plant)phylogenomics)(1KP))[PNAS,62014]
Statistical)Binning)(SB,)WSB))[Science,62014]6[PLoS6ONE,62015]6Avian)phylogenomics))[Science,62014]
Gene tree estimation
Orang.
GorillaChimp
Humangene
1
Orang.
Gorilla Chimp
Humangene
2
Orang.
Gorilla
Chimp
Humangene
100
0
given true gene trees
Properties of quartet trees in presence of ILS
7
• For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010]
Orang.
Gorilla Chimp
Human Orang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
Dominant
30%p1 = 30%p2 = 40%p3 =
Orang.
GorillaChimp
Human Orang.
Gorilla Chimp
Human
Orang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
Orang.
Gorilla Chimp
Human
Orang.
Gorilla
Chimp
Human
Properties of quartet trees in presence of ILS
7
• For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010]
• For >4 species, the dominant topology may be different from the species tree [Degnan and Rosenberg, 2006]
1. Break up input each gene tree into trees on 4 taxa (quartet trees)
2. Find all (4n) dominant quartet topologies
3. Combine dominant quartet treesOrang.
Gorilla Chimp
Human Orang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
Dominant
30%p1 = 30%p2 = 40%p3 =
Orang.
GorillaChimp
Human Orang.
Gorilla Chimp
Human
Orang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
Orang.
Gorilla Chimp
Human
Orang.
Gorilla
Chimp
Human ✓n
4
◆
✓n
4
◆
Properties of quartet trees in presence of ILS
7
• For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010]
• For >4 species, the dominant topology may be different from the species tree [Degnan and Rosenberg, 2006]
1. Break up input each gene tree into trees on 4 taxa (quartet trees)
2. Find all (4n) dominant quartet topologies
3. Combine dominant quartet trees
• Alternative: weight 3(4n) quartet topology by their frequency and find the optimal tree
Orang.
Gorilla Chimp
Human Orang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
Dominant
30%p1 = 30%p2 = 40%p3 =
Orang.
GorillaChimp
Human Orang.
Gorilla Chimp
Human
Orang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
Orang.
Gorilla Chimp
Human
Orang.
Gorilla
Chimp
Human ✓n
4
◆
✓n
4
◆
✓n
4
◆
Maximum Quartet Support Species Tree [Mirarab, et al., ECCB, 2014]
• Optimization Problem (suspected NP-Hard):
• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly
8
Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees
Set of quartet trees induced by T
a gene tree
Score(T ) =X
t2T(Q(T ) \Q(t))
all input gene trees
ASTRAL-I [Mirarab, et al., ECCB, 2014]
• ASTRAL solves the problem exactly using dynamic programming:
• Exponential running time (feasible for <18 species)
9
ASTRAL-I [Mirarab, et al., ECCB, 2014]
• ASTRAL solves the problem exactly using dynamic programming:
• Exponential running time (feasible for <18 species)
• Introduced a constrained version of the problem
• Draws the set of branches in the species tree from a given set X = {all bipartitions in all gene trees}
• Motivation: given large number of gene trees, each species tree branch appears in at least one gene tree
• Theorem: the constrained version remains statistically consistent
• Running time: for n species and k species
9
O(n2k|X |2)
ASTRAL-I on biological datasets
10
• 1KP: 103 plant species, 400-800 genes
• Yang, et. al. 96 Caryophyllales species, 1122 genes
• Dentinger, et. al. 39 mushroom species, 208 genes
• Giarla and Esselstyn. 19 Philippine shrew species, 1112 genes
• Laumer, et. al. 40 flatworm species, 516 genes
• Grover, et. al. 8 cotton species, 52 genes
• Hosner, Braun, and Kimball. 28 quail species, 11 genes
• Simmons and Gatesy. 47 angiosperm species, 310 genes
Phylotranscriptomic analysis of the origin and earlydiversification of land plantsNorman J. Wicketta,b,1,2, Siavash Mirarabc,1, Nam Nguyenc, Tandy Warnowc, Eric Carpenterd, Naim Matascie,f,Saravanaraj Ayyampalayamg, Michael S. Barkerf, J. Gordon Burleighh, Matthew A. Gitzendannerh,i, Brad R. Ruhfelh,j,k,Eric Wafulal, Joshua P. Derl, Sean W. Grahamm, Sarah Mathewsn, Michael Melkoniano, Douglas E. Soltish,i,k,Pamela S. Soltish,i,k, Nicholas W. Milesk, Carl J. Rothfelsp,q, Lisa Pokornyp,r, A. Jonathan Shawp, Lisa DeGironimos,Dennis W. Stevensons, Barbara Sureko, Juan Carlos Villarrealt, Béatrice Roureu, Hervé Philippeu,v, Claude W. dePamphilisl,Tao Chenw, Michael K. Deyholosd, Regina S. Baucomx, Toni M. Kutchany, Megan M. Augustiny, Jun Wangz, Yong Zhangv,Zhijian Tianz, Zhixiang Yanz, Xiaolei Wuz, Xiao Sunz, Gane Ka-Shu Wongd,z,aa,2, and James Leebens-Mackg,2
aChicago Botanic Garden, Glencoe, IL 60022; bProgram in Biological Sciences, Northwestern University, Evanston, IL 60208; cDepartment of Computer Science,University of Texas, Austin, TX 78712; dDepartment of Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E9; eiPlant Collaborative,Tucson, AZ 85721; fDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721; gDepartment of Plant Biology, University ofGeorgia, Athens, GA 30602; hDepartment of Biology and iGenetics Institute, University of Florida, Gainesville, FL 32611; jDepartment of Biological Sciences,Eastern Kentucky University, Richmond, KY 40475; kFlorida Museum of Natural History, Gainesville, FL 32611; lDepartment of Biology, Pennsylvania StateUniversity, University Park, PA 16803; mDepartment of Botany and qDepartment of Zoology, University of British Columbia, Vancouver, BC, Canada V6T 1Z4;nArnold Arboretum of Harvard University, Cambridge, MA 02138; oBotanical Institute, Universität zu Köln, Cologne D-50674, Germany; pDepartment ofBiology, Duke University, Durham, NC 27708; rDepartment of Biodiversity and Conservation, Real Jardín Botanico-Consejo Superior de InvestigacionesCientificas, 28014 Madrid, Spain; sNew York Botanical Garden, Bronx, NY 10458; tDepartment fur Biologie, Systematische Botanik und Mykologie,Ludwig-Maximilians-Universitat, 80638 Munich, Germany; uDépartement de Biochimie, Centre Robert-Cedergren, Université de Montréal, SuccursaleCentre-Ville, Montreal, QC, Canada H3C 3J7; vCNRS, Station d’ Ecologie Experimentale du CNRS, Moulis, 09200, France; wShenzhen Fairy Lake BotanicalGarden, The Chinese Academy of Sciences, Shenzhen, Guangdong 518004, China; xDepartment of Ecology and Evolutionary Biology, University ofMichigan, Ann Arbor, MI 48109; yDonald Danforth Plant Science Center, St. Louis, MO 63132; zBGI-Shenzhen, Bei shan Industrial Zone, Yantian District,Shenzhen 518083, China; and aaDepartment of Medicine, University of Alberta, Edmonton, AB, Canada T6G 2E1
Edited by Paul O. Lewis, University of Connecticut, Storrs, CT, and accepted by the Editorial Board September 29, 2014 (received for review December 23, 2013)
Reconstructing the origin and evolution of land plants and theiralgal relatives is a fundamental problem in plant phylogenetics, andis essential for understanding how critical adaptations arose, in-cluding the embryo, vascular tissue, seeds, and flowers. Despiteadvances in molecular systematics, some hypotheses of relationshipsremain weakly resolved. Inferring deep phylogenies with bouts ofrapid diversification can be problematic; however, genome-scaledata should significantly increase the number of informative charac-ters for analyses. Recent phylogenomic reconstructions focused onthe major divergences of plants have resulted in promising but in-consistent results. One limitation is sparse taxon sampling, likelyresulting from the difficulty and cost of data generation. To addressthis limitation, transcriptome data for 92 streptophyte taxa weregenerated and analyzed along with 11 published plant genomesequences. Phylogenetic reconstructions were conducted using upto 852 nuclear genes and 1,701,170 aligned sites. Sixty-nine analyseswere performed to test the robustness of phylogenetic inferences topermutations of the datamatrix or to phylogenetic method, includingsupermatrix, supertree, and coalescent-based approaches, maximum-likelihood and Bayesian methods, partitioned and unpartitioned ana-lyses, and amino acid versus DNA alignments. Among otherresults, we find robust support for a sister-group relationshipbetween land plants and one group of streptophyte green al-gae, the Zygnematophyceae. Strong and robust support for aclade comprising liverworts and mosses is inconsistent with awidely accepted view of early land plant evolution, and suggeststhat phylogenetic hypotheses used to understand the evolution offundamental plant traits should be reevaluated.
land plants | Streptophyta | phylogeny | phylogenomics | transcriptome
The origin of embryophytes (land plants) in the Ordovicianperiod roughly 480 Mya (1–4) marks one of the most im-
portant events in the evolution of life on Earth. The early evo-lution of embryophytes in terrestrial environments was facilitatedby numerous innovations, including parental protection for thedeveloping embryo, sperm and egg production in multicellularprotective structures, and an alternation of phases (often referred toas generations) in which a diploid sporophytic life history stagegives rise to a multicellular haploid gametophytic phase. With
Significance
Early branching events in the diversification of land plants andclosely related algal lineages remain fundamental and un-resolved questions in plant evolutionary biology. Accuratereconstructions of these relationships are critical for testing hy-potheses of character evolution: for example, the origins of theembryo, vascular tissue, seeds, and flowers. We investigatedrelationships among streptophyte algae and land plants usingthe largest set of nuclear genes that has been applied to thisproblem to date. Hypothesized relationships were rigorouslytested through a series of analyses to assess systematic errors inphylogenetic inference caused by sampling artifacts and modelmisspecification. Results support some generally accepted phy-logenetic hypotheses, while rejecting others. This work providesa new framework for studies of land plant evolution.
Author contributions: N.J.W., S. Mirarab, T.W., S.W.G., M.M., D.E.S., P.S.S., D.W.S., M.K.D.,J.W., G.K.-S.W., and J.L.-M. designed research; N.J.W., S. Mirarab, N.N., T.W., E.C., N.M., S.A.,M.S.B., J.G.B., M.A.G., B.R.R., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M.,C.J.R., L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., B.R., H.P., C.W.d., T.C., M.K.D., M.M.A., J.W., Y.Z.,Z.T., Z.Y., X.W., X.S., G.K.-S.W., and J.L.-M. performed research; S. Mirarab, N.N., T.W., N.M.,S.A., M.S.B., J.G.B., M.A.G., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M., C.J.R.,L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., H.P., C.W.d., T.C., M.K.D., R.S.B., T.M.K., M.M.A., J.W., Y.Z.,G.K.-S.W., and J.L.-M. contributed new reagents/analytic tools; N.J.W., S. Mirarab, N.N., E.C.,N.M., S.A., M.S.B., J.G.B., M.A.G., B.R.R., E.W., B.R., H.P., and J.L.-M. analyzed data; N.J.W.,S. Mirarab, T.W., S.W.G., M.M., D.E.S., D.W.S., H.P., G.K.-S.W., and J.L.-M. wrote the paper;and N.M. archived data.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission. P.O.L. is a guest editor invited by theEditorial Board.
Freely available online through the PNAS open access option.
Data deposition: The sequences reported in this paper have been deposited in theiplant datastore database, mirrors.iplantcollaborative.org/onekp_pilot, and the Na-tional Center for Biotechnology Information Sequence Read Archive, www.ncbi.nlm.nih.gov/sra [accession no. PRJEB4921 (ERP004258)].1N.J.W. and S. Mirarab contributed equally to this work.2To whom correspondence may be addressed. Email: [email protected],[email protected], or [email protected].
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1323926111/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1323926111 PNAS Early Edition | 1 of 10
EVOLU
TION
PNASPL
US
Future datasets• 1200 plants with ~ 400 genes (1KP consortium)
• 250 avian species with 2000 genes (with LSU, UF, and Smithsonian)
• 200 avian species with whole genomes (with Genome 10K, international)
• 250 suboscine species (birds) with ~2000 genes (with LSU and Tulane)
• 140 Insects with 1400 genes (with U. Illinois at Urbana-Champaign)
11
Shortcomings of ASTRAL-I
• Even the constrained version was too slow for more than about 200 species and hundreds of genes
• The constraint set X did not include true species tree branches for some challenging datasets, resulting in low accuracy in some cases
• Input gene trees could not have polytomies
12
ASTRAL-II1. Faster calculation of the score function inside DP
• O(nk) instead of O(n2k) for n species and k genes
• Post-order traversal of input trees instead of set operations
2. Add extra bipartitions to the set X using heuristic approaches
• Resolving consensus trees by subsampling taxa
• Using quartet-based distances to find likely branches
3. Ability to take as input gene trees with polytomies
13
Simulation study• Variable parameters:
• Number of species: 10 – 1000
• Number of genes: 50 – 1000
• Amount of ILS: low, medium, high
• Deep versus recent speciation
• 11 model conditions (50 replicas each) with heterogenous gene tree error
• Compare to NJst, MP-EST, concatenation (CA-ML)
• Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree
14
Truegenetrees Sequencedata
Es�matedspeciestree
Finch Falcon Owl Eagle Pigeon
Es�matedgenetreesFinch Owl Falcon Eagle Pigeon
True(model)speciestree
ASTRAL-II
look at all pairs of leaves chosen each from one of the children ofu. For each such pair of leaves, there are
�u0
2
�quartet trees that put
that pair together, where u0 is the number of leaves outside the nodeu. This will examine each pair of nodes in each of the input k nodesexactly once and would therefore require O(n2k) computations.The final score can be normalized by the maximum number of inputquartet trees that include a pair of taxa.
Given the similarity matrix, we calculate an UPGMA tree andadd all its bipartitions to the set X. This heuristic adds relatively fewbipartitions, but the matrix is used in the next heuristic, which is ourmain addition mechanism.
Greedy: We estimate the greedy consensus of the gene trees atdifferent threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3).For each polytomy in each greedy consensus tree, we resolve thepolytomy in multiple ways and add bipartitions implied by thoseresolutions to the set X. First, we resolve the polytomy by applyingthe UPGMA algorithm to the similarity matrix, starting from theclades given by the polytomy. Then, we sample one taxon fromeach side of the ploytomy randomly, and use the greedy consensusof the gene trees restricted to this subsample to find a resolutionof the polytomy (we randomly resolve any multifunctions in thisgreedy consensus on indued subsample). We repeat this process atleast 10 times, but if the subsampled greedy consensus trees includesufficiently frequent bipartitions (defined as > 1%), we do morerounds of random sampling (we increase the number of iterationsby two every time this happens). For each random subsamplearound a polytomy, we also resolve it by calculating an UPGMAtree on the subsampled similarity matrix. Finally, for the two firstgreedy threshold values and the first 10 random subsamples, wealso use a third strategy that can potentially add a larger number ofbipartitions. For each subsampled taxon x, we resolve the polytomyas a caterpillar tree by sorting the remaining taxa according to theirsimilarity with x.
Gene tree polytomies: When gene trees include polytomies, wealso add new bipartitions to set X. We first compute the greedyconsensus of the input gene trees with threshold 0 and if thegreedy consensus has polytomies, we resolve them using UPGMA;we repeat this process twice to account for uncertainty in greedyconsensus estimation. Then, for each gene tree polytomy, we use thetwo resolved consensus trees to infer a resolution of the polytomyand we add the implied resolutions to set X.
3.3 Multi-furcating input gene trees
Extending ASTRAL to inputs that include polytomies requiressolving the weighted quartet tree problem when each node of theinput defines not a tripartition, but a multi-partition of the setof taxa. We start by a basic observation: every resolved quartettree induced by a gene tree maps to two nodes in the gene treeregardless of whether the gene tree is binary or not. In other words,induced quartet trees that map to only one node of the gene tree areunresolved. When maximizing the quartet support, these unresolvedgene tree quartet trees are inconsequential and need to be ignored.Now, consider a polytomy of degree d. There are
�d3
�ways to select
three sides of the polytomy. Each of these ways of selecting threesides defines a tripartition of a subset of taxa. Any selection of twotaxa from one side of this tripartition and one taxon from each of theremaining two sides still defines an induced resolved quartet tree,
0
5
10
15
20
0% 20% 40% 60% 80%RF distance (true species tree vs true gene trees)
dens
ity
rate1e−06 1e−07
tree height10M 2M 500K
(a) True gene tree discordance
0
1
2
3
4
0% 25% 50% 75% 100%RF distance (true vs estimated)
dens
ity
(b) Gene tree estimation error
Fig. 1. Characteristics of the simulation (a) RF distance between the truespecies tree and the true gene trees (50 replicates of 1000 genes) for DatasetI. Tree height directly affects the amount of true discordance; the speciationrate affects true gene tree discordance only with 10M tree length. (b) RFdistance between true gene trees and estimated gene trees for Dataset I. Seealso Figure S1 for inter and intra-replicate gene tree error distributions.
and each induced resolved quartet tree would still map to exactlytwo nodes in our multi-furcating tree. Thus, all the algorithmicassumptions of ASTRAL remain intact, as long as for each multi-furcating node in an input gene tree, we treat it as a collection of
�d3
�
tripartitions. Note that in the presence of polytomies, the runningtime analysis can change because analyzing each multi-furcatingnode requires time cubic in its degree and the degree can increase inprinciple with n. Thus, the running time depends on the patterns ofthe multi-furcations and cannot be studied in a general case.
Statistical Consistency: ASTRAL-I was statistically consistent, andchanges from ASTRAL-I to ASTRAL-II either affect running time,or enlarge the search space, which does not negate consistency.
Theorem 3: ASTRAL-II is statistically consistent for binarycomplete input gene trees.
4 EXPERIMENTAL SETUPSimulation Procedure: We used SimPhy, a tool developed by Malloet al. (2015), to simulate species trees and gene trees (producedin mutation units), and then used Indelible to simulate sequencesdown the gene trees with varying length and model parameters. Weestimated gene trees on these simulated gene alignments, which wethen used in coalescent-based analyses.
We simulated 10 model conditions, which we divide into twodatasets, with one model condition appearing in both datasets. We
3
ASTRAL-I versus ASTRAL-II
15
10M 2M 500K
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
10M 2M 500K
0
3
6
9
12
0
3
6
9
12
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Run
ning
tim
e (h
ours
)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes
and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.
5
Medium ILSLow ILS High ILS
Spec
ies
tree
topo
logi
cal
erro
r (FN
)
10M 2M 500K
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
10M 2M 500K
0
3
6
9
12
0
3
6
9
121e−06
1e−07
50 200 1000 50 200 1000 50 200 1000genes
Run
ning
tim
e (h
ours
)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes
and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.
5
200 species, deep ILS
ASTRAL-I versus ASTRAL-II
15
10M 2M 500K
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
10M 2M 500K
0
3
6
9
12
0
3
6
9
12
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Run
ning
tim
e (h
ours
)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes
and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.
5
10M 2M 500K
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
10M 2M 500K
0
3
6
9
12
0
3
6
9
12
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Run
ning
tim
e (h
ours
)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes
and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.
5
Medium ILSLow ILS High ILS
Medium ILSLow ILS High ILS
Run
ning
tim
e (h
ours
)Sp
ecie
s tre
e to
polo
gica
l e
rror (
FN)
10M 2M 500K
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
10M 2M 500K
0
3
6
9
12
0
3
6
9
121e−06
1e−07
50 200 1000 50 200 1000 50 200 1000genes
Run
ning
tim
e (h
ours
)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes
and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.
5
200 species, deep ILS
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IINJstMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
17
0
10
20
10 50 100 200 500 1000number of species
Run
ning
tim
e (h
ours
)
ASTRAL−IINJstMP−EST
Running time when varying the number of species
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the level of ILS
18
200 species, recent ILS
1000 genes 200 genes 50 genes
0%
10%
20%
30%
10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)
Spec
ies
tree
topo
logi
cal e
rror (
FN) ASTRAL−II
NJstCA−ML
more ILS more ILS more ILSL M H L M H L M H
Tree accuracy when varying the level of ILS
18
200 species, recent ILS
1000 genes 200 genes 50 genes
0%
10%
20%
30%
10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)
Spec
ies
tree
topo
logi
cal e
rror (
FN) ASTRAL−II
NJstCA−ML
more ILS more ILS more ILSL M H L M H L M H
Impact of gene tree error (using true gene trees)
19
Spec
ies
tree
topo
logi
cal
erro
r (FN
)
10M 2M 500K
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−II ASTRAL−II (true gt) CA−ML
Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
9
Medium ILSLow ILS High ILS
10M 2M 500K
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.41e−06
1e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−II ASTRAL−II (true gt) CA−ML
Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
9
Impact of gene tree error (using true gene trees)
• When we divide our 50 replicates into low, medium, or high gene tree estimation error, ASTRAL tends to be better with low error
19
Spec
ies
tree
topo
logi
cal
erro
r (FN
)
10M 2M 500K
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−II ASTRAL−II (true gt) CA−ML
Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
9
Medium ILSLow ILS High ILS
10M 2M 500K
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.41e−06
1e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−II ASTRAL−II (true gt) CA−ML
Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
9
Insights on biological data• Main question: The placement of Amborella
at the base of angiosperms
• Xi et al. (2014) used a collection of 310 genes sampled from 46 species.
• Conflicting results:
• Concatenation puts Amborella at the base (H1)
• MP-EST puts Amobrella+water lilies at the base (H2)
• Xi et al. conclude ILS is the cause
• ASTRAL like many other recent studies (e.g., 1KP) recovers H1
• ILS is not necessarily the case
20more accurate than ASTRAL under lower levels of ILS is related to
estimation error in the input provided to ASTRAL.
In our ASTRAL and NJst analyses, gene tree error had a positive
correlation with species tree error (Supplementary Fig. S7), with cor-
relation coefficients that were similar for ASTRAL and NJst. The
error of CA-ML also correlated with gene tree error (obviously the
relationship is indirect; for example, short alignments impact both
CA-ML and gene tree error), but the correlation was weaker than
the correlation observed for coalescent-based methods
(Supplementary Fig. S8). Interestingly, the correlation between gene
tree estimation error and species tree error was typically higher with
fewer genes.
To further investigate the impact of the gene tree error, we div-
ided replicates of each model condition into three categories: aver-
age gene tree estimation error below 0.25 is low, between 0.25 and
0.4 is medium and above 0.4 is high. We plotted the species tree ac-
curacy within each of these categories (see Fig. 5 for one model con-
dition, but also see Supplementary Figs S9 and S10 for other model
conditions). The relative performance of ASTRAL and NJst is typic-
ally unchanged across various categories of gene tree error, but
increasing gene tree error tends to increase the magnitude of the dif-
ference between ASTRAL and NJst. Furthermore, MP-EST seemed
to be more sensitive to gene tree error than either NJst or ASTRAL
(Supplementary Fig. S10).
The relative performance of ASTRAL and CA-ML depended on
gene tree error. For those model conditions where CA-ML was
generally more accurate than ASTRAL (e.g. 2 M/1e-07), ASTRAL
tended to outperform CA-ML on the replicates with low gene tree
estimation error (Fig. 5). Consistent with this observation, we noted
that ASTRAL was impacted by gene tree error more than CA-ML
(Supplementary Fig. S9).
5.5 RQ5: collapsing low support branchesASTRAL-II can handle inputs with polytomies. Although we have
not done bootstrapping to get reliable measures of support, we do
get local SH-like branch support from FastTree-II. We collapsed low
support branches (10%, 33% and 50%) and ran ASTRAL on the re-
sulting unresolved gene trees. We measured the impact of contract-
ing low support branches on the RF rate: the median delta RF (error
before collapsing minus error after collapsing) is typically zero
(Supplementary Fig. S11), never above zero but in a few cases below
zero (signifying that accuracy was improved in those few cases).
However, these differences are not statistically significant
(P¼0.36). Since this analysis was performed using SH-like branch
support values instead of bootstrap support values (or other ways of
estimating support values), further studies are needed.
6 Biological results
The evolution of angiosperms, and the placement of Amborella tri-
chopoda Baill., is one of the challenging questions in land plant evo-
lution. One hypothesis recovered in some recent molecular studies
(e.g. Drew et al. 2014; Qiu et al. 2000; Wickett et al. 2014; Zhang
et al. 2012) is that A.trichopoda Baill. is sister to the rest of angio-
sperms, followed by water lilies (i.e. Nymphaeales). In particular, a
recent analysis of 104 plant species based on entire transcriptomes
recovered this relationship both with concatenation and ASTRAL-I,
using various perturbations of the dataset (Wickett et al., 2014). A
competing hypothesis is that Amborella is sister to water lilies, and
this whole group is sister to other angiosperms (Drew et al., 2014;
Goremykin et al., 2013). Xi et al. (2014) examined this question
using a collection of 310 genes sampled from 42 angiosperms and 4
outgroups. They observed that CA-ML produced the first hypothesis
and MP-EST produced the second hypothesis, and they argued that
these differences are due to the fact that CA-ML does not model
ILS, whereas MP-EST does.
We obtained alignments for these 310 genes from Xi et al.
(2014) and estimated gene trees using RAxML under GTRþCmodel with 200 replicates of bootstrapping and 10 rounds of ML
(RAxML was used because running time was not an issue on this
relatively small dataset). We ran MP-EST and ASTRAL and
obtained two different trees (Fig. 6). Reproducing Xi et al. (2014)
results, MP-EST recovered the sister relationship of Amborella and
Nymphaeales with 100% support. However, ASTRAL, just like
50 200 1000
0.0
0.1
0.2
0.3
0.4
2M1e −07
Spe
cies
tree
err
or (
FN
)
ASTRAL−II NJst CA−ML
low medium high low medium high low medium high
Fig. 5. Comparison of species tree accuracy with 200 taxa, divided into three categories of gene tree estimation error. Boxes show number of genes
Mimulus
Nuphar
Manihot
Zamia
Betula
Pinus
Quercus
Liriodendron
Vitis
Sesamum
Sorghum
Cucumis
Panax
SileneCamellia
Populus
Glycine
Eucalyptus
Persea
Aristolochia
Solanum
Ricinus
Helianthus
Malus
Ipomoea
Oryza
Musa
Dioscorea
Coffea
Amborella
Phalaenopsis
TheobromaGossypium
Citrus
Medicago
FragariaCannabis
Arabidopsis
CaricaBrassica
Lactuca
Phoenix
Aquilegia
Selaginella
Striga
Picea
42
94
28
72
66
69
96
86
61
60
5
85
56
70
44
92
92
88
98
93
96
96
76
25
53
75 78
20
10
NupharAmborella
Astral-II MP-ESTA B
Fig. 6. Comparison of species trees computed on the angiosperm dataset of
Xi et al. (2014). MP-EST and ASTRAL-II differ in the placement of Amborella;
the concatenation tree agrees with ASTRAL-II
ASTRAL-II i51
by guest on June 19, 2015http://bioinform
atics.oxfordjournals.org/D
ownloaded from
Summary• Genome-scale data provides a wealth of information for
resolving long-standing phylogenetic questions • ASTRAL-II improves on ASTRAL-I in terms of both
accuracy and running time • ASTRAL-II can handle datasets with 1000 genes from
1000 taxa in a day of single cpu running time • ASTRAL dominates other summary methods, However,
Concatenation is better when gene trees have high error • In future, we need to further explore, the impact of model
violations, recombination, missing data, and multiple sources of gene tree discordance (e.g., HGT)
21
Acknowledgments
Jim Leebens-‐mack (UGA)
Norman Wickett (U Chicago)
Gane Wong (U of Alberta)
Keshav Pingali
S.M. Bayzid Nam Nguyen (now at UIUC)
Tandy Warnow
Théo Zimmermann Bastien Boussau
(Université Lyon)Erich Jarvis
(Duke, HMMI)Tom Gilbert
(U Copenhagen)Guojie Zhang (BGI, China)
Ed Braun (U Florida)
……
HMMI international student fellowship