Coalescent-based species tree estimation with...

Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes

Siavash Mirarab12, Tandy Warnow3 1University of Texas at Austin, 2University of California San Diego,

3University of Illinois at Urbana-Champaign

OrangutanGorilla ChimpanzeeHuman

phylogenomics

2

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

“gene” here refers to a portion of the genome (not a functional gene)

Orangutan

Gorilla

Chimpanzee

Human

I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome

Gene tree discordance

3

Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1


3

OrangutanGorilla ChimpHuman

The species tree

A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1


3

OrangutanGorilla ChimpHuman

The species tree

A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

Causes of gene tree discordance include:• Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT)

gene1000gene 1

Incomplete Lineage Sorting (ILS)• A random process related to having

multiple versions of each gene in a population

4

Tracing alleles through generations



4




• Omnipresent; most likely for short branches or large population sizes

4




• Omnipresent; most likely for short branches or large population sizes

• We have statistical models of ILS (multi-species coalescent)

• The species tree defines the probability distribution on gene trees, and is identifiable from the distribution on gene trees [Degnan and Salter, Int. J. Org. Evolution, 2005]

4


5

Traditional approach: concatenation

Multi-gene phylogeny reconstruction

5

Sequencing

samplesgene 2

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG



gene 1000

gene 1

gene 2

ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG

CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG

CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT

gene 1000

gene 1

MSA

MSA

MSA

Summary method

Orangutan

Gorilla

Chimpanzee

Human

Approach 2: Summary methods

Approach 1: Concatenation

Bioinformatic processing

CC0 Public Domain http://pixabay.com/

Step 1: MSA

Step 2: Species tree reconstruction

Orangutan

Gorilla

Chimpanzee

Human

Phylogeny inferenceACTGCACACCG

ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG


CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

supermatrixgene 2 gene 1000gene 1

PASTA66[RECOMB62014]6[J.6Comp.6Bio.6(JCB)62015]

ASTRAL'I)and)ASTRAL'II))[Bioinformatics,62014,62015])Plant)phylogenomics)(1KP))[PNAS,62014]

Statistical)Binning)(SB,)WSB))[Science,62014]6[PLoS6ONE,62015]6Avian)phylogenomics))[Science,62014]

Gene tree estimation

Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

5

• Statistically inconsistent and can even be positively misleading (proved for unpartitioned maximum likelihood)[Roch and Steel, Theo. Pop. Gen., 2014]

• Mixed accuracy in simulations [Kubatko and Degnan, Systematic Biology, 2007] [Mirarab, et al., Systematic Biology, 2014] Data

Error

Traditional approach: concatenation


5

Sequencing

samplesgene 2




gene 1000

gene 1

gene 2




gene 1000

gene 1

MSA

MSA

MSA

Summary method

Orangutan

Gorilla

Chimpanzee

Human





Step 1: MSA


Orangutan

Gorilla

Chimpanzee

Human










Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

Scalable ILS-based summary methods

6

Data

Error• Summary methods can be statistically consistent

• STAR, STELLS, BUCKy-pop, MP-EST, NJst, ASTRAL [ECCB 2014], …


5

Sequencing

samplesgene 2




gene 1000

gene 1

gene 2




gene 1000

gene 1

MSA

MSA

MSA

Summary method

Orangutan

Gorilla

Chimpanzee

Human





Step 1: MSA


Orangutan

Gorilla

Chimpanzee

Human










Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

Scalable ILS-based summary methods

6

Data

Error• Summary methods can be statistically consistent

• STAR, STELLS, BUCKy-pop, MP-EST, NJst, ASTRAL [ECCB 2014], …

Error-free gene trees


5

Sequencing

samplesgene 2




gene 1000

gene 1

gene 2




gene 1000

gene 1

MSA

MSA

MSA

Summary method

Orangutan

Gorilla

Chimpanzee

Human





Step 1: MSA


Orangutan

Gorilla

Chimpanzee

Human










Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

given true gene trees

Properties of quartet trees in presence of ILS

7

• For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010]

Orang.

Gorilla Chimp

Human Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Dominant

30%p1 = 30%p2 = 40%p3 =

Orang.

GorillaChimp

Human Orang.

Gorilla Chimp

Human

Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Orang.

Gorilla Chimp

Human

Orang.

Gorilla

Chimp

Human


7


• For >4 species, the dominant topology may be different from the species tree [Degnan and Rosenberg, 2006]

1. Break up input each gene tree into trees on 4 taxa (quartet trees)

2. Find all (4n) dominant quartet topologies

3. Combine dominant quartet treesOrang.

Gorilla Chimp

Human Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Dominant

30%p1 = 30%p2 = 40%p3 =

Orang.

GorillaChimp

Human Orang.

Gorilla Chimp

Human

Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Orang.

Gorilla Chimp

Human

Orang.

Gorilla

Chimp

Human ✓n

4

◆

✓n

4

◆


7


• For >4 species, the dominant topology may be different from the species tree [Degnan and Rosenberg, 2006]

1. Break up input each gene tree into trees on 4 taxa (quartet trees)

2. Find all (4n) dominant quartet topologies

3. Combine dominant quartet trees

• Alternative: weight 3(4n) quartet topology by their frequency and find the optimal tree

Orang.

Gorilla Chimp

Human Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Dominant

30%p1 = 30%p2 = 40%p3 =

Orang.

GorillaChimp

Human Orang.

Gorilla Chimp

Human

Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Orang.

Gorilla Chimp

Human

Orang.

Gorilla

Chimp

Human ✓n

4

◆

✓n

4

◆

✓n

4

◆

Maximum Quartet Support Species Tree [Mirarab, et al., ECCB, 2014]

• Optimization Problem (suspected NP-Hard):

• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly

8

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

Set of quartet trees induced by T

a gene tree

Score(T ) =X

t2T(Q(T ) \Q(t))

all input gene trees

ASTRAL-I [Mirarab, et al., ECCB, 2014]

• ASTRAL solves the problem exactly using dynamic programming:

• Exponential running time (feasible for <18 species)

9

ASTRAL-I [Mirarab, et al., ECCB, 2014]

• ASTRAL solves the problem exactly using dynamic programming:

• Exponential running time (feasible for <18 species)

• Introduced a constrained version of the problem

• Draws the set of branches in the species tree from a given set X = {all bipartitions in all gene trees}

• Motivation: given large number of gene trees, each species tree branch appears in at least one gene tree

• Theorem: the constrained version remains statistically consistent

• Running time: for n species and k species

9

O(n2k|X |2)

ASTRAL-I on biological datasets

10

• 1KP: 103 plant species, 400-800 genes

• Yang, et. al. 96 Caryophyllales species, 1122 genes

• Dentinger, et. al. 39 mushroom species, 208 genes

• Giarla and Esselstyn. 19 Philippine shrew species, 1112 genes

• Laumer, et. al. 40 flatworm species, 516 genes

• Grover, et. al. 8 cotton species, 52 genes

• Hosner, Braun, and Kimball. 28 quail species, 11 genes

• Simmons and Gatesy. 47 angiosperm species, 310 genes

Phylotranscriptomic analysis of the origin and earlydiversification of land plantsNorman J. Wicketta,b,1,2, Siavash Mirarabc,1, Nam Nguyenc, Tandy Warnowc, Eric Carpenterd, Naim Matascie,f,Saravanaraj Ayyampalayamg, Michael S. Barkerf, J. Gordon Burleighh, Matthew A. Gitzendannerh,i, Brad R. Ruhfelh,j,k,Eric Wafulal, Joshua P. Derl, Sean W. Grahamm, Sarah Mathewsn, Michael Melkoniano, Douglas E. Soltish,i,k,Pamela S. Soltish,i,k, Nicholas W. Milesk, Carl J. Rothfelsp,q, Lisa Pokornyp,r, A. Jonathan Shawp, Lisa DeGironimos,Dennis W. Stevensons, Barbara Sureko, Juan Carlos Villarrealt, Béatrice Roureu, Hervé Philippeu,v, Claude W. dePamphilisl,Tao Chenw, Michael K. Deyholosd, Regina S. Baucomx, Toni M. Kutchany, Megan M. Augustiny, Jun Wangz, Yong Zhangv,Zhijian Tianz, Zhixiang Yanz, Xiaolei Wuz, Xiao Sunz, Gane Ka-Shu Wongd,z,aa,2, and James Leebens-Mackg,2

aChicago Botanic Garden, Glencoe, IL 60022; bProgram in Biological Sciences, Northwestern University, Evanston, IL 60208; cDepartment of Computer Science,University of Texas, Austin, TX 78712; dDepartment of Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E9; eiPlant Collaborative,Tucson, AZ 85721; fDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721; gDepartment of Plant Biology, University ofGeorgia, Athens, GA 30602; hDepartment of Biology and iGenetics Institute, University of Florida, Gainesville, FL 32611; jDepartment of Biological Sciences,Eastern Kentucky University, Richmond, KY 40475; kFlorida Museum of Natural History, Gainesville, FL 32611; lDepartment of Biology, Pennsylvania StateUniversity, University Park, PA 16803; mDepartment of Botany and qDepartment of Zoology, University of British Columbia, Vancouver, BC, Canada V6T 1Z4;nArnold Arboretum of Harvard University, Cambridge, MA 02138; oBotanical Institute, Universität zu Köln, Cologne D-50674, Germany; pDepartment ofBiology, Duke University, Durham, NC 27708; rDepartment of Biodiversity and Conservation, Real Jardín Botanico-Consejo Superior de InvestigacionesCientificas, 28014 Madrid, Spain; sNew York Botanical Garden, Bronx, NY 10458; tDepartment fur Biologie, Systematische Botanik und Mykologie,Ludwig-Maximilians-Universitat, 80638 Munich, Germany; uDépartement de Biochimie, Centre Robert-Cedergren, Université de Montréal, SuccursaleCentre-Ville, Montreal, QC, Canada H3C 3J7; vCNRS, Station d’ Ecologie Experimentale du CNRS, Moulis, 09200, France; wShenzhen Fairy Lake BotanicalGarden, The Chinese Academy of Sciences, Shenzhen, Guangdong 518004, China; xDepartment of Ecology and Evolutionary Biology, University ofMichigan, Ann Arbor, MI 48109; yDonald Danforth Plant Science Center, St. Louis, MO 63132; zBGI-Shenzhen, Bei shan Industrial Zone, Yantian District,Shenzhen 518083, China; and aaDepartment of Medicine, University of Alberta, Edmonton, AB, Canada T6G 2E1

Edited by Paul O. Lewis, University of Connecticut, Storrs, CT, and accepted by the Editorial Board September 29, 2014 (received for review December 23, 2013)

Reconstructing the origin and evolution of land plants and theiralgal relatives is a fundamental problem in plant phylogenetics, andis essential for understanding how critical adaptations arose, in-cluding the embryo, vascular tissue, seeds, and flowers. Despiteadvances in molecular systematics, some hypotheses of relationshipsremain weakly resolved. Inferring deep phylogenies with bouts ofrapid diversification can be problematic; however, genome-scaledata should significantly increase the number of informative charac-ters for analyses. Recent phylogenomic reconstructions focused onthe major divergences of plants have resulted in promising but in-consistent results. One limitation is sparse taxon sampling, likelyresulting from the difficulty and cost of data generation. To addressthis limitation, transcriptome data for 92 streptophyte taxa weregenerated and analyzed along with 11 published plant genomesequences. Phylogenetic reconstructions were conducted using upto 852 nuclear genes and 1,701,170 aligned sites. Sixty-nine analyseswere performed to test the robustness of phylogenetic inferences topermutations of the datamatrix or to phylogenetic method, includingsupermatrix, supertree, and coalescent-based approaches, maximum-likelihood and Bayesian methods, partitioned and unpartitioned ana-lyses, and amino acid versus DNA alignments. Among otherresults, we find robust support for a sister-group relationshipbetween land plants and one group of streptophyte green al-gae, the Zygnematophyceae. Strong and robust support for aclade comprising liverworts and mosses is inconsistent with awidely accepted view of early land plant evolution, and suggeststhat phylogenetic hypotheses used to understand the evolution offundamental plant traits should be reevaluated.

land plants | Streptophyta | phylogeny | phylogenomics | transcriptome

The origin of embryophytes (land plants) in the Ordovicianperiod roughly 480 Mya (1–4) marks one of the most im-

portant events in the evolution of life on Earth. The early evo-lution of embryophytes in terrestrial environments was facilitatedby numerous innovations, including parental protection for thedeveloping embryo, sperm and egg production in multicellularprotective structures, and an alternation of phases (often referred toas generations) in which a diploid sporophytic life history stagegives rise to a multicellular haploid gametophytic phase. With

Significance

Early branching events in the diversification of land plants andclosely related algal lineages remain fundamental and un-resolved questions in plant evolutionary biology. Accuratereconstructions of these relationships are critical for testing hy-potheses of character evolution: for example, the origins of theembryo, vascular tissue, seeds, and flowers. We investigatedrelationships among streptophyte algae and land plants usingthe largest set of nuclear genes that has been applied to thisproblem to date. Hypothesized relationships were rigorouslytested through a series of analyses to assess systematic errors inphylogenetic inference caused by sampling artifacts and modelmisspecification. Results support some generally accepted phy-logenetic hypotheses, while rejecting others. This work providesa new framework for studies of land plant evolution.

Author contributions: N.J.W., S. Mirarab, T.W., S.W.G., M.M., D.E.S., P.S.S., D.W.S., M.K.D.,J.W., G.K.-S.W., and J.L.-M. designed research; N.J.W., S. Mirarab, N.N., T.W., E.C., N.M., S.A.,M.S.B., J.G.B., M.A.G., B.R.R., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M.,C.J.R., L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., B.R., H.P., C.W.d., T.C., M.K.D., M.M.A., J.W., Y.Z.,Z.T., Z.Y., X.W., X.S., G.K.-S.W., and J.L.-M. performed research; S. Mirarab, N.N., T.W., N.M.,S.A., M.S.B., J.G.B., M.A.G., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M., C.J.R.,L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., H.P., C.W.d., T.C., M.K.D., R.S.B., T.M.K., M.M.A., J.W., Y.Z.,G.K.-S.W., and J.L.-M. contributed new reagents/analytic tools; N.J.W., S. Mirarab, N.N., E.C.,N.M., S.A., M.S.B., J.G.B., M.A.G., B.R.R., E.W., B.R., H.P., and J.L.-M. analyzed data; N.J.W.,S. Mirarab, T.W., S.W.G., M.M., D.E.S., D.W.S., H.P., G.K.-S.W., and J.L.-M. wrote the paper;and N.M. archived data.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. P.O.L. is a guest editor invited by theEditorial Board.

Freely available online through the PNAS open access option.

Data deposition: The sequences reported in this paper have been deposited in theiplant datastore database, mirrors.iplantcollaborative.org/onekp_pilot, and the Na-tional Center for Biotechnology Information Sequence Read Archive, www.ncbi.nlm.nih.gov/sra [accession no. PRJEB4921 (ERP004258)].1N.J.W. and S. Mirarab contributed equally to this work.2To whom correspondence may be addressed. Email: [email protected],[email protected], or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1323926111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1323926111 PNAS Early Edition | 1 of 10

EVOLU

TION

PNASPL

US

Future datasets• 1200 plants with ~ 400 genes (1KP consortium)

• 250 avian species with 2000 genes (with LSU, UF, and Smithsonian)

• 200 avian species with whole genomes (with Genome 10K, international)

• 250 suboscine species (birds) with ~2000 genes (with LSU and Tulane)

• 140 Insects with 1400 genes (with U. Illinois at Urbana-Champaign)

11

Shortcomings of ASTRAL-I

• Even the constrained version was too slow for more than about 200 species and hundreds of genes

• The constraint set X did not include true species tree branches for some challenging datasets, resulting in low accuracy in some cases

• Input gene trees could not have polytomies

12

ASTRAL-II1. Faster calculation of the score function inside DP

• O(nk) instead of O(n2k) for n species and k genes

• Post-order traversal of input trees instead of set operations

2. Add extra bipartitions to the set X using heuristic approaches

• Resolving consensus trees by subsampling taxa

• Using quartet-based distances to find likely branches

3. Ability to take as input gene trees with polytomies

13

Simulation study• Variable parameters:

• Number of species: 10 – 1000

• Number of genes: 50 – 1000

• Amount of ILS: low, medium, high

• Deep versus recent speciation

• 11 model conditions (50 replicas each) with heterogenous gene tree error

• Compare to NJst, MP-EST, concatenation (CA-ML)

• Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree

14

Truegenetrees Sequencedata

Es�matedspeciestree

Finch Falcon Owl Eagle Pigeon

Es�matedgenetreesFinch Owl Falcon Eagle Pigeon

True(model)speciestree

ASTRAL-II

look at all pairs of leaves chosen each from one of the children ofu. For each such pair of leaves, there are

�u0

2

�quartet trees that put

that pair together, where u0 is the number of leaves outside the nodeu. This will examine each pair of nodes in each of the input k nodesexactly once and would therefore require O(n2k) computations.The final score can be normalized by the maximum number of inputquartet trees that include a pair of taxa.

Given the similarity matrix, we calculate an UPGMA tree andadd all its bipartitions to the set X. This heuristic adds relatively fewbipartitions, but the matrix is used in the next heuristic, which is ourmain addition mechanism.

Greedy: We estimate the greedy consensus of the gene trees atdifferent threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3).For each polytomy in each greedy consensus tree, we resolve thepolytomy in multiple ways and add bipartitions implied by thoseresolutions to the set X. First, we resolve the polytomy by applyingthe UPGMA algorithm to the similarity matrix, starting from theclades given by the polytomy. Then, we sample one taxon fromeach side of the ploytomy randomly, and use the greedy consensusof the gene trees restricted to this subsample to find a resolutionof the polytomy (we randomly resolve any multifunctions in thisgreedy consensus on indued subsample). We repeat this process atleast 10 times, but if the subsampled greedy consensus trees includesufficiently frequent bipartitions (defined as > 1%), we do morerounds of random sampling (we increase the number of iterationsby two every time this happens). For each random subsamplearound a polytomy, we also resolve it by calculating an UPGMAtree on the subsampled similarity matrix. Finally, for the two firstgreedy threshold values and the first 10 random subsamples, wealso use a third strategy that can potentially add a larger number ofbipartitions. For each subsampled taxon x, we resolve the polytomyas a caterpillar tree by sorting the remaining taxa according to theirsimilarity with x.

Gene tree polytomies: When gene trees include polytomies, wealso add new bipartitions to set X. We first compute the greedyconsensus of the input gene trees with threshold 0 and if thegreedy consensus has polytomies, we resolve them using UPGMA;we repeat this process twice to account for uncertainty in greedyconsensus estimation. Then, for each gene tree polytomy, we use thetwo resolved consensus trees to infer a resolution of the polytomyand we add the implied resolutions to set X.

3.3 Multi-furcating input gene trees

Extending ASTRAL to inputs that include polytomies requiressolving the weighted quartet tree problem when each node of theinput defines not a tripartition, but a multi-partition of the setof taxa. We start by a basic observation: every resolved quartettree induced by a gene tree maps to two nodes in the gene treeregardless of whether the gene tree is binary or not. In other words,induced quartet trees that map to only one node of the gene tree areunresolved. When maximizing the quartet support, these unresolvedgene tree quartet trees are inconsequential and need to be ignored.Now, consider a polytomy of degree d. There are

�d3

�ways to select

three sides of the polytomy. Each of these ways of selecting threesides defines a tripartition of a subset of taxa. Any selection of twotaxa from one side of this tripartition and one taxon from each of theremaining two sides still defines an induced resolved quartet tree,

0

5

10

15

20

0% 20% 40% 60% 80%RF distance (true species tree vs true gene trees)

dens

ity

rate1e−06 1e−07

tree height10M 2M 500K

(a) True gene tree discordance

0

1

2

3

4

0% 25% 50% 75% 100%RF distance (true vs estimated)

dens

ity

(b) Gene tree estimation error

Fig. 1. Characteristics of the simulation (a) RF distance between the truespecies tree and the true gene trees (50 replicates of 1000 genes) for DatasetI. Tree height directly affects the amount of true discordance; the speciationrate affects true gene tree discordance only with 10M tree length. (b) RFdistance between true gene trees and estimated gene trees for Dataset I. Seealso Figure S1 for inter and intra-replicate gene tree error distributions.

and each induced resolved quartet tree would still map to exactlytwo nodes in our multi-furcating tree. Thus, all the algorithmicassumptions of ASTRAL remain intact, as long as for each multi-furcating node in an input gene tree, we treat it as a collection of

�d3

�

tripartitions. Note that in the presence of polytomies, the runningtime analysis can change because analyzing each multi-furcatingnode requires time cubic in its degree and the degree can increase inprinciple with n. Thus, the running time depends on the patterns ofthe multi-furcations and cannot be studied in a general case.

Statistical Consistency: ASTRAL-I was statistically consistent, andchanges from ASTRAL-I to ASTRAL-II either affect running time,or enlarge the search space, which does not negate consistency.

Theorem 3: ASTRAL-II is statistically consistent for binarycomplete input gene trees.

4 EXPERIMENTAL SETUPSimulation Procedure: We used SimPhy, a tool developed by Malloet al. (2015), to simulate species trees and gene trees (producedin mutation units), and then used Indelible to simulate sequencesdown the gene trees with varying length and model parameters. Weestimated gene trees on these simulated gene alignments, which wethen used in coalescent-based analyses.

We simulated 10 model conditions, which we divide into twodatasets, with one model condition appearing in both datasets. We

3

ASTRAL-I versus ASTRAL-II

15

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)


Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes

and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.

5

Medium ILSLow ILS High ILS

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)


10M 2M 500K

0

3

6

9

12

0

3

6

9

121e−06

1e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)




5

200 species, deep ILS

ASTRAL-I versus ASTRAL-II

15

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)


10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)




5

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)


10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)




5



Run

ning

tim

e (h

ours

)Sp

ecie

s tre

e to

polo

gica

l e

rror (

FN)

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)


10M 2M 500K

0

3

6

9

12

0

3

6

9

121e−06

1e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)




5

200 species, deep ILS

16

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

16

4%

8%

12%

16%


Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%


Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST



16

4%

8%

12%

16%


Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%


Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%


Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IINJstMP−EST



17

0

10

20


Run

ning

tim

e (h

ours

)

ASTRAL−IINJstMP−EST

Running time when varying the number of species


Tree accuracy when varying the level of ILS

18

200 species, recent ILS

1000 genes 200 genes 50 genes

0%

10%

20%

30%

10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)

Spec

ies

tree

topo

logi

cal e

rror (

FN) ASTRAL−II

NJstCA−ML

more ILS more ILS more ILSL M H L M H L M H

Tree accuracy when varying the level of ILS

18

200 species, recent ILS

1000 genes 200 genes 50 genes

0%

10%

20%

30%

10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)

Spec

ies

tree

topo

logi

cal e

rror (

FN) ASTRAL−II

NJstCA−ML

more ILS more ILS more ILSL M H L M H L M H

Impact of gene tree error (using true gene trees)

19

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−II ASTRAL−II (true gt) CA−ML

Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and

varying tree shapes and number of genes.

9


10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.41e−06

1e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)




9

Impact of gene tree error (using true gene trees)

• When we divide our 50 replicates into low, medium, or high gene tree estimation error, ASTRAL tends to be better with low error

19

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)




9


10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.41e−06

1e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)




9

Insights on biological data• Main question: The placement of Amborella

at the base of angiosperms

• Xi et al. (2014) used a collection of 310 genes sampled from 46 species.

• Conflicting results:

• Concatenation puts Amborella at the base (H1)

• MP-EST puts Amobrella+water lilies at the base (H2)

• Xi et al. conclude ILS is the cause

• ASTRAL like many other recent studies (e.g., 1KP) recovers H1

• ILS is not necessarily the case

20more accurate than ASTRAL under lower levels of ILS is related to

estimation error in the input provided to ASTRAL.

In our ASTRAL and NJst analyses, gene tree error had a positive

correlation with species tree error (Supplementary Fig. S7), with cor-

relation coefficients that were similar for ASTRAL and NJst. The

error of CA-ML also correlated with gene tree error (obviously the

relationship is indirect; for example, short alignments impact both

CA-ML and gene tree error), but the correlation was weaker than

the correlation observed for coalescent-based methods

(Supplementary Fig. S8). Interestingly, the correlation between gene

tree estimation error and species tree error was typically higher with

fewer genes.

To further investigate the impact of the gene tree error, we div-

ided replicates of each model condition into three categories: aver-

age gene tree estimation error below 0.25 is low, between 0.25 and

0.4 is medium and above 0.4 is high. We plotted the species tree ac-

curacy within each of these categories (see Fig. 5 for one model con-

dition, but also see Supplementary Figs S9 and S10 for other model

conditions). The relative performance of ASTRAL and NJst is typic-

ally unchanged across various categories of gene tree error, but

increasing gene tree error tends to increase the magnitude of the dif-

ference between ASTRAL and NJst. Furthermore, MP-EST seemed

to be more sensitive to gene tree error than either NJst or ASTRAL

(Supplementary Fig. S10).

The relative performance of ASTRAL and CA-ML depended on

gene tree error. For those model conditions where CA-ML was

generally more accurate than ASTRAL (e.g. 2 M/1e-07), ASTRAL

tended to outperform CA-ML on the replicates with low gene tree

estimation error (Fig. 5). Consistent with this observation, we noted

that ASTRAL was impacted by gene tree error more than CA-ML

(Supplementary Fig. S9).

5.5 RQ5: collapsing low support branchesASTRAL-II can handle inputs with polytomies. Although we have

not done bootstrapping to get reliable measures of support, we do

get local SH-like branch support from FastTree-II. We collapsed low

support branches (10%, 33% and 50%) and ran ASTRAL on the re-

sulting unresolved gene trees. We measured the impact of contract-

ing low support branches on the RF rate: the median delta RF (error

before collapsing minus error after collapsing) is typically zero

(Supplementary Fig. S11), never above zero but in a few cases below

zero (signifying that accuracy was improved in those few cases).

However, these differences are not statistically significant

(P¼0.36). Since this analysis was performed using SH-like branch

support values instead of bootstrap support values (or other ways of

estimating support values), further studies are needed.

6 Biological results

The evolution of angiosperms, and the placement of Amborella tri-

chopoda Baill., is one of the challenging questions in land plant evo-

lution. One hypothesis recovered in some recent molecular studies

(e.g. Drew et al. 2014; Qiu et al. 2000; Wickett et al. 2014; Zhang

et al. 2012) is that A.trichopoda Baill. is sister to the rest of angio-

sperms, followed by water lilies (i.e. Nymphaeales). In particular, a

recent analysis of 104 plant species based on entire transcriptomes

recovered this relationship both with concatenation and ASTRAL-I,

using various perturbations of the dataset (Wickett et al., 2014). A

competing hypothesis is that Amborella is sister to water lilies, and

this whole group is sister to other angiosperms (Drew et al., 2014;

Goremykin et al., 2013). Xi et al. (2014) examined this question

using a collection of 310 genes sampled from 42 angiosperms and 4

outgroups. They observed that CA-ML produced the first hypothesis

and MP-EST produced the second hypothesis, and they argued that

these differences are due to the fact that CA-ML does not model

ILS, whereas MP-EST does.

We obtained alignments for these 310 genes from Xi et al.

(2014) and estimated gene trees using RAxML under GTRþCmodel with 200 replicates of bootstrapping and 10 rounds of ML

(RAxML was used because running time was not an issue on this

relatively small dataset). We ran MP-EST and ASTRAL and

obtained two different trees (Fig. 6). Reproducing Xi et al. (2014)

results, MP-EST recovered the sister relationship of Amborella and

Nymphaeales with 100% support. However, ASTRAL, just like

50 200 1000

0.0

0.1

0.2

0.3

0.4

2M1e −07

Spe

cies

tree

err

or (

FN

)

ASTRAL−II NJst CA−ML

low medium high low medium high low medium high

Fig. 5. Comparison of species tree accuracy with 200 taxa, divided into three categories of gene tree estimation error. Boxes show number of genes

Mimulus

Nuphar

Manihot

Zamia

Betula

Pinus

Quercus

Liriodendron

Vitis

Sesamum

Sorghum

Cucumis

Panax

SileneCamellia

Populus

Glycine

Eucalyptus

Persea

Aristolochia

Solanum

Ricinus

Helianthus

Malus

Ipomoea

Oryza

Musa

Dioscorea

Coffea

Amborella

Phalaenopsis

TheobromaGossypium

Citrus

Medicago

FragariaCannabis

Arabidopsis

CaricaBrassica

Lactuca

Phoenix

Aquilegia

Selaginella

Striga

Picea

42

94

28

72

66

69

96

86

61

60

5

85

56

70

44

92

92

88

98

93

96

96

76

25

53

75 78

20

10

NupharAmborella

Astral-II MP-ESTA B

Fig. 6. Comparison of species trees computed on the angiosperm dataset of

Xi et al. (2014). MP-EST and ASTRAL-II differ in the placement of Amborella;

the concatenation tree agrees with ASTRAL-II

ASTRAL-II i51

by guest on June 19, 2015http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Summary• Genome-scale data provides a wealth of information for

resolving long-standing phylogenetic questions • ASTRAL-II improves on ASTRAL-I in terms of both

accuracy and running time • ASTRAL-II can handle datasets with 1000 genes from

1000 taxa in a day of single cpu running time • ASTRAL dominates other summary methods, However,

Concatenation is better when gene trees have high error • In future, we need to further explore, the impact of model

violations, recombination, missing data, and multiple sources of gene tree discordance (e.g., HGT)

21

Acknowledgments

Jim Leebens-‐mack (UGA)

Norman Wickett (U Chicago)

Gane Wong (U of Alberta)

Keshav Pingali

S.M. Bayzid Nam Nguyen (now at UIUC)

Tandy Warnow

Théo Zimmermann Bastien Boussau

(Université Lyon)Erich Jarvis

(Duke, HMMI)Tom Gilbert

(U Copenhagen)Guojie Zhang (BGI, China)

Ed Braun (U Florida)

……

HMMI international student fellowship

Date post:	13-May-2020
Category:	Documents
Upload:	others
View:	12 times
Download:	0 times

Coalescent-based species tree estimation with...

Documents