Using networks to explore, quantify, and summarize phylogenetic tree space

transcript

Jeremy M. Brown1, Guifang Zhou2, Wen Huang2, Jeremy Ash1, Melissa Marchand2, Kyle Gallivan2, and Jim Wilgenbusch3

1 Louisiana State University, Dept. of Biological Sciences 2 Florida State University, Dept. of Mathematics

3 Florida State University, Dept. of Scientific Computing

−160 −150 −140 −130 −120 −110 −100 −90 −80 −70540

The Team

Jeremy Brown

Jeremy Ash

Jim Wilgenbusch Guifang Zhou Wen Huang

Melissa MarchandKyle Gallivan

Overview• Motivation

• Our network approaches

• Some applications

• Initial results

• Software

Tree Sets

Motivation Our Approaches Applications Initial Results Software

Summarizing Tree Sets• Consensus trees

• MASTs

• Clustering

• NLDR

• Agreement subtrees

• Clustering

• NLDR

• Clustering

• NLDR

BIOINFORMATICS Vol. 18 Suppl. 1 2002Pages S285–S293

Statistically based postprocessing ofphylogenetic analysis by clustering

Cara Stockham 1, Li-San Wang 2,∗ and Tandy Warnow 2

1Texas Institute for Computational and Applied Mathematics, University of Texas,ACES 6.412, Austin, TX 78712, USA and 2Department of Computer Sciences,University of Texas, Austin, TX 78712, USA

Received on January 24, 2002; revised and accepted on March 29, 2002

ABSTRACTMotivation: Phylogenetic analyses often produce thou-sands of candidate trees. Biologists resolve the conflict bycomputing the consensus of these trees. Single-tree con-sensus as postprocessing methods can be unsatisfactorydue to their inherent limitations.Results: In this paper we present an alternative approachby using clustering algorithms on the set of candidatetrees. We propose bicriterion problems, in particular usingthe concept of information loss, and new consensus treescalled characteristic trees that minimize the informationloss. Our empirical study using four biological datasetsshows that our approach provides a significant improve-ment in the information content, while adding only a smallamount of complexity. Furthermore, the consensus treeswe obtain for each of our large clusters are more resolvedthan the single-tree consensus trees. We also providesome initial progress on theoretical questions that arise inthis context.Availability: Software available upon request from the au-thors. The agglomerative clustering is implemented usingMatlab (MathWorks, 2000) with the Statistics Toolbox. TheRobinson-Foulds distance matrices and the strict consen-sus trees are computed using PAUP (Swofford, 2001) andthe Daniel Huson’s tree library on Intel Pentium worksta-tions running Debian Linux.Contact: E-mail : lisan@cs.utexas.eduSupplementary Information:http://www.cs.utexas.edu/users/lisan/ismb02/Keywords: consensus methods; clustering; phylogenet-ics; information theory; maximum parsimony.

INTRODUCTIONPhylogenetic analysis can be divided into three stages. Inthe first stage, a researcher collects data (such as DNAsequences) for each of the different taxa (genes, species,etc.) under study. In the second phase, she applies a tree re-construction method to the data. Many tree reconstruction

∗To whom correspondence should be addressed.

methods produce more than one candidate tree for the in-put dataset. For example, the maximum parsimony (Swof-ford et al., 1996) method returns those binary trees withthe lowest parsimony score. (The parsimony score of a treeis the minimum tree length, i.e., the sum of distances be-tween two endpoints across all edges, obtained by any wayof labeling the internal nodes.) Very often the number oftrees can be in the hundreds or thousands. In the last phase,a consensus tree of the candidate trees is computed so asto resolve the conflict, summarize the information, and re-duce the overwhelming number of possible solutions tothe evolutionary history.

Many consensus tree methods are available, but acommon feature to all of them is that they produce onetree. There are several shortcomings of this approachincluding loss of information and sensitivity to outliers.

In this paper we present a different approach to post-processing. The set of candidate trees is divided into sev-eral subsets using clustering methods. Each cluster is thencharacterized by its own consensus tree. We pose severaltheoretical optimization problems for these kinds of out-puts, and present some initial progress on these problems;these are presented in the section on Clustering Criteria.The bulk of our paper is focused on an empirical study,which is presented in the Experiments section. We con-clude our study and propose additional research problemsin the Conclusions section.

BACKGROUNDPhylogenetic treesA leaf-labeled tree topology can be decomposed into aset of bipartitions in the following manner. Each edge,when deleted from the tree, induces a bipartition of theleaves; thus, we can identify each edge with its inducedbipartition. Let t1 and t2 be two trees on the same leafset, and let E(t1) and E(t2) denote their sets of internaledges. The quantity |E(t1)!E(t2)| = |(E(t1) − E(t2)) ∪(E(t2) − E(t1))| is called the Robinson-Foulds (RF)distance (Robinson and Foulds, 1981) between the two

c⃝ Oxford University Press 2002 S285

Report multiple consensus trees, while attempting to minimize the amount

of information lost from the full distribution.

• Clustering

• Dimensionality Reduction

478 SYSTEMATIC BIOLOGY VOL. 54

FIGURE 7. Comparison of trees generated from two differentBayesian MCMC analyses of the same data set. Both Bayesian anal-yses were conducted under the same conditions (see text). Each colorrepresents the trees from a different analysis. (a) Analysis based on un-weighted Robinson-Foulds distances; (b) analysis based on weightedRobinson-Foulds distances.

be a useful means for describing the Bayesian MCMCapproach in classes and workshops on phylogenetics(see http://lewis.eeb.uconn.edu/lewishome/software.html for another useful MCMC instruction tool).

FIGURE 8. Progress in a Bayesian MCMC analysis. The progress inthe search can be visualized in the Tree Set Visualization program, as ademonstration of how an MCMC analysis functions. In the visualiza-tion, the progress of the chain through tree-space moves from regions oflow optimality scores (blue) to regions of high optimality scores (red).

Another common problem in phylogenetics is the dis-covery of several distinct “tree islands” of equally opti-mal or near-optimal phylogenetic solutions for a givendataset (Maddison, 1991). In analyzing a particular dataset, one might discover that there are a large number ofsolutions that fit the data equally well. A consensus ofthese trees may show little or no resolution. However,an unresolved consensus tree does not necessarily indi-cate that all potential solutions fit the data equally well.Separate summaries of each of the tree islands is likely toshow a much higher degree of resolution, and the sepa-rate tree islands may represent alternative phylogeneticsolutions for the data set. Tree Set Visualization can beused to identify and analyze these tree islands, as shownin Figure 9.

Potential Limitations of MDS for Visualizing Tree-SpaceMultidimensional scaling based on RF distances is

clearly not the only way (and is not necessarily eventhe best way) to visualize and represent tree-space. Wehave found this approach to the problem to be useful forexploring large sets of phylogenetic trees, but we alsorecognize that the approach has some limitations. For in-stance, any reduction of high-dimensional space into twodimensions necessarily will result in some distortions. Asan example of distortion, consider the MDS visualizationof trees shown in Figure 10. In this case, a reference treeis shown in blue, and a series of trees that differ fromthe reference tree by one bipartition each (RF = 2) areshown in red. All of the trees are equally distant from thereference tree in tree-space, and in multiple dimensionswould form a “multidimensional sphere” around the

by guest on August 15, 2012

http://sysbio.oxfordjournals.org/D

ownloaded from

Networks of Trees

Tree-to-Tree!Affinities (Similarities)

Networks of Trees

Networks of Bipartitions

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

Bipartition CovariancesA

Cov(Orange,Blue)1,000

= 0.25

= 0.00

Sampling 2

Sampling 1

= 0.25

= 0.00

Sampling 2

Sampling 1

= 0.25

= 0.00

Sampling 2

Sampling 1

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov Uniform Distribution

of Topologies

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov Two Equally Frequent

Topologies

Network Visualizations

Gene 1

Gene 2Gene 3

Nodes ordered on each axis by frequency of topology

Gene 1

Gene 2Gene 3

Edges connect identical topologies

Assessing Model FitUsing parametric bootstrapping or posterior

prediction, we can compare network structures between observed and simulated datasets.

Empirical

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov

Simulated

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov

Detecting Distinct Phylogenetic Signals

Initial Results

Motivation Our Approaches Applications Software

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov Two Equally Frequent

Topologies

Two Equally Frequent

Topologies

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov

Completely distinct signals in two genes

Gene 1

Gene 2

Community 1

Community 2

Partially overlapping signal

Gene 1

Gene 2

Community 1

Community 2

Proof of Principle

Topologies used for simulating two halves of an alignment.

Proof of Principle

Bootstrap

CombineSimulate

???Summarize

Proof of Principle

taxon_14

taxon_8

taxon_5

taxon_13

taxon_2

taxon_15

taxon_11

taxon_4

taxon_16

taxon_9taxon_10

taxon_12

taxon_1

taxon_3

taxon_6taxon_7

Majority-Rule Consensus Tree

Proof of Principle

17#18#

14#15#

Community)A)Community)B)Free)Nodes)

Negative Covariances

Positive Covariances

Networks Detect Strong Conflict

taxon_1

taxon_14

taxon_12

taxon_8

taxon_15

taxon_2

taxon_4

taxon_10

taxon_5

taxon_9

taxon_6

taxon_16

taxon_3

taxon_7

taxon_13

taxon_11

36%$of$Trees$

taxon_2

taxon_5taxon_14

taxon_1

taxon_15

taxon_9

taxon_13

taxon_4taxon_3

taxon_11

taxon_16

taxon_12

taxon_7

taxon_10

taxon_8

taxon_6

37%$of$Trees$

taxon_9

taxon_14

taxon_12

taxon_7taxon_6

taxon_10

taxon_8

taxon_11

taxon_5

taxon_15

taxon_13

taxon_3taxon_4

taxon_1taxon_2

taxon_16

27%$of$Trees$

TreeScaper

Wen Huang. Tuesday morning iEvoBio Lightning Talk.Motivation Our Approaches Applications Initial Results Software

Web Interface (future)www.treescaperonline.org

TreeScaperOnline

TreeScaper OnlineInput !Create Networks !Community Detection !Report Network Stats !Visualizations

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov

−160 −150 −140 −130 −120 −110 −100 −90 −80 −70540

Gene 1

Gene 2Gene 3

Acknowledgements• Computing support from FSU’s Research Computing

Center and HPC@LSU

• Financial support from NSF (DBI 1262571)