Post on 17-Dec-2014
description
transcript
Using networks to explore, quantify, and summarize phylogenetic tree space
Jeremy M. Brown1, Guifang Zhou2, Wen Huang2, Jeremy Ash1, Melissa Marchand2, Kyle Gallivan2, and Jim Wilgenbusch3
1 Louisiana State University, Dept. of Biological Sciences 2 Florida State University, Dept. of Mathematics
3 Florida State University, Dept. of Scientific Computing
−160 −150 −140 −130 −120 −110 −100 −90 −80 −70540
550
560
570
580
590
600
610
620
630
The Team
Jeremy Brown
Jeremy Ash
Jim Wilgenbusch Guifang Zhou Wen Huang
Melissa MarchandKyle Gallivan
Overview• Motivation
• Our network approaches
• Some applications
• Initial results
• Software
Tree Sets
Motivation Our Approaches Applications Initial Results Software
Summarizing Tree Sets• Consensus trees
• MASTs
• Clustering
• NLDR
Motivation Our Approaches Applications Initial Results Software
0.9
0.7
Summarizing Tree Sets• Consensus trees
• Agreement subtrees
• Clustering
• NLDR
Motivation Our Approaches Applications Initial Results Software
0.9
Prune
Summarizing Tree Sets• Consensus trees
• Agreement subtrees
• Clustering
• NLDR
Motivation Our Approaches Applications Initial Results Software
BIOINFORMATICS Vol. 18 Suppl. 1 2002Pages S285–S293
Statistically based postprocessing ofphylogenetic analysis by clustering
Cara Stockham 1, Li-San Wang 2,∗ and Tandy Warnow 2
1Texas Institute for Computational and Applied Mathematics, University of Texas,ACES 6.412, Austin, TX 78712, USA and 2Department of Computer Sciences,University of Texas, Austin, TX 78712, USA
Received on January 24, 2002; revised and accepted on March 29, 2002
ABSTRACTMotivation: Phylogenetic analyses often produce thou-sands of candidate trees. Biologists resolve the conflict bycomputing the consensus of these trees. Single-tree con-sensus as postprocessing methods can be unsatisfactorydue to their inherent limitations.Results: In this paper we present an alternative approachby using clustering algorithms on the set of candidatetrees. We propose bicriterion problems, in particular usingthe concept of information loss, and new consensus treescalled characteristic trees that minimize the informationloss. Our empirical study using four biological datasetsshows that our approach provides a significant improve-ment in the information content, while adding only a smallamount of complexity. Furthermore, the consensus treeswe obtain for each of our large clusters are more resolvedthan the single-tree consensus trees. We also providesome initial progress on theoretical questions that arise inthis context.Availability: Software available upon request from the au-thors. The agglomerative clustering is implemented usingMatlab (MathWorks, 2000) with the Statistics Toolbox. TheRobinson-Foulds distance matrices and the strict consen-sus trees are computed using PAUP (Swofford, 2001) andthe Daniel Huson’s tree library on Intel Pentium worksta-tions running Debian Linux.Contact: E-mail : lisan@cs.utexas.eduSupplementary Information:http://www.cs.utexas.edu/users/lisan/ismb02/Keywords: consensus methods; clustering; phylogenet-ics; information theory; maximum parsimony.
INTRODUCTIONPhylogenetic analysis can be divided into three stages. Inthe first stage, a researcher collects data (such as DNAsequences) for each of the different taxa (genes, species,etc.) under study. In the second phase, she applies a tree re-construction method to the data. Many tree reconstruction
∗To whom correspondence should be addressed.
methods produce more than one candidate tree for the in-put dataset. For example, the maximum parsimony (Swof-ford et al., 1996) method returns those binary trees withthe lowest parsimony score. (The parsimony score of a treeis the minimum tree length, i.e., the sum of distances be-tween two endpoints across all edges, obtained by any wayof labeling the internal nodes.) Very often the number oftrees can be in the hundreds or thousands. In the last phase,a consensus tree of the candidate trees is computed so asto resolve the conflict, summarize the information, and re-duce the overwhelming number of possible solutions tothe evolutionary history.
Many consensus tree methods are available, but acommon feature to all of them is that they produce onetree. There are several shortcomings of this approachincluding loss of information and sensitivity to outliers.
In this paper we present a different approach to post-processing. The set of candidate trees is divided into sev-eral subsets using clustering methods. Each cluster is thencharacterized by its own consensus tree. We pose severaltheoretical optimization problems for these kinds of out-puts, and present some initial progress on these problems;these are presented in the section on Clustering Criteria.The bulk of our paper is focused on an empirical study,which is presented in the Experiments section. We con-clude our study and propose additional research problemsin the Conclusions section.
BACKGROUNDPhylogenetic treesA leaf-labeled tree topology can be decomposed into aset of bipartitions in the following manner. Each edge,when deleted from the tree, induces a bipartition of theleaves; thus, we can identify each edge with its inducedbipartition. Let t1 and t2 be two trees on the same leafset, and let E(t1) and E(t2) denote their sets of internaledges. The quantity |E(t1)!E(t2)| = |(E(t1) − E(t2)) ∪(E(t2) − E(t1))| is called the Robinson-Foulds (RF)distance (Robinson and Foulds, 1981) between the two
c⃝ Oxford University Press 2002 S285
Report multiple consensus trees, while attempting to minimize the amount
of information lost from the full distribution.
Summarizing Tree Sets• Consensus trees
• Agreement subtrees
• Clustering
• Dimensionality Reduction
Motivation Our Approaches Applications Initial Results Software
478 SYSTEMATIC BIOLOGY VOL. 54
FIGURE 7. Comparison of trees generated from two differentBayesian MCMC analyses of the same data set. Both Bayesian anal-yses were conducted under the same conditions (see text). Each colorrepresents the trees from a different analysis. (a) Analysis based on un-weighted Robinson-Foulds distances; (b) analysis based on weightedRobinson-Foulds distances.
be a useful means for describing the Bayesian MCMCapproach in classes and workshops on phylogenetics(see http://lewis.eeb.uconn.edu/lewishome/software.html for another useful MCMC instruction tool).
FIGURE 8. Progress in a Bayesian MCMC analysis. The progress inthe search can be visualized in the Tree Set Visualization program, as ademonstration of how an MCMC analysis functions. In the visualiza-tion, the progress of the chain through tree-space moves from regions oflow optimality scores (blue) to regions of high optimality scores (red).
Another common problem in phylogenetics is the dis-covery of several distinct “tree islands” of equally opti-mal or near-optimal phylogenetic solutions for a givendataset (Maddison, 1991). In analyzing a particular dataset, one might discover that there are a large number ofsolutions that fit the data equally well. A consensus ofthese trees may show little or no resolution. However,an unresolved consensus tree does not necessarily indi-cate that all potential solutions fit the data equally well.Separate summaries of each of the tree islands is likely toshow a much higher degree of resolution, and the sepa-rate tree islands may represent alternative phylogeneticsolutions for the data set. Tree Set Visualization can beused to identify and analyze these tree islands, as shownin Figure 9.
Potential Limitations of MDS for Visualizing Tree-SpaceMultidimensional scaling based on RF distances is
clearly not the only way (and is not necessarily eventhe best way) to visualize and represent tree-space. Wehave found this approach to the problem to be useful forexploring large sets of phylogenetic trees, but we alsorecognize that the approach has some limitations. For in-stance, any reduction of high-dimensional space into twodimensions necessarily will result in some distortions. Asan example of distortion, consider the MDS visualizationof trees shown in Figure 10. In this case, a reference treeis shown in blue, and a series of trees that differ fromthe reference tree by one bipartition each (RF = 2) areshown in red. All of the trees are equally distant from thereference tree in tree-space, and in multiple dimensionswould form a “multidimensional sphere” around the
by guest on August 15, 2012
http://sysbio.oxfordjournals.org/D
ownloaded from
Networks of Trees
Tree-to-Tree!Affinities (Similarities)
Motivation Our Approaches Applications Initial Results Software
Networks of Trees
Tree-to-Tree!Affinities (Similarities)
High
Low
Motivation Our Approaches Applications Initial Results Software
Networks of Bipartitions
OD | ABC
OA | BCD
OC | ABD
OD | ABC
AB | OCD
AC | OBD
OB | ACD AD | OBC
BC | OAD
BD | OAC
CD | OAB
Motivation Our Approaches Applications Initial Results Software
Bipartition CovariancesA
B
C
D
E
G
F
H
0.51
A
C
B
D
H
G
F
E
0.49
A
B
C
D
E
G
F
H
0.26
A
B
C
D
H
G
F
E
0.25
A
C
B
D
E
G
F
H
0.25
A
C
B
D
H
G
F
E
0.24
Cov(Orange,Blue)1,000
= 0.25
Cov(Orange,Blue)1,000
= 0.00
0.51
0.51
Sampling 2
Sampling 1
Motivation Our Approaches Applications Initial Results Software
Bipartition CovariancesA
B
C
D
E
G
F
H
0.51
A
C
B
D
H
G
F
E
0.49
A
B
C
D
E
G
F
H
0.26
A
B
C
D
H
G
F
E
0.25
A
C
B
D
E
G
F
H
0.25
A
C
B
D
H
G
F
E
0.24
Cov(Orange,Blue)1,000
= 0.25
Cov(Orange,Blue)1,000
= 0.00
0.51
0.51
Sampling 2
Sampling 1
Motivation Our Approaches Applications Initial Results Software
Bipartition CovariancesA
B
C
D
E
G
F
H
0.51
A
C
B
D
H
G
F
E
0.49
A
B
C
D
E
G
F
H
0.26
A
B
C
D
H
G
F
E
0.25
A
C
B
D
E
G
F
H
0.25
A
C
B
D
H
G
F
E
0.24
Cov(Orange,Blue)1,000
= 0.25
Cov(Orange,Blue)1,000
= 0.00
0.51
0.51
Sampling 2
Sampling 1
Motivation Our Approaches Applications Initial Results Software
Networks of Bipartitions
OD | ABC
OA | BCD
OC | ABD
OD | ABC
AB | OCD
AC | OBD
OB | ACD AD | OBC
BC | OAD
BD | OAC
CD | OAB
= + Cov
= - Cov Uniform Distribution
of Topologies
Motivation Our Approaches Applications Initial Results Software
Networks of Bipartitions
OD | ABC
OA | BCD
OC | ABD
OD | ABC
AB | OCD
AC | OBD
OB | ACD AD | OBC
BC | OAD
BD | OAC
CD | OAB
= + Cov
= - Cov Two Equally Frequent
Topologies
Motivation Our Approaches Applications Initial Results Software
Network Visualizations
Motivation Our Approaches Applications Initial Results Software
Network Visualizations
Motivation Our Approaches Applications Initial Results Software
Gene 1
Gene 2Gene 3
Nodes ordered on each axis by frequency of topology
Network Visualizations
Motivation Our Approaches Applications Initial Results Software
Gene 1
Gene 2Gene 3
Edges connect identical topologies
Motivation Our Approaches Applications Initial Results Software
Assessing Model FitUsing parametric bootstrapping or posterior
prediction, we can compare network structures between observed and simulated datasets.
Empirical
OD | ABC
OA | BCD
OC | ABD
OD | ABC
AB | OCD
AC | OBD
OB | ACD AD | OBC
BC | OAD
BD | OAC
CD | OAB
= + Cov
= - Cov
Simulated
OD | ABC
OA | BCD
OC | ABD
OD | ABC
AB | OCD
AC | OBD
OB | ACD AD | OBC
BC | OAD
BD | OAC
CD | OAB
= + Cov
= - Cov
=
Detecting Distinct Phylogenetic Signals
Tree-to-Tree!Affinities (Similarities)
High
Low
Motivation Our Approaches Applications Initial Results Software
Tree-to-Tree!Affinities (Similarities)
High
Low
Motivation Our Approaches Applications Initial Results Software
Detecting Distinct Phylogenetic Signals
Initial Results
Tree-to-Tree!Affinities (Similarities)
High
Low
Motivation Our Approaches Applications Software
Detecting Distinct Phylogenetic Signals
OD | ABC
OA | BCD
OC | ABD
OD | ABC
AB | OCD
AC | OBD
OB | ACD AD | OBC
BC | OAD
BD | OAC
CD | OAB
= + Cov
= - Cov Two Equally Frequent
Topologies
Motivation Our Approaches Applications Initial Results Software
Detecting Distinct Phylogenetic Signals
Two Equally Frequent
Topologies
OD | ABC
OA | BCD
OC | ABD
OD | ABC
AB | OCD
AC | OBD
OB | ACD AD | OBC
BC | OAD
BD | OAC
CD | OAB
= + Cov
= - Cov
Motivation Our Approaches Applications Initial Results Software
Detecting Distinct Phylogenetic Signals
Network Visualizations
Motivation Our Approaches Applications Initial Results Software
Completely distinct signals in two genes
Gene 1
Gene 2
Community 1
Community 2
Network Visualizations
Motivation Our Approaches Applications Initial Results Software
Partially overlapping signal
Gene 1
Gene 2
Community 1
Community 2
Proof of Principle
Motivation Our Approaches Applications Initial Results Software
Topologies used for simulating two halves of an alignment.
Proof of Principle
Motivation Our Approaches Applications Initial Results Software
Bootstrap
CombineSimulate
???Summarize
Proof of Principle
Motivation Our Approaches Applications Initial Results Software
0.09
taxon_14
taxon_8
taxon_5
taxon_13
taxon_2
taxon_15
taxon_11
taxon_4
taxon_16
taxon_9taxon_10
taxon_12
taxon_1
taxon_3
taxon_6taxon_7
1
1
1
1
1
0.67
1
0.61
0.74
1
10.64
0.64
0.63
z
Majority-Rule Consensus Tree
Proof of Principle
Motivation Our Approaches Applications Initial Results Software
17#18#
14#15#
16#
5#
3#
1#13#
8#
Community)A)Community)B)Free)Nodes)
10#
11#
7#
6#
4#
9#
12#
2#
Negative Covariances
Positive Covariances
Networks Detect Strong Conflict
Motivation Our Approaches Applications Initial Results Software
0.08
taxon_1
taxon_14
taxon_12
taxon_8
taxon_15
taxon_2
taxon_4
taxon_10
taxon_5
taxon_9
taxon_6
taxon_16
taxon_3
taxon_7
taxon_13
taxon_11
1
1
1
1
1
1
0.72
1
1
1
1
1
1
1
1"
1"
2"
3"
7"
6"
8"
9"
10"
11"
12"
13"
4"
5"
36%$of$Trees$
0.1
taxon_2
taxon_5taxon_14
taxon_1
taxon_15
taxon_9
taxon_13
taxon_4taxon_3
taxon_11
taxon_16
taxon_12
taxon_7
taxon_10
taxon_8
taxon_6
1
1
1
10.89
1
1
11
1
1
1
1
1
14#
14#
2#
17#
7#
6#
15#9#
10#
11#
12#
18#
4#
16#
37%$of$Trees$
0.8
taxon_9
taxon_14
taxon_12
taxon_7taxon_6
taxon_10
taxon_8
taxon_11
taxon_5
taxon_15
taxon_13
taxon_3taxon_4
taxon_1taxon_2
taxon_16
1
1
1
1
1
1
1
1
1
1
1
1
1
0.93
14#
14#
1#
15#
7#2#
9#
10#
11#
12#
13#
6#
4#
5#
27%$of$Trees$
TreeScaper
Wen Huang. Tuesday morning iEvoBio Lightning Talk.Motivation Our Approaches Applications Initial Results Software
Web Interface (future)www.treescaperonline.org
TreeScaperOnline
TreeScaper OnlineInput !Create Networks !Community Detection !Report Network Stats !Visualizations
OD | ABC
OA | BCD
OC | ABD
OD | ABC
AB | OCD
AC | OBD
OB | ACD AD | OBC
BC | OAD
BD | OAC
CD | OAB
= + Cov
= - Cov
−160 −150 −140 −130 −120 −110 −100 −90 −80 −70540
550
560
570
580
590
600
610
620
630
Motivation Our Approaches Applications Initial Results Software
Gene 1
Gene 2Gene 3
Acknowledgements• Computing support from FSU’s Research Computing
Center and HPC@LSU
• Financial support from NSF (DBI 1262571)