New Approaches for New Approaches for Inferring the Tree of LifeInferring the Tree of Life
Tandy WarnowAssociate Professor
Department of Computer SciencesGraduate Program in Ecology, Evolution, and Behavior
Co-DirectorThe Center for Computational Biology and Bioinformatics
The University of Texas at Austin
Packard Proposal 1996Packard Proposal 1996
I observed that DNA and RNA sequences are low in phylogenetic signal, as currently analyzed, and
I proposed to seek out and model new sources of significant phylogenetic signal, and then develop efficient algorithms to extract that signal, so that the inference of evolutionary history could be made with greater accuracy.
What I did insteadWhat I did instead
• Developed methods for use with biomolecular sequences that recover the true tree with high probability from polynomial length sequences.
• (Last two years): Developed methods for reconstructing phylogenies from gene order and content within whole genomes.
• (Last year): Started looking at inferring non-tree models of evolution.
DNA Sequence EvolutionDNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
Major Phylogenetic Reconstruction Major Phylogenetic Reconstruction MethodsMethods
• Polynomial-time distance-based methods (neighbor joining, perhaps the most popular)
• NP-hard sequence-based methods– Maximum Parsimony– Maximum Likelihood
that can take years on real datasets• Heated debates over the relative
performance of these methods
Quantifying ErrorQuantifying Error
FN: false negative (missing edge)FP: false positive (incorrect edge)
50% error rate
FN
FP
Absolute fast convergence Absolute fast convergence vs. exponential convergencevs. exponential convergence
• DCM+SQS is a two-phase procedure which reduces the sequence length requirement of methods.
DCM SQSExponentiallyconvergingmethod
Absolute fast convergingmethod
• We modify the second phase to improve the empirical performance, replacing SQS with ML (maximum likelihood) or MP (maximum parsimony).
DCM-Boosting DCM-Boosting [Warnow et al. 2001][Warnow et al. 2001]
DCMDCMNJNJ+ML vs. other methods on a +ML vs. other methods on a fixed model treefixed model tree
•500-taxon rbcL tree•K2P+ model (=2, =1)•Avg. branch length = 0.278•Relative performance is typical in our studies
Comparison of methods on random trees as Comparison of methods on random trees as a function of number of taxaa function of number of taxa
•K2P+ model (=2, =1)•Avg. branch length = 0.05•Seq. length = 1000
SummarySummary
• These are the first polynomial time methods that improve upon NJ (with respect to topological accuracy) and are never worse than NJ.
• The advantage obtained with DCMNJ+MP and
DCMNJ+ML increases with number of taxa.
• In practice these new methods are slower than NJ (minutes vs. seconds), but still much faster than MP and ML (which can take days).
• Conjecture: DCMNJ+ML is AFC.
II. Whole-Genome PhylogenyII. Whole-Genome Phylogeny
A
B
C
D
E
F
X
Y
ZW
A
B
C
D
E
F
Genomes As Signed PermutationsGenomes As Signed Permutations
1 –5 3 4 -2 -6or
6 2 -4 –3 5 –1etc.
1 2 3 4 5 6 7 8 9 10
1 2 3 –8 –7 –6 –5 -4 9 10
1 2 3 9 4 5 6 7 8 10
1 2 3 9 –8 –7 –6 –5 -4 10
Inversion:
Transposition:
Inverted Transposition:
Genomes Evolve by Genomes Evolve by RearrangementsRearrangements
Genome Rearrangement Has Genome Rearrangement Has A Huge State SpaceA Huge State Space
• DNA sequences : 4 states per site• Signed circular genomes with n genes:
states, 1 site
• Circular genomes (1 site)
– with 37 genes: states
– with 120 genes: states
)!1(2 1 nn
521056.2 2321070.3
Our ApproachesOur Approaches
• Statistically-based genomic distance estimators so that NJ analyses are more accurate, recovering 90% of the edges even for datasets close to saturation.
• Improved bounds for tree length.• GRAPPA: high performance
implementation for the maximum parsimony problems for rearranged genomes, achieving up to 200,000-fold speedup.
Accuracy of Neighbor Joining Accuracy of Neighbor Joining Using Distance EstimatorsUsing Distance Estimators
•120 genes•Inversion-only evolution (other models of evolution show the same relative performance)•10, 20, 40, 80, and 160 genomes
Consensus of 216 MP Trees for Consensus of 216 MP Trees for the the CampanulaceaeCampanulaceae dataset dataset
Strict Consensus of 216 trees;6 out of 10 internal edges recovered.
Trachelium
Campanula
Adenophora
Symphandra
Legousia
Asyneuma
Triodanus
Wahlenbergia
Merciera
Codonopsis
Cyananthus
Platycodon
Tobacco
Future WorkFuture Work
• New focus on Rare Genomic Changes– New data– New models– New methods
• New techniques for large-scale analyses– Divide-and-conquer methods– Non-tree models– Visualization of large trees and large sets of
trees
AcknowledgementsAcknowledgements
• Funding: The David and Lucile Packard Foundation, The National Science Foundation, and Paul Angello• Collaborators: Robert Jansen (U. Texas) Bernard Moret, David Bader, Mi-Yan
(U. New Mexico) Daniel Huson (Celera) Katherine St. John (CUNY) Linda Raubeson (Central Washington U.) Luay Nakhleh, Usman Roshan, Jerry Sun,
Li-San Wang, Stacia Wyman (Phylolab, U. Texas)
Phylolab, U. TexasPhylolab, U. Texas
Please visit us athttp://www.cs.utexas.edu/users/phylo/