Phylogeny Estimation: Why It Is "Hard", and
How to Design Methods with Good Performance
Tandy WarnowDepartment of Computer Sciences
University of Texas at Austin
The real title:
Phylogeny Estimation: Why it is “Hard”
but not how to design methods with good performance - talk to me separately about this, no time in
this lecture!
This talk
• Intro to phylogenetic estimation (using some terms to be defined later: polynomial time and NP-hard)
• Computational problems and what it means to solve them exactly
• Computational problems, and what it means to “solve them” heuristically
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website,University of Arizona
Phylogeny (evolutionary tree)
Evolutionary History
From AToL website
Helps us–predict gene function–develop drugs and vaccines–understand disease spread–understand human origins
Tree of Life
Phylogenetics: estimating evolutionary histories
DNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y
How can we infer evolution?
1. Heuristics for hard optimization problems (Maximum Parsimony and Maximum Likelihood)
Two types of phylogenetic reconstruction methods
Phylogenetic trees
Cost
Global optimum
Local optimum
2. Polynomial time distance-based methods: UPGMA, Neighbor Joining, FastME, Weighbor, etc.
Maximum Parsimony
• Input: Set S of n aligned sequences of length k• Output:
– A phylogenetic tree T leaf-labeled by sequences in S– additional sequences of length k labeling the internal
nodes of T
such that the total number of changes is minimized
Maximum parsimony (example)
• Input: Four sequences– ACT– ACA– GTT– GTA
• Question: which of the three trees has the best MP scores?
Maximum Parsimony
ACT
GTT ACA
GTA ACA ACT
GTAGTT
ACT
ACA
GTT
GTA
Maximum Parsimony
ACT
GTT
GTT GTA
ACA
GTA
12
2
MP score = 5
ACA ACT
GTAGTT
ACA ACT3 1 3
MP score = 7
ACT
ACA
GTT
GTAACA GTA1 2 1
MP score = 4
Optimal MP tree
Maximum Parsimony: computational complexity
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
But how do we find the best tree?
Optimal labeling can becomputed in linear time O(nk)
Exhaustive Search
For every tree in on the set of sequences, DO:
• Score each tree (compute optimal sequences for each internal node, and record the score)
• Keep track of the tree with the best score
How expensive is this?
Exhaustive Search
For every tree in on the set of sequences, DO:
• Score each tree (compute optimal sequences for each internal node, and record the score)
• Keep track of the tree with the best score
How expensive is this?
Don’t try “exhaustive search”
• Number of (unrooted) binary trees on n leaves is (2n-5)!! = (2n-5)x(2n-7)x(2n-9)x…x3
• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in
2890 millennia
#leaves #trees
4 35 15
6 1057 945
8 10395
9 135135
10 2027025
20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900
Maximum Parsimony: computational complexity
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Finding the optimal MP tree is NP-hard
Optimal labeling can becomputed in linear time O(nk)
NP-hard(ness)
• What does this mean?
• What are the consequences for a problem being NP-hard?
• What kind of methods are used to “solve” NP-hard problems?
• How should you interpret the output of a software program, when the problem is NP-hard?
“Real” problem: your brother’s birthday party
• Your brother is turning 10 and you need to arrange his birthday party
• He wants all his friends to come• But some of them hate each other
Your objective: have as few parties as you can, but invite everyone to at least one party (while not having people who hate each other at the same party)
Your brother’s party
• Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben
• Sally and Alice hate each other, also Henry and Sally, Henry and Tommy, Alice and Jimmy, Ben and Sally, and Ben and Henry.
Graph representation of your brother’s friends
Graph has vertices and edges• Vertices = your brother’s friends• Edges between vertices indicate they hate
each other
Your brother’s party
• Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben
• Sally and Alice hate each other, also Henry and Sally, Henry and Tommy, Alice and Jimmy, Ben and Sally, and Ben and Henry.
Coloring vertices to assign friends to parties
• Given graph G with vertices and edges• Assign colors to the vertices so that no edge
connects vertices of the same color, using a minimum number of colors
• Vertices = your brother’s friends• Edges between vertices indicate they hate each
other• Colors = parties
Assigning friends to parties: graph coloring!
• Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben
• We can’t do this with two parties. Why?• What about three?
Your brother’s parties
Solution: three parties!• Sally, Tommy, and Jimmy• Henry and Alice• Ben
What is the minimum number of colors that a graph needs?
Remember: no edge between vertices of the same color!
A graph that needs 3 colors
2-colored graph
A computational problem
• 2-colorability:• Given graph G, determine if we
can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.
Can we 2-color this graph?
Can we 2-color these graphs?
Solving 2-colorability• 2-colorability: Given graph G, determine if we can
assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.
• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.
• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.
Solving 2-colorability• 2-colorability: Given graph G, determine if we can
assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.
• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.
• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.
Solving 2-colorability• 2-colorability: Given graph G, determine if we can
assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.
• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.
• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.
What about this?
• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.
A 3-colored graph
Can you 3-color these graphs?
How about this graph?
Testing 3-colorability
• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.
The “greedy algorithm” will work correctly in some, but not all cases.
Exhaustive search for 3-colorability
• Look at all possible vertex colorings• See if any is “legal” (no edge between vertices of the
same color)
Problem: there are 3n vertex colorings of a graph on n vertices
Question to students: how many vertex colorings are there for a graph with 10 vertices? 20 vertices? 100 vertices?
What about this?
• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.
What about this?
• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.
• This problem is NP-hard. • What does this mean?
• Some decision problems can be solved in polynomial time:– Can graph G be 2-colored?– Does graph G have a 3-clique (three vertices that are all
adjacent)?• Some decision problems seem to not be solvable
in polynomial time:– Can graph G be 3-colored?– What is the size of the largest clique in the graph G?
P vs. NP, continued
• The “big” question in theoretical computer science is:– Is it possible to solve an NP-hard
problem in polynomial time?• If the answer is “yes”, then all NP-hard
problems can be solved in polynomial time, so P=NP. This is generally not believed.
Minimum coloring• Since 3-colorability is NP-hard, finding the minimum
number of colors for a graph is NP-hard.• That means the problem will be very had on some
graphs -- even if others can be easy.
• So if your brother has a lot of friends, arranging the minimum number of parties could take you a very very very very very long time.
• So forget solving this problem exactly!
Solving NP-hard optimization problems (like min coloring)
Options:– Solve the problem exactly (but use lots of time
on some inputs)– Use heuristics which may not solve the
problem exactly (and which might be computationally expensive, anyway)
Phylogeny estimation is NP-hard, so
• Most methods that are used for maximum parsimony (or maximum likelihood) are heuristics that are not guaranteed to solve the problems exactly.
• Even the best methods can take a very long time (months or more) on some inputs, without being guaranteed to solve their problems well.
• You do not know how poor the solution is.
Start with some tree and score it
Repeat Change the tree slightly, and see if the new tree has a better score.
until no neighbor of your best tree has a better score (i.e., stop at a local optimum)
Return the best tree you found
Hill-climbing for phylogeny estimation
Exploring “tree space”
Two problems: 1. Getting “stuck” in local optima2. Taking too long to get to good solutions
“Solving” NP-hard phylogenetic estimation problems
Phylogenetic trees
Cost
Global optimum
Local optimum
Problems with current techniques for MP
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 4 8 12 16 20 24
Hours
Average MP score above
optimal, shown as a percentage of
the optimal
Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%.
Performance of TNT with time
Observations
• The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets.
• Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions.
• Apparent convergence can be misleading.
If a problem is NP-hard
• Some inputs you can solve correctly and quickly, using simple algorithms.
• Some inputs you can solve correctly but it will take a long time.
• Some algorithms will give incorrect answers on some inputs.
• You may not know if your answer is correct or not.
Lessons
• Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima.
• Therefore we still need better heuristics.• Biologists should be cautious in believing
that the trees found are actually “optimal”.
Reconstructing the “Tree” of LifeHandling large datasets: Handling large datasets:
millions of speciesmillions of species
The “Tree of Life” is not The “Tree of Life” is not really a tree: really a tree:
reticulate evolutionreticulate evolution
Phylolab, U. TexasPlease visit us athttp://www.cs.utexas.edu/users/phylo/
Acknowledgements
• Funding: NSF and the David and Lucile Packard Foundation
• Collaborators and students: Bernard Moret, Luay Nakhleh, Usman Roshan, and Tiffani Williams
General comments for NP-hard optimization problems
• Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time.
• You may not know when you have an optimal solution, if you use a heuristic.
• Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation do you need?