+ All Categories
Home > Documents > Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Date post: 09-Feb-2016
Category:
Upload: yamka
View: 49 times
Download: 0 times
Share this document with a friend
Description:
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction. Tandy Warnow Department of Computer Sciences University of Texas at Austin. Reconstructing the “Tree” of Life. Handling large datasets: millions of species. The “Tree of Life” is not really a tree: - PowerPoint PPT Presentation
Popular Tags:
53
Combinatorial and graph- theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin
Transcript
Page 1: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Combinatorial and graph-theoretic problems in evolutionary tree

reconstruction

Tandy WarnowDepartment of Computer Sciences

University of Texas at Austin

Page 2: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Reconstructing the “Tree” of LifeHandling large datasets: Handling large datasets:

millions of speciesmillions of species

The “Tree of Life” is not The “Tree of Life” is not really a tree: really a tree:

reticulate evolutionreticulate evolution

Page 3: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Evolution informs about everything in biology

• Big genome sequencing projects just produce data -- so what?

• Evolutionary history relates all organisms and genes, and helps us understand and predict – interactions between genes (genetic networks)– drug design– predicting functions of genes– influenza vaccine development– origins and spread of disease– origins and migrations of humans

Page 4: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Possible Indo-European tree(Ringe, Warnow and Taylor 2000)

Page 5: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Challenges in estimating phylogenies

• Computational: almost all “good” approaches for estimating phylogenies involve solving NP-hard problems

• Statistical• Data

Page 6: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Major methods for phylogeny reconstruction

• Biology: heuristics for NP-hard optimization problems

• Linguistics: an exact algorithm for an NP-hard optimization problem

Page 7: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Outline for the rest of the talk

• NP-hard and polynomial time problems• Phylogeny reconstruction in biology: the

challenge is to develop better heuristics for NP-hard problems

• Phylogeny reconstruction in linguistics: the NP-hard perfect phylogeny problem, and how we solve it exactly

Page 8: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

A polynomial-time problem• 2-colorability: Given graph G = (V,E), determine if

we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

Page 9: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

A polynomial-time problem• 2-colorability: Given graph G = (V,E), determine if

we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

Page 10: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

A polynomial-time problem• 2-colorability: Given graph G = (V,E), determine if

we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

Page 11: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

What about this?

• 3-colorability: Given graph G, determine if we assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

Page 12: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

What about this?

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

A brute-force solution seems to require O(3n) time, where n is the number of vertices.

Page 13: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

• Some decision problems can be solved in polynomial time:– Can graph G be 2-colored?– Does graph G have a Eulerian tour?

• Some decision problems seem to not be solvable in polynomial time:– Can graph G be 3-colored?– Does graph G have a Hamiltonian cycle?

Page 14: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

What about this?

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

• This problem is provably NP-hard. What does this mean?

Page 15: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

P vs. NP, continued

• The “big” question in theoretical computer science is:– Is it possible to solve an NP-hard

problem in polynomial time?• If the answer is “yes”, then all NP-hard

problems can be solved in polynomial time, so P=NP. This is generally not believed.

Page 16: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Coping with NP-hard problems

Since NP-hard problems may not be solvable in polynomial time, the options are:– Solve the problem exactly (but use lots of time

on some inputs)– Use heuristics which may not solve the

problem exactly (and which might be computationally expensive, anyway)

Page 17: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

General comments for NP-hard optimization problems

• Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time.

• You may not know when you have an optimal solution, if you use a heuristic.

• Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation do you need?

Page 18: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Page 19: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Molecular Systematics

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Page 20: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Maximum Parsimony

• Given set S of sequences of the same length over the nucleotide alphabet {A,C,T,G}, find tree leaf-labelled by S with other DNA sequences (of the same length) labelling internal nodes, so as to minimize the “length” of the tree (the sum of the Hamming distances on the edges).

• NP-hard!20

Page 21: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Solving NP-hard problems exactly is … unlikely

• Number of (unrooted) binary trees on n leaves is (2n-5)!!

• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in

2890 millennia

#leaves #trees4 35 156 1057 9458 103959 135135

10 202702520 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Page 22: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Research: we try to develop better heuristics

Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Current best techniques

DCM boosted version of best techniques

Page 23: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Summary (so far)

• Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima.

• The challenge here is to find better heuristics, since exact solutions are very unlikely to ever be achievable on large datasets.

Page 24: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Possible Indo-European tree(Ringe, Warnow and Taylor 2000)

Page 25: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Phylogenies of Languages

• Languages evolve over time, just as biological species do (geographic and other separations induce changes that over time make different dialects incomprehensible -- and new languages appear)

• The result can be modelled as a rooted tree• The interesting thing is that many characteristics

of languages evolve without back mutation or parallel evolution -- so a “perfect phylogeny” is possible!

Page 26: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

“Homoplasy-Free” Evolution (perfect phylogenies)

YES NO

Page 27: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Historical Linguistic Data

• A character is a function that maps a set of languages, L, to a set of states.

• Three kinds of characters:– Phonological (sound changes)– Lexical (meanings based on a wordlist)– Morphological (grammatical features)

Page 28: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Cognate Classes• Two words w1 and w2 are in the same cognate class, if they

evolved from the same word through sound changes.

• French “champ” and Italian “champo” are both descendants of Latin “campus”; thus the two words belong to the same cognate class.

• Spanish “mucho” and English “much” are not in the same cognate class.

Page 29: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

The Ringe-Warnow Model of Language Evolution

• The nodes of the tree which contain elements of the same cognate class should form a rooted connected subgraph of the true tree

• The model is known as the Character Compatibility or Perfect Phylogeny.

Page 30: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Perfect Phylogeny

• A phylogeny T for a set S of taxa is a perfect phylogeny if each state of each character occupies a subtree (no character has back-mutations or parallel evolution)

30

Page 31: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Perfect phylogenies, cont.

• A=(0,0), B=(0,1), C=(1,3), D=(1,2) has a perfect phylogeny!

• A=(0,0), B=(0,1), C=(1,0), D=(1,1) does not have a perfect phylogeny!

Page 32: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

A perfect phylogeny

• A = 0 0• B = 0 1• C = 1 3• D = 1 2

A

B

C

D

Page 33: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

A perfect phylogeny

• A = 0 0• B = 0 1• C = 1 3• D = 1 2• E = 0 3• F = 1 3

A

B

C

D

E F

Page 34: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

The Perfect Phylogeny Problem

• Given a set S of taxa (species, languages, etc.) determine if a perfect phylogeny T exists for S.

• The problem of determining whether a perfect phylogeny exists is NP-hard (McMorris et al. 1994, Steel 1991).

Page 35: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Triangulated Graphs

• A graph is triangulated if it has no simple cycles of size four or more.

Page 36: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Triangulating Colored Graphs:An Example

A graph that can be c-triangulated

Page 37: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Triangulating Colored Graphs:An Example

A graph that can be c-triangulated

Page 38: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Triangulating Colored Graphs:An Example

A graph that cannot be c-triangulated

Page 39: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Triangulating Colored Graphs (TCG)

Triangulating Colored Graphs: given a vertex-colored graph G, determine if G can be c-triangulated.

Page 40: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

The PP and TCG Problems

• Buneman’s Theorem: A perfect phylogeny exists for a set S if and only if the associated character state intersection graph can be c-triangulated.

• The PP and TCG problems are polynomially equivalent and NP-hard.

40

Page 41: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

A no-instance of Perfect Phylogeny

• A = 0 0• B = 0 1• C = 1 0• D = 1 1

0 1

0

1

An input to perfect phylogeny (left) of four sequences describedby two characters, and its partition intersection graph. Note thatthe partition intersection graph is 2-colored.

Page 42: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Solving the PP Problem Using Buneman’s Theorem

“Yes” Instance of PP: c1 c2 c3 s1 3 2 1 s2 1 2 2 s3 1 1 3 s4 2 1 1

Page 43: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Solving the PP Problem Using Buneman’s Theorem

“Yes” Instance of PP: c1 c2 c3 s1 3 2 1 s2 1 2 2 s3 1 1 3 s4 2 1 1

Page 44: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Some special cases are easy

• Binary character perfect phylogeny solvable in linear time

• r-state characters solvable in polynomial time for each r (combinatorial algorithm)

• Two character perfect phylogeny solvable in polynomial time (produces 2-colored graph)

• k-character perfect phylogeny solvable in polynomial time for each k (produces k-colored graphs -- connections to Robertson-Seymour graph minor theory)

44

Page 45: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Constructing trees in historical linguistics

• Maximum Compatibility: given the input matrix for the set S of languages described by the set C of characters, find a tree T leaf-labelled by S on which a maximum number of the characters in C are compatible (i.e., evolve without homoplasy).

• NP-hard.

Page 46: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

The Indo-European (IE) Dataset

• 24 languages• 22 phonological characters, 15 morphological characters,

and 333 lexical characters.• Total number of working characters is 390 (multiple

character coding, and parallel development)• A phylogenetic tree T on the IE dataset (Ringe, Taylor and

Warnow)• T is compatible with all but 16 characters• Resolves most of the significant controversies in Indo-European

evolution; shows however that Germanic is a problem (not treelike)

Page 47: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Phylogenetic Tree of the IE Dataset (Ringe, Warnow, and Taylor)

Page 48: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Explaining remaining incompatibilies

• We modelled the remaining incompatibilities as undetected borrowing between languages.

• This leads to the mathematical model of “perfect phylogenetic networks”

Page 49: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Modelling borrowing: Networks and Trees within Networks

Page 50: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

“Perfect Phylogenetic Network” for IENakhleh et al., Language 2005

Page 51: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Summary

• NP-hard optimization problems abound in phylogeny reconstruction, and in computational biology in general, and need very accurate solutions

• Many real problems have beautiful and natural combinatorial and graph-theoretic formulations

Page 52: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Acknowledgements

• NSF and the David and Lucile Packard Foundation (funding)

• Collaborators Bernard Moret (UNM CS), Donald Ringe (Penn Linguistics)

• Students: Usman Roshan and Luay Nakhleh

Page 53: Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Phylolab, U. TexasPlease visit us athttp://www.cs.utexas.edu/users/phylo/


Recommended