Graphs and Graph Theory in Computational Biology

Graphs and Graph Theory in Computational Biology

Dan GusfieldMiami University, May 15, 2008

(four hour tutorial)

Expanded from:

September 2009: I will add to this lecture as new materialis developed.

The goal of these lectures

To show examples of non-trivial GraphTheory (theorems) that arise in computational

biology problems. There are many applications of graphs as means of displaying or organizing biological relationships, and algorithms that analyze those graphs, but many fewer examples of real graph theory in biology. These lectures are not intended to be all-inclusive.

Some examples of graphs in biology

• Taken from the web - see the citations for details. Many other examples of graphs more complex than trees in biology.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

From Max Delbrueck Center, Berlin



From http://www-personal.umich.edu/~mejn/networks/

Yeast proteininteractions



Protein-Protein Interactions



Protein-Protein Interaction ModellingDr. Peter UetzInstitut fur Toxikologie und Genetik Forschungszentrum Karlsruhe

http://www.nytimes.com/interactive/2008/05/05/science/20080506_DISEASE.html

NY Times May 5, 2008 The Diseasome

Graphs and Graph Theory1. Numerous uses of graphs and networks to represent

biological phenomena at many conceptual levels. Maybe several 1000s of papers using graph representations, particularly trees, but little graph theory.

2. A respectable number of papers that develop new non-trivial graph theory for problems in biology. 100s of papers, maybe 1000.

3. A handful of papers exploiting or extending non-trivial classic graph theory for problems in biology. Perhaps a few hundred.

Introduction and Conclusion

Very diverse biological applications and very diverse graph theory. So no single grand reason for graphs and no single graph topic in biology.

Lots of opportunity for graph theorists and graph

algorithmists to develop or apply graph theory to biological problems. Even more opportunity for combinatorial optimization.

What I will do in this tutorial• Emphasis on points 2 and 3, i.e., Examples of the

development of new non-trivial graph theory, and of the exploitation of classic graph theory. And (my apologies) I will mostly emphasize topics I have been involved with.

Still,• There are some hot biological areas today where

graphs arise, and some graph topics that recur commonly, and I should point those out even if I will not talk in detail on those topics.

The digression• Hot biology: Network biology -- biological phenomena

that are represented by networks -- gene regulatory networks and protein interaction networks, just to name two. These form the core of Systems biology. Other relationships in biology represented by graphs and networks. Ex. diseasome.

• Recurring graph problems: graph problems in clustering data ( ex. finding cliques or variants of cliques); variants of graph isomorphism in network motif or molecular pathway problems; need for more random graph theory for significance testing

Clique Problems

Clique problems are recurrent in clustering applications, but true cliques are computationally hard to find. Suggested research for graph theorist and algorithmists: computationally tractable, biologically meaningful alternatives to cliques. As examples: maximum density subgraphs; extreme sets in a graph.

Subgraph density

• Given a graph G, and a subset S of its nodes, let G(S) be the subgraph of G induced by S, i.e, G(S) has node set S and edge set E(S) consisting of all edges in G both of whose ends are in S.

• A Maximum Density subgraph of G is induced by the set of nodes S which the Maximizes |E(S)|/|S|.

• The maximum density subgraph can be found in polynomial time. It has the flavor of a maximum clique, but has different properties.

Extreme Sets

In an edge-weighted undirected graph G, a subset S of nodes of G is called an extreme set if for every subset S’ of S, the total weight of the edges crossing from S’ to V-S’

is larger than the total weight of the edges crossing from S to V-S.

All the extreme sets in a graph can be found in polynomial time.

Also There is also a great need for more sophisticated

application of random graph theory in the study of biological networks. This is needed in order to establish null models to use in assessing the statistical significance of subgraphs, paths, patterns and motifs that are found in biological networks.

We need to be able to distinguish observed patterns and subgraphs from those that occur with a high probability in a random graph, under a biologically appropriate model of randomness (an open field).

End of digression

Start of the main tutorial: Examples of Graph Theory in

Bioinformaticsand Computational Biology

Outline• Three Smaller examples: Euler paths and

sequencing; Tanglegrams and co-evolution; Network Design and Multiple Alignment.

• Haplotyping by Perfect Phylogeny: Graph Realization.

• Phylogenetic Networks: Incompatibility Graph; Galled-Trees; Recombination Networks; The Decomposition Theorem and sufficient conditions.

• Multi-state Perfect Phylogeny and Chordal Graphs.

To start: Three small examples

1. Euler paths in sequencing and sequence assembly.2. Tanglegrams and planarity testing in the study of

co-evolution.3. Application of Tree-Design approximations in

multiple sequence alignment. Interplay between trees and strings.

Topic I: Eulerian paths in sequencing problems

The general situation is that we have a (DNA say)molecule S whose sequence is unknown, butwe know all the k-mers that occur in S, for some fixed k.

Given those k-mers, we want to determine S, if possible, or determine whatever is possible to determine about S. Note that k is not related to the

alphabet size.A very useful approach to problems of this type is to

build an Eulerian digraph, based on the (k-1)-mers.

Euler graph for general kFor general k, there is one node for each (k-1)-mer contained inan observed k-mer. Then there is a directed edge from the node for (k-1)mer A

to the node for (k-1)mer B, if the (k-2) suffix of A matches the (k-2) prefix of B, so that A and B can be overlapped to form the observed k-mer.

Example: k = 5 and we observe the 5-mer XXYZW.Then there will be a node for XXYZ and a node for XYZWand a directed edge from the first node to the second node. Thosetwo nodes and the directed edge between them represent the5-mer XXYZW. In some applications, there will be one such edge for each

observation of that 5-mer.

The Euler graph derived from the sequence ACACGCAACTTAAAIf a triple is observed more than once, there should beOne directed edge for each observation of the triple.

Ex. k = 3. The graph will have one node for each of the 2-mersin the observed 3-mers. Then there is a directed edge fromthe node for the 2-mer XY to the node for the 2-mer YZ, for any X, Z.

The point: Every Eulerian path in the graph specifies asequence whose k-mers match the given data, and converselyevery sequence whose k-mers match the data specifies an Eulerian path in the graph. So the set of Eulerian paths specifies the set of candidate sequences for the unknown original sequence.

Algorithms exist for efficiently finding Eulerian paths, forcounting their number, for determining uniqueness etc. sowe can use this representation to study the set of candidatesequences.

Compare this approach to earlier efforts to represent the set ofcandidates by a graph with a Hamilton path: each node representsan observed k-mer, not a (k-1)-mer.

In general there may be many Eulerian paths in the graph,and we want some additional criteria to distinguish the goodnessof one Eulierian path compared to another.

Different biological considerations translate into having a valuefor each subpath of length two. Then the value of an Eulerian pathP with n edges is the sum of the n-1 values of the n-1 length-twosubpaths in P.

The problem is to find an Eulerian path with maximum value.

We have some reasonable approximations for that, but a simplercase can be solved optimally in polynomial time.

Making finer distinctions in Euler paths

The case of a binary alphabet, but arbitrary k

Since the alphabet size is two, each node in the graph has at most two incoming edges

and two outgoing edges. Assume exactly two each.

011

110

110

001

101

Ex. k = 4


At any node, there are two possible ways for

an Euler path to pass through the node.

011

110

110

001

101

Ex. k = 4turning


At any node, there are two possible ways for

an Euler path to pass through the node.

011

110

110

001

101

Ex. k = 4crossing

So in terms of subpaths of length two, we have two choices at eachnode.

Restating the optimal Euler path problem

We are given an Eulerian graph where the in and out degrees are at most two at each node, and at each node there is a given value for the turning pair, and a value for the crossing pair. Then choose the turning or the crossing pairs at the nodes to maximize the total value of the choices, subject to the requirement that the choices create an Euler path in the graph.

Main Result

• The problem can be solved in polynomial time.

• The set of choices that give Euler paths has a matroidal structure, which allows a matroid-greedy algorithm to find the optimal Euler path.

• A more direct algorithm based on Minimum Spanning Trees also solves the problem.

The Matroid Structure• At every node v, the edge pair (crossing or turning)

which has the lowest value is called the low pair, and the other pair is the high pair. The difference in values is called the loss at v.

• A subset S of nodes is called independent if there is an Euler path in the graph where at every node in S, the low pair is chosen.

• As defined, the family of independent sets form a matroid, and so we can find, by a greedy algorithm, an independent set which minimizes the loss - and this gives the optimal Euler path.

Topic II: Tanglegrams

• A Tanglegram is a pair of trees drawn in the plane with no crossing edges, with the same labeled leaf set. The leaves of one tree are displayed on a line, and the leaves of the other tree are displayed on a parallel line.

• A straight line connect each leaf in one tree to the leaf with the same label in the other tree.

• The number of crossing lines is a measure of the similarity of the trees.









Topic III: Multiple Sequence Alignment

Interplay between sequences and trees.Exploitation of network design

approximation.

This topic was discussed on the board - slides are needed.

Intro to Hours 2 and 3: Two “Post-HGP” Topics

Two topics in Population Genomics• SNP Haplotyping in populations• Reconstructing a history of recombination

These topics in Population Genomics illustrate current challenges in biology, and illustrate the use of graph theory, combinatorial algorithms and discrete mathematics in biology.

What is population genomics?

• The Human genome “sequence” is done.• Now we want to sequence many individuals

in a population to correlate similarities and differences in their sequences with genetic traits (e.g. disease or disease susceptibility).

• Presently, we can’t sequence large numbers of individuals, but we can sample the sequences at SNP sites.

SNP Data• A SNP is a Single Nucleotide Polymorphism - a site

in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

• SNP maps have been compiled with a density of about 1 site per 1000.

• SNP data is what is mostly collected in populations - it is much cheaper to collect than full sequence data, and focuses on variation in the population, which is what is of interest.

Haplotype Map Project: HAPMAP

• NIH lead project ($100M) to find common SNP haplotypes (“SNP sequences”) in the Human population.

• Association mapping: HAPMAP used to try to associate genetic-influenced diseases with specific SNP haplotypes, to either find causal haplotypes, or to find the region near causal mutations.

• The key to the logic of Association mapping is historical recombination in populations. Nature has done the experiments, now we try to make sense of the results.

Topic IV: Perfect Phylogeny Haplotyping via Graph

Realization

Genotypes and HaplotypesEach individual has two “copies” of each

chromosome. At each site, each chromosome has one of two

alleles (states) denoted by 0 and 1 (motivated by

SNPs)0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individualMerge the haplotypes

Haplotyping Problem

• Biological Problem: For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect. Genotype data is easy to collect.

• Computational Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes. This is hopeless without a genetic model.

The Perfect Phylogeny Model for

SNP sequences

000001

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edgesThe tree derives the set M:1010010000010110101000010

Only one mutation per siteallowed.

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs:

0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test

When can a set of sequences be derived on a perfect

phylogeny?

So, in the case of binary characters, if each pair of columnsallows a tree, then the entire set of columns allows a tree.

For M of dimension n by m, the existence of a perfect phylogenyfor M can be tested in O(nm) time and atree built in that time, if there is one. Gusfield, Networks 91

We will use the classic theorem in two more modernand more genetic applications.

The Perfect Phylogeny ModelWe assume that the evolution of extant haplotypes can be

displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed.

In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root.

Justification: Haplotype Blocks, rare recombination, base problem whose solution to be modified to incorporate more biological complexity.

Perfect Phylogeny Haplotype (PPH)

Given a set of genotypes S, find an explaining set of haplotypes that fits a perfect phylogeny.

1 2a 2 2b 0 2c 1 0

sitesA haplotype pair explains agenotype if the merge of thehaplotypes creates thegenotype. Example: Themerge of 0 1 and 1 0 explains 2 2.

Genotype matrix

S

The PPH Problem

Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny

1 2a 2 2b 0 2c 1 0

1 2a 1 0a 0 1b 0 0b 0 1c 1 0c 1 0

The Haplotype Phylogeny Problem

Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny

1 2a 2 2b 0 2c 1 0

1 2a 1 0a 0 1b 0 0b 0 1c 1 0c 1 0

1

c c a a

b

b

2

10 10 10 01 01

00

00

The Alternative Explanation

1 2a 2 2b 0 2c 1 0

1 2a 1 1a 0 0b 0 0b 0 1c 1 0c 1 0

No treepossiblefor thisexplanation

Efficient Solutions to the PPH problem - n genotypes, m

sites• Reduction to a graph realization problem (GPPH) - build on Bixby-Wagner or Fushishige solution to graph realization O(nm alpha(nm)) time. Gusfield, Recomb 02

• Reduction to graph realization - build on Tutte’s graph realization method O(nm^2) time. Chung, Gusfield 03

• Direct, from scratch combinatorial approach -O(nm^2) Bafna, Gusfield et al JCB 03

• Berkeley (EHK) approach - specialize the Tutte solution to the PPH problem - O(nm^2) time.

• Linear-time solutions - Recomb 2005, Ding, Filkov, Gusfield and a different linear time solution.

The Reduction Approach

This is the original polynomial time method. Conceptually simplest at a high level (but not at the implementation level) and most extendable to other problems; nearly linear-time but not linear-time.

The case of the 1’s1) For any row i in S, the set of 1 entries in

row i specify the exact set of mutations on the path from the root to the least common ancestor of the two leaves labeled i, in every perfect phylogeny for S.

2) The order of those 1 entries on the path is also the same in every perfect phylogeny for S, and is easy to determine by “leaf counting”.

Leaf Counting

1 2 3 4 5 6 7a 1 0 1 0 0 0 0b 0 1 0 1 0 0 0c 1 2 0 0 2 0 2d 2 2 0 0 0 2 0

In any column c, count two for each 1, andcount one for each 2. The total is the number of leaves below mutation c, in every perfect phylogeny for S. So if we know the set ofmutations on a path from the root, we knowtheir order as well.

S

Count 5 4 2 2 1 1 1

Simple Conclusions

Root The order is known for the red mutations together with the leftmost blue(?) mutation.

1 2 3 4 5 6 7

i:0 1 0 1 2 2 2

Subtree for row i data

24

sites

5

But what to do with the remaining blue entries (2’s) in

a row?

More Simple Tools

3) For any row i in S, and any column c, if S(i,c) is 2, then in every perfect phylogeny for S, the path between the two leaves labeled i, must contain the edge with mutation c.

Further, every mutation c on the path between the two i leaves must be from such a column c.

From Row Data to Tree Constraints

Root 1 2 3 4 5 6 7

i:0 1 0 1 2 2 2


24

sites

5 Edges 5, 6 and 7must be on the blue path,and 5 is already known tofollow 4, but we don’twhere to put 6 and 7.i i

The Graph Theoretic Problem

Given a genotype matrix S with n sites, and a red-blue subgraph for each row i,

create a directed tree T where each integer from 1 to n labels exactly one edge, so that each subgraph is contained in T.i i

Powerful Tool: Tree and Graph Realization

• Let Rn be the integers 1 to n, and let P be an unordered subset of Rn. P is called a path set.

• A tree T with n edges, where each is labeled with a unique integer of Rn, realizes P if there is a contiguous path in T labeled with the integers of P and no others.

• Given a family P1, P2, P3…Pk of path sets, tree T realizes the family if it realizes each Pi.

• The graph realization problem generalizes the consecutive ones problem, where T is a path.

• More generally, each set specifies a fundamental cycle in the unknown graph.

Tree Realization Example

1

2 4

5

63

8

7

P1: 1, 5, 8P2: 2, 4P3: 1, 2, 5, 6 P4: 3, 6, 8P5: 1, 5, 6, 7

Realizing Tree T

More generally, think of each path set as specifying a fundamentalcycle containing the edges in the specified path.

Graph Realization Polynomial time (almost linear-time) algorithms exist

for the graph realization problem, given the family of fundamental cycles the unknown graph should contain – Whitney, Tutte, Cunningham, Edmonds, Bixby, Wagner, Gavril, Tamari, Fushishige, Lofgren 1930’s - 1980’s

Most of the literature on this problem is in the context of determining if a binary matroid is graphic.

The algorithms are not simple; none implemented

before 2002.

Reducing PPH to graph realization

We solve any instance of the PPH problem by creating appropriate path sets, so that a solution to the resulting graph realization problem leads to a solution to the PPH problem instance.

The key issue: How to encode the needed subgraph for each row, and glue them together at the root.

From Row Data to Tree Constraints

Root 1 2 3 4 5 6 7

i:0 1 0 1 2 2 2


24

sites

5 Edges 5, 6 and 7must be on the blue path,and 5 is already known tofollow 4.

i i

Encoding a Red-Blue directed path

245

P1: U, 2P2: U, 2, 4P3: 2, 4P4: 2, 4, 5P5: 4, 5

245

U

U is a glue edge used to glue together the directedpaths from the different rows.

forcedIn T

Now add a path set for the blues in row i.

Root 1 2 3 4 5 6 7

i:0 1 0 1 2 2 224

sites

5

i i

P: 5, 6, 7

That’s the Reduction

The resulting path-sets encode everything that isknown about row i in the input.

The family of path-sets are input to the graph-realization problem, and every solution to thethat graph-realization problem specifies a solution tothe PPH problem, and conversely.

Whitney (1933?) characterized the set of all solutions to graphrealization (based on the three-connected components of a graph)and Tarjan et al showed how to find these in linear time.

An implicit representation of all solutions

Whitney (1930) proved that a graph realization problem has aunique solution if and only if the graph is three-connected. That is, at least three nodes must be removed in order to disconnect the graph (assuming it is connected).

Whitney (1931) proved that if the solution is not unique, then there is a semi-unique decomposition of the graph into three-connected components, so that the graph realizations are in one-one correspondence with all the ways that these components canbe ``twisted” relative to each other. So the number of solutionsis 2^(number of three connected comps. -1).

Tree Realization Example

1

2 4

5

63

8

7

P1: 1, 5, 8P2: 2, 4P3: 1, 2, 5, 6 P4: 3, 6, 8P5: 1, 5, 6, 7

Realizing Tree T withedges added to create afundamental cycle for eachpath

Topic V: Phylogenetic Networks with Recombination

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs:

0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test

When can a set of sequences be derived on a perfect

phylogeny?

Incompatible Sites

A pair of sites (columns) of M that fail the4-gametes test are said to be

incompatible.

A site that is not in such a pair is compatible.

A richer model

000001

2

4

3

510100

1000001011

00010

01010

12345101001000001011010100001010101 added

Pair 4, 5 fails the fourgamete-test. The sites 4, 5are incompatible.

Real sequence histories often involve recombination.

M

10100 01011

5

10101

The first 4 sites come from P (Prefix) and the sitesfrom 5 onward come from S (Suffix).

P S

Sequence Recombination

A recombination of P and S at recombination point 5.

Single crossover recombination

Network with Recombination: ARG

000001

2

4

3

510100

1000001011

00010

01010

12345101001000001011010100001010101 new

10101

The previous tree with onerecombination event now derivesall the sequences.

5

P

S

M

A Min ARG for Kreitman’s data

QuickTime™ and aTIFF (LZW) decompressor


ARG created by SHRUB

An illustration of why we are interested in recombination:

Association Mapping of Complex Diseases Using

ARGs

Association Mapping

• A major strategy being practiced to find genes influencing disease from haplotypes of a subset of SNPs.– Disease mutations: unobserved.

• A simple example to explain association mapping and why ARGs are useful, assuming the true ARG is known.

0 1 0 0 1Disease mutation site

SNPs

00000

52

3

3

4SP

PS

1

4a:00010

b:10010

c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

Very Simplistic Mapping the Unobserved Mutation of Mendelian Diseases with ARGs

Diseased

Assumption (for now): A sequence is diseased iff it carries the single disease mutation

Where is the disease mutation?

1 2 3 4 5

What part of 01100 d, e, f inherit?

d: e:f:

? ?

The single disease mutation occurs between sites 2 and 3!

Mapping Disease Gene with Inferred ARGs

• “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard, 2005

• But we do not know the true ARG! • Goal: infer ARGs from SNP data for

association mapping– Not easy and often approximation (e.g. Zollner and

Pritchard)– Improved results to do the inference Y. Wu (RECOMB 2007)

Results on Reconstructing the Evolution of SNP Sequences

• Part I: Clean mathematical and algorithmic results: Galled-Trees, near-uniqueness, graph-theory lower bound, and the Decomposition theorem

• Part II: Practical computation of Lower and Upper bounds on the number of recombinations needed. Construction of (optimal) phylogenetic networks; uniform sampling; haplotyping with ARGs; LD mapping …

• Part III: Varied Biological Applications• Part IV: Extension to Gene Conversion• Part V: The Minimum Mosaic Model of Recombination

This talk will discuss topics in Parts I

Problem: If not a tree, then what?

If the set of sequences M cannot be derived on a perfect phylogeny (true tree) how much deviation from a tree is required?

We want a network for M that uses a small number of recombinations, and we want the resulting network to be as ``tree-like” as possible.

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

A tree-like networkfor the same sequences generatedby the prior network.

2

4

p s

ps

Recombination Cycles

• In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet.

• The cycle specified by those two paths is called a ``recombination cycle”.

Galled-Trees

• A phylogenetic network where no recombination cycles share an edge is called a galled tree.

• A cycle in a galled-tree is called a gall.• Question: if M cannot be generated on a

true tree, can it be generated on a galled-tree?

Results about galled-trees• Theorem: Efficient (provably polynomial-time) algorithm to

determine whether or not any sequence set M can be derived on a galled-tree.

• Theorem: A galled-tree (if one exists) produced by the algorithm minimizes the number of recombinations used over all possible phylogenetic-networks.

• Theorem: If M can be derived on a galled tree, then the Galled-Tree is ``nearly unique”. This is important for biological conclusions derived from the galled-tree.

Papers from 2003-2007.

Elaboration on Near Uniqueness

Theorem: The number of arrangements (permutations) of the sites on any gall isat most three, and this happens only if the gall has two sites.

If the gall has more than two sites, then the number ofarrangements is at most two.

If the gall has four or more sites, with at least two siteson each side of the recombination point (not the side ofthe gall) then the arrangement is forced and unique.

Theorem: All other features of the galled-trees for M are invariant.

A whiff of the ideas behind the results

Incompatible Sites

A pair of sites (columns) of M that fail the4-gametes test are said to be

incompatible.

A site that is not in such a pair is compatible.

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

2 5

Two nodes are connected iff the pairof sites are incompatible, i.e, fail the 4-gamete test.

Incompatibility Graph G(M)

M

THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

The connected components of G(M) are very informative

• Theorem: The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network.

• Theorem: When M can be derived on a galled-tree, all the incompatible sites in a gall must come from a single connected component C, and that gall must contain all the sites from C. Compatible sites need not be inside any blob.

• In a galled-tree the number of recombinations is exactly the

number of connected components in G(M), and hence is minimum over all possible phylogenetic networks for M.

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

2

4

p s

ps

1 3

4

2 5

Incompatibility Graph

A Graph Theoretic Necessary Condition for a Galled-Tree

If M can be generated on a galled-tree, then the incompatibility graph must be a bipartite bi-convex graph. Other structural properties

of the conflict graph can be deduced and

exploited.

Galled-Tree Haplotyping

Problem: Given genotype matrix G, if there is no PPH solution for G, is there a haplotyping H for G such that H can be derived on a Galled-Tree?

A different Neccessary Condition for a one-gall tree

1. There exists a set of sequences S such that for every pair of incompatible sites p,q, a single p,q state-pair appears in all sequences in S, and does not appear in any sequence outside S.

2. There must be a number x such that p < x < q, for each incompatible pair p,q.

4

1

3

2 5

a: 000100

b: 100100

d: 101000

c: 001000

f: 011000g: 001010

2p s

0 0 0 1 0 01 0 0 1 0 00 0 1 0 0 01 0 1 0 0 01 1 0 0 0 10 1 1 0 0 00 0 1 0 1 0

abcdefg

H

6

e:1010001

S = {e,d} the sequencesbelow the recombination node.

Example

Surprising Result - Yun Song

The necessary condition is also sufficient.Yun S. Song in TCBB 2006

Coming full circle - back to genotypes

When can a set of genotypes be explained by a set of haplotypes derived on a galled-tree, rather than on a perfect phylogeny?

The Song NASC can be translated into an ILP, using the part of theMinIncompat ILP that identifies which site pairs are

incompatibile.

For the one gall problem, the ILP formulation solves very efficiently (200 rows x 40 sites in seconds to minutes). So far, the 2-gall case does not solve well (ongoing work).

(Dan Brown, Gusfield 2006).

Change of Scope: Minimizing Recombinations in

unconstrained networks• Problem: given a set of sequences M, find a

phylogenetic network generating M, minimizing the number of recombinations used to generate M, allowing only one mutation per site. This has biological meaning in appropriate contexts.

• We can solve this problem in poly-time for the special case of Galled-Trees.

• The minimization problem is NP-hard in general.

Minimization is an NP-hard Problem

What we have done:

1. Solve small data-sets optimally with exponential-time methods

or with algorithms that work well in practice;

2. Efficiently compute lower and upper bounds on the number of

needed recombinations. 3. Apply these methods to address specificbiological and bio-tech questions.

The Decomposition Theorem

Since the minimization problem is NP-hardwe want to break up a problem into subproblems that can be solved separately and combined.

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

2 5

Two nodes are connected iff the pairof sites are incompatible, i.e, fail the 4-gamete test.

Incompatibility Graph G(M)

M

THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

The connected components of G(M) are very informative

For example we have the Theorem:

The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network.

Recombination Cycles

• In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet.

• The cycle specified by those two paths is called a ``recombination cycle”.

A maximal set of intersecting cycles forms a Blob

00000

52

3

3

4Sp

PS

1

4

10010

0110000101

01101

00100

00010

If directions on the edges are removed, a blob isa bi-connected component of the network.

Blobed Trees

• Contracting each blob in a network results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. Simple, but key insight.

• So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree.

The blobs are the non-tree-like parts of the network.

Ugly tanglednetwork insidethe blob.

Every network is a tree of blobs.

A network where every blob is a single cycle is a Galled-Tree.

A Simple Observation

In any network N for M, all sites from the same connected component of G(M) must appear together in a single blob in N.

The Decomposition Theorem Theorem: For any set of sequences M, there is a

phylogenetic network that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. This

“fully-decomposed” network is the finest decomposition possible.

Example: Network for input M with one blob

00000

52

3

3

4Sp

PS

1

4

a:00010

b:10010c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

2

4

p s

ps

1 3

4

2 5

Incompatibility GraphThe fully-decomposednetwork for M

Moreover, the backbone tree is invariantover all the fully-decomposed networksfor M, and can be determined in polynomial-time.

So, we can find a network for M by solvingthe recombination minimization problem foreach connected component of G(M) separately, and then connect those subnetworks in an invariant way.

Algorithmically

• Finding the tree part of the blobbed-tree is easy.• Determining the sequences labeling the exterior nodes on any

blob is easy.• Determining a “good” structure inside a blob B is the problem of

generating the sequences of the exterior nodes of B. • It is easy to test whether the exterior sequences on B can be

generated with only a single recombination. The original galled-tree problem is now just the problem of testing whether one single-crossover recombination is sufficient for each blob.

• That can be solved by successively removing each exterior sequence and testing if the remaining sequences can be generated on a perfect phylogeny of the correct form.

However …While fully-decomposed networks always exist,

they do not necessarily minimize the number of recombination nodes, over all possible networks.

That is, sometimes it pays to put sites from different connected components together on the same blob.

But we can prove several useful sufficient conditionsfor when there is a fully-decomposed network that minimizes thenumber of recombinations, over all possible networks.

The deepest result:Theorem: Let N be a phylogenetic network for input M, let L be the set of sequences that label the nodes of N, and let G(L) be the incompatibility graph for L. If G(L) and G(M) have the same number of connected components, then there is a fully-decomposed network for M with the same number of recombinations as in N.

JCB December 2007

Sufficient Conditions

Corollary

A fully-decomposed network exists thatminimizes the number of recombinations,unless every optimal network uses somerecombination node(s) labeled by sequence(s)not in M, and the addition of those sequencesto M creates an incompatibility between sitesin different components of G(M).

0000003 4

5 1

p4

0010000011010

010010

2 6

100001 100101

000100

3 5p s

s100010

G(L) has one component. The addition of sequence 100010 reduces the number of components from 2 to 1.

Sequences in M are in black.Sequence 100010 is not in M.

G(M) has two components. Eachrequires two recs, butthis combined network needs only three.

ps

A Practical Sufficient Condition

If M can be derived on a network N in whichevery edge contains at mostone site, and every node is labeled with asequence in M, then there is a fully-decomposed network for M whichminimizes the number of recombinations over all possible networks for M.

Another Practical Sufficient Condition

If M can be derived on a network N wherethe number of recombinations equals the(poly-computable) Haplotype Lower Bound, then there is a fully decomposed networkfor M which minimizes the number of recombinations over all possible networks.

Topic VI: Perfect Phylogeny Extension to non-binary

characters We detail the case of three allowed

states per character.

What is a Perfect Phylogeny for non-binary characters?

• Input consists of n sequences M with m sites (characters) each, where each site can take one of k states.

• In a Perfect Phylogeny T for M, each node of T is labeled with an m-length sequence where each site has a value from 1 to k.

• T has n leaves, one for each sequence in M, labeled by that

sequence.

• For each character-state pair (C,s), the nodes of T that are labeled with state s for character C, form a connected subtree of T. It follows that the subtrees for any C are node-disjoint

Example: A perfect phylogeny for input M

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5m = 3k = 3

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

Example

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5m = 3k = 3

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

The tree forState 2 ofCharacter B

Perfect Phylogeny Problem

Given M, is there a Perfect Phylogeny for M?

Chordal Graphs

Basic Definition: A graph G is called Chordal if every cycle of length four or more contains a chord.

More useful result: A graph G is chordal if and only if everyminimal vertex separator in G is a clique.

Chordal graphs have a large number of applications, more basedon the separator result than on the basic definition. For example,a chordal graph on n nodes can have at most n maximal cliquesand n-1 minimal vertex separators.

Another Classic Chordal Graph Theorem

A graph G is chordal if and only if it is the intersection graph ofa set S of subtrees of a tree T. Each node of G is a member of S.

a

b c

d

e f

g

{b,c}

{b,c,d}

{c,d,e,g}

{a,e} {e,f,g}

T

{a,e,g}

G

Relation to Perfect Phylogeny

In a perfect phylogeny T for a table E, for any character Cand any state X of character C, the sub-forest of Tinduced by the nodes labeled (C,X) form a single, connectedsubtree of T.

So, there is a natural set of subtrees of T induced by E.

Chordal Completion Approach to Perfect Phylogeny

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

34

5

A B C

1 1 1

2 2 2

3 3 3

Each row of table E induces a clique in G(E).

Table E

Graph G(E) has one node for eachcharacter-state pairin E, and an edgebetween two nodesif and only if thereis a row in E withboth thosecharacter-statepairs.

G(E)

Classic Theorem

There is a perfect phylogeny for table E if and only if edges can be added to graph G(E) to make it a chordal,K-partite graph. If there is such a chordal graph, denote itby G’(E).

Note that if table E has K columns, then G(E) is aK-partite graph.

Theorem (Buneman 196?)

Deeper Result: If G’(E) exists

• Let C(E) be the graph derived from graph G’(E) as follows: create a node in C(E) for each maximal clique in G’(E), and create an edge (u,v) in C(E) iff the cliques for u and v in G’(E) share a node. Weight edge (u,v) by the number of shared nodes. Note that C(E) can be created from G’(E) in polynomial time.

• Any Maximum Spanning Tree T in C(E) is a perfect phylogeny for E. Actually, T can be found more directly in linear time from G’(E).

Perfect Phylogeny Results

The perfect phylogeny problem was open for about 20 years,but solved by Dress, Steel, Warnow and Kannan, Agarwalla and Fernandez-Baca.

For any fixed bound on the number of states per character,the Perfect Phylogeny Problem can be solved in polynomial time.

However, if the number of states per character is not bounded,then the problem is NP-Complete.

Also, for any fixed number of characters, the problem can besolved in polynomial time.

Dress-Steel solution for 3-state Perfect phylogeny given

complete data (1991)• Recode each site M(i) of M as three binary sites

M’(i,1), M’(i,2), M’(i,3) each indicating the taxa that have state 1, 2, or 3.

• Theorem (DS) There is a 3-state perfect phylogeny for M, if and only if there is a binary-character perfect phylogeny for some subset of M’ consisting of exactly two of the columns

M’(i,1), M’(i,2), M’(i,3), for each column i of M.

Example

1

2

3

4

M’

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

M

0 0 1 0 1 0 1 0 0

0 1 0 0 0 1 0 1 0

0 0 1 0 1 0 0 0 1

1 0 0 1 0 0 0 0 1

1 0 0 0 1 0 0 0 15

A,1 A,2 A,3 B,1 B,2 B,3 C,1 C,2 C,3

Compatible subset

Solved in Poly-Time by 2-SAT

As stated, the problem still seems like it would take exponentialtime to solve, but in fact it is easy to code the problem as a2-SAT problem (Y. Wu) and hence is solvable in polynomialtime. The Dress-Steel paper gave an independent poly-timesolution.

Multi-State Perfect Phylogenywith Missing and Removable Data: Solutions via Chordal

Graph Theory

Dan Gusfield

Recomb09, May 2009

The Perfect Phylogeny Modelfor binary sequences

000001

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edgesThe tree derives the set M:1010010000010110101000010

Only one mutation per siteallowed (infinite sites)

Beyond Binary; beyond SNPs The binary perfect phylogeny model has been widely

used in population genetics (four-gametes), phylogenetics (compatibility); and many problems and methods build on the model (haplotyping, networks with recombination).

But, non-binary, non-SNP data is becoming more important in population genomics: CNVs, full DNA sequence, micro-sats; and other applications in phylogenetics.

A 3-state perfect phylogeny

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5 number of taxam = 3 number of sitesk = 3 number of states

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

A formal definition of a k-state perfect phylogeny

• Input consists of n sequences M with m sites (characters) each, where each site can take one of k > 2 states (values).

• T has n leaves, one for each sequence X in M, labeled by X.

• Each node of T is labeled with an m-length sequence (not necessarily from M) where each site has a value from 1 to k.

• For each character-state pair (C,s), the nodes of T that are

labeled with state s for character C form a connected subtree of T. This is the convexity requirement.

This more reflects the infinite alleles model rather than the infinite sites model in binary perfect phylogeny.

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5 number of taxam = 3 number of sitesk = 3 number of states

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

The subtree forState 2 ofCharacter B

An alternative view of convexity

Arbitrarily root T at some node, and direct all the edges away from the root.

Then, any character C can mutate into a state s on at most one edge, but there are no edges where C mutates into its root state.

This view makes a k-state perfect phylogeny a more natural generalization of a binary perfect phylogeny.

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5m = 3k = 3

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

RootB

B

K-state Perfect Phylogeny Problems

Existence Problem:Given M and k, is there a k-state Perfect Phylogeny for M?

Missing Data Problem: For a given k, if there are cells in M withoutvalues, can values less than or equal to k be imputedso that the resulting matrix M’ has a k-state perfect phylogeny?

Handling missing data extends the utility of the perfect-phylogeny model.

Removable data

Given data that does not have a k-state perfect phylogeny,what is the minimum number of characters to remove sothat the remaining data does have a k-state perfectphylogeny?

The paper gives a solution for k = 3, but no time to discuss it in this talk. Here we discuss the missingdata problem for arbitrary k.

Status of the Existence Problem

Poly-time algorithm for 3 states, Dress-Steel (1993)

Poly-time algorithm for 3 or 4 states, Kannan-Warnow (1994)

Poly-time algorithm for any fixed number of states -polynomial in n and m, but exponential in k, Agarwalla andFernandez-Baca (1994)

Speed up of the AFB method by Kannan-Warnow (1997)

When k is not fixed, the existence problem is NP-hard

The missing data challenge

The general AFB,KW algorithms that solve the existence problem are not easily adapted to handle the missing data problem. They seem to extend only by brute-force enumeration of imputed values.

So, we need another approach to the missing data problem.

Status of Missing Data problem

NP-complete even for k = 2; effective, practicalapproaches for k = 2. (GFB in cocoon 2007; Satya, Mukherjee, TCBB 2008)

Polynomial-time methods for a `directed’ variant of k = 2.

No literature on the missing data problem for k > 2.

New work here: specialized ILP methods for k = 3,4,5and a general solution for any fixed k.In this talk I will only discuss the general solution.

New approach to existence and missing data problems

Based on an old theorem and newer techniques.

Old theorem: Buneman’s theorem relating Perfect-Phylogeny to chordal graphs. (thirty-five years old)

Newer techniques and theorems: Minimal triangulations of anon-chordal graph to make it chordal. The literature on minimal triangulations is current and ongoing.

Definition: Chordal Graphs

A graph G is called Chordal if every cycle of length four or more contains a chord. Chordal graphs arealso called triangulated graphs.

G

Buneman’s Approach to Perfect Phylogeny (1974)

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

1 1 1 2 2 23 3 3

Each row of table M induces a clique in G(M).

Input M, n by m

Partition-Intersection Graph G(M) has one node for eachcharacter-state pairin M, and an edgebetween two nodesif and only if thereis a row in M withboth thosecharacter-statepairs.

G(M)

C1 C2 C3 C1 C2 C3

G(M) is the superposition of m cliques.

DefinitionsIf M has m characters, then G(M) is an m-partite graph. The nodes associated with a singlecharacter (class in the partition) are given a distinct color.

An edge (u,v) not in G(M) is called legalif u and v do not have the same color.

Two nodes with the same color are called a mono-chromatic pair.

Buneman’s Theorem

There is a perfect phylogeny for M if and only if legaledges can be added to graph G(M) to make it chordal.

If there is such a chordal graph, denote it G’(M).

Theorem (Buneman 1974)

G’(M) is called a legal triangulation of G(M).

From Chordal Graph to Perfect Phylogeny

Fact: Given a legal triangulation G’(M), a Perfect Phylogeny for M can be constructed in linear time.

The algorithms are based on `perfect elimination orders’ and `clique trees’, classic objects in the chordal graphliterature.

Example

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

Each node represents aCharacter-State pair

A legal triangulation

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

X Y

Yields a Perfect Phylogeny

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

B C

A D

One node in T for eachmaximal clique in G’(M)

X Y

002

010 111

122

012 112

(Fact: every `clique-tree’ of the Chordal graph G’(M) is a perfect Phylogeny for M)

What about Missing Data?If M is missing data, build the partition intersection graph G(M) using the known data in M. Buneman’s theorem still holds:

Theorem: There is a perfect phylogeny for some imputation of missing data in M, if and only if there is a legal triangulation of G(M).

The legal triangulation gives a perfect phylogeny T for Mwith some imputed data, and then the imputed M’ can beobtained from T.

Example

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

Example

A: 0 0 2B: 0 ? 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

The Key Problem

So the key problem, in both theExistence and the Missing Data problems, is howto find a legal triangulation, if there is one.

But, there is a robust and still expanding literature onefficient algorithms to find a minimal triangulation ofa non-chordal graph.

Some triangulation problems are NP-hard (Tree-width,Minimizing the number of added edges).

The PI graph is conceptually perfect for modeling missingdata.

Minimal triangulation

A triangulation of a non-chordal graph G is minimal if no subset of added edges is a triangulationof G.

Clearly, if there is a legal triangulation G’(M) of G(M),then there is one that is a minimal triangulation. A minimal triangulation is good enough for us.

So we can take advantage of the minimal triangulationtechnology, and the contemporary literature. The minimal vertex separators are the key objects.

Minimal vertex separatorsA set of nodes S whose removal separates verticesu and v is called a u,v separator. S is a minimal u,vseparator if no subset of S is a u,v separator.

S is a `minimal separator’ if it is a minimal u,v separator for some vertex pair u,v.

Minimal separator S crosses minimal separator S’, ifS separates some pair of nodes in S’.

Crossing is a symmetric relation for minimal separators.

Example

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

S = {(2,1), (3,2)} andS’ = {(1,0), (1,1)}are crossing minimal separators.

S

S’

Example

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

S = {(2,1), (1,1)} and S’ = {(1,0), (3,2)} arenon-crossing minimal separators.

S

S’

A lucky break for us: A complete characterization of the minimal triangulations of G

was derived in 1997Definition: Completing a minimal separator S means adding all the missing edges between pairs of nodes in S to make S a clique.

Capstone Theorem on Minimal Triangulations

Parra, Sheffler (1997): Every minimal triangulation of G is obtained by completing each minimal separator in amaximal set of pairwise non-crossing minimal separators of G.

Conversely, completing every minimal separator ina maximal set of pairwise non-crossing minimal separatorsyields a minimal triangulation of G.

Example:

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

There are6 minimalseparators.

There are two maximal sets of 5 pairwise non-crossing minimal separators.

A minimal (illegal) triangulation obeying the P,S Theorem

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

A legal minimal triangulation

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

Back to Perfect PhylogenyA minimal separator S in the partition intersection graph G(M)Is called legal if it does not use an edge between two nodes of the same color, and is called illegal if it does.

P,S Theorem can be used to prove the Main New Results

Theorem 1:There is a perfect phylogeny for M, even if M is missing data,If and only if there is a set Q of pairwise non-crossing, legal,minimal separators in G(M) that separate every mono-chromatic pair of nodes in G(M).

The legal minimal triangulation, obeying

Theorem 13,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

From G’(M), we get a perfect phylogeny for M.

Corollaries to Theorem 1Cor 1: If there is a mono-chromatic pair of nodes in G(M)that is not separated by any legal minimal separator, thenM has no perfect phylogeny.

Cor 2: If G(M) has no illegal minimal separators, thenM has a perfect phylogeny.

Cor 3: If every mono-chromatic pair of nodes is separatedby some legal minimal separator, and no legal minimalseparators cross, then M has a perfect phylogeny.

Recipe to solve the missing data problem with Theorem 1

Given M, find all legal minimal separators in G(M); for each legal minimal separator, determine which mono-chromatic pairs of nodes it separates, and which legal minimal separators it crosses.

Determine if any of the Corollaries hold. If so, either there is no perfect phylogeny (Cor. 1) or a set Q needed in Theorem 1 can be found greedily.

If no Cor. holds, set up and solve a (straightforward) integer linear program to find a set Q of pairwise non-crossing legal minimal separators that separate every mono-chromatic pair of nodes in G(M).

If the ILP is feasible, greedily extend Q to be a maximal set of pairwise non-crossing legal minimal separators, and use Q to get a legal triangulation G’(M) of G(M).

From G’(M), construct a perfect phylogeny T, and fromT impute values for the missing entries.

Conceptually nice, but

Does it work in practice?

It works surprisingly (shockingly) well

Simulations with data from program ms, characteristic of many current applications in phylogenetics and population genetics - but notgenomic scale or tree-of-life scale.

Surprising empirical resultsThe minimal separators are found quickly by existing algorithms from 1999: cubic-time per minimal separator, but we have methods (not in the paper) to speed this up.

When there is no missing data, all the legal minimal separators can be found in O(nm^2) worst-case time,for any fixed k.

The observed number of minimal separators is small.There are few crossing pairs of legal minimal separators.

Until a large percentage of missing data, most problemsare solved by the Corollaries, without the need for an ILP.

The ILPs solve quickly in practice - all havesolved in 0.00 CPLEX-reported seconds (CPLEX 11 on2.5 Ghz machine).

Most solve in the CPLEX pre-processor.

When an ILP is needed, it has been tiny. For the existenceproblem, the size of the ILP is polynomialy bounded.

So Although the chordal graph approach

may at first seem impractical, it works on a large range of data of sizes that are typical of current phylogenetic

problems, and degree of missing data.

More structure

The empirical results suggest the existence of more combinatorial structure in the perfect-phylogeny problem. And more has been recently found.

(F. Lam) When k = 3, a NASC for the existence of a 3-state perfect-phylogeny is:

Every mono-chromatic pair of nodes in G(M) is separated by some legal minimal separator. (Compare to Theorem 1).

This does not extend to k = 4.

All software to replicate theseresults will be available on my

UC Davis website, shortly.

Date post:	11-Feb-2016
Category:	Documents
Upload:	jeff
View:	42 times
Download:	0 times

Graphs and Graph Theory in Computational Biology

Documents