+ All Categories
Home > Documents > Graphs and Graph Theory in Computational Biology

Graphs and Graph Theory in Computational Biology

Date post: 11-Feb-2016
Category:
Upload: jeff
View: 42 times
Download: 0 times
Share this document with a friend
Description:
Expanded from:. Graphs and Graph Theory in Computational Biology. Dan Gusfield Miami University, May 15, 2008 (four hour tutorial) . September 2009: I will add to this lecture as new material is developed. . The goal of these lectures. To show examples of non-trivial Graph - PowerPoint PPT Presentation
185
Graphs and Graph Theory in Computational Biology Dan Gusfield Miami University, May 15, 2008 (four hour tutorial) Expanded from: ber 2009: I will add to this lecture as new materia eloped.
Transcript
Page 1: Graphs and Graph Theory in Computational Biology

Graphs and Graph Theory in Computational Biology

Dan GusfieldMiami University, May 15, 2008

(four hour tutorial)

Expanded from:

September 2009: I will add to this lecture as new materialis developed.

Page 2: Graphs and Graph Theory in Computational Biology

The goal of these lectures

To show examples of non-trivial GraphTheory (theorems) that arise in computational

biology problems. There are many applications of graphs as means of displaying or organizing biological relationships, and algorithms that analyze those graphs, but many fewer examples of real graph theory in biology. These lectures are not intended to be all-inclusive.

Page 3: Graphs and Graph Theory in Computational Biology

Some examples of graphs in biology

• Taken from the web - see the citations for details. Many other examples of graphs more complex than trees in biology.

Page 4: Graphs and Graph Theory in Computational Biology

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

From Max Delbrueck Center, Berlin

Page 5: Graphs and Graph Theory in Computational Biology

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

From http://www-personal.umich.edu/~mejn/networks/

Yeast proteininteractions

Page 6: Graphs and Graph Theory in Computational Biology

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Protein-Protein Interactions

Page 7: Graphs and Graph Theory in Computational Biology

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Protein-Protein Interaction ModellingDr. Peter UetzInstitut fur Toxikologie und Genetik Forschungszentrum Karlsruhe

Page 8: Graphs and Graph Theory in Computational Biology

http://www.nytimes.com/interactive/2008/05/05/science/20080506_DISEASE.html

NY Times May 5, 2008 The Diseasome

Page 9: Graphs and Graph Theory in Computational Biology

Graphs and Graph Theory1. Numerous uses of graphs and networks to represent

biological phenomena at many conceptual levels. Maybe several 1000s of papers using graph representations, particularly trees, but little graph theory.

2. A respectable number of papers that develop new non-trivial graph theory for problems in biology. 100s of papers, maybe 1000.

3. A handful of papers exploiting or extending non-trivial classic graph theory for problems in biology. Perhaps a few hundred.

Page 10: Graphs and Graph Theory in Computational Biology

Introduction and Conclusion

Very diverse biological applications and very diverse graph theory. So no single grand reason for graphs and no single graph topic in biology.

Lots of opportunity for graph theorists and graph

algorithmists to develop or apply graph theory to biological problems. Even more opportunity for combinatorial optimization.

Page 11: Graphs and Graph Theory in Computational Biology

What I will do in this tutorial• Emphasis on points 2 and 3, i.e., Examples of the

development of new non-trivial graph theory, and of the exploitation of classic graph theory. And (my apologies) I will mostly emphasize topics I have been involved with.

Still,• There are some hot biological areas today where

graphs arise, and some graph topics that recur commonly, and I should point those out even if I will not talk in detail on those topics.

Page 12: Graphs and Graph Theory in Computational Biology

The digression• Hot biology: Network biology -- biological phenomena

that are represented by networks -- gene regulatory networks and protein interaction networks, just to name two. These form the core of Systems biology. Other relationships in biology represented by graphs and networks. Ex. diseasome.

• Recurring graph problems: graph problems in clustering data ( ex. finding cliques or variants of cliques); variants of graph isomorphism in network motif or molecular pathway problems; need for more random graph theory for significance testing

Page 13: Graphs and Graph Theory in Computational Biology

Clique Problems

Clique problems are recurrent in clustering applications, but true cliques are computationally hard to find. Suggested research for graph theorist and algorithmists: computationally tractable, biologically meaningful alternatives to cliques. As examples: maximum density subgraphs; extreme sets in a graph.

Page 14: Graphs and Graph Theory in Computational Biology

Subgraph density

• Given a graph G, and a subset S of its nodes, let G(S) be the subgraph of G induced by S, i.e, G(S) has node set S and edge set E(S) consisting of all edges in G both of whose ends are in S.

• A Maximum Density subgraph of G is induced by the set of nodes S which the Maximizes |E(S)|/|S|.

• The maximum density subgraph can be found in polynomial time. It has the flavor of a maximum clique, but has different properties.

Page 15: Graphs and Graph Theory in Computational Biology

Extreme Sets

In an edge-weighted undirected graph G, a subset S of nodes of G is called an extreme set if for every subset S’ of S, the total weight of the edges crossing from S’ to V-S’

is larger than the total weight of the edges crossing from S to V-S.

All the extreme sets in a graph can be found in polynomial time.

Page 16: Graphs and Graph Theory in Computational Biology

Also There is also a great need for more sophisticated

application of random graph theory in the study of biological networks. This is needed in order to establish null models to use in assessing the statistical significance of subgraphs, paths, patterns and motifs that are found in biological networks.

We need to be able to distinguish observed patterns and subgraphs from those that occur with a high probability in a random graph, under a biologically appropriate model of randomness (an open field).

Page 17: Graphs and Graph Theory in Computational Biology

End of digression

Start of the main tutorial: Examples of Graph Theory in

Bioinformaticsand Computational Biology

Page 18: Graphs and Graph Theory in Computational Biology

Outline• Three Smaller examples: Euler paths and

sequencing; Tanglegrams and co-evolution; Network Design and Multiple Alignment.

• Haplotyping by Perfect Phylogeny: Graph Realization.

• Phylogenetic Networks: Incompatibility Graph; Galled-Trees; Recombination Networks; The Decomposition Theorem and sufficient conditions.

• Multi-state Perfect Phylogeny and Chordal Graphs.

Page 19: Graphs and Graph Theory in Computational Biology

To start: Three small examples

1. Euler paths in sequencing and sequence assembly.2. Tanglegrams and planarity testing in the study of

co-evolution.3. Application of Tree-Design approximations in

multiple sequence alignment. Interplay between trees and strings.

Page 20: Graphs and Graph Theory in Computational Biology

Topic I: Eulerian paths in sequencing problems

The general situation is that we have a (DNA say)molecule S whose sequence is unknown, butwe know all the k-mers that occur in S, for some fixed k.

Given those k-mers, we want to determine S, if possible, or determine whatever is possible to determine about S. Note that k is not related to the

alphabet size.A very useful approach to problems of this type is to

build an Eulerian digraph, based on the (k-1)-mers.

Page 21: Graphs and Graph Theory in Computational Biology

Euler graph for general kFor general k, there is one node for each (k-1)-mer contained inan observed k-mer. Then there is a directed edge from the node for (k-1)mer A

to the node for (k-1)mer B, if the (k-2) suffix of A matches the (k-2) prefix of B, so that A and B can be overlapped to form the observed k-mer.

Example: k = 5 and we observe the 5-mer XXYZW.Then there will be a node for XXYZ and a node for XYZWand a directed edge from the first node to the second node. Thosetwo nodes and the directed edge between them represent the5-mer XXYZW. In some applications, there will be one such edge for each

observation of that 5-mer.

Page 22: Graphs and Graph Theory in Computational Biology

The Euler graph derived from the sequence ACACGCAACTTAAAIf a triple is observed more than once, there should beOne directed edge for each observation of the triple.

Ex. k = 3. The graph will have one node for each of the 2-mersin the observed 3-mers. Then there is a directed edge fromthe node for the 2-mer XY to the node for the 2-mer YZ, for any X, Z.

Page 23: Graphs and Graph Theory in Computational Biology

The point: Every Eulerian path in the graph specifies asequence whose k-mers match the given data, and converselyevery sequence whose k-mers match the data specifies an Eulerian path in the graph. So the set of Eulerian paths specifies the set of candidate sequences for the unknown original sequence.

Algorithms exist for efficiently finding Eulerian paths, forcounting their number, for determining uniqueness etc. sowe can use this representation to study the set of candidatesequences.

Compare this approach to earlier efforts to represent the set ofcandidates by a graph with a Hamilton path: each node representsan observed k-mer, not a (k-1)-mer.

Page 24: Graphs and Graph Theory in Computational Biology

In general there may be many Eulerian paths in the graph,and we want some additional criteria to distinguish the goodnessof one Eulierian path compared to another.

Different biological considerations translate into having a valuefor each subpath of length two. Then the value of an Eulerian pathP with n edges is the sum of the n-1 values of the n-1 length-twosubpaths in P.

The problem is to find an Eulerian path with maximum value.

We have some reasonable approximations for that, but a simplercase can be solved optimally in polynomial time.

Making finer distinctions in Euler paths

Page 25: Graphs and Graph Theory in Computational Biology

The case of a binary alphabet, but arbitrary k

Since the alphabet size is two, each node in the graph has at most two incoming edges

and two outgoing edges. Assume exactly two each.

011

110

110

001

101

Ex. k = 4

Page 26: Graphs and Graph Theory in Computational Biology

The case of a binary alphabet, but arbitrary k

At any node, there are two possible ways for

an Euler path to pass through the node.

011

110

110

001

101

Ex. k = 4turning

Page 27: Graphs and Graph Theory in Computational Biology

The case of a binary alphabet, but arbitrary k

At any node, there are two possible ways for

an Euler path to pass through the node.

011

110

110

001

101

Ex. k = 4crossing

So in terms of subpaths of length two, we have two choices at eachnode.

Page 28: Graphs and Graph Theory in Computational Biology

Restating the optimal Euler path problem

We are given an Eulerian graph where the in and out degrees are at most two at each node, and at each node there is a given value for the turning pair, and a value for the crossing pair. Then choose the turning or the crossing pairs at the nodes to maximize the total value of the choices, subject to the requirement that the choices create an Euler path in the graph.

Page 29: Graphs and Graph Theory in Computational Biology

Main Result

• The problem can be solved in polynomial time.

• The set of choices that give Euler paths has a matroidal structure, which allows a matroid-greedy algorithm to find the optimal Euler path.

• A more direct algorithm based on Minimum Spanning Trees also solves the problem.

Page 30: Graphs and Graph Theory in Computational Biology

The Matroid Structure• At every node v, the edge pair (crossing or turning)

which has the lowest value is called the low pair, and the other pair is the high pair. The difference in values is called the loss at v.

• A subset S of nodes is called independent if there is an Euler path in the graph where at every node in S, the low pair is chosen.

• As defined, the family of independent sets form a matroid, and so we can find, by a greedy algorithm, an independent set which minimizes the loss - and this gives the optimal Euler path.

Page 31: Graphs and Graph Theory in Computational Biology

Topic II: Tanglegrams

• A Tanglegram is a pair of trees drawn in the plane with no crossing edges, with the same labeled leaf set. The leaves of one tree are displayed on a line, and the leaves of the other tree are displayed on a parallel line.

• A straight line connect each leaf in one tree to the leaf with the same label in the other tree.

• The number of crossing lines is a measure of the similarity of the trees.

Page 32: Graphs and Graph Theory in Computational Biology

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 33: Graphs and Graph Theory in Computational Biology

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 34: Graphs and Graph Theory in Computational Biology

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 35: Graphs and Graph Theory in Computational Biology

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 36: Graphs and Graph Theory in Computational Biology

Topic III: Multiple Sequence Alignment

Interplay between sequences and trees.Exploitation of network design

approximation.

This topic was discussed on the board - slides are needed.

Page 37: Graphs and Graph Theory in Computational Biology

Intro to Hours 2 and 3: Two “Post-HGP” Topics

Two topics in Population Genomics• SNP Haplotyping in populations• Reconstructing a history of recombination

These topics in Population Genomics illustrate current challenges in biology, and illustrate the use of graph theory, combinatorial algorithms and discrete mathematics in biology.

Page 38: Graphs and Graph Theory in Computational Biology

What is population genomics?

• The Human genome “sequence” is done.• Now we want to sequence many individuals

in a population to correlate similarities and differences in their sequences with genetic traits (e.g. disease or disease susceptibility).

• Presently, we can’t sequence large numbers of individuals, but we can sample the sequences at SNP sites.

Page 39: Graphs and Graph Theory in Computational Biology

SNP Data• A SNP is a Single Nucleotide Polymorphism - a site

in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

• SNP maps have been compiled with a density of about 1 site per 1000.

• SNP data is what is mostly collected in populations - it is much cheaper to collect than full sequence data, and focuses on variation in the population, which is what is of interest.

Page 40: Graphs and Graph Theory in Computational Biology

Haplotype Map Project: HAPMAP

• NIH lead project ($100M) to find common SNP haplotypes (“SNP sequences”) in the Human population.

• Association mapping: HAPMAP used to try to associate genetic-influenced diseases with specific SNP haplotypes, to either find causal haplotypes, or to find the region near causal mutations.

• The key to the logic of Association mapping is historical recombination in populations. Nature has done the experiments, now we try to make sense of the results.

Page 41: Graphs and Graph Theory in Computational Biology

Topic IV: Perfect Phylogeny Haplotyping via Graph

Realization

Page 42: Graphs and Graph Theory in Computational Biology

Genotypes and HaplotypesEach individual has two “copies” of each

chromosome. At each site, each chromosome has one of two

alleles (states) denoted by 0 and 1 (motivated by

SNPs)0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individualMerge the haplotypes

Page 43: Graphs and Graph Theory in Computational Biology

Haplotyping Problem

• Biological Problem: For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect. Genotype data is easy to collect.

• Computational Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes. This is hopeless without a genetic model.

Page 44: Graphs and Graph Theory in Computational Biology

The Perfect Phylogeny Model for

SNP sequences

000001

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edgesThe tree derives the set M:1010010000010110101000010

Only one mutation per siteallowed.

Page 45: Graphs and Graph Theory in Computational Biology

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs:

0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test

When can a set of sequences be derived on a perfect

phylogeny?

Page 46: Graphs and Graph Theory in Computational Biology

So, in the case of binary characters, if each pair of columnsallows a tree, then the entire set of columns allows a tree.

For M of dimension n by m, the existence of a perfect phylogenyfor M can be tested in O(nm) time and atree built in that time, if there is one. Gusfield, Networks 91

We will use the classic theorem in two more modernand more genetic applications.

Page 47: Graphs and Graph Theory in Computational Biology

The Perfect Phylogeny ModelWe assume that the evolution of extant haplotypes can be

displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed.

In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root.

Justification: Haplotype Blocks, rare recombination, base problem whose solution to be modified to incorporate more biological complexity.

Page 48: Graphs and Graph Theory in Computational Biology

Perfect Phylogeny Haplotype (PPH)

Given a set of genotypes S, find an explaining set of haplotypes that fits a perfect phylogeny.

1 2a 2 2b 0 2c 1 0

sitesA haplotype pair explains agenotype if the merge of thehaplotypes creates thegenotype. Example: Themerge of 0 1 and 1 0 explains 2 2.

Genotype matrix

S

Page 49: Graphs and Graph Theory in Computational Biology

The PPH Problem

Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny

1 2a 2 2b 0 2c 1 0

1 2a 1 0a 0 1b 0 0b 0 1c 1 0c 1 0

Page 50: Graphs and Graph Theory in Computational Biology

The Haplotype Phylogeny Problem

Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny

1 2a 2 2b 0 2c 1 0

1 2a 1 0a 0 1b 0 0b 0 1c 1 0c 1 0

1

c c a a

b

b

2

10 10 10 01 01

00

00

Page 51: Graphs and Graph Theory in Computational Biology

The Alternative Explanation

1 2a 2 2b 0 2c 1 0

1 2a 1 1a 0 0b 0 0b 0 1c 1 0c 1 0

No treepossiblefor thisexplanation

Page 52: Graphs and Graph Theory in Computational Biology

Efficient Solutions to the PPH problem - n genotypes, m

sites• Reduction to a graph realization problem (GPPH) - build on Bixby-Wagner or Fushishige solution to graph realization O(nm alpha(nm)) time. Gusfield, Recomb 02

• Reduction to graph realization - build on Tutte’s graph realization method O(nm^2) time. Chung, Gusfield 03

• Direct, from scratch combinatorial approach -O(nm^2) Bafna, Gusfield et al JCB 03

• Berkeley (EHK) approach - specialize the Tutte solution to the PPH problem - O(nm^2) time.

• Linear-time solutions - Recomb 2005, Ding, Filkov, Gusfield and a different linear time solution.

Page 53: Graphs and Graph Theory in Computational Biology

The Reduction Approach

This is the original polynomial time method. Conceptually simplest at a high level (but not at the implementation level) and most extendable to other problems; nearly linear-time but not linear-time.

Page 54: Graphs and Graph Theory in Computational Biology

The case of the 1’s1) For any row i in S, the set of 1 entries in

row i specify the exact set of mutations on the path from the root to the least common ancestor of the two leaves labeled i, in every perfect phylogeny for S.

2) The order of those 1 entries on the path is also the same in every perfect phylogeny for S, and is easy to determine by “leaf counting”.

Page 55: Graphs and Graph Theory in Computational Biology

Leaf Counting

1 2 3 4 5 6 7a 1 0 1 0 0 0 0b 0 1 0 1 0 0 0c 1 2 0 0 2 0 2d 2 2 0 0 0 2 0

In any column c, count two for each 1, andcount one for each 2. The total is the number of leaves below mutation c, in every perfect phylogeny for S. So if we know the set ofmutations on a path from the root, we knowtheir order as well.

S

Count 5 4 2 2 1 1 1

Page 56: Graphs and Graph Theory in Computational Biology

Simple Conclusions

Root The order is known for the red mutations together with the leftmost blue(?) mutation.

1 2 3 4 5 6 7

i:0 1 0 1 2 2 2

Subtree for row i data

24

sites

5

Page 57: Graphs and Graph Theory in Computational Biology

But what to do with the remaining blue entries (2’s) in

a row?

Page 58: Graphs and Graph Theory in Computational Biology

More Simple Tools

3) For any row i in S, and any column c, if S(i,c) is 2, then in every perfect phylogeny for S, the path between the two leaves labeled i, must contain the edge with mutation c.

Further, every mutation c on the path between the two i leaves must be from such a column c.

Page 59: Graphs and Graph Theory in Computational Biology

From Row Data to Tree Constraints

Root 1 2 3 4 5 6 7

i:0 1 0 1 2 2 2

Subtree for row i data

24

sites

5 Edges 5, 6 and 7must be on the blue path,and 5 is already known tofollow 4, but we don’twhere to put 6 and 7.i i

Page 60: Graphs and Graph Theory in Computational Biology

The Graph Theoretic Problem

Given a genotype matrix S with n sites, and a red-blue subgraph for each row i,

create a directed tree T where each integer from 1 to n labels exactly one edge, so that each subgraph is contained in T.i i

Page 61: Graphs and Graph Theory in Computational Biology

Powerful Tool: Tree and Graph Realization

• Let Rn be the integers 1 to n, and let P be an unordered subset of Rn. P is called a path set.

• A tree T with n edges, where each is labeled with a unique integer of Rn, realizes P if there is a contiguous path in T labeled with the integers of P and no others.

• Given a family P1, P2, P3…Pk of path sets, tree T realizes the family if it realizes each Pi.

• The graph realization problem generalizes the consecutive ones problem, where T is a path.

• More generally, each set specifies a fundamental cycle in the unknown graph.

Page 62: Graphs and Graph Theory in Computational Biology

Tree Realization Example

1

2 4

5

63

8

7

P1: 1, 5, 8P2: 2, 4P3: 1, 2, 5, 6 P4: 3, 6, 8P5: 1, 5, 6, 7

Realizing Tree T

More generally, think of each path set as specifying a fundamentalcycle containing the edges in the specified path.

Page 63: Graphs and Graph Theory in Computational Biology

Graph Realization Polynomial time (almost linear-time) algorithms exist

for the graph realization problem, given the family of fundamental cycles the unknown graph should contain – Whitney, Tutte, Cunningham, Edmonds, Bixby, Wagner, Gavril, Tamari, Fushishige, Lofgren 1930’s - 1980’s

Most of the literature on this problem is in the context of determining if a binary matroid is graphic.

The algorithms are not simple; none implemented

before 2002.

Page 64: Graphs and Graph Theory in Computational Biology

Reducing PPH to graph realization

We solve any instance of the PPH problem by creating appropriate path sets, so that a solution to the resulting graph realization problem leads to a solution to the PPH problem instance.

The key issue: How to encode the needed subgraph for each row, and glue them together at the root.

Page 65: Graphs and Graph Theory in Computational Biology

From Row Data to Tree Constraints

Root 1 2 3 4 5 6 7

i:0 1 0 1 2 2 2

Subtree for row i data

24

sites

5 Edges 5, 6 and 7must be on the blue path,and 5 is already known tofollow 4.

i i

Page 66: Graphs and Graph Theory in Computational Biology

Encoding a Red-Blue directed path

245

P1: U, 2P2: U, 2, 4P3: 2, 4P4: 2, 4, 5P5: 4, 5

245

U

U is a glue edge used to glue together the directedpaths from the different rows.

forcedIn T

Page 67: Graphs and Graph Theory in Computational Biology

Now add a path set for the blues in row i.

Root 1 2 3 4 5 6 7

i:0 1 0 1 2 2 224

sites

5

i i

P: 5, 6, 7

Page 68: Graphs and Graph Theory in Computational Biology

That’s the Reduction

The resulting path-sets encode everything that isknown about row i in the input.

The family of path-sets are input to the graph-realization problem, and every solution to thethat graph-realization problem specifies a solution tothe PPH problem, and conversely.

Whitney (1933?) characterized the set of all solutions to graphrealization (based on the three-connected components of a graph)and Tarjan et al showed how to find these in linear time.

Page 69: Graphs and Graph Theory in Computational Biology

An implicit representation of all solutions

Whitney (1930) proved that a graph realization problem has aunique solution if and only if the graph is three-connected. That is, at least three nodes must be removed in order to disconnect the graph (assuming it is connected).

Whitney (1931) proved that if the solution is not unique, then there is a semi-unique decomposition of the graph into three-connected components, so that the graph realizations are in one-one correspondence with all the ways that these components canbe ``twisted” relative to each other. So the number of solutionsis 2^(number of three connected comps. -1).

Page 70: Graphs and Graph Theory in Computational Biology

Tree Realization Example

1

2 4

5

63

8

7

P1: 1, 5, 8P2: 2, 4P3: 1, 2, 5, 6 P4: 3, 6, 8P5: 1, 5, 6, 7

Realizing Tree T withedges added to create afundamental cycle for eachpath

Page 71: Graphs and Graph Theory in Computational Biology

Topic V: Phylogenetic Networks with Recombination

Page 72: Graphs and Graph Theory in Computational Biology

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs:

0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test

When can a set of sequences be derived on a perfect

phylogeny?

Page 73: Graphs and Graph Theory in Computational Biology

Incompatible Sites

A pair of sites (columns) of M that fail the4-gametes test are said to be

incompatible.

A site that is not in such a pair is compatible.

Page 74: Graphs and Graph Theory in Computational Biology

A richer model

000001

2

4

3

510100

1000001011

00010

01010

12345101001000001011010100001010101 added

Pair 4, 5 fails the fourgamete-test. The sites 4, 5are incompatible.

Real sequence histories often involve recombination.

M

Page 75: Graphs and Graph Theory in Computational Biology

10100 01011

5

10101

The first 4 sites come from P (Prefix) and the sitesfrom 5 onward come from S (Suffix).

P S

Sequence Recombination

A recombination of P and S at recombination point 5.

Single crossover recombination

Page 76: Graphs and Graph Theory in Computational Biology

Network with Recombination: ARG

000001

2

4

3

510100

1000001011

00010

01010

12345101001000001011010100001010101 new

10101

The previous tree with onerecombination event now derivesall the sequences.

5

P

S

M

Page 77: Graphs and Graph Theory in Computational Biology

A Min ARG for Kreitman’s data

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

ARG created by SHRUB

Page 78: Graphs and Graph Theory in Computational Biology

An illustration of why we are interested in recombination:

Association Mapping of Complex Diseases Using

ARGs

Page 79: Graphs and Graph Theory in Computational Biology

Association Mapping

• A major strategy being practiced to find genes influencing disease from haplotypes of a subset of SNPs.– Disease mutations: unobserved.

• A simple example to explain association mapping and why ARGs are useful, assuming the true ARG is known.

0 1 0 0 1Disease mutation site

SNPs

Page 80: Graphs and Graph Theory in Computational Biology

00000

52

3

3

4SP

PS

1

4a:00010

b:10010

c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

Very Simplistic Mapping the Unobserved Mutation of Mendelian Diseases with ARGs

Diseased

Assumption (for now): A sequence is diseased iff it carries the single disease mutation

Where is the disease mutation?

1 2 3 4 5

What part of 01100 d, e, f inherit?

d: e:f:

? ?

The single disease mutation occurs between sites 2 and 3!

Page 81: Graphs and Graph Theory in Computational Biology

Mapping Disease Gene with Inferred ARGs

• “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard, 2005

• But we do not know the true ARG! • Goal: infer ARGs from SNP data for

association mapping– Not easy and often approximation (e.g. Zollner and

Pritchard)– Improved results to do the inference Y. Wu (RECOMB 2007)

Page 82: Graphs and Graph Theory in Computational Biology

Results on Reconstructing the Evolution of SNP Sequences

• Part I: Clean mathematical and algorithmic results: Galled-Trees, near-uniqueness, graph-theory lower bound, and the Decomposition theorem

• Part II: Practical computation of Lower and Upper bounds on the number of recombinations needed. Construction of (optimal) phylogenetic networks; uniform sampling; haplotyping with ARGs; LD mapping …

• Part III: Varied Biological Applications• Part IV: Extension to Gene Conversion• Part V: The Minimum Mosaic Model of Recombination

This talk will discuss topics in Parts I

Page 83: Graphs and Graph Theory in Computational Biology

Problem: If not a tree, then what?

If the set of sequences M cannot be derived on a perfect phylogeny (true tree) how much deviation from a tree is required?

We want a network for M that uses a small number of recombinations, and we want the resulting network to be as ``tree-like” as possible.

Page 84: Graphs and Graph Theory in Computational Biology

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

A tree-like networkfor the same sequences generatedby the prior network.

2

4

p s

ps

Page 85: Graphs and Graph Theory in Computational Biology

Recombination Cycles

• In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet.

• The cycle specified by those two paths is called a ``recombination cycle”.

Page 86: Graphs and Graph Theory in Computational Biology

Galled-Trees

• A phylogenetic network where no recombination cycles share an edge is called a galled tree.

• A cycle in a galled-tree is called a gall.• Question: if M cannot be generated on a

true tree, can it be generated on a galled-tree?

Page 87: Graphs and Graph Theory in Computational Biology
Page 88: Graphs and Graph Theory in Computational Biology

Results about galled-trees• Theorem: Efficient (provably polynomial-time) algorithm to

determine whether or not any sequence set M can be derived on a galled-tree.

• Theorem: A galled-tree (if one exists) produced by the algorithm minimizes the number of recombinations used over all possible phylogenetic-networks.

• Theorem: If M can be derived on a galled tree, then the Galled-Tree is ``nearly unique”. This is important for biological conclusions derived from the galled-tree.

Papers from 2003-2007.

Page 89: Graphs and Graph Theory in Computational Biology

Elaboration on Near Uniqueness

Theorem: The number of arrangements (permutations) of the sites on any gall isat most three, and this happens only if the gall has two sites.

If the gall has more than two sites, then the number ofarrangements is at most two.

If the gall has four or more sites, with at least two siteson each side of the recombination point (not the side ofthe gall) then the arrangement is forced and unique.

Theorem: All other features of the galled-trees for M are invariant.

Page 90: Graphs and Graph Theory in Computational Biology

A whiff of the ideas behind the results

Page 91: Graphs and Graph Theory in Computational Biology

Incompatible Sites

A pair of sites (columns) of M that fail the4-gametes test are said to be

incompatible.

A site that is not in such a pair is compatible.

Page 92: Graphs and Graph Theory in Computational Biology

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

2 5

Two nodes are connected iff the pairof sites are incompatible, i.e, fail the 4-gamete test.

Incompatibility Graph G(M)

M

THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

Page 93: Graphs and Graph Theory in Computational Biology

The connected components of G(M) are very informative

• Theorem: The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network.

• Theorem: When M can be derived on a galled-tree, all the incompatible sites in a gall must come from a single connected component C, and that gall must contain all the sites from C. Compatible sites need not be inside any blob.

• In a galled-tree the number of recombinations is exactly the

number of connected components in G(M), and hence is minimum over all possible phylogenetic networks for M.

Page 94: Graphs and Graph Theory in Computational Biology

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

2

4

p s

ps

1 3

4

2 5

Incompatibility Graph

Page 95: Graphs and Graph Theory in Computational Biology

A Graph Theoretic Necessary Condition for a Galled-Tree

If M can be generated on a galled-tree, then the incompatibility graph must be a bipartite bi-convex graph. Other structural properties

of the conflict graph can be deduced and

exploited.

Page 96: Graphs and Graph Theory in Computational Biology

Galled-Tree Haplotyping

Problem: Given genotype matrix G, if there is no PPH solution for G, is there a haplotyping H for G such that H can be derived on a Galled-Tree?

Page 97: Graphs and Graph Theory in Computational Biology

A different Neccessary Condition for a one-gall tree

1. There exists a set of sequences S such that for every pair of incompatible sites p,q, a single p,q state-pair appears in all sequences in S, and does not appear in any sequence outside S.

2. There must be a number x such that p < x < q, for each incompatible pair p,q.

Page 98: Graphs and Graph Theory in Computational Biology

4

1

3

2 5

a: 000100

b: 100100

d: 101000

c: 001000

f: 011000g: 001010

2p s

0 0 0 1 0 01 0 0 1 0 00 0 1 0 0 01 0 1 0 0 01 1 0 0 0 10 1 1 0 0 00 0 1 0 1 0

abcdefg

H

6

e:1010001

S = {e,d} the sequencesbelow the recombination node.

Example

Page 99: Graphs and Graph Theory in Computational Biology

Surprising Result - Yun Song

The necessary condition is also sufficient.Yun S. Song in TCBB 2006

Page 100: Graphs and Graph Theory in Computational Biology

Coming full circle - back to genotypes

When can a set of genotypes be explained by a set of haplotypes derived on a galled-tree, rather than on a perfect phylogeny?

The Song NASC can be translated into an ILP, using the part of theMinIncompat ILP that identifies which site pairs are

incompatibile.

Page 101: Graphs and Graph Theory in Computational Biology

For the one gall problem, the ILP formulation solves very efficiently (200 rows x 40 sites in seconds to minutes). So far, the 2-gall case does not solve well (ongoing work).

(Dan Brown, Gusfield 2006).

Page 102: Graphs and Graph Theory in Computational Biology

Change of Scope: Minimizing Recombinations in

unconstrained networks• Problem: given a set of sequences M, find a

phylogenetic network generating M, minimizing the number of recombinations used to generate M, allowing only one mutation per site. This has biological meaning in appropriate contexts.

• We can solve this problem in poly-time for the special case of Galled-Trees.

• The minimization problem is NP-hard in general.

Page 103: Graphs and Graph Theory in Computational Biology

Minimization is an NP-hard Problem

What we have done:

1. Solve small data-sets optimally with exponential-time methods

or with algorithms that work well in practice;

2. Efficiently compute lower and upper bounds on the number of

needed recombinations. 3. Apply these methods to address specificbiological and bio-tech questions.

Page 104: Graphs and Graph Theory in Computational Biology

The Decomposition Theorem

Since the minimization problem is NP-hardwe want to break up a problem into subproblems that can be solved separately and combined.

Page 105: Graphs and Graph Theory in Computational Biology

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

2 5

Two nodes are connected iff the pairof sites are incompatible, i.e, fail the 4-gamete test.

Incompatibility Graph G(M)

M

THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

Page 106: Graphs and Graph Theory in Computational Biology

The connected components of G(M) are very informative

For example we have the Theorem:

The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network.

Page 107: Graphs and Graph Theory in Computational Biology

Recombination Cycles

• In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet.

• The cycle specified by those two paths is called a ``recombination cycle”.

Page 108: Graphs and Graph Theory in Computational Biology

A maximal set of intersecting cycles forms a Blob

00000

52

3

3

4Sp

PS

1

4

10010

0110000101

01101

00100

00010

If directions on the edges are removed, a blob isa bi-connected component of the network.

Page 109: Graphs and Graph Theory in Computational Biology

Blobed Trees

• Contracting each blob in a network results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. Simple, but key insight.

• So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree.

The blobs are the non-tree-like parts of the network.

Page 110: Graphs and Graph Theory in Computational Biology

Ugly tanglednetwork insidethe blob.

Every network is a tree of blobs.

A network where every blob is a single cycle is a Galled-Tree.

Page 111: Graphs and Graph Theory in Computational Biology

A Simple Observation

In any network N for M, all sites from the same connected component of G(M) must appear together in a single blob in N.

Page 112: Graphs and Graph Theory in Computational Biology

The Decomposition Theorem Theorem: For any set of sequences M, there is a

phylogenetic network that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. This

“fully-decomposed” network is the finest decomposition possible.

Page 113: Graphs and Graph Theory in Computational Biology

Example: Network for input M with one blob

00000

52

3

3

4Sp

PS

1

4

a:00010

b:10010c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

Page 114: Graphs and Graph Theory in Computational Biology

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

2

4

p s

ps

1 3

4

2 5

Incompatibility GraphThe fully-decomposednetwork for M

Page 115: Graphs and Graph Theory in Computational Biology

Moreover, the backbone tree is invariantover all the fully-decomposed networksfor M, and can be determined in polynomial-time.

So, we can find a network for M by solvingthe recombination minimization problem foreach connected component of G(M) separately, and then connect those subnetworks in an invariant way.

Page 116: Graphs and Graph Theory in Computational Biology

Algorithmically

• Finding the tree part of the blobbed-tree is easy.• Determining the sequences labeling the exterior nodes on any

blob is easy.• Determining a “good” structure inside a blob B is the problem of

generating the sequences of the exterior nodes of B. • It is easy to test whether the exterior sequences on B can be

generated with only a single recombination. The original galled-tree problem is now just the problem of testing whether one single-crossover recombination is sufficient for each blob.

• That can be solved by successively removing each exterior sequence and testing if the remaining sequences can be generated on a perfect phylogeny of the correct form.

Page 117: Graphs and Graph Theory in Computational Biology

However …While fully-decomposed networks always exist,

they do not necessarily minimize the number of recombination nodes, over all possible networks.

That is, sometimes it pays to put sites from different connected components together on the same blob.

Page 118: Graphs and Graph Theory in Computational Biology

But we can prove several useful sufficient conditionsfor when there is a fully-decomposed network that minimizes thenumber of recombinations, over all possible networks.

The deepest result:Theorem: Let N be a phylogenetic network for input M, let L be the set of sequences that label the nodes of N, and let G(L) be the incompatibility graph for L. If G(L) and G(M) have the same number of connected components, then there is a fully-decomposed network for M with the same number of recombinations as in N.

JCB December 2007

Sufficient Conditions

Page 119: Graphs and Graph Theory in Computational Biology

Corollary

A fully-decomposed network exists thatminimizes the number of recombinations,unless every optimal network uses somerecombination node(s) labeled by sequence(s)not in M, and the addition of those sequencesto M creates an incompatibility between sitesin different components of G(M).

Page 120: Graphs and Graph Theory in Computational Biology

0000003 4

5 1

p4

0010000011010

010010

2 6

100001 100101

000100

3 5p s

s100010

G(L) has one component. The addition of sequence 100010 reduces the number of components from 2 to 1.

Sequences in M are in black.Sequence 100010 is not in M.

G(M) has two components. Eachrequires two recs, butthis combined network needs only three.

ps

Page 121: Graphs and Graph Theory in Computational Biology

A Practical Sufficient Condition

If M can be derived on a network N in whichevery edge contains at mostone site, and every node is labeled with asequence in M, then there is a fully-decomposed network for M whichminimizes the number of recombinations over all possible networks for M.

Page 122: Graphs and Graph Theory in Computational Biology

Another Practical Sufficient Condition

If M can be derived on a network N wherethe number of recombinations equals the(poly-computable) Haplotype Lower Bound, then there is a fully decomposed networkfor M which minimizes the number of recombinations over all possible networks.

Page 123: Graphs and Graph Theory in Computational Biology

Topic VI: Perfect Phylogeny Extension to non-binary

characters We detail the case of three allowed

states per character.

Page 124: Graphs and Graph Theory in Computational Biology

What is a Perfect Phylogeny for non-binary characters?

• Input consists of n sequences M with m sites (characters) each, where each site can take one of k states.

• In a Perfect Phylogeny T for M, each node of T is labeled with an m-length sequence where each site has a value from 1 to k.

• T has n leaves, one for each sequence in M, labeled by that

sequence.

• For each character-state pair (C,s), the nodes of T that are labeled with state s for character C, form a connected subtree of T. It follows that the subtrees for any C are node-disjoint

Page 125: Graphs and Graph Theory in Computational Biology

Example: A perfect phylogeny for input M

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5m = 3k = 3

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

Page 126: Graphs and Graph Theory in Computational Biology

Example

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5m = 3k = 3

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

The tree forState 2 ofCharacter B

Page 127: Graphs and Graph Theory in Computational Biology

Perfect Phylogeny Problem

Given M, is there a Perfect Phylogeny for M?

Page 128: Graphs and Graph Theory in Computational Biology

Chordal Graphs

Basic Definition: A graph G is called Chordal if every cycle of length four or more contains a chord.

More useful result: A graph G is chordal if and only if everyminimal vertex separator in G is a clique.

Chordal graphs have a large number of applications, more basedon the separator result than on the basic definition. For example,a chordal graph on n nodes can have at most n maximal cliquesand n-1 minimal vertex separators.

Page 129: Graphs and Graph Theory in Computational Biology

Another Classic Chordal Graph Theorem

A graph G is chordal if and only if it is the intersection graph ofa set S of subtrees of a tree T. Each node of G is a member of S.

a

b c

d

e f

g

{b,c}

{b,c,d}

{c,d,e,g}

{a,e} {e,f,g}

T

{a,e,g}

G

Page 130: Graphs and Graph Theory in Computational Biology

Relation to Perfect Phylogeny

In a perfect phylogeny T for a table E, for any character Cand any state X of character C, the sub-forest of Tinduced by the nodes labeled (C,X) form a single, connectedsubtree of T.

So, there is a natural set of subtrees of T induced by E.

Page 131: Graphs and Graph Theory in Computational Biology

Chordal Completion Approach to Perfect Phylogeny

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

34

5

A B C

1 1 1

2 2 2

3 3 3

Each row of table E induces a clique in G(E).

Table E

Graph G(E) has one node for eachcharacter-state pairin E, and an edgebetween two nodesif and only if thereis a row in E withboth thosecharacter-statepairs.

G(E)

Page 132: Graphs and Graph Theory in Computational Biology

Classic Theorem

There is a perfect phylogeny for table E if and only if edges can be added to graph G(E) to make it a chordal,K-partite graph. If there is such a chordal graph, denote itby G’(E).

Note that if table E has K columns, then G(E) is aK-partite graph.

Theorem (Buneman 196?)

Page 133: Graphs and Graph Theory in Computational Biology

Deeper Result: If G’(E) exists

• Let C(E) be the graph derived from graph G’(E) as follows: create a node in C(E) for each maximal clique in G’(E), and create an edge (u,v) in C(E) iff the cliques for u and v in G’(E) share a node. Weight edge (u,v) by the number of shared nodes. Note that C(E) can be created from G’(E) in polynomial time.

• Any Maximum Spanning Tree T in C(E) is a perfect phylogeny for E. Actually, T can be found more directly in linear time from G’(E).

Page 134: Graphs and Graph Theory in Computational Biology

Perfect Phylogeny Results

The perfect phylogeny problem was open for about 20 years,but solved by Dress, Steel, Warnow and Kannan, Agarwalla and Fernandez-Baca.

For any fixed bound on the number of states per character,the Perfect Phylogeny Problem can be solved in polynomial time.

However, if the number of states per character is not bounded,then the problem is NP-Complete.

Also, for any fixed number of characters, the problem can besolved in polynomial time.

Page 135: Graphs and Graph Theory in Computational Biology

Dress-Steel solution for 3-state Perfect phylogeny given

complete data (1991)• Recode each site M(i) of M as three binary sites

M’(i,1), M’(i,2), M’(i,3) each indicating the taxa that have state 1, 2, or 3.

• Theorem (DS) There is a 3-state perfect phylogeny for M, if and only if there is a binary-character perfect phylogeny for some subset of M’ consisting of exactly two of the columns

M’(i,1), M’(i,2), M’(i,3), for each column i of M.

Page 136: Graphs and Graph Theory in Computational Biology

Example

1

2

3

4

M’

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

M

0 0 1 0 1 0 1 0 0

0 1 0 0 0 1 0 1 0

0 0 1 0 1 0 0 0 1

1 0 0 1 0 0 0 0 1

1 0 0 0 1 0 0 0 15

A,1 A,2 A,3 B,1 B,2 B,3 C,1 C,2 C,3

Compatible subset

Page 137: Graphs and Graph Theory in Computational Biology

Solved in Poly-Time by 2-SAT

As stated, the problem still seems like it would take exponentialtime to solve, but in fact it is easy to code the problem as a2-SAT problem (Y. Wu) and hence is solvable in polynomialtime. The Dress-Steel paper gave an independent poly-timesolution.

Page 138: Graphs and Graph Theory in Computational Biology

Multi-State Perfect Phylogenywith Missing and Removable Data: Solutions via Chordal

Graph Theory

Dan Gusfield

Recomb09, May 2009

Page 139: Graphs and Graph Theory in Computational Biology

The Perfect Phylogeny Modelfor binary sequences

000001

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edgesThe tree derives the set M:1010010000010110101000010

Only one mutation per siteallowed (infinite sites)

Page 140: Graphs and Graph Theory in Computational Biology

Beyond Binary; beyond SNPs The binary perfect phylogeny model has been widely

used in population genetics (four-gametes), phylogenetics (compatibility); and many problems and methods build on the model (haplotyping, networks with recombination).

But, non-binary, non-SNP data is becoming more important in population genomics: CNVs, full DNA sequence, micro-sats; and other applications in phylogenetics.

Page 141: Graphs and Graph Theory in Computational Biology

A 3-state perfect phylogeny

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5 number of taxam = 3 number of sitesk = 3 number of states

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

Page 142: Graphs and Graph Theory in Computational Biology

A formal definition of a k-state perfect phylogeny

• Input consists of n sequences M with m sites (characters) each, where each site can take one of k > 2 states (values).

• T has n leaves, one for each sequence X in M, labeled by X.

• Each node of T is labeled with an m-length sequence (not necessarily from M) where each site has a value from 1 to k.

• For each character-state pair (C,s), the nodes of T that are

labeled with state s for character C form a connected subtree of T. This is the convexity requirement.

This more reflects the infinite alleles model rather than the infinite sites model in binary perfect phylogeny.

Page 143: Graphs and Graph Theory in Computational Biology

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5 number of taxam = 3 number of sitesk = 3 number of states

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

The subtree forState 2 ofCharacter B

Page 144: Graphs and Graph Theory in Computational Biology

An alternative view of convexity

Arbitrarily root T at some node, and direct all the edges away from the root.

Then, any character C can mutate into a state s on at most one edge, but there are no edges where C mutates into its root state.

This view makes a k-state perfect phylogeny a more natural generalization of a binary perfect phylogeny.

Page 145: Graphs and Graph Theory in Computational Biology

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5m = 3k = 3

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

RootB

B

Page 146: Graphs and Graph Theory in Computational Biology

K-state Perfect Phylogeny Problems

Existence Problem:Given M and k, is there a k-state Perfect Phylogeny for M?

Missing Data Problem: For a given k, if there are cells in M withoutvalues, can values less than or equal to k be imputedso that the resulting matrix M’ has a k-state perfect phylogeny?

Handling missing data extends the utility of the perfect-phylogeny model.

Page 147: Graphs and Graph Theory in Computational Biology

Removable data

Given data that does not have a k-state perfect phylogeny,what is the minimum number of characters to remove sothat the remaining data does have a k-state perfectphylogeny?

The paper gives a solution for k = 3, but no time to discuss it in this talk. Here we discuss the missingdata problem for arbitrary k.

Page 148: Graphs and Graph Theory in Computational Biology

Status of the Existence Problem

Poly-time algorithm for 3 states, Dress-Steel (1993)

Poly-time algorithm for 3 or 4 states, Kannan-Warnow (1994)

Poly-time algorithm for any fixed number of states -polynomial in n and m, but exponential in k, Agarwalla andFernandez-Baca (1994)

Speed up of the AFB method by Kannan-Warnow (1997)

When k is not fixed, the existence problem is NP-hard

Page 149: Graphs and Graph Theory in Computational Biology

The missing data challenge

The general AFB,KW algorithms that solve the existence problem are not easily adapted to handle the missing data problem. They seem to extend only by brute-force enumeration of imputed values.

So, we need another approach to the missing data problem.

Page 150: Graphs and Graph Theory in Computational Biology

Status of Missing Data problem

NP-complete even for k = 2; effective, practicalapproaches for k = 2. (GFB in cocoon 2007; Satya, Mukherjee, TCBB 2008)

Polynomial-time methods for a `directed’ variant of k = 2.

No literature on the missing data problem for k > 2.

New work here: specialized ILP methods for k = 3,4,5and a general solution for any fixed k.In this talk I will only discuss the general solution.

Page 151: Graphs and Graph Theory in Computational Biology

New approach to existence and missing data problems

Based on an old theorem and newer techniques.

Old theorem: Buneman’s theorem relating Perfect-Phylogeny to chordal graphs. (thirty-five years old)

Newer techniques and theorems: Minimal triangulations of anon-chordal graph to make it chordal. The literature on minimal triangulations is current and ongoing.

Page 152: Graphs and Graph Theory in Computational Biology

Definition: Chordal Graphs

A graph G is called Chordal if every cycle of length four or more contains a chord. Chordal graphs arealso called triangulated graphs.

G

Page 153: Graphs and Graph Theory in Computational Biology

Buneman’s Approach to Perfect Phylogeny (1974)

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

1 1 1 2 2 23 3 3

Each row of table M induces a clique in G(M).

Input M, n by m

Partition-Intersection Graph G(M) has one node for eachcharacter-state pairin M, and an edgebetween two nodesif and only if thereis a row in M withboth thosecharacter-statepairs.

G(M)

C1 C2 C3 C1 C2 C3

G(M) is the superposition of m cliques.

Page 154: Graphs and Graph Theory in Computational Biology

DefinitionsIf M has m characters, then G(M) is an m-partite graph. The nodes associated with a singlecharacter (class in the partition) are given a distinct color.

An edge (u,v) not in G(M) is called legalif u and v do not have the same color.

Two nodes with the same color are called a mono-chromatic pair.

Page 155: Graphs and Graph Theory in Computational Biology

Buneman’s Theorem

There is a perfect phylogeny for M if and only if legaledges can be added to graph G(M) to make it chordal.

If there is such a chordal graph, denote it G’(M).

Theorem (Buneman 1974)

G’(M) is called a legal triangulation of G(M).

Page 156: Graphs and Graph Theory in Computational Biology

From Chordal Graph to Perfect Phylogeny

Fact: Given a legal triangulation G’(M), a Perfect Phylogeny for M can be constructed in linear time.

The algorithms are based on `perfect elimination orders’ and `clique trees’, classic objects in the chordal graphliterature.

Page 157: Graphs and Graph Theory in Computational Biology

Example

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

Each node represents aCharacter-State pair

Page 158: Graphs and Graph Theory in Computational Biology

A legal triangulation

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

X Y

Page 159: Graphs and Graph Theory in Computational Biology

Yields a Perfect Phylogeny

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

B C

A D

One node in T for eachmaximal clique in G’(M)

X Y

002

010 111

122

012 112

(Fact: every `clique-tree’ of the Chordal graph G’(M) is a perfect Phylogeny for M)

Page 160: Graphs and Graph Theory in Computational Biology

What about Missing Data?If M is missing data, build the partition intersection graph G(M) using the known data in M. Buneman’s theorem still holds:

Theorem: There is a perfect phylogeny for some imputation of missing data in M, if and only if there is a legal triangulation of G(M).

The legal triangulation gives a perfect phylogeny T for Mwith some imputed data, and then the imputed M’ can beobtained from T.

Page 161: Graphs and Graph Theory in Computational Biology

Example

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

Page 162: Graphs and Graph Theory in Computational Biology

Example

A: 0 0 2B: 0 ? 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

Page 163: Graphs and Graph Theory in Computational Biology

The Key Problem

So the key problem, in both theExistence and the Missing Data problems, is howto find a legal triangulation, if there is one.

But, there is a robust and still expanding literature onefficient algorithms to find a minimal triangulation ofa non-chordal graph.

Some triangulation problems are NP-hard (Tree-width,Minimizing the number of added edges).

The PI graph is conceptually perfect for modeling missingdata.

Page 164: Graphs and Graph Theory in Computational Biology

Minimal triangulation

A triangulation of a non-chordal graph G is minimal if no subset of added edges is a triangulationof G.

Clearly, if there is a legal triangulation G’(M) of G(M),then there is one that is a minimal triangulation. A minimal triangulation is good enough for us.

So we can take advantage of the minimal triangulationtechnology, and the contemporary literature. The minimal vertex separators are the key objects.

Page 165: Graphs and Graph Theory in Computational Biology

Minimal vertex separatorsA set of nodes S whose removal separates verticesu and v is called a u,v separator. S is a minimal u,vseparator if no subset of S is a u,v separator.

S is a `minimal separator’ if it is a minimal u,v separator for some vertex pair u,v.

Minimal separator S crosses minimal separator S’, ifS separates some pair of nodes in S’.

Crossing is a symmetric relation for minimal separators.

Page 166: Graphs and Graph Theory in Computational Biology

Example

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

S = {(2,1), (3,2)} andS’ = {(1,0), (1,1)}are crossing minimal separators.

S

S’

Page 167: Graphs and Graph Theory in Computational Biology

Example

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

S = {(2,1), (1,1)} and S’ = {(1,0), (3,2)} arenon-crossing minimal separators.

S

S’

Page 168: Graphs and Graph Theory in Computational Biology

A lucky break for us: A complete characterization of the minimal triangulations of G

was derived in 1997Definition: Completing a minimal separator S means adding all the missing edges between pairs of nodes in S to make S a clique.

Page 169: Graphs and Graph Theory in Computational Biology

Capstone Theorem on Minimal Triangulations

Parra, Sheffler (1997): Every minimal triangulation of G is obtained by completing each minimal separator in amaximal set of pairwise non-crossing minimal separators of G.

Conversely, completing every minimal separator ina maximal set of pairwise non-crossing minimal separatorsyields a minimal triangulation of G.

Page 170: Graphs and Graph Theory in Computational Biology

Example:

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

There are6 minimalseparators.

There are two maximal sets of 5 pairwise non-crossing minimal separators.

Page 171: Graphs and Graph Theory in Computational Biology

A minimal (illegal) triangulation obeying the P,S Theorem

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

Page 172: Graphs and Graph Theory in Computational Biology

A legal minimal triangulation

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

Page 173: Graphs and Graph Theory in Computational Biology

Back to Perfect PhylogenyA minimal separator S in the partition intersection graph G(M)Is called legal if it does not use an edge between two nodes of the same color, and is called illegal if it does.

P,S Theorem can be used to prove the Main New Results

Theorem 1:There is a perfect phylogeny for M, even if M is missing data,If and only if there is a set Q of pairwise non-crossing, legal,minimal separators in G(M) that separate every mono-chromatic pair of nodes in G(M).

Page 174: Graphs and Graph Theory in Computational Biology

The legal minimal triangulation, obeying

Theorem 13,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

From G’(M), we get a perfect phylogeny for M.

Page 175: Graphs and Graph Theory in Computational Biology

Corollaries to Theorem 1Cor 1: If there is a mono-chromatic pair of nodes in G(M)that is not separated by any legal minimal separator, thenM has no perfect phylogeny.

Cor 2: If G(M) has no illegal minimal separators, thenM has a perfect phylogeny.

Cor 3: If every mono-chromatic pair of nodes is separatedby some legal minimal separator, and no legal minimalseparators cross, then M has a perfect phylogeny.

Page 176: Graphs and Graph Theory in Computational Biology

Recipe to solve the missing data problem with Theorem 1

Given M, find all legal minimal separators in G(M); for each legal minimal separator, determine which mono-chromatic pairs of nodes it separates, and which legal minimal separators it crosses.

Determine if any of the Corollaries hold. If so, either there is no perfect phylogeny (Cor. 1) or a set Q needed in Theorem 1 can be found greedily.

If no Cor. holds, set up and solve a (straightforward) integer linear program to find a set Q of pairwise non-crossing legal minimal separators that separate every mono-chromatic pair of nodes in G(M).

Page 177: Graphs and Graph Theory in Computational Biology

If the ILP is feasible, greedily extend Q to be a maximal set of pairwise non-crossing legal minimal separators, and use Q to get a legal triangulation G’(M) of G(M).

From G’(M), construct a perfect phylogeny T, and fromT impute values for the missing entries.

Page 178: Graphs and Graph Theory in Computational Biology

Conceptually nice, but

Does it work in practice?

Page 179: Graphs and Graph Theory in Computational Biology

It works surprisingly (shockingly) well

Simulations with data from program ms, characteristic of many current applications in phylogenetics and population genetics - but notgenomic scale or tree-of-life scale.

Page 180: Graphs and Graph Theory in Computational Biology

Surprising empirical resultsThe minimal separators are found quickly by existing algorithms from 1999: cubic-time per minimal separator, but we have methods (not in the paper) to speed this up.

When there is no missing data, all the legal minimal separators can be found in O(nm^2) worst-case time,for any fixed k.

The observed number of minimal separators is small.There are few crossing pairs of legal minimal separators.

Until a large percentage of missing data, most problemsare solved by the Corollaries, without the need for an ILP.

Page 181: Graphs and Graph Theory in Computational Biology

The ILPs solve quickly in practice - all havesolved in 0.00 CPLEX-reported seconds (CPLEX 11 on2.5 Ghz machine).

Most solve in the CPLEX pre-processor.

When an ILP is needed, it has been tiny. For the existenceproblem, the size of the ILP is polynomialy bounded.

Page 182: Graphs and Graph Theory in Computational Biology
Page 183: Graphs and Graph Theory in Computational Biology

So Although the chordal graph approach

may at first seem impractical, it works on a large range of data of sizes that are typical of current phylogenetic

problems, and degree of missing data.

Page 184: Graphs and Graph Theory in Computational Biology

More structure

The empirical results suggest the existence of more combinatorial structure in the perfect-phylogeny problem. And more has been recently found.

(F. Lam) When k = 3, a NASC for the existence of a 3-state perfect-phylogeny is:

Every mono-chromatic pair of nodes in G(M) is separated by some legal minimal separator. (Compare to Theorem 1).

This does not extend to k = 4.

Page 185: Graphs and Graph Theory in Computational Biology

All software to replicate theseresults will be available on my

UC Davis website, shortly.


Recommended