Linear Programming for Phylogenetic Reconstruction Based …stelo/cpm/cpm05/cpm05_10_2_Tang.pdf ·...

Post on 25-Jul-2018

221 views 0 download

transcript

Linear Programming for

Phylogenetic Reconstruction

Based on Gene Rearrangements

Jijun Tangjtang@cse.sc.edu

Department of Computer Science and EngineeringUniversity of South Carolina

– p. 1/30

Acknowledgment

• Joint work with Bernard Moret (University ofNew Mexico).

• Supported by National ScienceFoundation and U. of South Carolina.

– p. 2/30

Overview

• Introduction to gene-order data

• GRAPPA and the computational challenge

• Linear programming setup

• Experimental design

• Experimental results

• Conclusions

– p. 3/30

What Is A Phylogeny?

– p. 4/30

What Is A Phylogeny?

• The evolutionary history of a group oforganisms

– p. 4/30

What Is A Phylogeny?

• The evolutionary history of a group oforganisms

• Usually takes the form of a tree:• Modern organisms are placed at the leaves

• Edges denote evolutionary relationships

– p. 4/30

Example

– p. 5/30

Gene-Order Data

– p. 6/30

Gene-Order Data

• Chromosome can be represented by anordering of signed genes• Linear or circular

• Sign of a gene represents gene orientation

– p. 6/30

Gene-Order Data

• Chromosome can be represented by anordering of signed genes• Linear or circular

• Sign of a gene represents gene orientation

• The gene order can be rearranged byevolutionary events such as:• Inversion, transposition and inverted transposition

• Deletion and insertion

– p. 6/30

Gene-Order Rearrangements

12

3 7

4 65

8

7

85

6

1

43

2

7

85

6

1

−4−3

−2

1

7

65

8−4

−3

−2

InversionInverted Transposition

Transposition

– p. 7/30

Reconstruction Methods

– p. 8/30

Reconstruction Methods

• Distance based methods:Neighbor-joining and its variants

– p. 8/30

Reconstruction Methods

• Distance based methods:Neighbor-joining and its variants

• Bayesian method:Badger

– p. 8/30

Reconstruction Methods

• Distance based methods:Neighbor-joining and its variants

• Bayesian method:Badger

• Maximum parsimony based on encoding:MPBE, MPME

– p. 8/30

Reconstruction Methods

• Distance based methods:Neighbor-joining and its variants

• Bayesian method:Badger

• Maximum parsimony based on encoding:MPBE, MPME

• Direct optimization method:BPAnalysis, GRAPPA, MGR

– p. 8/30

Direct Optimization Methods

– p. 9/30

Direct Optimization Methods

• Goal: to reconstruct phylogeny withminimum # of rearrangement events

– p. 9/30

Direct Optimization Methods

• Goal: to reconstruct phylogeny withminimum # of rearrangement events

• Computationally hard even for only threegenomes• Median problem for three is NP hard under general

distance definition• Find the content of the median genome

to minimize the sum of the distances fromthe median to the three genomes

– p. 9/30

Reconstruction Example

12 11

12

−8

−5

−4−3

9−7

−610

12−5

−4

−9 −8−7−6

1011

12

−3

12

89

1011

12

−5

−7−6 4 3

1 1211

9

2−5

−7−6 4 −8

−3

10

1211

10

98

4−6−7

−52 1

3

1 1211

109

876

2−5

−4−3

−8−9

7−65

4

3 10−2−1−11

−12

3

45 6 7 8

9

10−2−1−12

−11

2 1

10

84

1211

765

9

3

−7 6−5

−410

−2−1

98

3−12

−11

4

3

7−65−810

−2−1

−9

−12−11

45

68

−910

−2−1−3

1211

7

(1,3) (9)

(6,9) (4,7) (6) (8,9)

(6) (7,8)− −(4,9)

(11,2)(3,5)

– p. 10/30

GRAPPA

– p. 11/30

GRAPPA

• Genome Rearrangements Analysis underParsimony and other PhylogeneticAlgorithms

– p. 11/30

GRAPPA

• Genome Rearrangements Analysis underParsimony and other PhylogeneticAlgorithms

• Started as an effort to reimplement theBPAnalysis of Sankoff and Blanchette

– p. 11/30

GRAPPA

• Genome Rearrangements Analysis underParsimony and other PhylogeneticAlgorithms

• Started as an effort to reimplement theBPAnalysis of Sankoff and Blanchette

• Used algorithmic techniques to improvethe speed• A tightened lower bound to discard bad trees

before scoring them

• Profiling, cache awareness, etc

– p. 11/30

Algorithm Outline

– p. 12/30

Algorithm Outline

• Consider each tree topology in turn

– p. 12/30

Algorithm Outline

• Consider each tree topology in turn

• For each tree• Test the lower bound, if it exceeds the best so far,

continue to the next tree

• Initialize the internal nodes by some means

• Compute medians of three iteratively until nochange occurs

– p. 12/30

Algorithm Outline

• Consider each tree topology in turn

• For each tree• Test the lower bound, if it exceeds the best so far,

continue to the next tree

• Initialize the internal nodes by some means

• Compute medians of three iteratively until nochange occurs

• Return the lowest score tree

– p. 12/30

Scoring a Tree

� �� �� �� �� �� �

� �� �� �� �� �� �� � �� � �� � �� �� �� �� �� �� �� �� �� �

� �� �� �

� �� �� �

� �� �� �

� �� �� �� �� �� �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �C

1

3 4 52

B

A

– p. 13/30

Scoring a Tree

� �� �� �� �� �� �

� �� �� �� �� �� �� � �� � �� � �� �� �� �� �� �� �� �� �� �

� �� �� �

� �� �� �

� �� �� �

� �� �� �� �� �� �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � � � � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �C

1

3 4 52

B

A

– p. 13/30

Scoring a Tree

� �� �� �� �� �� �

� �� �� �� �� �� �� � �� � �� � �� �� �� �� �� �� �� �� �� �

� �� �� �

� �� �� �

� �� �� �

� �� �� �� �� �� �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � � � � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �C

1

3 4 52

B

A

– p. 13/30

Scoring a Tree

� �� �� �� �� �� �

� �� �� �� �� �� �� � �� � �� � �� �� �� �� �� �� �� �� �� �

� �� �� �

� �� �� �

� �� �� �

� �� �� �� �� �� �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � � � � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �C

1

3 4 52

B

A

– p. 13/30

Scoring a Tree

� �� �� �� �� �� �

� �� �� �� �� �� �� � �� � �� � �� �� �� �� �� �� �� �� �� �

� �� �� �

� �� �� �

� �� �� �

� �� �� �� �� �� �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � � � � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �

� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �

� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �C

1

3 4 52

B

A

– p. 13/30

Computational Challenge

– p. 14/30

Computational Challenge

• Scoring a tree is very expensive

– p. 14/30

Computational Challenge

• Scoring a tree is very expensive

• When the genomes are distant, a medianmay take days or months to be solved

– p. 14/30

Computational Challenge

• Scoring a tree is very expensive

• When the genomes are distant, a medianmay take days or months to be solved

• It needs to solve the median problemsiteratively

– p. 14/30

Computational Challenge

• Scoring a tree is very expensive

• When the genomes are distant, a medianmay take days or months to be solved

• It needs to solve the median problemsiteratively

• Can we find the tree score without solvingthe median problems?

– p. 14/30

Linear Programming Approach

– p. 15/30

Linear Programming Approach

• Goal: minimize the tree length

– p. 15/30

Linear Programming Approach

• Goal: minimize the tree length

• What do we know?

– p. 15/30

Linear Programming Approach

• Goal: minimize the tree length

• What do we know?• The pairwise distance matrix

• A given tree topology

– p. 15/30

Linear Programming Approach

• Goal: minimize the tree length

• What do we know?• The pairwise distance matrix

• A given tree topology

• Approach:• Finding useful constraints

• Using linear programming method to minimize thetree length

– p. 15/30

Median Problem

– p. 16/30

Median Problem

23

1

2 3

0

d12

d10 d13

d30d20

d

d01 + d02 + d03 ≤d12 + d23 + d13

2

– p. 16/30

Median Problem

23

1

2 3

0

d12

d10 d13

d30d20

d

d01 + d02 + d03 ≤d12 + d23 + d13

2

More than 98% cases we have

d01 + d02 + d03=d12 + d23 + d13

2

– p. 16/30

Constraint on Internal Node

d

A

M

C

BA,Bd

kd

A,Cdk+2d

k+1d

B,C

∀M, dk + dk+1 + dk+2 =dA,B + dA,C + dB,C

2

– p. 17/30

Equations

– p. 18/30

Equations

d

1

N+1

2

N+2 2N−3

N

N−1

2N−2

1,2d1d

1,N+2d

2d

2,N+2d3d

2N−3,Nd

2N−3d

2N−5d

2N−3,N−1d

2N−4d

N−1,N

– p. 18/30

Equations

d

1

N+1

2

N+2 2N−3

N

N−1

2N−2

1,2d1d

1,N+2d

2d

2,N+2d3d

2N−3,Nd

2N−3d

2N−5d

2N−3,N−1d

2N−4d

N−1,N

d1 + d2 + d3 =d1,2 + d2,N+2 + d1,N+2

2

· · ·

d2N−5 + d2N−4 + d2N−3 =d2N−3,N−1 + dN−1,N + d2N−3,N

2

– p. 18/30

Problems

d

1

N+1

2

N+2 2N−3

N

N−1

2N−2

1,2d1d

1,N+2d

2d

2,N+2d3d

2N−3,Nd

2N−3d

2N−5d

2N−3,N−1d

2N−4d

N−1,N

– p. 19/30

Problems

d

1

N+1

2

N+2 2N−3

N

N−1

2N−2

1,2d1d

1,N+2d

2d

2,N+2d3d

2N−3,Nd

2N−3d

2N−5d

2N−3,N−1d

2N−4d

N−1,N

• There are ≈ 5N variables,but only N − 2 equations · · ·

– p. 19/30

Problems

d

1

N+1

2

N+2 2N−3

N

N−1

2N−2

1,2d1d

1,N+2d

2d

2,N+2d3d

2N−3,Nd

2N−3d

2N−5d

2N−3,N−1d

2N−4d

N−1,N

• There are ≈ 5N variables,but only N − 2 equations · · ·

• There are many (and redundant) triangular inequations

– p. 19/30

Inequality Equations

• We want to pick up a minimum number ofinequations to cover all the variables

• We know only the distance matrix and treetopology

• Choices:for each pair of genomes, find the two shortest pathsfrom one to another, and build one inequation for eachpath

– p. 20/30

Inequality Equations

d

1

N+1

2

N+2 2N−3

N

N−1

2N−2

1,2d1d

1,N+2d

2d

2,N+2d3d

2N−3,Nd

2N−3d

2N−5d

2N−3,N−1d

2N−4d

N−1,N

d1,2 ≤ d1 + d3

dN−1,N ≤ d2N−4 + d2N − 3

· · ·

d1,N−1 ≤ d1,N+2 + · · · + d2N−3,N−1

d1,N−1 ≤ d1,N+2 + · · · + d2N−5,+d2N−4

– p. 21/30

Sum-up

• Examine every tree

• For each tree (with N genomes)• Minimize the sum of 2N − 3 edge lengths• ≈ 5N variables total• N − 2 equations, < 2N(N − 1) inequations

• These numbers are relatively small if N < 20

• Use lp_solve to find the length of the tree

• Return tree(s) with the minimum length

– p. 22/30

Experimental Design

• Real datasets—limited samples

• Simulation• Generate a tree (true tree) from different

topologies: uniform, birth-death, · · ·• Assign edge lengths based on the expected

evolutionary rate• Assign gene content to each genome based on the

edge length• Use GRAPPA to find a tree (inferred tree)

• Compare inferred tree with true tree to determinethe accuracy

– p. 23/30

Topological Accuracy

– p. 24/30

Topological Accuracy

• False positive:an edge is in the inferred tree,not in the true tree

• False negative:an edge is in the true tree,not in the inferred tree

– p. 24/30

Topological Accuracy

• False positive:an edge is in the inferred tree,not in the true tree

• False negative:an edge is in the true tree,not in the inferred tree

Goal: to minimize FP and FN

– p. 24/30

Simulation Details

• Number of genomes (N ): 12

• Number of genes (n): 200, 500 and 1000

• Expected # of events on each edge:0.05n − 0.15n

• Tree topologies: uniform and birth-death

• Datasets on each combination: 10

– p. 25/30

Simulation Details

• Number of genomes (N ): 12

• Number of genes (n): 200, 500 and 1000

• Expected # of events on each edge:0.05n − 0.15n

• Tree topologies: uniform and birth-death

• Datasets on each combination: 10

– p. 25/30

Simulation Details

• Number of genomes (N ): 12

• Number of genes (n): 200, 500 and 1000

• Expected # of events on each edge:0.05n − 0.15n

• Tree topologies: uniform and birth-death

• Datasets on each combination: 10

– p. 25/30

Simulation Details

• Number of genomes (N ): 12

• Number of genes (n): 200, 500 and 1000

• Expected # of events on each edge:0.05n − 0.15n

• Tree topologies: uniform and birth-death

• Datasets on each combination: 10

– p. 25/30

Simulation Details

• Number of genomes (N ): 12

• Number of genes (n): 200, 500 and 1000

• Expected # of events on each edge:0.05n − 0.15n

• Tree topologies: uniform and birth-death

• Datasets on each combination: 10

– p. 25/30

FN (500 genes, BD tree)

20

15

10

5

0 72 64 56 48 40 32 24

FN

rat

e (n

=50

0)

r

NJLP

– p. 26/30

FP (500 genes, BD tree)

20

15

10

5

0 72 64 56 48 40 32 24

FP

rat

e (n

=50

0)

r

NJLP

– p. 27/30

FN (1000 genes, uniform tree)

25

20

15

10

5

0 144 128 112 96 80 64 48

FN

rat

e (n

=10

00)

r

NJLP

– p. 28/30

FP (1000 genes, uniform tree)

25

20

15

10

5

0 144 128 112 96 80 64 48

FP

rat

e (n

=10

00)

r

NJLP

– p. 29/30

Conclusion

• Linear programming gives us a new andaccurate method for difficult datasets

• Can be applied to any distance

• Has potential to be used for large andcomplex genomes

• Can be extended to solve the medianproblems

– p. 30/30