A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given -...

Post on 21-Dec-2015

222 views 1 download

Tags:

transcript

A Large Version of the Small Parsimony Problem

Optimally reconstruct ancestral sequences given

- unrooted phylogeny (hence ‘small’ parsimony p.) - multiple alignment - affine gap cost function

Jakob Fredslund* (jakobf@birc.dk), Jotun Hein**, Tejs Scharling*

* Bioinformatics Research Center, Aarhus University, Denmark** Department of Statistics, University of Oxford, United Kingdom

2

Overview

• Introduction

• Examples

• Gap graph construction

• Theory

• Results

• Conclusions

3

Small Parsimony, No GapsAlgorithm due to Finch-Hartigan-Sankoff: Calculate N(A, C, G,T)

in each node (minimal cost of subtree rooted at this node with

nucleotide X in the root) going up, backtrack going down.

4

Small Parsimony, Large Version

1: ac-a---gattc2: acgac---atcc3: gc-----gagcc4: -agacttgt---5: aagtcttagt-c

g(k) = 12 + 2*k

(note: alignment is given)

5

Two Steps

1) Find optimal set of indels to explain gaps

2) Assign nucleotides optimally (FHS)

So: focus on indels

6

Tracing Evolution

What events could explain this alignment?

cagtta

gcag--a

-cagtta

-cag--a

-ctg--a

7

Tracing Evolution

cagtta

cagtta

8

Tracing Evolution

cagtta caga

cagtta

cag--a

cagtta

9

Tracing Evolution

cagtta caga

ctga

cagtta

cag--a

ctg--a

caga

10

Tracing Evolution

cagtta caga

ctga

gcag--a

cagtta

cag--a

ctg--a

-cagtta

-cag--a

-ctg--a

gcaga

11

Indels Affect Full Subtrees

cagtta caga

ctga

gcaga

gcag--a

-cagtta

-cag--a

-ctg--a

All sequences in right subtree have gaps in blue indel’s position

12

Indels Affect Full Subtrees

cagtta caga

ctga

gcaga

gcag--a

-cagtta

-cag--a

-ctg--a

All sequences in left subtree have gaps in green indel’s position

13

Direction of Evolution?

cagtta caga

ctga

gcaga

gcag--a

-cagtta

-cag--a

-ctg--adeletion of tt

14

Direction of Evolution?

cagtta caga

ctga

gcaga

gcag--a

-cagtta

-cag--a

-ctg--a

insertion of tt

15

Direction of Evolution?

cagtta caga

ctga

gcaga

gcag--a

-cagtta

-cag--a

-ctg--a

Since we don’t know the direction, we refer to insertions/ deletions as indels. And remember: an indel creates gaps in a full subtree.

16

Explaining Gaps With Indels

g(k) = a + bk

(Anonymous nucleotides denoted by n)

17

Explaining Gaps With Indels

g(k) = a + bk 2*(a+2b)

18

Explaining Gaps With Indels

g(k) = a + bk 2*(a+2b) 3*(a+b)

19

Larger Example

N8, N9, N10, N11, N12, N13 : ???.. Complex problem! (not aware of any upper time bound)

20

Gap Graph Construction

Represent in a concise way all gaps and how they are connected: in a graph.

21

Gap Intervals

1.Find gap intervals.

22

Gap Intervals

1.Find gap intervals.

No optimal indel ‘stops’ in the middle of a gap interval:

it is cheaper to extend the indel making the first gap than to open a new one.

(by triangle inequality)

23

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

24

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

25

Gap Graph Vertices

Each vertex represents:

a) subtree with gaps in all leaves

b) region of alignment

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

26

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

27

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

28

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

29

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

30

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

31

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

32

Gap Graph Connections

3. Create connection between vertices v and w if they represent neighboring gaps.

33

Gap Graph Connections

3. Create connection between vertices v and w if they represent neighboring gaps.

v → w : all v’s gaps continue in w

34

Gap Graph Connections

3. Create connection between vertices v and w if they represent neighboring gaps.

v → w : all v’s gaps continue in w

35

Gap Graph Connections

3. Create connection between vertices v and w if they represent neighboring gaps.

v → w : all v’s gaps continue in w

(a special-case connection exists; see paper)

37

Interpreting a Gap Graph VertexA vertex is a potential indel: one indel could have created all gaps in the subtree.

Either one indel created all gaps in the subtree (vertex confirmed), ..

38

Interpreting a Gap Graph Vertex.. or the vertex is decomposed into several indels (further ‘down’ in the tree).

Goal: confirm or decompose vertices with respect to the gap cost function.

43

Theory Needed Here..

44

We Need Optimality Proof

A gap graph may be huge, thus representing an enormous

number of potential indels. We need to show two things:

P1: that all optimal indels are represented in the gap graph;

P2: how to ‘resolve the graph’ to determine the set of optimal indels.

P1 proved directly in paper (Theorem 1).

45

Resolving the Gap Graph

In order to determine optimal set of indels, we need to reduce potentially huge graph while keeping the optimal solution!

Theorem 2 and a set of following lemmas serve this purpose by

identifying certain local graph configurations that can be reduced.

Preprocess gap graph (perform local reductions) by applying lemmas.

46

Preprocessing Earlier Example

Iteratively apply lemmas to reduce the

graph..

47

Preprocessing Earlier Example

Iteratively apply lemmas to reduce the

graph..

48

Preprocessing Earlier Example

Iteratively apply lemmas to reduce the

graph..

49

Preprocessing Earlier Example

Iteratively apply lemmas to reduce the

graph..

50

Solving Earlier Example

After preprocessing: resolve remaining graph by checking all combinations

decompose

51

Solving Earlier Example

Placing indels in the tree:

52

After Local Preprocessing

• In longer examples there will be many undecided vertices (purple) after preprocessing.

• Find possible decompositions for each vertex and check all combinations in each chain – number of combinations exponential in chain length

53

Execution Times..?Worst-case: exponential.

Average times for random alignments with 60% gaps:

54

60% gapsis a lot..

55

Real Genome Analysis

B.ES.89.S61K15, B.FR.83.HXB2, B.GA.88.OYI, B.GB.83.CAM1, B.NL.86.3202A21, B.TW.94.TWCYS, B.US.86.AD87,

B.US.84.NY5CG, and B.US.83.SF2

Nine HIV-1 subtypes from the Los AlamosHIV database (tree constructed with Quicktree).

Length: 9868. Running Time: 1 sec

56

Conclusions

• Concise way of representing alignment gaps

• Theoretically sound framework prove optimality

• Graph reductions lead to fast resolvement