+ All Categories
Home > Documents > Dynamic Programming and Biological Sequence Comparison

Dynamic Programming and Biological Sequence Comparison

Date post: 03-Jan-2016
Category:
Upload: remedios-carrillo
View: 23 times
Download: 0 times
Share this document with a friend
Description:
Dynamic Programming and Biological Sequence Comparison. Part I. Topic II – Biological Sequence Alignment and Database Search. Part I (Topic-2a): Dynamic programming and Sequence comparison Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment - PowerPoint PPT Presentation
Popular Tags:
36
Dynamic Programming and Biological Sequence Comparison Part I
Transcript
Page 1: Dynamic Programming and Biological Sequence Comparison

Dynamic Programming and Biological Sequence Comparison

Part I

Page 2: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 2

Topic II – Biological Sequence Alignment and Database Search

Part I (Topic-2a): Dynamic programming and Sequence comparison

Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment

Part III (Topic-2c): Multiple sequence alignment

Page 3: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 3

Outline

Concept of alignment

Two algorithm design techniques;

Dynamic Programming: Examples

Applying DP to Sequence Comparison;

The database search problem

Heuristic algorithms to database search

Page 4: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 4

Alignment

The two sequences will have the same length (after possible insertions of spaces on either or both of them)

No space in one sequence can be aligned with a space in the other

Spaces can be inserted at the beginning or end of the sequences

Page 5: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 5

Biological Sequence Alignment and Database Search

1. We have two sequences over the same alphabet, both about the same length (tens of thousands of characters) and the sequences are almost equal. The average frequency of these differences is low, say, one each hundred characters. We want to find the places where the differences occur.

2. We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there is a prefix of one which is similar to suffix of the other.

Page 6: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 6

3. We have the same problem as in (2), but now we have several hundred sequences that must be compared (each one against all). In addition, we know that the great majority of sequence pairs are unrelated, that is, they will not have the required degree of similarity.

4. We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there are two substrings, one from each sequence, that are similar.

5. We have the same problem as in (4), but instead of two sequences we have one sequence that must be compared to thousands of others.

(cont’d)

Page 7: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 7

Breaking Problems Down:

Divide and Conquer: Starting with the complete instance of a problem, divide it into smaller subinstances, solve each of them recursively and combine the partial solutions into a solution to the original problem.

Dynamic Programming: Starting with the smallest subinstances of a problem, solve and combine them until the complete instance of the original problem is solved.

Two Related Algorithm Design Techniques

Page 8: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 8

Divide and Conquer – Example 1

9 1 25 4 15 4 1 9 25 15

becomes

4 1

25 15 becomes

becomes 1

4 15 25

1 4 15 25

Quick Sort

Page 9: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 9

Divide and Conquer – Example 2

The Fibonacci numbers

Fib(n){ if (n < 2) return 1; else return Fib(n-1)+Fib(n-2);}

F1 = 1, F2 = 1

Fn = Fn-1 + Fn-2

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …

Page 10: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 10

Divide and Conquer – Example 2F1 = 1, F2 = 1

Fn = Fn-1 + Fn-2

F(7)

F(3)

+

F(2) F(1)

F(4)

+

F(2)

F(6)

+

F(3)

+

F(2) F(1)F(3)

+

F(2) F(1)

F(4)

+

F(2)

F(5)

+

+

F(3)

+

F(2) F(1)F(3)

+

F(2) F(1)

F(4)

+

F(2)

F(5)

+

n 1 2 3 4 5 6 7 8 9 10 11 …Fn 1 1 2 3 5 8 13 21 34 55 89 …

Fn / Fn-1 1.6 Fn 1.6n, n >> 1

T(n) #Internal_nodes = #leaves - 1but #leaves = Fn

T(n) = O(1.6n)Exponential

Time!

Page 11: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 11

How to Compute Fib Function Using Dynamic Programming

Method?

Page 12: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 12

Dynamic Programming–Example 1

Fib(n) { int tab[n];

tab[1] = 1; tab[2] = 1; for (j = 3; j <= n; j++) tab[j]=tab[j-1] + tab[j-2]; return tab[n];}

Start by solving thesmallest problems

Use the partial solutions to solvebigger and bigger problems

Extra memory to store intermediate values

1

1

2

3

5

8

13

21

34

55

89

….

tab

LinearTime!T(n) = O(n) Space-Time Tradeoff

Page 13: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 13

Sequence Comparison

Molecular sequence data are at the heart of Computational Biology

DNA sequences RNA sequences Protein sequences

We can think of these sequences as strings of letters DNA & RNA: alphabet of 4 letters (A,T,C,G) Protein: alphabet of 20 letters

code full nameA alanineC cysteineD aspartateE glutamateF phenylalanineG glycineH histidineI isoleucineK lysineL leucineM methionineN aspartamineP prolineQ glutamineR arginineS serineT threonineV valineW tryptophanY tyrosine

Page 14: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 14

Sequence Comparison – (Cont.)

Why compare sequences? Find similar genes/proteins

Allows to predict function & structure

Locate common subsequences in genes/proteins Identify common recurrent patterns

Locate sequences that might overlap Help in sequence assembly

Page 15: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 15

Sequence X = A T A A G T

Sequence Y = A T G C A G T

To compare the sequences we need to quantify the similariy

matches = 1mismatches = 0

Score 1 1 0 0 0 0 0

Total = 2

Sequence Comparison – (Cont.)

Page 16: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 16

Sequence Y = A T G C A G T

Sequence X = A T A A G T

Sequence Comparison – (Cont.)

Sequence X = A T A A G T

Taking positions of the letters into account

matches = 1mismatches = 0

Score 0 0 0 0 1 1 1

Total = 3

Page 17: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 17

Sequence Y = A T G C A G T

Sequence X = A T A A G T

Sequence Comparison – (Cont.)

Sequence X = A T A - A G T

How to take possible mutations into account?

matches = 1mismatches = 0gap = -1

Score 1 1 0 –1 1 1 1

Total = 4

matches = 1mismatches = 0

Page 18: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 18

Applying DP to Sequence ComparisonSequence X = GASequence Y = AG

G -

-A

G - - A

GA

- GA -

GA - -

- -AG

GA - - - A

GA- A

G - A - A -

G - -- AG

GAA -

G -AG

- GAA - -

- G -A -G

- GAG

- - GAG -

GA - - - - AG

GA -- AG

G - A - AG

G - A - - A -G

G - - A- AG -

GA -A -G

GAAG

G - AAG -

- GA -A - -G

- GAA -G

- G - AA -G -

- GAAG -

- - GAAG - -

scores

-1 -1

-2 -2 0 -2 -2

-3 0 -3 -3 -1 -1 -3 -3 0 -3

-4 -1 -4 -2 -4 -2 0 -2 -4 -2 -4 -1 -4

T(n,n) = O(kn)

ExponentialTime!

choose the best score, i.e max(-2, 0, -2)choose the best score, i.e max(-3, 0, -1)choose the best score, i.e max(-1, 0, -3)choose the best score, i.e max(-1, 0, -1)total score = 0

Page 19: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 19

G A

A

G

Applying DP to Sequence ComparisonSequence X = GASequence Y = AG

G -

-A

G - - A

GA

- GA -

GA - -

- -AG

GA - - - A

GA- A

G - A - A -

G - -- AG

GAA -

G -AG

- GAA - -

- G -A -G

- GAG

- - GAG -

GA - - - - AG

GA -- AG

G - A - AG

G - A - - A -G

G - - A- AG -

GA -A -G

GAAG

G - AAG -

- GA -A - -G

- GAA -G

- G - AA -G -

- GAAG -

- - GAAG - -

-1 -1

-2 -2 0 -2 -2

-3 0 -3 -3 -1 -1 -3 -3 0 -3

-4 -1 -4 -2 -4 -2 0 -2 -4 -2 -4 -1 -4

0

0 -1 -2

-2

-1

0 0

0

T(n,n) = O(n2)

PolynomialTime!

Page 20: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 20

Questions

Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example?

Answer: Let us count Total = 13

G A 0 -1 -2

A -1 0 0

G -2 0 0

3 5 7

1 2 4

6 8 9

Question: from 1 to 9 how many paths?

1

3 5 2

86

9 9 9 9 9 99

9 9 9

9 9 9

8 7

8 78

5

5

8 7

477

Page 21: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 21

DP algorithm for Sequence Comparison

int S[m,n]

m = length(X)n = length(Y)for i = 0 to m do S[i,0] = i . gfor j = 0 to n do S[j,0] = j . gfor i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g )return S[m,n]

sb[i,j] - Substitution Matrix

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

A T C G

A

T

C

G

Start by solving thesmallest problems

Extra memory to store intermediate values

Use the partial solutions to solve bigger and

bigger problems

Page 22: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 22

The Substitution Matrix

For DNA we usually use identity matrices;

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

A T C G

A

T

C

G

For proteins more sensitive matrices, derived empirically, are used;

A B C D E F G H I K L M N P Q R S T V W Y Z

A 2 0 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 0 B 0 2 -4 3 2 -5 0 1 -2 1 -3 -2 2 -1 1 -1 0 0 -2 -5 -3 2 C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 D 0 3 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 3 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 3 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 G 1 0 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 2 I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 K -1 1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 0 L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 M -1 -2 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 Q 0 1 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 3 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 0 S 1 0 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 0 T 1 0 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 -1 V 0 -2 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -2 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -6 Y -3 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 0 0 -1 -2 -6 -4 3

Page 23: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 23

Sequence Comparison revisited

A T G C A G T

A

T

A

A

G

T

-1 -2 -3 -4 -5

0 2 1 0 -1 -2 -3

-1 1 2 1 1 0 -1

-2 0 1 2 2 1 0

-3 -1 1 1 2 3 2

0 -1 -2 -3

-1

-2

-3

-4 -5 -6

-4

-5

-7

-6 -4 -2 0 1 1 2 4

Similarity Matrix

int S[m,n]

m = length(X)n = length(Y)for i = 0 to m do S[i,0] = i . gfor j = 0 to n do S[j,0] = j . gfor i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g )return S[m,n]

1

1-1 + (-1) 0 + (+1)-1 + (-1)

0

0-2 + (-1)-1 + ( 0 ) 1 + (-1)

-1-3 + (-1)-2 + ( 0 ) 0 + (-1)

-2-4 + (-1)-3 + ( 0 ) -1 + (-1)

-3-5 + (-1)-4 + (+1)-2 + (-1)

-5-7 + (-1)-6 + ( 0 )-4 + (-1)

-4-6 + (-1)-5 + ( 0 )-3 + (-1)

Page 24: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 24

What To Do Next?

Answer: Finding alignments

But, How?

Page 25: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 25

Finding the Alignment(s)

A T G C A G T

A

T

A

A

G

T

1 0 -1 -2 -3 -4 -5

0 2 1 0 -1 -2 -3

-1 1 2 1 1 0 -1

-2 0 1 2 2 1 0

-3 -1 1 1 2 3 2

0 -1 -2 -3

-1

-2

-3

-4 -5 -6

-4

-5

-7

-6 -4 -2 0 1 1 2 4

Similarity Matrix

42 + (-1)3 + (+1)2 + (-1)

TT

31 + (-1)2 + (+1)2 + (-1)

G TG T

21 + (-1)1 + (+1)2 + (-1)

A G TA G T

10 + (-1)1 + ( 0 )2 + (-1)

C A G TA A G T

C A G T - A G T

1-1 + (-1)0 + ( 0 )2 + (-1)

G C A G T - A A G T

1-1 + (-1)0 + (+1)-1 + (-1)

21 + (-1)2 + ( 0 )1 + (-1)

G C A G TA - A G T

20 + (-1)1 + (+1)0 + (-1)

T G C A G TT - A A G T

T G C A G TT A - A G T

A T G C A G TA T A - A G T

A T G C A G TA T - A A G T

Global Alignments

Page 26: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 26

How to Break a Tie?

Should one report all?

Or, report only one?

Page 27: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 27

Advantage of DP Alignment Algorithms

Build up the solution by determining all similarities between arbitrary prefixes of the two sequences

Starting with the shorter prefixes and use previously computed results to solve for larger prefixes

Page 28: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 28

The Complexity of the DP Alignment Algorithm?

Find an optimal alignment

O (m + n)

Construction of the similarity matrix:

O (m • n)

Page 29: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 29

Global versus Local Alignments

A global alignment attempts to match all of one sequence against all of another

LGPSTKQFGKGSSSRIWDN| |||| | | LNQIERSFGKGAIMRLGDA

A local alignment attempts to match subsequences of the two sequences;

-------FGKG-------- |||| -------FGKG--------

Page 30: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 30

How to Compute Local Alignment?

Page 31: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 31

Applying DP to Local Alignment

Similarity Matrix Computation:

a[i,j-1]+g

a[i,j]= max a[i-1,j-1]+sb(i,j)

a[i-1,j]+g

0

0

0

0

0 0 0 0 0

..

..

a[i,0]= 0 ; for i= 0…m

a[0,j]= 0 ; for j= 0…n

If the best alignment up to somepoint has a negative score, it’s better to start a new one, rather

than extend the old one.

Don’t penalize gaps on leftand right ends!

Page 32: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 32

Criteria of Finding a Local Alignment

Find the entries with maximum values in the simularity matrix

For each of such entries, construct an local alignment

See next example

We may also be interested in near-optimal alignments

Page 33: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 33

A T G C A G T

A

T

A

A

G

T

1 0 0 0 1 0 0

0 2 1 0 0 1 1

1 1 2 1 1 0 1

1 1 1 2 2 1 0

0 0 2 1 2 3 2

0 0 0 0

0

0

0

0 0 0

0

0

0

0 0 1 1 2 1 2 4

Similarity Matrix

Similarity Matrix Computation:

a[i,j-1]+g

a[i,j]= max a[i-1,j-1]+sb(i,j)

a[i-1,j]+g

0

A T G C A G TA T - A A G T

A T G C A G TA T A - A G T

A T G CA A G T

Applying DP to Local Alignment

Page 34: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 34

Local Alignment using DPT G A T G G A G G T

G

A

T

A

G

G

0 1 0 0 1 1 0 1 1 0

0 0 0 0

0

0

0

0 0 0

0

0

0

0

0 0 0

0 0 2 0 0 0 2 0 0 0

1 0 0 3 1 0 0 1 0 1

0 0 1 1 2 0 1 0 0 0

0 1 0 0 2 3 1 2 1 0

0 1 0 0 1 3 1 2 3 1

0

0 + (-2)0 + (-1)0 + (-2)0

1

0 + (-2)0 + (+1)0 + (-2)0

T G A T G G A G G T A G G

a[i,j-1]+g

a[i-1,j-1]+sb(i,j)

a[i-1,j]+g

0

a[i,j]= max

1 -1 -1 -1

-1 1 -1 -1

-1 -1 1 -1

-1 -1 -1 1

A T C G

A

T

C

G

g = -2 T G A T - G G A G G T G A T A G G

T G A T G G A G G T G A T A G

T G A T G G A G G T G A T

Page 35: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 35

How to Break a Tie?

Should one report all?

Or, report only one?

Page 36: Dynamic Programming and Biological Sequence Comparison

\course\eleg667-01-f\Topic-2a.ppt 36

Extension to the Basic DP Method

Improving space complexity Introduce general gap functions

That is, the probability of a sequence of consecutive spaces is more likely than individual spaces

Affine gap functions: w(k) = h + gk


Recommended