BIC I, Week 4 lectures

Post on 17-Jan-2016

32 views 0 download

description

BIC I, Week 4 lectures. Rhys Price Jones and Anne Haake Rochester Institute of Technology rpjavp@rit.edu , arh@it.rit.edu. Overview of the need for Dynamic Programming. Consider Fibonacci - PowerPoint PPT Presentation

transcript

1

BIC I, Week 4 lectures

Rhys Price Jones and Anne Haake

Rochester Institute of Technology

rpjavp@rit.edu, arh@it.rit.edu

2

Overview of the need for Dynamic Programming

• Consider Fibonacci• The obvious algorithm is elegant, easily

derived from the definition, and clearly correct.(define fib (lambda (n) (if (<= n 1) 1 (+ (fib (- n 2)) (fib (- n 1))))))

• But it’s hopelessly inefficient• Why?• Because it makes repeated recursive calls

with the same argument

3

The Traditional Solution

• Change the order in which the computations are performed

• Change the logic of the program– So that it works “bottom up” instead of “top down”– Fill an array with calculated values starting with (fib 0), then

(fib 1) then (fib 2), etc.

• You can do it manually, as in fib.ss• That is dynamic programming!• The main problem is that it requires thought and

programming and hence may introduce error.

4

It’s not just Fibonacci

• Many programs “write themselves” from the specification of the problem.

• When that happens, we are extremely pleased

• Sadly, the resulting program is often inefficient

• But dynamic programming is a technique to make it efficient again.

5

Memo-izing

• Redefine the function calling mechanism so that:– We first check to see if we’ve made that calculation before– If no, go ahead and compute it but store the result in a hash

table– If yes, look up the previously computed value in the hash

table

• Do it once• Inefficient code becomes efficient automatically with

no re-programming memolambda.ss

memofib.ssmemofib.ss

6

Another Example

• Pascal’s triangle• Each entry is the sum of its parents

– Cn,k = Cn-1,k-1 + Cn-1,k

– C0,k = Cn,0 = 1

• Leading to program• Runs really slowly• Replace lambda by memolambda

badcomb.ss

badcomb.ss

goodcomb.ss

7

Review of Pattern Matching

• Does CGGA appear within the sequence ATCGCGTAACGGAGATAGGCTTA ?

• More generally, where does pattern p (length n) appear within text t (length m)

• Boyer-Moore, or Knuth-Morris-Pratt give O(m+n) search

• If p is going to change a lot and t stay the same, suffix tree can be built in O(m), each search is then O(n)

• If p is stable and there are lots of different t, virtual machine can be built in O(n) and then each search is O(m)

8

Build a Virtual Machine

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

9

First Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

10

Second Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

11

Third Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

12

Fourth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

13

Fifth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

14

Sixth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

15

Seventh Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

16

Eighth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

17

Ninth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

18

Tenth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

19

Eleventh Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

20

Twelfth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

21

Thirteenth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

22

Fourteenth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

23

Fifteenth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

24

Sixteenth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

25

17th – 23rd Steps

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

26

Pattern Matching – Conclusion

• Exact pattern matching is easy• Often the naive algorithm is good enough• Fast algorithms are readily available• Sadly, not much use for biological tasks

27

Why not?

• What’s the difference?• Mutation• Insertion/deletion gaps• We need an inexact way to compare two (or

more) biological sequences

28

Pattern Matching vs. Sequence Alignment

• In the CS world, we talk of comparing strings, or matching patterns of characters within strings

• For biological applications, we talk of comparing sequences, or aligning sequences of nucleotides (or amino acids) to each other

29

Evolutionary Relatedness

• Consider ACCGT and CACGT• How likely is it that they are “related”?• Possible alignments:• ACCGT AC-CGTXX||| -|-|||CACGT -CACGT

• Which is better?

30

It Depends

• ACCGT AC-CGTXX||| -|-|||CACGT -CACGT

• Scoring 2 for a match, -2 for a mismatch, and –1 for a gap, 2 versus 6

• Scoring 2 for a match, 0 for a mismatch and –2 for a gap, 6 versus 4

• And we haven’t even begun to consider experimental evidence that might cause us to rank some mutations better than others!

31

Distance measure

• Score 0 for a match• 1 for a mismatch or gap• Low score best!• ACCGT AC-CGTXX||| -|-|||CACGT -CACGT

• Now it’s 2 versus 2

32

Global alignment

• For two sequences• - A C C A C C-ACACC

• Use the scoring scheme to fill in the table, starting with first row and first column

33

First entries

• Using the distance measure• - A C C A C C- 0 1 2 3 4 5 6A 1 C 2A 3C 4A 5

• Each nucleotide<->gap costs 1 point

34

Extending inwards

• Extending the distance measure• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 A 3 2 C 4 3 A 5 4

• Extending from North or West costs 1 point, from NW costs 0 (match) or 1 (mismatch)

• Pick cheapest of the three

35

More extension

• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4 A 3 2 1 1 C 4 3 2 1 A 5 4 3 2

• mi,j = min (mi,j-1+g mi-1,j+g mi-1,j-1+cij)

• where cij = 0 for a match, 1 for a mismatch

36

Getting there...

• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 A 5 4 3 2 1

• mi,j = min (mi,j-1+1 mi-1,j+1 mi-1,j-1+cij)

• where cij = 0 for a match, 1 for a mismatch

37

Almost done...

• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 1 2A 5 4 3 2 1 2

• mi,j = min (mi,j-1+1 mi-1,j+1 mi-1,j-1+cij)

• where cij = 0 for a match, 1 for a mismatch

38

Finally, we can get a Global alignment

• One of the least-cost routes• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 1 2A 5 4 3 2 1 2 2

• Can you see how this path leads to the alignment• ACCACCAC-ACA

39

Global alignment program

• Distance measure• Runnable program• Dynamic Programming version

globalig.txt

globalig.ss

globaligm.ss

40

Global vs Local Alignment

• Global alignment seeks the best alignment between the complete sequence and the complete sequenceA global alignment between GATCCACCA and GTAACACA might be

• G-ATCCACCA|-|X|-||-|GTAAC-AC-A

• A local alignment is the best alignment between subsequences. A local alignment between GATCCACCA and GTAACACA might be

• gATCCACca |X|-||gtAAC-ACa

• Best local alignment depends on scoring scheme

41

Local Alignment

• For this demo, we will use a different measure– 2 for a match– -1 for a mismatch, -2 for a gap– Find best match withinG C T C T G C G A A T A GC G T T G A G A T A C T C

42

The solution

• - G C T C T G C G A A T A G

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 2 0 2 0 0 2 0 0 0 0 0 0

G 0 2 0 1 0 1 2 0 4 2 0 0 0 2

T 0 0 1 2 0 2 0 1 2 3 1 2 0 0

T 0 0 0 3 1 2 1 0 0 1 2 3 1 0

G 0 2 0 1 2 0 4 2 2 0 0 1 2 3

A 0 0 1 0 0 1 2 3 1 4 2 0 3 1

G 0 2 0 0 0 0 3 1 5 3 3 1 1 5

A 0 0 1 0 0 0 1 2 3 7 5 3 3 3

T 0 0 0 3 1 2 0 0 1 5 6 7 5 3

A 0 0 0 1 2 0 1 0 0 3 7 5 9 7

C 0 0 2 0 3 1 0 3 1 1 5 6 7 8

T 0 0 0 4 2 5 3 1 2 0 3 7 5 6

C 0 0 2 2 6 4 4 5 3 1 1 5 6 4

• G C T C T G C G A A T A G

| | x | | X | |

C G T T G A G A - T A C T C

43

The Program

• Has dynamic programming to make it fast!• This is basically Smith-Waterman• Work has been done on different scoring

schemes, gap penalties, etc.• Runs in time O(mn)

localig.ss

44

Exercises

• that we will attempt in class:– amend global alignment program to do the

“backtracking” needed for the alignment

• that will be homework– amend local alignment program to do the

“backtracking” needed for the alignment