Dynamic Programming for Pairwise Alignment 2

Post on 12-Jan-2016

34 views 0 download

description

Dynamic Programming for Pairwise Alignment 2. Dr Alexei Drummond Department of Computer Science alexei@cs.auckland.ac.nz. Semester 2, 2006. Review. Dynamic programming algorithm for global alignment (Needleman & Wunsch) Given sequences: F(i,j) = score of best alignment between and. - PowerPoint PPT Presentation

transcript

Dynamic Programmingfor

Pairwise Alignment 2

Dr Alexei Drummond

Department of Computer Science

alexei@cs.auckland.ac.nz

Semester 2, 2006

2

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Review

Dynamic programming algorithm for global alignment (Needleman & Wunsch)

Given sequences:

F(i,j) = score of best alignment

between

and €

Y = (y1,y2,...,yn )

X = (x1,x2,...,xm )

(x1,x2,...,x i)

(y1,y2,...,y j )

3

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

F(i, j)

4

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

x i

y j

F(i, j)

F(i −1, j −1) + s(x i,y j )

5

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

x i

y j

F(i, j)

F(i −1, j −1) + s(x i,y j )

or ……………

x1,x2,x3,...,x i

y1,y2,y3,...,y j−1

y j

F(i, j −1) − d

6

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

x i

y j

F(i, j)

F(i −1, j −1) + s(x i,y j )

or ……………

x1,x2,x3,...,x i

y1,y2,y3,...,y j−1

y j

F(i, j −1) − d

or ……………

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j

x i

F(i −1, j) − d

7

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

x i

y j

F(i, j)

F(i −1, j −1) + s(x i,y j )

or ……………

x1,x2,x3,...,x i

y1,y2,y3,...,y j−1

y j

F(i, j −1) − d

or ……………

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j

x i

F(i −1, j) − d

so ……………

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

8

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Basis

x1, x2, x3, ..., x i

− − − − ... −

y1, y2, y3, ..., y j

− − − − ... −

F(i,0) = F(i −1,0) + s(x i,−)

F(0, j) = F(0, j −1) + s(−,y j )

F(0,0) = 0

9

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

Optimalalignmentscore

10

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Constructing alignment

0

F matrix

0

1

2

m

0 1 2 n

X

Y

Optimalalignmentscore

11

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example

0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

-8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

-16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

-24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

-32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

-40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

-48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

-56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

F matrix

0

1

2

m

0 1 2 n

X

Optimalalignmentscore

P

A

W

H

E

A

E

Y

H E A G A W G H E E

AlignmentAlignmentX

Y H E A G A W G H E - E

- - P - A W - H E A E

12

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Time and space

⇒ Θ(mn)

F matrix

0

1

2

m

0 1 2 n

(m +1) × (n +1) table entries space

Each entry computed in constant time

⇒ Θ(mn) time

13

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Smith & Waterman algorithm

Computes local alignment.

i.e. look for best alignment of subsequences of X and Y, ignoring scoresof regions on either side

Y

X

Best subsequence alignment

14

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Recurrences

0

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

F(i,0) = F(0, j) = 0Basis:

15

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example

F H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

16

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example

F H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

AlignmentX

Y A W G H E

A W - H E

17

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Repeated (local) matches

Long sequences - interested in all local alignments with significant score,> threshold T.

e.g. copies of repeated domain or motif in a protein.

X = sequence containing motif

Y = target sequence

Method is asymmetric

Y

Matching parts of X

18

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Principle of Optimality

Given sequences

Define F(i,j) (i ≥ 1) = best sum of match scores in

and €

Y = (y1,y2,...,yn )

X = (x1,x2,...,xm )

(x1,x2,...,x i)

(y1,y2,...,y j )

y j

x i

y j

assuming

and match ends in

is in a matched region

or

19

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Ends of matches

F(0,0) = 0

F(0, j) = best sum of completed match scores to

(y1,y2,...,y j )

assuming that

y j is not in a matched region

F(0, j −1)

F(0, j) = max F(i, j −1) −T, i =1,...,n

Row 0 therefore marks unmatched regions and ends of matches in Y.

20

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

General recurrence

F(0, j)

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

Start of new match

Extension of previous match

21

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

22

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

23

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

24

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

25

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

26

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

0

Filling up table

F matrix

0

1

2

m

0 1 2 n

X

Y

27

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

0

Filling up table

F matrix

0

1

2

m

0 1 2 n

X

Y

28

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

0

Filling up table

F matrix

0

1

2

m

0 1 2 n

X

Y

29

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

30

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

0

Filling up table

F matrix

0

1

2

m

0 1 2 n

X

Y

31

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Filling up table

0

F matrix

0

1

2

m

0 1 2 n

X

Y

OptimalSum ofalignmentscores

32

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

ExampleF H E A G A W G H E E

0 0 0 0 1 1 1 1 1 3 9

P 0 0 0 0 1 1 1 1 1 3 9

A 0 0 0 5 1 6 1 1 1 3 9

W 0 0 0 0 2 1 21 13 5 3 9

H 0 10 2 0 1 1 13 19 23 15 9

E 0 2 16 8 1 1 5 11 19 29 21

A 0 0 8 21 13 6 1 5 11 21 28

E 0 0 6 13 18 12 4 1 5 17 27

9

Extra cell for final total score

33

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example

AlignmentX

Y H E A G A W G H E E

H E A . A W - H E .

Extra cell for final total score

F H E A G A W G H E E

0 0 0 0 1 1 1 1 1 3 9

P 0 0 0 0 1 1 1 1 1 3 9

A 0 0 0 5 1 6 1 1 1 3 9

W 0 0 0 0 2 1 21 13 5 3 9

H 0 10 2 0 1 1 13 19 23 15 9

E 0 2 16 8 1 1 5 11 19 29 21

A 0 0 8 21 13 6 1 5 11 21 28

E 0 0 6 13 18 12 4 1 5 17 27

9

34

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Overlap matchesY Y

X X

YY

X X

Don’t penalize overhanging ends i.e. set F(i,0) = F(0,j) = 0

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

Otherwise

35

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

ExampleF H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 -2̀ -1 -1 -2 -1 -4 -2 -2 -1 -1

A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2

W 0 -3 -5 -4 1 -4 18 10 2 6 -6

H 0 10 2 6 -6 -1 10 16 20 12 4

E 0 2 16 8 0 7 2 8 16 26 18

A 0 -2 8 21 13 5 3 2 8 18 25

E 0 0 4 13 18 12 4 4 2 14 24

36

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

ExampleF H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 -2̀ -1 -1 -2 -1 -4 -2 -2 -1 -1

A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2

W 0 -3 -5 -4 1 -4 18 10 2 6 -6

H 0 10 2 6 -6 -1 10 16 20 12 4

E 0 2 16 8 0 7 2 8 16 26 18

A 0 -2 8 21 13 5 3 2 8 18 25

E 0 0 4 13 18 12 4 4 2 14 24

AlignmentX

Y G A W G H E E

P A W - H E A

37

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Affine gap penalities

• Affine score: (g) = -d - (g-1)e

gap-open penality gap-extension penalty

• Different penalties associated with extending alignment with gap symbol

Y = C C T W PX = C S T W -

Y = C C T W PX = C S T - -

different from

38

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

General recurrence

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(k, j) + γ(i − k), k = 0,1,...,i −1

(i, j > 0) F(i,k) + γ ( j − k), k = 0,1,..., j −1

Extend by matching

x i and y j

Extend by matching suffix of Y to gap of length i-k

Extend by matching suffix of X to gap of length j-k

Θ(n3)Problem: Procedure runs in worst-case time

39

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

version

Θ(n2)

Extra variables

M(i, j) = best score of alignment of (x1,x2,...,x i) and

(y1,y2,...,y j ) given that x i is aligned with y j Ix (i, j) = best score of alignment of (x1,x2,...,x i) and

(y1,y2,...,y j ) given that x i is aligned with a gap

Iy (i, j) = best score of alignment of (x1,x2,...,x i) and

(y1,y2,...,y j ) given that y j is aligned with a gap

40

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Recurrences

M(i −1, j) − d

Ix (i, j) = max Ix (i −1, j) − e

(i, j > 0)

M(i, j −1) − d

Iy (i, j) = max Iy (i, j −1) − e

(i, j > 0)

M(i −1, j −1) + S(x i,y j )

M(i, j) = max Ix (i −1, j −1) + S(x i,y j )

Iy (i −1, j −1) + S(x i,y j )

(i, j > 0)

aligned to start of gap

x i

Θ(n2)Procedure runs in worst-case time

aligned to continuation of gap

x i

aligned to start of gap

y j

aligned to continuation of gap

y j

41

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

42

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

43

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

44

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

45

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

46

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

47

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

48

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

49

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

50

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

51

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignment

F matrix

0

1

2

m

0 1 2 n

X

Y

m2⎣ ⎦

52

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space algorithm

From top

From bottom

+

=

k

Ftop ( j)

Fbottom ( j)

Ftop ( j) + Fbot ( j)

k ∈ {0,1,...,n} such that

Ftop (k) + Fbot (k) is maximized

k is on path of optimal alignment

53

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignmentHirschberg’s insight

F

m

n00

m2⎣ ⎦

k

54

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Linear space alignmentHirschberg’s insight

F

m

n00

m2⎣ ⎦

k

55

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Software for pairwise alignment

Pure D.P. runs in

Θ(mn) time

Example

100 million residues in database

Search sequence of length 10,000

# F matrix cells to be calculated:

1012

Computer speed: 10 million cells a second

Total time: 100,000 seconds = 28 hours (approx.)

56

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Heuristic methods

FASTA (Pearson & Lipman, 1988)

Words in X and Y(length ktup)

. . .

…, ( i, j ), …cgtta

Position in X Position in Y

. . .

• sort matches on j - i • extend best matches (ungapped)• join neighbouring matches by inserting gaps• realign best matches by dynamic programming

57

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Sensitivity

Tradeoff

High values of ktup: faster search, but may miss significant matches

Low values of ktup: catches more matches, but slower

ktup = 1 for sensitivity close to dynamic programming

Available from

http://www.fasta.bioch.virginia.edu/

58

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example>short1.seq Length: 100 August 1, 2003 11:09 Type: N Check: 5940atgaaattaacagcaatagctaaagcaacattagcattaggaatattaacaacaggtgtgatgacagcagaaagtcaaactgtaaacgcgaaagtaaagt

>short2.seq Length: 100 August 1, 2003 10:43 Type: N Check: 1744atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagcgactggggttataacatcaacggctcaaactgtaaatgcgagcgaacatg

/seqprg/slib/bin/lalign -N 5000 -n -r "+5/-4" -f -12 -g -4 -w 75 -q @ @

resetting to DNA matrix resetting to DNA matrix LALIGN finds the best local alignments between two sequences version 2.1u03 April 2000Please cite: X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381

resetting to DNA matrixalignments < E( 0.05):score: 51 (50 max) Comparison of:(A) @ short1.seq Length: 100 August 1, 2003 11:09 Type - 100 nt(B) @ short2.seq Length: 100 August 1, 2003 10:43 Typ - 100 nt using matrix file: DNA, gap penalties: -12/-4 E(limit) 0.05

71.4% identity in 91 nt overlap (1-91:1-91); score: 221 E(10000): 3.7e-12

10 20 30 40 50 60 70short1 ATGAAATTAACAGCAATAGCTAAAGCAACATTAGCATTAGGAATATTAACAACAGGTGTGATGACAGCAGAAAGT ::::: : :::::::: :: ::::: : ::::: :: : :: ::: : :: :: :: :: ::: :: :short2 ATGAAGATGACAGCAATTGCGAAAGCCAGTTTAGCTCTAAGTATTTTAGCGACTGGGGTTATAACATCAACGGCT 10 20 30 40 50 60 70

80 90short1 CAAACTGTAAACGCGA ::::::::::: ::::short2 CAAACTGTAAATGCGA 80 90

----------

Input sequences

Output matches

59

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example

More matches

64.1% identity in 39 nt overlap (17-54:32-69); score: 53 E(10000): 3.7e+02

20 30 40 50short1 TAGCTAAAGCAACATTAGC-ATTAGGAATATTAACAACA ::::: : : ::::: : : :: ::: :::: ::short2 TAGCTCTAAGTATTTTAGCGACTGGGGTTAT-AACATCA 40 50 60

----------

73.9% identity in 23 nt overlap (60-77:6-28); score: 53 E(10000): 3.7e+02

60 70short1 GATGACAGCA-----GAAAGTCA :::::::::: ::::: ::short2 GATGACAGCAATTGCGAAAGCCA 10 20

----------

60

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

BLAST

Developed by Altschul & al (1990)

Preprocesses query sequence

Makes list of “neighbourhood words” with match > T

Tries to extend “seed” matches (ungapped) in database sequences

GAPPED-BLAST looks for gapped alignments

61

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Genetics Computer Group package

GCG at University of Wisconsin

Commercial package (http://www.gcg.com/)

* assemble * backtranslate * bestfit * blast * breakup * chopup * circles * codonfrequency * codonpreference * coilscan * compare * composition * compresstext * comptable * consensus * correspond * corrupt * dataset * detab * distances * diverge * domes * dotplot * extractpeptide * fasta * fasta_parsable_output * fetch * figure * findpatterns * fingerprint * fitconsensus * foldrna * framealign * frames * framesearch * fromembl * fromfasta * fromgenbank * fromig * frompir * fromstaden * gap * gapshow * gcgtoblast * gelassemble * geldisassemble

* gelenter * gelintroduction * gelmerge * gelstart * gelview * getseq * growtree * helicalwheel * hthscan * isoelectric * lineup * listfile * lookup * lprint * map * mapplot * mapsort * mfold * moment * motifs * mountains * name * names * netblast * nooverlap * olddistances * onecase * overlap * paupdisplay * paupsearch * pepdata * pepplot * peptidemap * peptidesort * peptidestructure * pileup * plasmidmap * plotfold * plotsimilarity * plotstructure * plottest * pretty * prime * profileanalysis * profilegap * profilemake

* profilescan * profilesearch * profilesegments * publish * red * reformat * repeat * replace * reverse * sample * seg * segments * seqed * seqlab * setkeys * setplot * shiftover * shuffle * simplify * spew * spscan * squiggles * statplot * stemloop * stringsearch * symbol * terminator * testcode * tfasta * tofasta * toig * topir * tostaden * translate * whats_new_9.0 * whats_new_9.1 * window * wordsearch * xnu

62

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

GAP

GAP (“Global Alignment Program” ?)

Needleman & Wunsch algorithm

Input in GCG format

Use GETSEQ

!!NA_SEQUENCE 1.0 GETSEQ from gcg, August 14, 19103 12:19.

Length: 389 August 14, 19103 12:19 Type: N Check: 9580 ..

1 AAATGATAAA CTATTTTACT TTATGTCTAA GGTCTTTCAT AATATGAAAT

51 AGAATGTAGA TATTGCAACA ATAGCATTTT TGGAGACAGC TACCTCCTTT

101 ACCAGGAATA ATCTTTGCAT GTCACATTTA GAGATAAAGC TCAAAATGCA

151 AATCCTTCCC CTGAGAGTGG GAAAGCATTA ACAAATGAGA GTGGGAAAAG

201 CATTAACAAA GCATTAACAC AGGTCTTTAC ATATTCAAAA TATTAAACTA

251 ATGCTAGGAT TATAGACTTG ATTTTAAGAC ATGGTAGTTA ATAGAAAAGT

301 TCTAGATTGA AAACAATTTT GCAAAAATAT ACATTTGGTA TATGTGTATA

351 TATGTATGTG GTATATATAT ATCNACTAGG GAAAATATA

63

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Example<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>gapGap uses the algorithm of Needleman and Wunsch to find the alignment oftwo complete sequences that maximizes the number of matches and minimizesthe number of gaps.

GAP of what sequence 1 ? Hs#S374655.gcg

Begin (* 1 *) ? End (* 389 *) ? Reverse (* No *) ?

to what sequence 2 (* Hs#S374655.gcg *) ? Hs#S1117589.gcg

Begin (* 1 *) ? End (* 323 *) ? Reverse (* No *) ?

What is the gap creation penalty (* 50 *) ?

What is the gap extension penalty (* 3 *) ?

What should I call the paired output display file (* Hs#S374655.pair *) ?

Aligning ................-. Aligning ................-..

Gaps: 0 Quality: 3080 Quality Ratio: 9.536 % Similarity: 95.356 Length: 389

64

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Display fileGAP of: Hs#S374655.gcg check: 9580 from: 1 to: 389 GETSEQ from gcg, August 14, 19103 12:19.to: Hs#S1117589.gcg check: 8814 from: 1 to: 323 GETSEQ from gcg, August 14, 19103 12:20.

Symbol comparison table: /usr/users/gcg/gcgcore/data/rundata/nwsgapdna.cmp CompCheck: 8760 Gap Weight: 50 Average Match: 10.000 Length Weight: 3 Average Mismatch: 0.000 Quality: 3080 Length: 389 Ratio: 9.536 Gaps: 0 Percent Similarity: 95.356 Percent Identity: 95.356 Match display thresholds for the alignment(s): | = IDENTITY : = 5 . = 1 Hs#S374655.gcg x Hs#S1117589.gcg August 18, 19103 17:59 .. . . . . . 1 AAATGATAAACTATTTTACTTTATGTCTAAGGTCTTTCATAATATGAAAT 50 ||||||||||||||||||||||||||||||||||||||||||||||| 1 ...TGATAAACTATTTTACTTTATGTCTAAGGTCTTTCATAATATGAAAT 47 . . . . . 51 AGAATGTAGATATTGCAACAATAGCATTTTTGGAGACAGCTACCTCCTTT 100 |||||||||||||||||||||||||||||||||||||||||||||||||| 48 AGAATGTAGATATTGCAACAATAGCATTTTTGGAGACAGCTACCTCCTTT 97 . . . . . 101 ACCAGGAATAATCTTTGCATGTCACATTTAGAGATAAAGCTCAAAATGCA 150 |||||||||||||||||||||||||||||||||||||||||||| ||||| 98 ACCAGGAATAATCTTTGCATGTCACATTTAGAGATAAAGCTCAAGATGCA 147 . . . . . 151 AATCCTTCCCCTGAGAGTGGGAAAGCATTAACAAATGAGAGTGGGAAAAG 200 |||||||||||||||||||||||||||||||||||||||||||||||||| 148 AATCCTTCCCCTGAGAGTGGGAAAGCATTAACAAATGAGAGTGGGAAAAG 197 . . . . . 201 CATTAACAAAGCATTAACACAGGTCTTTACATATTCAAAATATTAAACTA 250 |||||||||||||||||||||||||||||||||||||||||||||||||| 198 CATTAACAAAGCATTAACACAGGTCTTTACATATTCAAAATATTAAACTA 247 . . . . . 251 ATGCTAGGATTATAGACTTGATTTTAAGACATGGTAGTTAATAGAAAAGT 300 ||||||||||||||||||||||||||| |||||||| ||||| 248 ATGCTAGGATTATAGACTTGATTTTAAACATGGGTAGTTATAGAAAAAGG 297 . . . . . 301 TCTAGATTGAAAACAATTTTGCAAAAATATACATTTGGTATATGTGTATA 350 |||||||||||||||| ||| ||| 298 TCTAGATTGAAAACAAATTTTGCAAA........................ 323 . .

65

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Bestfit<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>bestfit

BestFit makes an optimal alignment of the best segment of similaritybetween two sequences. Optimal alignments are found by inserting gaps tomaximize the number of matches using the local homology algorithm ofSmith and Waterman.

BESTFIT of what sequence 1 ? short1.gcg

Begin (* 1 *) ? End (* 100 *) ? Reverse (* No *) ?

to what sequence 2 (* short1.gcg *) ? short2.gcg

Begin (* 1 *) ? End (* 100 *) ? Reverse (* No *) ?

What is the gap creation penalty (* 50 *) ?

What is the gap extension penalty (* 3 *) ?

What should I call the paired output display file (* short1.pair *) ?

Aligning ....-. Aligning ....-.

Gaps: 0 Quality: 416 Quality Ratio: 4.571 % Similarity: 71.429 Length: 91

Smith & Waterman algorithm

Local alignment

Same interface as GAP

66

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Bestfit display fileBESTFIT of: short1.gcg check: 2998 from: 1 to: 100

GETSEQ from gcg, August 18, 19103 15:25.

to: short2.gcg check: 6455 from: 1 to: 100

GETSEQ from gcg, August 18, 19103 15:26.

Symbol comparison table: /usr/users/gcg/gcgcore/data/rundata/swgapdna.cmp CompCheck: 2335

Gap Weight: 50 Average Match: 10.000 Length Weight: 3 Average Mismatch: -9.000

Quality: 416 Length: 91 Ratio: 4.571 Gaps: 0 Percent Similarity: 71.429 Percent Identity: 71.429

Match display thresholds for the alignment(s): | = IDENTITY : = 5 . = 1

short1.gcg x short2.gcg August 18, 19103 15:27 ..

. . . . . 1 atgaaattaacagcaatagctaaagcaacattagcattaggaatattaac 50 ||||| | |||||||| || ||||| | ||||| || | || ||| | 1 atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagc 50 . . . . 51 aacaggtgtgatgacagcagaaagtcaaactgtaaacgcga 91 || || || || ||| || |||||||||||| |||| 51 gactggggttataacatcaacggctcaaactgtaaatgcga 91

67

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Wordsearch

Algorithm similar to algorithm of Wilbur and Lipman (1983).

Compares one sequence (the query) to any group of sequences.

Comparisons can be viewed as set of dot-plots.

Search finds registers of comparison (diagonals) that have the largest number of short perfect matches (words).

Best segment of similarity along each diagonal viewed with program SEGMENTS.

68

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Wordsearch example<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>wordsearch

WordSearch identifies sequences in the database that share largenumbers of common words in the same register of comparison with yourquery sequence. The output of WordSearch can be displayed withSegments.

WORDSEARCH with what query sequence ? short1.gcg

Begin (* 1 *) ? End (* 100 *) ?

Search for query in what sequence(s) (* GenEMBL:* *) ? short2.gcg

What word size (* 6 *) ?

List how many best diagonals (* 50 *) ? 4

Integrate how many adjacent diagonals (* 3 *) ?

What should I call the output file (* short1.word *) ?

1 short2.gcg Len: 100

6-mers found: 168 Diagonals with words: 6 Total diagonals: 398 Sequences searched: 1 CPU time: 00.03

Output file: /usr/users/gcg/359Stuff/short1.word

69

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Short1.word contents

!!SEQUENCE_LIST 1.0 (Nucleotide) WORDSEARCH of: /usr/users/gcg/359Stuff/short1.gcg check: 2998 from: 1 to:100

GETSEQ from gcg, August 18, 19103 15:25.

TO: short2.gcg Sequences: 1 Total-length: 100 August 18, 19103 15:47

Word-size: 6 Words: 168 Diagonals: 6 Total-diagonals: 398 Integral-width: 3 Alphabet: 4 List-size: 4 CPU minutes: 0.00

Sequence Strd Diag Score Width Documentation ..

/short2.gcg + 0 20 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg + -54 10 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg - -47 7 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg + -69 7 3 GETSEQ from gcg, August 18, 19103 15:26.

70

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Run SEGMENTS

<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>segments

Segments aligns and displays the segments of similarity found byWordSearch.

(BestFit) SEGMENTS from what WORDSEARCH file ? short1.word

What should I call the output file (* short1.pairs *) ?

Aligning ....-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:500 / Length: 98 Aligning ..-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:100 / Length: 10 Aligning ..-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:112 / Length: 32 Aligning .-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality: 96 / Length: 16

71

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Short1.pairs contents(BestFit) SEGMENTS from: short1.word August 18, 19103 15:48

(Nucleotide) WORDSEARCH of:/usr/users/gcg/359Stuff/short1.gcg check: 2998from: 1 to: 100GETSEQ from gcg, August 18, 19103 15:25.TO: short2.gcg Sequences: 1 Total-length: 100 August 18, 19103 15:47Word-size: 6 Words: 168 Diagonals: 6 Total-diagonals: 398Integral-width: 3 Alphabet: 4 List-size: 4 CPU minutes: 0.00

AvMatch: 3.84 AvMisMatch: -6.00 GapWeight: 50 LengthWeight: 3 ..

Match display thresholds for the alignment(s): | = IDENTITY : = 3 . = 1

short1.gcg check: 2998 from: 1 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 500 Ratio:5.102 Score:20 Width:3 Limits: +/-4 . . . . . 1 atgaaattaacagcaatagctaaagcaacattagcattaggaatattaac 50 ||||| | |||||||| || ||||| | ||||| || | || ||| | 1 atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagc 50 . . . . 51 aacaggtgtgatgacagcagaaagtcaaactgtaaacgcgaaagtaaa 98 || || || || ||| || |||||||||||| |||| | | | 51 gactggggttataacatcaacggctcaaactgtaaatgcgagcgaaca 98

72

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Short1.pairs (continued)short1.gcg check: 2998 from: 54 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 100 Ratio:10.000 Score:10 Width:3 Limits: +/-4 . 60 gatgacagca 69 |||||||||| 6 gatgacagca 15

short1.gcg check: 2998 from: 47 to: 100 /Reverse/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 112 Ratio:3.500 Score:7 Width:3 Limits: +/-4 . . . 40 ctaatgctaatgttgctttagctattgctgtt 9 | | ||| || | ||||||| | | || 14 caattgcgaaagccagtttagctctaagtatt 45

short1.gcg check: 2998 from: 69 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 96 Ratio:6.000 Score:7 Width:3 Limits: +/-4 . 79 actgtaaacgcgaaag 94 || | || ||||||| 10 acagcaattgcgaaag 25

73

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

EMBOSSWhat is EMBOSS?The European Molecular Biology Open Software SuiteEMBOSS is a package of high-quality FREE Open Sourcesoftware for sequence analysis.

Applications in EMBOSSThe EMBOSS programs and their documentation.

User DocumentationTutorial, Command syntax, Sequences and Databases, Reference

Jemboss and other InterfacesMany groups are creating graphical interfaces to EMBOSSJemboss is our supported interface

Downloading the softwareYou can download, install and run the software on most UNIXcomputersIt is known to work on: Irix, AIX(4.3.3 and 5.1), Red Hat, SuSe, Debian,HPUX11/IA64, MacOSX, Mandrake, NetBSD, Slackware, Solaris,Tru64 Unix (Full support soon. Loan machine being arranged)It is reported to work on: FreeBSD, OSF, SuSE-PPC

LATEST NEWS: Release 2.7.1 available as of 3rd June 2003

74

Dyn

amic

Pro

gra

mm

ing

fo

r P

airw

ise

Alig

nm

ent

Suite of programsgetorf HGMP Finds and extracts open reading frames (ORFs)helixturnhelix HGMP Finds nucleic acid binding domains.hmoment HGMP Hydrophobic moment calculationiep HGMP Calculates the isoelectric point of a proteininfoalign HGMP Information on a multiple sequence alignmentinfoseq HGMP Displays some simple information about sequencesisochore Sanger Plots isochores in large DNA sequencesjembossctl HGMP J emboss Authentication Controllindna Norway Draws linear maps of DNA constructslistor HGMP Writes a list file of the logical OR of

two sets of sequencesmarscan HGMP Finds MAR/SAR sites in nucleic sequencesmaskfeat HGMP Mask off features of a sequencemaskseq HGMP Mask off regions of a sequence.matcher Sanger Local alignment of two sequencesmegamerger HGMP Merge two large overlapping nucleic acid sequencesmerger HGMP Merge two overlapping sequencesmsbar HGMP Mutate sequence beyond all recognitionmwcontam HGMP Shows molwts that match across a set of filesmwfilter HGMP Filter noisy molwts from mass spec outputneedle HGMP Needleman-Wunsch global alignment.