+ All Categories
Home > Documents > 2 Pairwise alignment - Algorithms in Bioinformatics · 2 Pairwise alignment ... Bioinformatics I,...

2 Pairwise alignment - Algorithms in Bioinformatics · 2 Pairwise alignment ... Bioinformatics I,...

Date post: 06-Jul-2018
Category:
Upload: lammien
View: 225 times
Download: 0 times
Share this document with a friend
15
4 Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009 2 Pairwise alignment Outline of the chapter: 1. Global, local, and overlap alignment of two sequences using dynamic programming 2. Affine gap penalty 3. Algorithms that reduce the time and memory complexity of the dynamic programming ap- proaches Comparison of biological sequences: one of the most important operations in computational biology Biology’s paradigm: High sequence similarity implies similar structure and/or function : Similarity of sequences: measure of how similar sequences are Alignment of two sequences: place one sequences above the other one such that similar or identical characters are in the same column and non-identical/non-similar characters are either placed in the same column as a mismatch or opposite a gap in the other sequence. Definition 2.0.1 (Alignment) Given two sequences X and Y over an alphabet Σ. An alignment A of X and Y is obtained by inserting dashes (‘-’) so that both resulting strings X = x 1 ...x l and Y = y 1 ...y l of equal length can be written one above the other in such a way that each character in the one string is opposite a unique character in the other string. Note: Usually, we require that no two dashes are aligned in this way. Example: X= Y E - S T E R D A Y Y= - E A S T E R S - - 2.1 Sequence similarity In order to evaluate an alignment we need a scoring system. The score of two aligned characters in biology is mostly a similarity score. Given two sequences X = x 1 ...x n and Y = y 1 ...y m over an alphabet Σ. A similarity score matrix s ∪ {-} × Σ ∪ {-} → R assigns a similarity score to each pair of characters in Σ ∪ {-}. Let A be an alignment of X and Y . The score S (A) of A is defined as S (A)= l i=1 s(x i ,y i ). Example: for Σ ∪ {-} = {A, B, L, -} we use the similarity score matrix: s A B L - A 3 1 -1 -2 B 2 0 -3 L 1 2 - 0
Transcript

4 Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009

2 Pairwise alignment

Outline of the chapter:

1. Global, local, and overlap alignment of two sequences using dynamic programming

2. A!ne gap penalty

3. Algorithms that reduce the time and memory complexity of the dynamic programming ap-proaches

Comparison of biological sequences: one of the most important operations in computational biology

Biology’s paradigm:

High sequence similarity implies similar structure and/or function :

Similarity of sequences: measure of how similar sequences are

Alignment of two sequences: place one sequences above the other one such that similar or identicalcharacters are in the same column and non-identical/non-similar characters are either placed in thesame column as a mismatch or opposite a gap in the other sequence.

Definition 2.0.1 (Alignment) Given two sequences X and Y over an alphabet ". An alignmentA of X and Y is obtained by inserting dashes (‘-’) so that both resulting strings X ! = x!1 . . . x!l andY ! = y!1 . . . y!l of equal length can be written one above the other in such a way that each character inthe one string is opposite a unique character in the other string.

Note: Usually, we require that no two dashes are aligned in this way.

Example:

X= Y E - S T E R D A YY= - E A S T E R S - -

2.1 Sequence similarity

In order to evaluate an alignment we need a scoring system. The score of two aligned characters inbiology is mostly a similarity score.

Given two sequences X = x1 . . . xn and Y = y1 . . . ym over an alphabet ". A similarity score matrixs : " ! {"}# " ! {"}$ R assigns a similarity score to each pair of characters in " ! {"}.Let A be an alignment of X and Y .

The score S(A) of A is defined as

S(A) =l!

i=1

s(x!i, y!i).

Example: for " ! {"} = {A, B, L,"} we use the similarity score matrix:

s A B L "A 3 1 "1 "2B 2 0 "3L 1 2" 0

Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009 5

The alignment A then has the score:

X’ = B L A - B L AY’ = A L A B B L -

1 +1 +3 -3 +2 +1 -2 = 3 = S(A)

2.2 The scoring model

The algorithms that compute an alignment critically depend on the choice of the parameters forsubstitutions, deletions and insertions. Generally no existing scoring model can be applied to allsituations. Here the underlying question and/or application always needs to be considered. Generallypairwise alignments are conducted when

• Evolutionary relationships between the sequences are reconstructed. Here scoring matrices basedon mutation rates are usually applied.

• Protein domains are compared. Then the scoring matrices should be based on composition ofdomains and their substitution frequency.

To be able to score an alignment, we need to determine score terms for each aligned residue pair.

Definition 2.2.1 A substitution matrix S over an alphabet " = {a1, . . . , a!} has !#! entries, whereeach entry (i, j) assigns a score for a substitution of the letter ai by the letter aj in an alignment.

General idea:

• consider non-gapped alignments

• compute relative frequencies of the letters f(a) as well as substitution frequencies f(a, b) in arepresentative data set.

• Compare match model against null/random model:

Score(a, b) =f(a, b)

f(a)f(b)

Consider a non-gapped alignment

X = x1x2 . . . xn

Y = y1y2 . . . yn

Null hypothesis: the two sequences are unrelated (not homologous). The alignment is then randomwith a probability described by the model R: each letter a occurs independently with some probabilitypa, and hence the probability of the two sequences is the product:

P (X, Y | R) = P (X | R)P (Y | R) ="

i

pxi

"

i

pyi .

match model M : the two sequences are related (homologous). In the aligned pairs of residues occurwith a joint probability pab, which is the probability that a and b have each evolved from some unknownoriginal residue c as their common ancestor. Thus, the probability for the whole alignment is:

P (X, Y | M) ="

i

pxiyi .

6 Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009

Note that the two values P (X, Y | R) and P (X, Y | M) are likelihoods.

The ratio of the two gives a measure of the relative likelihood that the sequences are related (modelM) as opposed to being unrelated (model R). This ratio is called odds ratio:

P (X, Y | M)P (X, Y | R)

=#

i pxiyi#i pxi

#i pyi

="

i

pxiyi

pxipyi

To obtain an additive scoring scheme, we take the logarithm (base 2 is usually chosen) to get thelog-odds ratio:

S = log(P (X,Y | M)P (X, Y | R)

) = log("

i

pxiyi

pxipyi

) =!

i

s(xi, yi),

withs(a, b) := log

$pab

papb

%.

We thus obtain a matrix S = s(a, b) that determines a score for each aligned residue pair, known asa score or substitution matrix.

For amino-acid alignments, commonly used matrices are the PAM and BLOSUM matrices.

2.3 BLOCKS and BLOSUM matrices

BLOSUM matrices: derived from the database BLOCKS1 (http://blocks.fhcrc.org/).

Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regionsof proteins.

BLOSUM (=BLOcks SUbstitution Matrix) family scoring matrices of all block of the database areevaluated columnwise. For each possible pair of amino acids the frequency f(ai, aj) of common pairs(ai, aj) in all columns is determined.

Di#erent levels of the BLOSUM matrix can be created by di#erentially weighting the degree of simi-larity between sequences.

Common BLOSUM matrices:

Standard values are BLOSUM50 up to BLOSUM80, with the commonly used BLOSUM62 matrix.

BLOSUM62 matrix: calculated from protein blocks such that sequences that are more than 62%identical, then their common contribution has weight 1.

1S Heniko! and JG Heniko!, PNAS 89:10915, 1992

Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009 7

Note that lower BLOSUMx values correspond to longer evolutionary time, and are applicable for moredistantly related sequences. [fragile] BLOSUM62 is scaled so that its values are in half-bits, ie. thelog-odds were multiplied by 2/ log2 and then rounded to the nearest integer value.

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

2.4 Optimal alignment

Given a similarity score matrix. The similarity of two sequences X and Y is the value of any alignmentA of X and Y that maximizes the alignment value. Such an alignment is called optimal.

The goal is to use similarity-based alignments to uncover homology, while avoiding homoplasy2.

Gaps are undesirable and thus penalized. The standard cost associated with a gap of length g is giveneither by a linear score

"(g) = "gd

or an a!ne score"(g) = "d" (g " 1)e,

where d is the gap open penalty and e is the gap extension penalty.

Usually, e < d, with the result that less isolated gaps are produced.

2.5 Alignment algorithms

Given a scoring scheme, we need to have an algorithm that computes the highest-scoring alignmentof two sequences.

Alignment algorithms based on dynamic programming are guaranteed to find the optimal scoringalignment.

DP algorithm

• for global alignment,

• for local alignment, and

• for overlap alignment.2Homoplasy=mutations that appear in parallel or convergently in two di!erent lineages

8 Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009

2.5.1 Global alignment: Needleman-Wunsch algorithm

The Needleman-Wunsch algorithm3: is a dynamic program that solves the problem of obtaining thebest global alignment of two sequences X and Y . The recursion for linear gap penalty d > 0 is definedas:

Initialization: F (i, 0) := "i · d for all i = 0, 1, 2, . . . , n, F (0, j) := "j · d for all j = 0, 1, 2, . . . ,m

F (i, j) := max

&'

(

F (i" 1, j " 1) + s(xi, yj)F (i" 1, j)" dF (i, j " 1)" d

The score of the optimal global alignment is # := F (n, m)

2.5.2 Traceback

To obtain the actual alignment corresponding to the optimal score, save (in an independent matrix)which of the three terms in the recursion was maximal and used for F (i, j). Then from the finalF (n, m) the alignment can be achieved by backtracking or traceback of the entries in the secondmatrix.

2.5.3 Example of a global alignment matrix

Needleman-Wunsch matrix of the sequences CTAATC and TAATG, scoring values s(a, a) = 1, s(a, b) = "1and a linear gap cost of d = 2:

F 0 C T A A T C

0 0T

A

A

T

G

Score: ; Alignment:

2.5.4 Example of a traceback matrix

Traceback matrix for example above:F 0 C T A A T C0 0 -2 -4 -6 -8 -10 -12T -2 -1 -1 -3 -5 -7 -9A -4 -3 -2 0 -2 -4 -6A -6 -5 -4 -1 1 -1 -3T -8 -7 -4 -3 -1 2 0G -10 -9 -6 -5 -3 0 1

T 0 1 2 3 4 5 60 0,0 0,0 1,0 2,0 3,0 4,0 5,01 0,0 0,0 1,0 2,1 3,1 4,1 5,12 0,1 1,1 1,1 2,1 3,1 4,2 5,23 0,2 1,3 1,2 2,2 3,2 4,3 5,34 0,3 1,4 1,3 3,4 4,3 4,3 5,35 0,4 1,5 1,4 2,4 4,4 5,4 5,4

3Saul Needleman and Christian Wunsch (1970), improved by Peter Sellers (1974).

Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009 9

2.5.5 Complexity

Complexity of the Needleman-Wunsch algorithm:

Space: We need to store (n + 1)# (m + 1) numbers.

Time: Each number takes a constant number of calculations to compute: three sums and a max.

Space / Memory Complexity: O(nm)

Time Complexity: O(nm)

2.6 Local alignment

Global alignment: align two homologous sequences from start-to-end

Local alignment: for two sequences find the best match between substrings of both.

Compute local alignment if the score of an alignment between two substrings will be larger than thescore of a global alignment between the full lengths strings.

Definition 2.6.1 Let X = x1 . . . xn and Y = y1 . . . ym be two sequences over an alphabet ". Let $ bea score function for an alignment. A local alignment of X and Y is a global alignment of substringsX ! = xi1 . . . xi2 and Y ! = yj1 . . . yj2. An alignment A = (X !, Y !) of substrings X ! and Y ! is an optimallocal alignment of X and Y with respect to $ if

$(A) = max{sim(X !, Y !)|X ! is a substring of X, Y ! is a substring of Y }

2.6.1 Smith-Waterman algorithm

The Smith-Waterman local alignment algorithm

• published in Smith, T. and Waterman, M. Identification of common molecular subsequences. J.Mol. Biol. 147:195-197, 1981

The Smith-Waterman local alignment algorithm:

Initialization: F (i, 0) := 0 for all i = 0, 1, 2, . . . , n and F (0, j) = 0 and j = 0, 1, 2, . . . ,m.

F (i, j) = max

&))'

))(

0,F (i" 1, j " 1) + s(xi, yj),F (i" 1, j)" d,F (i, j " 1)" d.

The value F (i, j) = 0: start a new alignment at (i, j).

The cell with the highest score: arg max F (i, j).

2.6.2 Smith-Waterman algorithm: Traceback

Traceback start: arg maxF (i, j)

Traceback end: cell with score 0

Requirement: !

a,b"!

pa · pb · s(a, b) < 0,

where pa and pb are the probabilities for the seeing the symbol a or b respectively, at any given position.

10 Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009

2.6.3 Example

Smith-Waterman matrix of the sequences CTAATC and TAATG, with s(a, a) = 1, s(a, b) = "1 and gapscore d = 2:

F 0 C T A A T C

0T

A

A

T

G

Score: ; Alignment =

Smith-Waterman matrix of the sequences CTAATC and TAATG, with s(a, a) = 1, s(a, b) = "1 and gapscore d = 2:

F 0 C T A A T C0 0 0 0 0 0 0 0T 0 0 1 0 0 1 0A 0 0 0 2 1 0 0A 0 0 0 1 3 1 0T 0 0 1 0 1 4 2G 0 0 0 0 0 2 3

Score: 4 Alignment = T A A TT A A T

2.6.4 Complexity of the algorithm

Same as the Needleman-Wunsch algorithm: O(nm)

To do: find at least 2 examples of bioinformatics questions which can be solved using the SW algorithm

2.7 Overlap alignments

If we are given di#erent fragments of genomic DNA that we would like to put together, then we needan alignment method that does not penalize overhanging ends:

Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009 11

x

y

xy

x

y

x

y

Alignment should be allowed to start anywhere on the top or left boundary of the matrix, and shouldbe allowed to end anywhere on the bottom or right boundary.

For former, use initialization of SW algorithm: F (i, 0) = 0 and F (0, j) = 0 for i = 0, 1, 2, . . . , n andj = 1, 2, . . . ,m.

For alignment, use recursion of NW algorithm:

To allow the latter, start the traceback at the best scoring cell contained in the bottommost row orrightmost column, i.e. start at

arg max{F (i, j) | i = n or j = m}.

2.7.1 Example of an overlap alignment

X = ACATATT and Y = TTTTAC. Let s(a, a) = 1, s(a, b) = "1, for the match, mismatch score,respectively, and gap score = 2:

0 A C A T A T T0

T

T

T

T

A

C

2.8 Alignments with more complex gap models

Linear gap score not ideal for biological sequences

More appropriate scheme: expensive to open a gap, but once open, less expensive to extend a gap

Let "(k) be the score function for gaps of length k.

12 Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009

Init: F (0, 0) = 0, F (i, 0) = "(i) and F (0, j) = "(j)

F (i, j) := max

&'

(

F (i" 1, j " 1) + s(xi, yj),max1#k#j{F (i, j " k) + "(k)},max1#l#i{F (i" l, j) + "(l)}.

Problem: requires O(n3) time, because for each cell we need to inspect i + j + 1 predecessors.

2.8.1 A!ne gap scores

A!ne gap score"(k) = "d" (k " 1)e,

with d the gap-open score and e the gap-extension score.

Modification of Needleman-Wunsch algorithm for global alignment so as to incorporate a!ne gap costsis due to Osamu Gotoh (1982)4:

Instead of using one matrix F (i, j) use three matrices M , Ix and Iy:

Case 1 Case 2 Case 3xi aligns to yj : xi aligns to a gap: yj aligns to a gap:

I G A xi

L G V yj

A I G A xi

A G V yj -S G A xi -S L G V yj

1. M(i, j) is the best score up to (i, j), given that xi is aligned to yj ,

2. Ix(i, j) is the best score up to (i, j), given that xi is aligned to a gap, and

3. Iy(i, j) is the best score up to (i, j), given that yj is aligned to a gap.

2.8.2 Recursion for global alignment with a!ne gap costs

Initialization:

M(0, 0) = 0, Ix(0, 0) = Iy(0, 0) = "%Ix(i, 0) = "d" (i" 1)e, M(i, 0) = Iy(i, 0) = "%, for i = 1, . . . , n, andIy(0, j) = "d" (j " 1)e, M(0, j) = Ix(0, j) = "%, for j = 1, . . . ,m.

Recursion:Ix(i, j) = max

*M(i" 1, j)" d,Ix(i" 1, j)" e;

Iy(i, j) = max*

M(i, j " 1)" d,Iy(i, j " 1)" e;

M(i, j) = max

&'

(

M(i" 1, j " 1) + s(xi, yj),Ix(i" 1, j " 1) + s(xi, yj),Iy(i" 1, j " 1) + s(xi, yj).

4Gotoh, O. An improved algorithm for matching biological sequences. J. Mol. Biol. 162:705-708, 1982.

Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009 13

2.8.3 Example of a global alignment with a!ne gap costs

X = TTAGAT and Y = TTG, s(a, a) = 1, s(a, b) = "1, "(g) = "d " (g " 1)e with d = 4 and e = 1 forthe match, mismatch, gap-open and gap-extension scores, respectively:

0 T T A G T

0MIx

Iy

TMIx

Iy

TMIx

Iy

GMIx

Iy

X = TTAGAT and Y = TTG, s(a, a) = 1, s(a, b) = "1, "(g) = "d " (g " 1)e with d = 4 and e = 1 forthe match, mismatch, gap-open and gap-extension scores, respectively:

0 T T A G T

M 0 "% "% "% "% "%0 Ix "% "4 "5 "6 "7 "8

Iy "% "% "% "% "% "%M "% +1 "3 "6 "7 "6

T Ix "% "% "3 "4 "5 "6Iy "4 "% "% "% "% "%M "% "3 +2 "4 "5 "4

T Ix "% "% "7 "2 "3 "4Iy "5 "3 "7 "10 "11 "10M "% "6 "4 +1 "1 "4

G Ix "% "% "10 "8 "3 "4Iy "6 "4 "2 "8 "9 "8

2.8.4 Simplifying the a!ne-gap algorithm

Use only two matrices, M and I:

M corresponds to an alignment of two symbols,I corresponds to an insertion in one of the two sequences.

M(i, j) = max*

M(i" 1, j " 1) + s(xi, yj),I(i" 1, j " 1) + s(xi, yj);

I(i, j) = max

&))'

))(

M(i" 1, j)" d,I(i" 1, j)" e,M(i, j " 1)" d,I(i, j " 1)" e.

Exercise: how does one have to initialize the matrices?

Identical to the original algorithm if min(s(a, b)) > "2e.

14 Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009

2.9 Alignment in linear space

Can we compute a best alignment between two sequencesX = (x1, x2, . . . , xn) and Y = (y1, y2, . . . , ym) using only linear space?

The best score of an alignment is easy to compute in linear space, as F (i, j) is computed locally fromthe values in the previous and current column only.

However, to obtain an actual alignment in linear space, we need to replace the traceback matrix.

We will discuss this for the case of global alignments.

Idea: Divide-and-conquer! The idea was developed and published by Hirschberg (1975)5: Considerthe middle column u = &n

2 ' of the F matrix.

n/2

Idea: Divide-and-conquer! Consider the middle column u = &n2 ' of the F matrix. If we knew at

which cell (v, u) the best scoring alignment passes through the uth column, then we could split thedynamic-programming problem into two parts:

• align from (0, 0) to (v, u), and then

• align from (v, u) to (m, n).5Hirschberg, D. S. (1975) A linear space algorithm for computing longest common subsequences, Commun. Assoc.

Comput. Mach., 18, 341-343.

Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009 15

Y \ X 0 1 · · · u · · · n01...v (v, u)...m

Concatenate the two solutions to obtain the final result.

Compute length of optimal path from (0, 0) up to (j, &n2 '):

Compute length of optimal path from (m, n) up to (j, &n2 '):

How to determine v, the row in which a best path crosses the uth column? The key point of thisalgorithm is, that we can find this row without actually knowing the optimal full path through thewhole matrix. Define l(j) to be the score of the path from (0, 0) to (m, n) that passes through thevertex (j, &n

2 '). The cell (j, &n2 ') splits the l(j)-path into two subpaths. The first runs from (0, 0) to

(j, &n2 '), the second runs from (j, &n

2 ') to (m, n). Then l(j) is equal to the sum of the scores of bothsubpaths. Clearly the score of the first subpath is equal to F (j, &n

2 '). The score of the second path isequal to FR(j, &n

2 ') of the path from (m, n) to (j, &n2 ') in the reverse matrix.

Then clearly v = arg max0#j#m l(j).

Note that since the l(j) values are determined through the F (j, i) values, computing all l(j) valuestakes O(m) space.

Once we have determined (v, u), we recurse, as indicated here:

16 Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009

u’’ u’ u

v’’v’

v

m

n

We obtain the actual alignment as a sequence of pairs (v1, 1), (v2, 2), . . . , (vn, n).

What is the time complexity? We first look at 1#nm cells, then at 12nm cells, then at 1

4nm cells etc.As

+ni=0

12i < 2, this algorithm is only twice as slow as the quadratic-space one!

2.10 Alignment in linear time: Banded global alignment

For simplicity, we consider DNA sequences, assume n = m and use a linear gap score d.

Idea: Instead of computing the whole matrix F , use only a band of cells along the main diagonal:0

0

n

n

y

x

kk

Let 2k denote the height of the band. Obviously, the time complexity of the banded algorithm willbe O(kn).

Questions: Will this algorithm produce an optimal global alignment? What should k be set to?

2.10.1 What does k represent?

Alignment with k = 0: NO GAPS

Alignment with k = 1: allow maximally k = 1 consecutive gap coming from the diagonal, and/or2k = 2 consecutive gaps in sequence y coming from the band’s border of sequence x (and vice versa).

...

2.10.2 The KBand algorithm

Input: two sequences X and Y of equal length n, integer kOutput: best score #k of global alignmentInitialization: Set F (i, 0) := "i · d for all i = 0, 1, 2, . . . , k.Set F (0, j) := "j · d for all j = 1, 2, . . . , k.

for i = 1 to n dofor h = "k to k do

j := i + hif 1 ( j ( n then

F (i, j) := F (i" 1, j " 1) + s(xi, yj)if insideBand(i" 1, j, k) then

F (i, j) := max{F (i, j), F (i" 1, j)" d}if insideBand(i, j " 1, k) then

Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009 17

F (i, j) := max{F (i, j), F (i, j " 1)" d}return F (n, n)

To test whether (i, j) is inside the band, we use:insideBand(i, j, k) := ("k ( i" j ( k).

2.10.3 Searching for high-identity alignments

We can use the KBand algorithm as a fast method for finding high-identity alignments:

If we know that the two input sequences are highly similar and we have a bound b on the number ofgaps that will occur in the best alignment, then the KBand algorithm with k = b/2 will compute anoptimal alignment.

For example, in forensics, one must sometimes determine whether a sample of human mtDNA obtainedfrom a victim matches a sample obtained from a relative (or from a hair brush etc). If two suchsequences di#er by more than a couple of base-pairs or gaps, then they are not considered a match.

2.10.4 Optimal alignments using KBand

Given two sequences X and Y of the same length n. Let M be the match score and d the gap penalty.

Question: Let #k be the best score obtained using the KBand algorithm for a given k. When is #k

equal to the optimal global alignment score #?

Lemma If #k )M(n" k " 1)" 2(k + 1)d, then #k = #.

Proof If there exists an optimal alignment with score # that does not leave the band, then clearly#k = #. Else, all optimal alignments leave the band somewhere. This requires insertion of at leastk +1 gaps in each sequence, and allows only at most n" k" 1 matches, giving the desired bound. !

2.10.5 Optimal alignment using repeated KBand

The following algorithm computes an optimal alignment by repeated application of the KBand algo-rithm, with larger and larger k:

Input: two sequences X and Y of the same length nOutput: an optimal global alignment of x and y

Initialize k := 1repeat

compute #k using KBandif #k )M(n" k " 1)" 2(k + 1)d then

return #k

k := 2kend

As usual, we omit details of the traceback.

18 Bioinformatics I, WS’09-10, J. Fischer (script by D. Huson) November 26, 2009

2.10.6 Analysis of time complexity

The algorithm terminates when:

#k )M(n" k " 1)" 2(k + 1)d *

#k "Mn + M + 2d ) "(M + 2d)k *

"#k + Mn" (M + 2d) ( (M + 2d)k *Mn" #k

M + 2d" 1 ( k

At this point, the total complexity is:

n + 2n + 4n + · · · + kn ( 2kn.

So far, this doesn’t look better than nn. To bound the total complexity, we need a bound on k.

When the algorithm stops for k, we must have:

k

2<

Mn" #k/2

M + 2d" 1.

There are two cases: If #k/2 = #k = #, then

k < 2(Mn" #

M + 2d" 1).

Otherwise, # k2

< #k = #. Then any optimal alignment must have more than k2 spaces, and thus

# (M(n" k

2" 1) + 2(

k

2+ 1)d + k ( 2(

Mn" #

M + 2d" 1).

As M + 2d is a constant, it follows that k is bounded by O($), with $ = Mn"#, and thus the totalbound is O($n). !In consequence, the more similar the sequences, the faster the KBand algorithm will run!


Recommended