Multiple Sequence Alignment
Algorithms in Computational BiologySpring 2006
Most of the slides were created by Dan Geiger and Ydo Wexler and edited by Itai Sharon, other created by Itai Sharon
2
Multiple Sequence Alignment
S1=AGGTC
S2=GTTCG
S3=TGAAC
GGG
CCC
A-T
G--
TTA
-TA
-G-
GTT
CCA
AG-
GTG
T-A
--A
-GC
Possible alignment
Possible alignment
3
Motivation
Construction of phylogenetic trees Requires that sites being compared are homologous
Extraction of conserved regions in proteins
Construction of profiles characteristic for a protein family
Repetitive sequences in DNA
4
Multiple Sequence Alignment (MSA)
Definition Given strings s1, s2, …, sk an MSA algorithm maps
them to strings s’1, s’2, …, s’k that may contain gaps, where:
|s’1| = |s’2| = … = |s’k|
The removal of gaps from s’i leaves si
Note It is usually convenient to represent an MSA as a
matrix with k rows and |s’i| columns No column may consist solely of gaps
5
Assigning scores to an MSA
We will consider additive functions only
Points to consider regarding a scoring function Should not be dependent if the on the order of arguments Should reward the presence of many equal/strongly related
residues and penalize unrelated residues and spaces
In pairwise alignment the score is simply the sum of similarity scores of corresponding letters
What is the “best” way to measure the similarity of k>2 letters?
6
Sum of Pairs (SP)
The sum of pairs score of an MSA is the sum of scores of
all pairwise alignments induced by it
Example:
Using a cost function (x, x) = 0 and (x, y) = 1 for x ≠ y this alignment has a SP value of
- c - a d b -
a - b c d a d
a c - c d b -
4 + 6 + 2 = 12
2
k
7
Sum of Pairs
SP tends to overcount mutations. For instance: Assume that our column consists of (A, A, A, T) and that
(x, x) = 1, (x, y) = -1 The score for the column will be
3*(A, A) + 3*(A, T) = 3 – 3 = 0
While this could be explained by a single mutation:
AAA T
8
How to Perform MSA?
Multidimensional dynamic programming
Tree alignments
Star alignments
Progressive alignment
9
Multidimensional DP Alignment
Given k strings of length n, there is a natural generalization of the DP algorithm
Instead of a 2-dimensional table, we now have a k-dimensional table to fill
For each cell V(i), i=(i1,.., ik), compute an optimal multiple alignment for the k prefix sequences s1(1,.., i1),..., sk(1,.., ik)
The adjacent cells are all cells V(i-b), where bi{0,1} and bi≠0.Each cell depends on 2k-1 adjacent cells
Use the SP-score for computing the score
10
Multidimensional DP Alignment
What’s the price? Number of cells to fill: O(nk) Number of dependencies of each cell: 2k-1 Time to compute the SP-score: O(k2)
In fact, the optimal SP-alignment problem was shown to be NP-complete!
Well, these sequences need to be aligned… what can we do?
Complexity: O(k22knk)
11
Time Saving Heuristics – Relevance Tests
Idea: Avoid computing score(i) for irrelevant cells
Compute a lower bound L on the optimal alignment Any efficient approximation algorithm can be used
For each cell V(i) compute an upper bound U on the best alignment that goes through it
Ignore the cell if U<L
12
Time Saving Heuristics – Relevance Tests
How do we compute the upper bound U for cell V(i)?
For cell i=(…,iu,…,iv,…) do the following: For each two indices 1 u < v k compute the optimal
score of a pairwise alignment of su and sv, which goes via cell i
Compute
Claim: U is an upper bound on the best MSA that goes through cell i
kvu1
vvuu n))..i(1..sn),..i(1..score(sU
13
Time Saving Heuristics – Relevance Tests
How do we compute the optimal route? Recall the space efficient algorithm for pairwise alignment.
can we go over all cells determine if they are relevant or not? No. Start with (0,…,0) and add to the list relevant entries
until reaching (n1,…,nk)
What is the new time complexity? For each potential cell we’ve added O(k2n2) operations Depending on the quality of L we’ve eliminated (hopefully)
many cells
14
Tree Alignments – Structure
Input A set of k sequences S= {s1, s2, …, sk} Topology of the tree T whose leaves are the
members of S
Algorithm Find an assignment of sequences for the interior
nodes of the tree that optimizes the overall score For each edge e=(vi,vj) of T, its weight w(e) is the pairwise
alignment score of vi and vj
The overall score is defined by
Te
w(e)score(T)
15
Tree Alignments – an Example
Suppose that We’re given the following tree:
Given that (x, x)=1, (x, y)=0 and (x, -)=-1, the overall score of the alignment is
score(T)=2+3+1=6
CAT
GT
CTG
CG
CT
1
1
2
CG
2
1
+3
1
+1=6
16
Tree Alignments – Notes
The MSA can be recovered from the alignments on the different edges
Overall score of the alignment is not SP
The tree alignment problem is NP-hard There exists an algorithm that finds an optimal alignment
in time exponential in the number of sequences
Tree alignment algorithm are applicable only when a tree topology is known
17
Star Alignments – Structure
Choose a sequence s* that will serve as the center of the star How to choose: try all sequences, choose the one whose
distances from all the rest is the smallest, etc.
Add other sequences by aligning them to s* Add gaps to already aligned sequences
when necessary Never remove a gap (“Once a gap,
always a gap”)
s3s4
s2s5
s6
s1
18
The Center Star Method
Publication Gusfield, 1993
Assumption The cost function δ is a distance function that satisfies:
(x, y) = (y, x) ≥ 0 (x, x) = 0 (x, z) + (z, y) ≥ (x, y)
Algorithm Runs in polynomial time alignment’s score is less than twice the score of the
optimal alignment
19
The Center Star Method – Definitions
Definitions M - the alignment produced by the algorithm M* – the best alignment, namely the one that gets the
lowest score d(i, j), d*(i, j) – the distance induced by M (M*) on (si, sj)
DP(si, sj) – minimum pairwise alignment score v(M) - score for alignment M:
Note that it is always true that d(i, j) ≥ DP(i, j)
k
i
k
ijj1 1
ji,dMv
20
The Center Star Method
Input A set of k sequences S = {s1, …, sk}
Algorithm Find the center s* = . Suppose s*= s1
for i=2 to k do: Suppose s1, …, si-1 are already aligned as s’1, …, s’i-1 Align si against s’1 by running the DP algorithm to produce
the alignment (s”1, s’i)
Adjust s’2, …, s’i-1 to s”1 by adding gaps to those columns where gaps were added to get s”1 from s’1.
Replace s’1 by s”1, add s’i.
end for
SsSsj
)sDP(s,argmin j
21
The Center Star Method – Time Analysis
Choosing s*
running the DP algorithm times – o(k2n2)
Adding s2, …, sk to the MSA In step i the length of s’* is at most i·n Aligning s’* with si takes o(i·n2) time Performing k-1 such alignments takes o(k2n2) time:
Overall time complexity: o(k2n2)
1
1
221
1
2k
i
k
i
nkOiOnninO
2
k
22
k
i
k
ijj1 1
j)d(i,v(M)
The Center Star Method – Error Analysis
k
i
k
j1 2j1 )s,DP(s
Definition of S1
Triangle inequality
k
i
k
ijj1 1
j),1d(i),1d(
k
i
d2
i),1()1(k2
k
i
k
ijj1 1
** j)(i,d)v(M
k
i
k
ijj1 1
ji )s,DP(s
k
j 2j1 )s,DP(sk
2k
1)2(k
)v(M
v(M)*
k
i 2i1 )s,DP(s)1(k2
d(1,i)=DP(1,i)
23
Progressive Alignments
Idea successively align pairs of sequences using pairwise
alignment algorithms
General structure Choose two sequences and align them using a pairwise
alignment algorithm Choose another sequence and align it to the current
alignment Repeat the previous stage as long as there are sequences
left
24
Progressive Alignments
Differences between algorithms Choosing the next sequence Progression involves aligning sequences vs. alignments
only, or also alignments vs. alignments Scoring methods
Progressive alignment algorithms Clustal W T-Coffee
25
CLUSTAL W
Publication Thompson et al., 1994
The algorithm consists of three stages: Distance matrix construction, by pairwise alignment of
each pair of sequences Guide tree construction from the distance matrix Progressive alignment of the sequences according to
the branches in the guide tree
More on ClustalW – next week…