Post on 11-Feb-2016
description
transcript
Multiple Sequence Alignment (I)
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Oct. 4, 2005
ChengXiang Zhai
Department of Computer ScienceUniversity of Illinois, Urbana-Champaign
Outline
• Motivation
• Scoring of multiple sequence alignments
• Algorithms– Dynamic programming – Progressive alignment (next class)
Why Multiple Alignments?• Characterize protein families: Identify
shared regions of homology in a multiple sequence alignment
• Determination of the consensus sequence of several aligned sequences.
• Help predict the secondary and tertiary structures of new sequences
• Help predict the function of new sequences
• Preliminary step in molecular evolution analysis using phylogenetic trees.
Example of Multiple Alignment
Multiple sequence alignment of 7 neuroglobins using clustalx(Slide from Craig A. Struble)
4 Basic Questions in Multiple Alignment
X1=x11,…,x1m1Model: scoring function s: A
Possible alignments of all Xi’s: A ={a1,…,ak}
Find the best alignment(s)
1 2* arg max ( ( , ,..., ))a Na s a X X X
Q3: How can we find a* quickly?
Q1: How should we define s?
S(a*)= 21
Q4: Is the alignment biologically Meaningful?
Q2: How should we define A?
X2=x21,…,x2m2
XN=xN1,…,xNmN
…
X1=x11,…,x1m1
X2=x21,…,x2m2
XN=xN1,…,xNmN
…
Defining Multi-Sequence Alignment• We may generalize our definition of pairwise sequence
alignment• Alignment of 2 sequences is represented as a 2-row matrix• In a similar way, we represent alignment of 3 sequences as
a 3-row matrix A T _ G C G _A _ C G T _ AA T C A C _ A
• A column must have at least one nucleotide • Question: How many possible global alignments are there
for 3 sequences each of length 2?
How do we score a multiple alignment?
Scoring a Multiple Alignment• Ideally, it should be based on evolutionary
models
• In practice, – We often assume columns are independent
– Use “Sum of Pairs” (SP scores)
( ) ( )ii
S m G s m
( ) ( , )
. ., ( , ) 0, ( , ) ( , ) , ( , ) ( , )
k li i i
k l
S m s m m
E g s s a s a d s a b BLOSUM a b
G is the gap score
Minimum Entropy Scoring
''
( ) logi ia iaa
iaia
iaa
S m p p
cpc
Intuition: A perfectly aligned column has one single symbol (least uncertainty)A poorly aligned column has many distinct symbols (high uncertainty)
Count of symbol a in column i
This is related to the HMM formulation of the alignment problem, which we will cover later …
Entropy: Example
0
AAAA
entropy
2)241(4
41log
41
CGTA
entropy
Best case
Worst case
Entropy of an Alignment: Example
column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)
•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0
•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811
•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2
•Alignment Entropy = 0 + 0.811 + 2 = +2.811
A A A
A C C
A C G
A C T
How can we find a multiple alignment quickly?
Can we generalize the dynamic programming algorithm used for pairwise alignment?
Alignments = Paths in…
• Align 3 sequences: ATGC, AATC,ATGC
A A T -- C
A -- T G C
-- A T G C
Alignment Paths
0 1 1 2 3 4
A A T -- C
A -- T G C
-- A T G C
x coordinate
Alignment Paths• Align the following 3 sequences: ATGC, AATC,ATGC
0 1 1 2 3 4
0 1 2 3 3 4
A A T -- C
A -- T G C
-- A T G C
•
x coordinate
y coordinate
Alignment Paths
0 1 1 2 3 4
0 1 2 3 3 4
A A T -- C
A -- T G C
0 0 1 2 3 4
-- A T G C
• Resulting path in (x,y,z) space:
(0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)
x coordinate
y coordinate
z coordinate
2-D vs 3-D Alignment Grid
V
W
2-D edit graph
3-D?
Architecture of 3-D Alignment Grid
In 3-D, 7 edges in each unit cube
In 2-D, 3 edges in each unit square
A Cell of 3-D Alignment Grid(i-1,j-1,k-1)
(i,j-1,k-1)
(i,j-1,k)
(i-1,j-1,k) (i-1,j,k)
(i,j,k)
(i-1,j,k-1)
(i,j,k-1)
Multiple Alignment: Dynamic Programming
• si,j,k = max
(x, y, z) is an entry in the 3-D scoring matrix and can be computed using sum of pairs or entropy
si-1,j-1,k-1 + (vi, wj, uk)si-1,j-1,k + (vi, wj, _ )si-1,j,k-1 + (vi, _, uk)si,j-1,k-1 + (_, wj, uk)si-1,j,k + (vi, _ , _)si,j-1,k + (_, wj, _)si,j,k-1 + (_, _, uk)
cube diagonal: no indels
face diagonal: one indel
edge diagonal: two indels
Multiple Alignment: Running Time
• For 3 sequences of length n, the run time is 7n3; O(n3)
• For k sequences, building a k-dimensional edit graph has run time (2k-1)(nk); O(2knk)
• Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time
In the next class, we will cover more efficient algorithms -- progressive alignment ….
What You Should Know
• How to score a multi-sequence alignment
• How the dynamic programming algorithm works
• Computational complexity of dynamic programming algorithms