Post on 02-Oct-2020
transcript
1
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Lecture 2, 12/3/2003:
• Introduction to sequence alignment
• The Needleman-Wunsch algorithm for global sequence alignment: description and properties
•Local alignmentthe Smith-Waterman algorithm
2
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003Computational sequence-analysis
The major goal of computational sequence analysis is to predict the function and structure of genes and proteins from their sequence.
This is made possible sinceorganisms evolve by mutation, duplication and selection oftheir genes.
Thus, sequence similarity often indicates functional andstructural similarity.
3
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Sequence alignment
5’ ATCAGAGTC 3’ 5’ TTCAGTC 3’
ATC ≠ CTA
AG ≠ GA
etc.
4
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
ATCAGAGTC TTCAGTC
Sequence alignment
We wish to identify what regions are most similar to each other in the two sequences . Sequences are shifted one by the other and gaps introduced, to cover all possible alignments. The shifts and gaps provide the steps by which one sequence can be converted into the other.
ATCAGAGTCTTCAGTC
ATCAGAGTCTTCAGTC
ATCAGAGTCTTCAGTC
++
ATCAGAGTCTTCAGTC++++
ATCAGAGTCTTCA--GTC+++^^+++
5
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
T
T
C
A
G
T
C
A T C A G A G T C
T • •T • •C • •A • • •G • •T • •C • •
Sequence alignmentdot-plot
ATTCATCA
GA--GTCGTC
6
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
ATCAGAGTCTTCA--GTC
Sequence alignmentscoring
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Substitution matrix - the similarity value between each pair of residues
Gap penalty - the cost of introducing gaps Gap penalty -2
A C G TACGT
: 0+2+2+2-2-2+2+2+2 = 8•+++^̂ +++
7
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
T 0 2 0 0 0 0 0 2 0
T 0 2 0 0 0 0 0 2 0
C 0 0 2 0 0 0 0 0 2
A 2 0 0 2 0 2 0 0 0
G 0 0 0 0 2 0 2 0 0
T 0 2 0 0 0 0 0 2 0
C 0 0 2 0 0 0 0 0 2
[T2T1] ATC-TT
[C3T1] ATC---TT
[T2T2] ATCTT-
Sequence alignmentNeedleman-Wunsch global alignment
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
Initialization
Position 3,2:
[ab]
[a-]
[-b]
8
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
T -2 0 2 0 0 0 0 0 2 0
T -4 0 2 0 0 0 0 0 2 0
C -6 0 0 2 0 0 0 0 0 2
A -8 2 0 0 2 0 2 0 0 0
G -10 0 0 0 0 2 0 2 0 0
T -12 0 2 0 0 0 0 0 2 0
C -14 0 0 2 0 0 0 0 0 2
Sequence alignmentNeedleman-Wunsch global alignment
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
[ab]
[a-]
[-b]
Directionality of score calculationInitialization
9
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
T -2 0 0 -2 -4 -6 -8 -10 -12 -14
T -4 -2 2 0 -2 -4 -6 -8 -8 -10
C -6 0 0 2 0 0 0 0 0 2
A -8 2 0 0 2 0 2 0 0 0
G -10 0 0 0 0 2 0 2 0 0
T -12 0 2 0 0 0 0 0 2 0
C -14 0 0 2 0 0 0 0 0 2
Sequence alignmentNeedleman-Wunsch global alignment
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
10
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
T -2 0 0 -2 -4 -6 -8 -10 -12 -14
T -4 -2 2 0 -2 -4 -6 -8 -8 -10
C -6 -4 0 2 0 0 0 0 0 2
A -8 2 0 0 2 0 2 0 0 0
G -10 0 0 0 0 2 0 2 0 0
T -12 0 2 0 0 0 0 0 2 0
C -14 0 0 2 0 0 0 0 0 2
Sequence alignmentNeedleman-Wunsch global alignment
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
11
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
T -2 0 0 -2 -4 -6 -8 -10 -12 -14
T -4 -2 2 0 -2 -4 -6 -8 -8 -10
C -6 -4 0 2 0 0 0 0 0 2
A -8 2 0 0 2 0 2 0 0 0
G -10 0 0 0 0 2 0 2 0 0
T -12 0 2 0 0 0 0 0 2 0
C -14 0 0 2 0 0 0 0 0 2
Sequence alignmentNeedleman-Wunsch global alignment
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
12
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
T -2 0 0 -2 -4 -6 -8 -10 -12 -14
T -4 -2 2 0 -2 -4 -6 -8 -8 -10
C -6 -4 0 4 0 0 0 0 0 2
A -8 2 0 0 2 0 2 0 0 0
G -10 0 0 0 0 2 0 2 0 0
T -12 0 2 0 0 0 0 0 2 0
C -14 0 0 2 0 0 0 0 0 2
Sequence alignmentNeedleman-Wunsch global alignment
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
13
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
σ[ab] : score of aligning a pair of residues a and b
σ[a-] : score of aligning residue a with a gap (gap penalty: -q)
S : score matrix
S(i,j) : optimal score of aligning residues positions 1 to i on one sequence
with residues positions 1 to j on another sequence
Sequence alignmentNeedleman-Wunsch algorithm
14
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Sequence alignmentNeedleman-Wunsch algorithm
S(0,0) ⇐ 0for j ⇐ 1 to N do
S(0,j) ⇐ S(0,j-1) + σ[-bj]
for i ⇐ 1 toM do
{ S(i,0) ⇐ S(i-1,0) + σ[ai-]
for j ⇐ 1 to N do
S(i,j) ⇐ max (S(i-1, j-1) + σ[aibj],
S(i-1, j) + σ[ai- ],
S(i, j-1) + σ[-bj ])
} Pearson & MillerMeth Enz 210:575, ‘92
15
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Sequence alignmentNeedleman-Wunsch global alignment
Optimal score/s is found - more steps needed to find the corresponding alignment/s.This is a time-saving property in database searches and other applications.
Only a single pass through the alignment matrix is needed.
16
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
T -2 0 0 -2 -4 -6 -8 -10 -12 -14
T -4 -2 2 0 -2 -4 -6 -8 -8 -10
C -6 -4 0 4 2 0 -2 -4 -6 -6
A -8 -4 -2 2 6 4 2 0 -2 -4
G -10 -6 -4 0 4 8 6 4 2 0
T -12 -8 -4 -2 2 6 8 6 6 4
C -14 -10 -6 -2 0 4 6 8 6 8
Needleman-Wunsch global alignment: The TRACEBACK
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
ATCAGAGTC||--||||
TTC--AGTC
Score: 2 x 6 – 2x2 = 8
ATCAGAGTC||||--||TTCAG--TC
Score:2 x 6 – 2x2 = 8
17
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Sequence alignmentNeedleman-Wunsch global alignment
Algorithm calculates score/s of optimal global sequence alignments,penalizes end gaps andpenalizes each residue in a gap is equally.
ATCAGAGTC has lower score then CAGAGTC --TTCAGTC TTCAGTC
ATCACAGTC has same score as ATCACAGTC T-C--AGTC T---CAGTC
ATCACAGTC has lower score then ACACAGTC T---CAGTC T--CAGTC
18
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Sequence alignmentNeedleman-Wunsch global alignment
In order to score a gap penalty q independent of the gap length, i.e
ACACAGTC ATCACAGTC AGCTTTCACAGTC all have theT--CAGTC T---CAGTC T-------CAGTC same score
the algorithm we presented is modified to extend alignments in more then the three ways we considered.
19
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
[ab]
[a-]
[-b]
A T C A G A G T C
T 0 2 0 0 0 0 0 2 0
T 0 2 0 0 0 0 0 2 0
C 0 0 2 0 0 0 0 0 2
A 2 0 0 2 0 2 0 0 0
G 0 0 0 0 2 0 2 0 0
T 0 2 0 0 0 0 0 2 0
C 0 0 2 0 0 0 0 0 2
Sequence alignmentNeedleman-Wunsch global alignment
[ab]
[a-]
[-b]
20
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Sequence alignmentNeedleman-Wunsch algorithm
S(0,0) ⇐ 0for j ⇐ 1 to N do
S(0,j) ⇐ -q
for i ⇐ 1 toM do
{ S(i,0) ⇐ -q
for j ⇐ 1 to N do
S(i,j) ⇐ max (S(i-1, j-1) + σ[aibj],
max {S(0, j)...S(i-1, j)} -q,max {S(i, 0)...S(i, j-1)} -q)
} Pearson & MillerMeth Enz 210:575, ‘92
21
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Sequence alignmentNeedleman-Wunsch global alignment
caveatsEvery algorithm is limited by the model it is built upon.
For example, the NW dynamic programming algorithm guaranteesus optimal global alignments with the parameters we supply (substitution matrix, gap penalty and gap scoring).
However -• Different parameters can give different alignments, • The correct alignment might not be the optimal one.• The correct alignment might correspond only to part of the global
alignments,
22
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Source: Pearson WR & Miller W"Dynamic programming algorithms for biological sequence comparison." Methods in Enzymology , 210:575-601 (1992).
Assignment: Calculate NW alignments with constant gap penalty seeing the effect of different gap penalties and match/mismatch scores. In all cases use substitution matrices that have two types of scores only a value for an exact match and a lower value for mismatches. Try the nucleotide sequences used in class and the following amino acid sequences: “ACDGSMF” & “AMDFR”.
More details, sources and things to do for next class
23
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Local sequence alignments are necessary for cases of:
• Modular organization of genes and proteins (exons, domains, etc.)
• Repeats• Sequences diverged so that similarity was retained,
or can be detected, just in some sub-regions
Local sequence alignments
24
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003Modular organization of genes
gene A gene B gene C
gene Y gene Zgene Xgene W
25
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003Modular protein
organization Adapted from Henikoff et alScience 278:609, ‘97
IG domain
IG domain
Kringle domain
Protein-kinasedomain
TLK receptor tyrosine-kinase
IG domain
IG domain
IG domain
IG domainEGF domain
EGF domainEGF domainFN3 domain
FN3 domainFN3 domain
TEK receptor tyrosine-kinase
26
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003Modular protein organization
1KAP secreted calcium-binding alkaline-protease
Calcium-binding repeats
Protease domain
27
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Local sequence alignment
28
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Local sequence alignment
For local sequence alignment we wish to find what regions(sub-sequences) in the compared pair of sequences will give the best alignment scores with the parameters we supply (substitution matrix, gap penalty and gap scoring model.
The aligned regions may be anywhere along the sequences. More then one region might be aligned with a score above the threshold.
29
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
S(0,0) ⇐ 0for j ⇐ 1 to N do
S(1,j) ⇐ -q
for i ⇐ 1 toM do
{ S(i,1) ⇐ -q
for j ⇐ 1 to N do
S(i,j) ⇐ max (S(i-1, j-1) + σ[aibj],
max {S(0, j)...S(i-1, j)} -q,max {S(i, 0)...S(i, j-1)} -q)
}
Sequence alignmentNeedleman-Wunsch algorithm
[ab]
[a-]
[-b]
30
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
σ[ab] : score of aligning a pair of residues a and b
-q : gap penalty
S’(i,j) : optimal score of an alignment ending at residues i,j
best : highest score in the scores-matrix (S)
Local sequence alignmentSmith-Waterman algorithm
31
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Pearson & MillerMeth Enz 210:575, ‘92
Local sequence alignmentSmith-Waterman algorithmbest ⇐ 0
for j ⇐ 1 to N doS’(0,j) ⇐ 0
for i ⇐ 1 to M do
{ S’(i,0) ⇐ 0
for j ⇐ 1 to N do
S’(i,j) ⇐ max (S’(i-1, j-1) + σ[aibj],
max {S’(0, j)...S(i-1, j)} -q,max {S’(i, 0)...S(i, j-1)} -q,
0)best ⇐ max (S’(i, j) , best)
}
32
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 1 0 1 0 0
T 0 0 1 0 0 0 0 0 2 0
C 0 0 0 2 0 0 0 0 0 3
A 0 1 0 0 3 1 1 1 1 1
G 0 0 0 0 1 4 2 2 2 2
T 0 0 1 0 1 2 3 1 3 1
C 0 0 0 2 1 2 1 2 1 4
A 0 1 0 0 3 2 3 1 1 2
A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1
Gap penalty -2
TCAGAGTCTCAG--TC++++^̂ ++ : 1+1+1+1-2+1+1=4
The optimal local alignment is:
Local sequence alignment Smith-Waterman algorithmFinding the optimal alignment
AG A
33
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 1 0 1 0 0
T 0 0 1 0 0 0 0 0 2 0
C 0 0 0 2 0 0 0 0 0 3
A 0 1 0 0 3 1 1 1 1 1
G 0 0 0 0 1 4 2 2 2 2
T 0 0 1 0 1 2 3 1 3 1
C 0 0 0 2 1 2 1 2 1 4
A 0 1 0 0 3 2 3 1 1 2
A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1
Gap penalty -2
Score threshold 3
Local sequence alignment Smith-Waterman algorithmFinding the optimal alignment
34
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A T C A G A G T C
0 0 0 0 0 0 0 0 0 0
G 0 -1 -1 -1 -1 1 -1 1 -1 -1
T 0 -1 0 -1 -1 -1 -1 -1 1 -1
C 0 -1 -1 0 -1 -1 -1 -1 -1 1
A 0 1 -1 -1 0 -1 1 -1 -1 -1
G 0 -1 -1 -1 -1 0 -1 1 -1 -1
T 0 -1 1 -1 -1 -1 -1 -1 0 -1
C 0 -1 -1 1 -1 -1 -1 -1 -1 0
A 0 1 -1 -1 1 -1 1 -1 -1 -1
A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1
Gap penalty -2
Remove scores of the current optimal alignment and then recalculate the matrix to find the next best alignment /s
ATCAGAGTCGTCAG--TCA
Local sequence alignment Smith-Waterman algorithmFinding the optimal alignment
35
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
A GAGTCGTCAG
A T C A G A G T C
0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 1 0 1 0 0
T 0 0 0 0 0 0 0 0 2 0
C 0 0 0 0 0 0 0 0 0 3
A 0 1 0 0 0 0 1 0 0 0
G 0 0 0 0 0 0 0 2 0 0
T 0 0 1 0 0 0 0 0 0 0
C 0 0 0 2 0 0 0 0 0 0
A 0 1 0 0 3 1 1 1 1 1
A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1
Gap penalty -2
Local sequence alignmentSmith-Waterman algorithm
Finding the sub-optimal alignment
Score threshold 3
TCATCA+++ : 1+1+1 =3
36
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Local sequence alignmentSmith-Waterman algorithm
In order for the algorithm to identify local alignments the score for aligning unrelated sequence segments should typically be negative. Otherwise true optimal local alignments will be extended beyond their correct ends or have lower scores then longer alignments betweenunrelated regions.
Alignment scores are determined by substitution matrix and by the gap penalties and gap scoring model.
37
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Alignment scoring schemes: gap models
Gap scoring by a constant relation to the gap length:σ ⇐ -q g (g is the number ATCACA σ ⇐ -3q
of gapped residues) T---CA
Gap scoring by a constant relation to the gap length:σ ⇐ -q ATCACA σ ⇐ -q
T---CA
Affine gap scoring (opening [d] and extending gap penalties [e]):σ ⇐ -(d + e (g-1)) ATCACA σ ⇐ -(d + 2e)
T---CA
38
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Local sequence alignmentSmith-Waterman algorithm
If alignment scores of unrelated sequences are mainly or solely determined by the substitution scores then such alignments wouldhave negative scores if the sum of expected substitution scores would be negative:
Σi,j pi pj sij < 0 i & j - residues,pi - frequency of residue i
sij - score of aligning residues i and j
39
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Local sequence alignmentSmith-Waterman algorithm
We can easily identify substitution matrices that will not give positive scores to random alignments. However, we have no analytical way for finding which gap scores will satisfy the demand for random alignment scores to be less or equal to zero and produce local sequence alignments.
Nevertheless, certain sets of scoring schemes (substitution matrix and gap scores) were found to give satisfactory local alignments.
40
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Sources: Pearson & Miller "Dynamic programming algorithms for biological sequence comparison." Methods in Enz. , 210:575-601 (1992),
Altschul “Amino acid substitution matrices from an information theoretic perspective” J Mol Biol 219:555-565 (1991),
Henikoff “Scores for sequence searches and alignments” CurrOpin Struct Biol 6:353-360 (1996).
Assignment: Read the source articles for this lecture. They have more details on the material we covered and introduce topics for next lectures.Calculate S’ for the sequences presented in class, using the unitary matrix (1 for match, -1 for mismatch), and the constant gap penalty model with q=-1, -2 or -4.
More details, sources and things to do for next lecture
41
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
More details, sources and things to do for next lecture
For those who are not acquainted with information theory or want to be certain they know the basics of it:An information theory primer for molecular biologists-http://www.lecb.ncifcrf.gov/~toms/paper/primer
42
Advanced Topics in BioinformaticsWeizmann Institute of Science, spring 2003
Next lecture, 12/12/2001:
• Substitution Matrices: amino-acids features and empirical matrices
• BLAST and FASTA: algorithms and statistics; assumptions and associated artifacts