Post on 20-Dec-2015
transcript
EE3J2 Data MiningSlide 1
EE3J2 Data Mining
Lecture 12: Sequence Analysis (2)
Martin Russell
EE3J2 Data MiningSlide 2
Objectives
Revise dynamic programming Examples
EE3J2 Data MiningSlide 3
Alignment pathA C X C C D
A
B
C
D
d(C,X)
EE3J2 Data MiningSlide 4
Accumulated Distance
The accumulated distance along the path p is the sum of distances along its length
Large accumulative distance = poor matches between symbols = poor path
Small accumulative distance = good matches between symbols = good path
The path with the smallest accumulated distance is called the optimal path
Computed using Dynamic Programming
EE3J2 Data MiningSlide 5
Dynamic Programming
A C X C C D A
B
C
D
Accumulated distance to this point…
…is minimum of accumulated distances to possible previous points
Plus local, incremental cost
EE3J2 Data MiningSlide 6
Formally…
nmdKnmad
nmdnmad
nmdKnmad
nmad
INS
DEL
,1,
,1,1
,,1
min,
Accumulated distance up to the point (m,n)
Deletion penalty
‘Local’ distance between mth symbol in sequence 1 and nth symbol in sequence 2
EE3J2 Data MiningSlide 7
Example application: sequence retrieval
……
AAGDTDTDTDD
AABBCBDAAAAAAA
BABABABBCCDF
GGGGDDGDGDGDGDTDTD
DGDGDGDGD
AABCDTAABCDTAABCDTAAB
CDCDCDTGGG
GGAACDTGGGGGAAA
…….
…….
Corpus of sequential data
‘query’ sequence Q
…BBCCDDDGDGDGDCDTCDTTDCCC…
Dynamic Programming
Distance Calculation Calculate ad(S,Q)
for each sequence S in corpus
QSadSS
,minargˆ
EE3J2 Data MiningSlide 8
Example: Edit DistanceS1 = AABCD KDEL=0
S2 = ABCCD KINS = 0
Distance matrix
A B C C DA 0 1 1 1 1
A 0 1 1 1 1
B 1 0 1 1 1
C 1 1 0 0 1
D 1 1 1 1 0
Accumulated distance matrix
A B C C DA 0 1 2 3 4
A 0 1 2 3 4
B 1 0 1 2 3
C 2 1 0 0 1
D 2 1 1 1 0
Forward path matrix
A B C C D
A \ _ _ _ _
A | _ _ _ _
B | \ _ _ _
C | | \ _ _
D | | | | \
A B C C D
A \ _ _ _ _
A | _ _ _ _
B | \ _ _ _
C | | \ _ _
D | | | | \
AABCCDAABCCD
EE3J2 Data MiningSlide 9
Example 2: Edit DistanceS1 = AABCD KDEL=2
S2 = ABCCD KINS = 2
Distance matrix
A B C C DA 0 1 1 1 1
A 0 1 1 1 1
B 1 0 1 1 1
C 1 1 0 0 1
D 1 1 1 1 0
Accumulated distance matrix
A B C C DA 0 3 6 9 12
A 2 1 4 7 10
B 5 2 2 5 8
C 8 5 2 2 5
D 11 8 5 3 2
Forward path matrix
A B C C D
A \ _ _ _ _
A | \ _ _ _
B | \ \ _ _
C | | \ \ _
D | | | \ \
A B C C D
A \ _ _ _ _
A | \ _ _ _
B | \ \ _ _
C | | \ \ _
D | | | | \
ABCCDABCCD
EE3J2 Data MiningSlide 10
edit-dist.c
New C program on course website Computes the edit distance between two sequences Prints out:
– Distance matrix
– Forward accumulated distance matrix
– Forward path matrix
– Optimal path
– Optimal alignment
EE3J2 Data MiningSlide 11
edit-dist.c
Format:edit-dist seq1 seq2 <Kdel> <Kins>
Seq1 and seq2 are the sequences <Kdel> and <Kins> optional, default 0
EE3J2 Data MiningSlide 12
Matching partial sequences
In some applications the interest is in whether one sequence matches a subsequence of another sequence
Example: Bioinformatics– Look for examples of a simple DNA sequence within a
more complex sequence
– Infer evolutionary relationship between two organisms
EE3J2 Data MiningSlide 13
Partial alignment
Simple intuitive solution is to allow Dynamic Programming to:– Start at any point in the first row
– End at any point in the final row
Then proceed as before Unfortunately this has limitations…
EE3J2 Data MiningSlide 14
Finding matching sub-sequencesStart DP
from here
Best scoring end point
Lower cost path
EE3J2 Data MiningSlide 15
Backwards Pass DP
nmdKnmad
nmdnmad
nmdKnmad
nmad
INS
DEL
,1,
,1,1
,,1
min,
Forward pass
nmdKnmad
nmdnmad
nmdKnmad
nmad
INS
DEL
,1,
,1,1
,,1
min,
Backward pass
EE3J2 Data MiningSlide 16
Backwards Pass DP
Starts in bottom row, works right-to-left and bottom-to-top
Otherwise, backwards accumulated distance matrix and backwards path matrix calculations analogous with forward-pass DP
EE3J2 Data MiningSlide 17
Forward-backward DP
Suppose that we have done a complete forward DP and a complete backward DP
We will have two path matrices:
– Forward path matrix
– Backward path matrix For any point in bottom row can trace-back through forward
path matrix and recover path ending in top row For any point in top row can trace-back through backward
path matrix and recover path ending in bottom row
EE3J2 Data MiningSlide 18
Matching sub-sequences
Choose a point in the bottom row. Traceback though forward path matrix
Identify start of path. Then traceback through backward path matrix
Are paths the same? If so, then we have a matching
subsequence
EE3J2 Data MiningSlide 19
Matching subsequences
If a path occurs as a consequence of tracing-back through the forward path matrix and tracing-back through the backward path matrix, then the corresponding section of the horizontal sequence is called a matching subsequence
The matching subsequences are those which achieve a good match with the vertical pattern
EE3J2 Data MiningSlide 20
Matching subsequences
A
B
B
C
X Z A B C C Y Z
matching subsequence
We say that this subsequence most closely resembles the original sequence ABBC
EE3J2 Data MiningSlide 21
Summary
Revision of Dynamic Programming Examples: Edit distance Motivation for interest in optimal subsequences Forward and backward dynamic programming Matching subsequences, subsequences which most
closely resemble a given sequence