Sequence Local Alignment using Directed Acyclic Word Graph
Do Huy Hoang
SEQUENCE ALIGNMENT
Sequence Similarity
• Alignment–Arrange DNA/Protein sequences to show
the similarity• “” denotes the insertion/deletion event
Other variations
• Edit distance• Longest common substring• Affine gap scoring• Using scoring matrix (BLOSUM, PAM)
Alignment score computation
• Needleman–Wunsch – Dynamic programming
Other variationsName Problem Worst time Average time Memory
Four Russian Edit distance 1,0 M*N/log(N) <not good> MN
Ukkonen Global edit (linear cost)
ND N+D2 D2
Waterman Local alignment MN MN MN
Tree tree Local alignment M2N2 <close to M2N2>
BWTSW Meaningful local alignment
MN2 MN0.68
Local alignment
• Local alignment– Find the best alignments of two substring
from the sequences
BWTSW
• BWTSW– Motivation• Scoring 75% similarity• Local alignment table most are zero• Meaningful alignment
– Suffix tree– Meaningful alignment– Meaningful alignment with gap– How good is it?
Meaningful alignment (1)
• Sequences similarity sometimes implies functional similarity.
• Biologists is NOT usually interested in sequences with less than 70% similarity.
• BLAST score– Match = 1– Mismatch = -3– Open Gap = -5– Extending gap = -2
Meaningful alignment (2)
• BLAST score– Match = 1– Mismatch = -3– Open Gap = -5– Extending Gap = -2
– At least 70% match to have none zero score
Meaningful alignment (3)
• BLAST score– Match = 1– Mismatch = -3– Open Gap = -5– Extending Gap = -2
• How many none zero entries in the local alignment DP table?
How to improve?
• Idea:– Not storing zero score entries– Using suffix tree to prune off early
BWTSW details
• FM index for suffix tree representation• Prune zero entries• Store DP vector using linked list
Analysis
• Text length = N• Pattern length = M• Alphabet size =
Average running time (1)
• Let F(L) be the number of pairs of strings length L, which Score(S1,S2) > 0– Sizeof{(S1,S2) : Len(S1)=Len(S2)=L,
Score(S1,S2)>0}– F(L) counts the number of pairs of 75% identity.
• F(L) = sum(i=0..L/4, Binomial(L,i) * (-1)i) • F(L) k1k2
L
• F(log(N)) k3* N0.68
Average running time (2)
• Given S1, Pr(Score(S1,S2) > 0|S1) = F(L)/L
• For M < log(N)– The number of entries are– O(M * F(M)) < O(log(N)*F(log(N))
• For M > log (N)– O(M * N * F(M) / L)
• On average– Time = O(M*F(log(N))) = M * N0.68
DAWG
Possible improvement of BWTSW
• Worst case running time O(N2 M)– When M=N
– O(M N0.68+M3) When M is substring of N• What about ST vs. ST?
• What we used in BWTSW is Suffix Trie (not suffix tree).– #Prove it#
• Suffix trie has O(N2)nodes
• DAWG is a similar structure with O(N) nodes
DAWG (1)
DAWG (2)
• DAWG: Directed Acyclic Word Graph• DAWG is a cyclic automata that recognizes all
the sub-strings of the given string.
DAWG (3)
• Example:– DAWG of “abcbc”
a
b
bc, cab
abc
abcb, bcb, cb
abcbc, bcbc, cbc
a
b c
cb
c
b
b
c
DAWG (4)
• End-set view
0,1, 2,3,4,5
1
2, 4
3, 52
3
4
5
a
b c
cb
c
b
b
c
a
b
bc, cab
abc
abcb, bcb, cb
abcbc, bcbc, cbc
a
b c
cb
c
b
b
c
Trivial DAWG construction
• Using End-set class
0,1, 2,3,4,5
1
2, 4
3, 52
3
4
5
a
b c
cb
c
b
b
c
a
b
bc, cab
abc
abcb, bcb, cb
abcbc, bcbc, cbc
a
b c
cb
c
b
b
c
DAWG properties
• For |w|>2, the Directed Acyclic Word Graph for w has at most 2|w|-1 states, and 3|w|-4 edges
D(w) and ST(wR)• There is a map between nodes in DAWG and implicit
ST(wR)– Example: w=abcbc, wR=cbcba
• Store DAWG using ST, which uses only o(N) bits
a
ab
cb
cbaa
cba
a
b
bc, cab
abc
abcb, bcb, cb
abcbc, bcbc, cbc
a
b c
cb
c
b
b
c
D(w) and ST(wR) (2)list all incoming edges of node q in Dw using ST(w^R)
Local Alignment using DAWG
• Basis
• Induction
Extensions
• Meaningful alignment using DAWG– Prune the nodes whose Score is less than zero
• Shortest path pruning style• Cache log(N) nodes the worst case running
time is M*N*log(N), average case is the same for M << N.