Improved string matching with k mismatches
(The Kangaroo Method)Galil, R. Giancarlo
SIGACT News, Vol. 17, No. 4, 1986 , pp. 52–54
Original: Moshe LewensteinModified by: Hsing-Yen Ann Date: Nov. 26, 2004
Exact String Matching
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A…
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3
Exact String Matching
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7
Exact String Matching
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7 11
Exact String Matching
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A…
Answer: {3,7,11,..}
Exact String Matching
Approximate String Matching
Idea: Find all text locations where distance from pattern is sufficiently small.
distance metric: HAMMING DISTANCE
Let S = s1s2…sm
R = r1r2…rm
Ham(S,R) = The number of locations j where sj rj
Example: S = ABCABC R = ABBAAC
Ham(S,R) = 2
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C…
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2
Ham(P,T1) = 2
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4
Ham(P,T2) = 4
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6
Ham(P,T3) = 6
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2
Ham(P,T4) = 2
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
Input: T = t1 . . . tn, P = p1 … pm
String Matching with k Mismatches
Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
Input: T = t1 . . . tn, P = p1 … pm
String Matching with k Mismatches
Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
Input: T = t1 . . . tn, P = p1 … pm
String Matching with k Mismatches
Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…
Naïve Algorithm(for counting mismatches or k-mismatches problem)
Running Time: O(nm) n = |T|, m = |P|
- Goto each location of text and compute hamming distance of P and Ti
The Kangaroo Method(for k-mismatches)
Landau – Vishkin 1986
Galil – Giancarlo 1986
Trie
• A tree representing a set of strings.
ab
c
e
e
f
d b
f
e g
{ aeef ad bbfe bbfg c }
Trie (Cont)
• Assume no string is a prefix of another
ab
c
e
e
f
d b
f
e g
Each string corresponds to a leaf.
Compressed Trie • Compress unary nodes, label edges by strings
ab
c
e
e
f
d b
f
e g
a
bbf
c
eefd
e g
Suffix tree
Suffix tree of string s:a compressed trie of all suffixes of s
Prefix-free: add a special character, say $, at the end of s
Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$
{ $ b$ ab$ bab$ abab$ }
ab
ab$
ab$
b
$
$
$
Suffix Tree properties
- Succint in space - O(n).
- Can be built in O(n) time. McCreight, Weiner,
Ukkonen, Farach-Colton
b
12
ab
a
b$
a
b$
3
$ 4
$
5
$
Exact string matching
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
Given a pattern P = ab we traverse the tree according to the pattern.
s=abab$
Exact string matching
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
Leaves correspond to locations of appearance!
s=abab$ 1 3
Exact string matching
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
Prepare Tree: O(n) time
Find matches: O(m + occ) time occ = # of matches
s=abab$ 1 3
Lowest common ancestors
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
s = abbaab$
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaa
b$
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaa
b$
s = abbaab$ aab$
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaa
b$
s = abbaab$ aab$ abbaab$
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaa
b$
s = abbaab$
aab$ abbaab$
LCA/LCP propertiesa
1
3
b
aa
b
ab$
b
5
$
2
b
4
b$
a6
$
7
$
b
$
aa
ab
$
Preprocesssing time : O(n)
Query Time: O(1)
Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T$
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iFinding LCP(s, P0, Ti)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T$
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iLength of LCP(s, P0, Ti) = 4
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T$
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iKangrooing distance = LCP(s, P0, Ti) +1 = 5
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T$
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iFinding LCP(s, P5, Ti+5)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T$
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iLength of LCP(s, P5, Ti+5) = 2
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T$
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iKangrooing distance = LCP(s, P5, Ti+5) +1 = 3
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T$
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iFinding LCP(s, P8, Ti+8)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T$
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iLength of LCP(s, P8, Ti+8) = 3
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T$
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iNext iteration: i = i + 1
The Kangaroo Method(for k-mismatches)
Preprocess:
Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time
Check P at given text location
Kangroo jump till next mismatch - O(k) time
Overall time for naïve approach: O(nk)
2004/11/22 Hsing-Yen Ann
Faster Algorithms for Four Different Cases
Large alphabet At least 2k different alphabets in pattern P. O(n)
Small alphabet At most different alphabets in pattern P.
General alphabets - many frequent symbols At least frequent symbols
General alphabets - few frequent symbols Less than frequent symbols
k2
mknO log
mknO log
mknO log
k
k