Post on 13-Jan-2016
transcript
String Matching
15-211 Fundamental Data Structures and Algorithms
April 22, 2003
Announcements
• Quiz #4 available after class today! available until Wednesday midnight
• Homework 6 is out! Due on Thursday May 1, 11:59pm Tournament will run on May 7
• details to come…
• Final exam on May 8 8:30am-11:30am, UC McConomy Review on May 4, details TBA
String Matching
Why String Matching?
• Finding patterns in documents formed using a large alphabet Word processing – search/modify/replace Web searching- search/display
• Applications in Molecular Biology biological molecules can often be approximated as sequences
of amino acids Very large volumes of data – doubles every 18 months Need efficient string matching algorithms
• Applications in systems and software design
• Main data form used to exchange information - TEXT So text pattern matching is very important
• Big Question Given a string T of length n and a pattern P of length m
(m <= n), how do we find any or all occurrences of pattern P in T?
String Matching
• Text string T[0..N-1]T = “abacaabaccabacabaabb”
• Pattern string P[0..M-1]P = “abacab”
• Think of a naïve algorithm to find the pattern P in T
• How much work is needed to determine that? Can we do better?
• Better String Matching Algorithms Use finite automata Use combinatorial properties
String Matching
• Let T and P be strings build over a finite alphabet with || =
• Text string T[0..N-1]T = “abacaabaccabacabaabb”
• Pattern string P[0..M-1]P = “abacab”
• Where is the first instance of P in T?T[10..15] = P[0..5]
String Matching
abacaabaccabacabaabbabacab abacab abacab abacab abacab abacab abacab abacab abacab abacab abacab
• The brute force algorithm
• 22+6=28 comparisons.
•Brute Force Algorithm requires O(nm) operations
Brute Force, v.1
static int match(char[] T, char[] P){
int n = T.length; int m = P.length; for (int i=0; i<=n-m; i++) { int j = 0; while (j<m && T[i+j]==P[j]) j++;
if (j==m) return i; } return -1;}
Brute Force, v.2 (one loop)
static int match(char[] T, char[] P){ int n = T.length; int m = P.length; int i = 0; int j = 0; do { if (T[i]==P[j]) { i++; j++; } else { i=i-j+1; j=0; } } while (j<m && i<n); if (j==m) return i-m; else return –1;}
String Matching
• Text string T[0..N-1]T = “abacaabaccabacabaabb”
• Pattern string P[0..M-1]P = “abacab”
• Where is the first instance of P in T? T[10..15] = P[0..5]
• In general, how many comparisonsT[i] = P[j] ?
are needed to do the search?Worst case: O(NM)
A bad case
00000000000000001
0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 00001
• 60+5 = 65 comparisons are needed
• How many of them could be avoided?
A bad case
00000000000000001
0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 00001
• 60+5 = 65 comparisons are needed
• How many of them could be avoided?
Typical text matching
This is a sample sentence
- - - s- - - s- - - - s- - - - - - - sente
• 20+5=25 comparisons are needed
(The match is near the same point in the target string as the previous example.)
• In practice, 0j2
String Matching
• Brute force worst case O(MN) Expensive for long patterns in
repetitive text
• How to improve on this?
• Intuition: Don’t look at the text more than once. Remember what is learned from
previous matches
Motivation with FSM
• Consider the alphabet {a,b,c} and the FSM given below
• What is a language accepted by this FSM?
• What can we learn from this FSM?
1Start 2 3 4 Enda a b c
b/c
b/cc
a
b
a
Clever string matching
• 1970. Cook published an abstract result about machine models Match in O(N+M) vs. O(MN)?!
• Knuth and Pratt studied it and refined it into a simple algorithm.
• Morris, annoyed at a design problem in implementing a text editor, discovered the same algorithm. How to avoid decrementing i ?
• KMP published together in 1976.
Morris
String Matching
• Meanwhile …
• Boyer and Moore discovered another algorithm that is even faster (for some uses) in the average case.
• Gosper independently discovered the same algorithm.
• Boyer and Moore published in 1977.
String Matching
• In 1980, Karp and Rabin discovered a simpler algorithm. Uses hashing idea: quickly compute
hashes for all M-length substrings in T, and compare with the hash for P.
Knuth Morris Pratt
The KMP idea
• Take advantage of what we already know during the match process.
• Suppose P = 1000000
• Suppose P[0..5] matches T[10..15]Suppose P[6] T[16]
• Suppose we know thatP[0] any of T[11..15]
• And the next possible match isP[0] ? T[16]
KMP example
• Match fails: T[i] P[j] i = 6 j = 6
• Next match attempt i = 6 j = 0
10000010000000000
100000- 1000000
Brute Force KMP
• A worse case example:196 + 14 = 210
comparisons
0000000000000000000000000001
0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 00000000000001
0000000000000000000000000001
0000000000000- 0- 0- 0- 0- 0- 0- 0- 0- 0- 0- 0- 0- 0- 01
28+14 = 42 comparisons
Brute Force KMP
21 comparisons
abcdeabcdeabcedfghijkl
- bc- - - - - bc- - - - - bcedfg
abcdeabcdeabcedfghijkl
- bc- - - - bc- - - - bcedfg
19 comparisons
Brute Force KMP
21 comparisons
abcdeabcdeabcedfghijkl
- bc- - - - - bc- - - - - bcedfg
abcdeabcdeabcedfghijkl
- bc- - - - bc- - - - bcedfg
19 comparisons
5 preparation comparisons
KMP – The Big Idea
• Retain information from prior attempts.
• Compute in advance how far to jump in P when a match fails. Suppose the match fails at P[j] T[i+j]. Then we know P[0 .. j-1] = T[i .. i+j-1].
• We must next try P[0] ? T[i+1]. But we know T[i+1]=P[1] There is another way to compare: P[1]?P[0]
• If so, increment j by 1. No need to look at T. What if P[1]=P[0] and P[2]=P[1]?
• Then increment j by 2. Again, no need to look at T.
• In general, we can determine how far to jump without any knowledge of T!
Implementing KMP
• Never decrement i, ever. Comparing
T[i] with P[j].
• Compute a table f of how far to jump j forward when a match fails.
The next match will compare T[i] with P[f[j-1]]
• Do this by matching P against itself in all positions.
Building the Table for f
• P = 1010011
• Find self-overlaps Prefix Overlap j f1 . 1 010 . 2 0101 1 3 11010 10 4 210100 . 5 0101001 1 6 11010011 1 7 1
What f means
Prefix Overlap j f1 . 1 010 . 2 0101 1 3 11010 10 4 210100 . 5 0101001 1 6 11010011 1 7 1
• If f is zero, there is no self-match. This is good news: Set j=0 Do not change i.
• The next match isT[i] ? P[0]
• f non-zero implies there is a self-match. This is bad news: E.g., f=2 means
P[0..1] = P[j-2..j-1]• Hence must start new
comparison at j-2, since we know T[i-2..i-1] = P[0..1]
In general: Set j=f[j-1] Do not change i.
• The next match isT[i] ? P[f[j-1]]
Favorable conditions
• P = 1234567
• Find self-overlaps Prefix Overlap j f1 . 1 012 . 2 0123 . 3 01234 . 4 012345 . 5 0123456 . 6 01234567 . 7 0
Mixed conditions
• P = 1231234
• Find self-overlaps Prefix Overlap j f1 . 1 012 . 2 0123 . 3 01231 1 4 112312 12 5 2123123 123 6 31231234 . 7 0
Poor conditions
• P = 1111110
• Find self-overlaps Prefix Overlap j f1 . 1 011 1 2 1111 11 3 21111 111 4 311111 1111 5 4111111 11111 6 51111110 . 7 0
KMP matcher
static int match(char[] T, char[] P) { int n = T.length; int m = P.length; int[] f = computeF(P);
int i = 0; int j = 0; while(i<n) { if(P[j]==T[i]) { if (j==m-1) return i-m+1; i++; j++; } else if (j>0) j=f[j-1]; else i++; } return -1;}
Use f to determine next
value for j.
KMP pre-process
static int[] computeF(char[] P) { int m = P.length; int[] f = new int[m]; f[0] = 0;
int i = 1; int j = 0; while(i<m) { if(P[j]==P[i]) { f[i] = j+1; i++; j++; } else if (j>0) j=f[j-1]; else {f[i] = 0; i++;} } return f;}
Use previous values of f
KMP Performance
• At each iteration, one of three cases: T[i] = P[j]
• i increases T[i] <> P[j] and j>0
• i-j increases T[I] <> P[j] and j=0
• i increases and i-j increases
• Hence, maximum of 2N iterations.
• Constructing f[] needs 2M iterations.
• Thus worst case performance is O(N+M).
KMP Summary
• performs the comparisons from left to right;
• preprocessing phase in O(m) space and time complexity;
• searching phase in O(n+m) time complexity (independent from the alphabet size);
• performs at most 2n-1 information gathered during the scan of the text;
Boyer Moore
Brute Force KMP
21 comparisons
abcdeabcdeabcedfghijkl
- bc- - - - - bc- - - - - bcedfg
abcdeabcdeabcedfghijkl
- bc- - - - bc- - - - bcedfg
19 comparisons
Brute Force B-M
15 + 6 = 21 comparisons
abcdeabcdeabcedfghijkl
- bc- - - - - bc- - - - - bcedfg
abcdeabcdeabcedfghijkl
- - g f d e c b
2 + 6 = 8 comparisons
Boyer Moore
• Perhaps the most efficient algorithm for general pattern matching
• Ideas Scan pattern from right to left (and
target from left to right)• Allows for bigger jumps on early failures• Could use a table similar to KMP. • But follow a better idea:
Use information about T as well as P in deciding what to do next.
Brute Force B-M
16 + 7 = 23 comparisons
This string is textual
- - - - - - t- - - - - - - - - textual
This string is textual
- - - l a u t x e t
3 + 7 = 10 comparisons
Brute Force B-M
25 comparisons
This is a sample sentence
- - - - - - - - - - - - - - - - -
This is a sample sentence
- - - - -
foobar
5 comparisons
Boyer Moore
• Ideas Scan pattern from right to left (and
target from left to right)• Allows for bigger jumps on early failures• Could use a table similar to KMP. • But follow a better idea:
Use information about T as well as P in deciding what to do next.• If T[i] does not appear in the pattern, skip
forward beyond the end of the pattern.
Boyer Moore matcher
static int[] buildLast(char[] P) { int[] last = new int[128]; int m = P.length;
for (int i=0; i<128; i++) last[i] = -1;
for (int j=0; j<P.length; j++) last[P[j]] = j;
return last;}
Mismatch char is nowhere in the pattern
(default). last says “jump the distance”
Mismatch is a pattern char. last says
“jump to align pattern with last instance of this char”
Boyer Moore matcher
static int match(char[] T, char[] P) { int[] last = buildLast(P); int n = T.length; int m = P.length; int i = m-1; int j = m-1; if (i > n-1) return -1; do { if (P[j]==T[i]) if (j==0) return i; else { i--; j--; } else { i = i + m – Math.min(j, 1 + last[T[i]]); j = m - 1; } } while (i <= n-1); return -1;}
Use last to determine next
value for i.
KMP B-M
13 comparisons
1234561234356
- - - - - - - - - - - - -
1234561234356
-
7777777
1 comparison
KMP B-M
16 comparisons
This is a string
- - - - - - - - - - - - ring
This is a string
- - - g n i r
ring
7 comparisons
KMP B-M
16 comparisons
This is a string
- - - - - - - - - - - tring
This is a string
- - - g n i r t
tring
8 comparisons
Matching Summary
Boyer-Moore Summary
• performs the comparisons from right to left;
• preprocessing phase in O(m+ ) time and space complexity;
• searching phase in O(mn) time complexity;
• 3n text character comparisons in the worst case when searching for a non periodic pattern;
• O(n / m) best performance.
Knuth-Morris-Pratt Summary
• For text, similar performance to brute force Can be slower, due to precomputation
• Works well for self-repetitive patterns in self-repetitive text
• Never decrements i. Matching an input stream …
• Intuition: derives from thinking about a Matching FSM.
Karp and Rabin
• In 1980, Karp and Rabin discovered a simpler algorithm. Uses hashing ideas: quickly compute
hashes for all M-length substrings in T, and compare with the hash for P.
Compute the hashes in a cumulative way, so each T[i] needs to be seen only once.
Average case time is O(M+N). Worst case is unlikely (all collisions) at
O(MN).
Next
• Go to recitation Wednesday Discuss more about string matching
algorithms very important!
• On Thursday, we will discuss Union Find Many many applications Read chapter 24
• Work on Homework 6
End