String Matching 15-211 Fundamental Data Structures and Algorithms April 22, 2003.

transcript

String Matching

15-211 Fundamental Data Structures and Algorithms

April 22, 2003

Announcements

• Quiz #4 available after class today! available until Wednesday midnight

• Homework 6 is out! Due on Thursday May 1, 11:59pm Tournament will run on May 7

• details to come…

• Final exam on May 8 8:30am-11:30am, UC McConomy Review on May 4, details TBA

String Matching

Why String Matching?

• Finding patterns in documents formed using a large alphabet Word processing – search/modify/replace Web searching- search/display

• Applications in Molecular Biology biological molecules can often be approximated as sequences

of amino acids Very large volumes of data – doubles every 18 months Need efficient string matching algorithms

• Applications in systems and software design

• Main data form used to exchange information - TEXT So text pattern matching is very important

• Big Question Given a string T of length n and a pattern P of length m

(m <= n), how do we find any or all occurrences of pattern P in T?

String Matching

• Text string T[0..N-1]T = “abacaabaccabacabaabb”

• Pattern string P[0..M-1]P = “abacab”

• Think of a naïve algorithm to find the pattern P in T

• How much work is needed to determine that? Can we do better?

• Better String Matching Algorithms Use finite automata Use combinatorial properties

String Matching

• Let T and P be strings build over a finite alphabet with || =

• Where is the first instance of P in T?T[10..15] = P[0..5]

String Matching

abacaabaccabacabaabbabacab abacab abacab abacab abacab abacab abacab abacab abacab abacab abacab

• The brute force algorithm

• 22+6=28 comparisons.

•Brute Force Algorithm requires O(nm) operations

Brute Force, v.1

static int match(char[] T, char[] P){

int n = T.length; int m = P.length; for (int i=0; i<=n-m; i++) { int j = 0; while (j<m && T[i+j]==P[j]) j++;

if (j==m) return i; } return -1;}

Brute Force, v.2 (one loop)

static int match(char[] T, char[] P){ int n = T.length; int m = P.length; int i = 0; int j = 0; do { if (T[i]==P[j]) { i++; j++; } else { i=i-j+1; j=0; } } while (j<m && i<n); if (j==m) return i-m; else return –1;}

String Matching

• Where is the first instance of P in T? T[10..15] = P[0..5]

• In general, how many comparisonsT[i] = P[j] ?

are needed to do the search?Worst case: O(NM)

A bad case

00000000000000001

0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 00001

• 60+5 = 65 comparisons are needed

• How many of them could be avoided?

A bad case

00000000000000001

0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 0000- 00001

• 60+5 = 65 comparisons are needed

• How many of them could be avoided?

Typical text matching

This is a sample sentence

- - - s- - - s- - - - s- - - - - - - sente

• 20+5=25 comparisons are needed

(The match is near the same point in the target string as the previous example.)

• In practice, 0j2

String Matching

• Brute force worst case O(MN) Expensive for long patterns in

repetitive text

• How to improve on this?

• Intuition: Don’t look at the text more than once. Remember what is learned from

previous matches

Motivation with FSM

• Consider the alphabet {a,b,c} and the FSM given below

• What is a language accepted by this FSM?

• What can we learn from this FSM?

1Start 2 3 4 Enda a b c

Clever string matching

• 1970. Cook published an abstract result about machine models Match in O(N+M) vs. O(MN)?!

• Knuth and Pratt studied it and refined it into a simple algorithm.

• Morris, annoyed at a design problem in implementing a text editor, discovered the same algorithm. How to avoid decrementing i ?

• KMP published together in 1976.

Morris

String Matching

• Meanwhile …

• Boyer and Moore discovered another algorithm that is even faster (for some uses) in the average case.

• Gosper independently discovered the same algorithm.

• Boyer and Moore published in 1977.

String Matching

• In 1980, Karp and Rabin discovered a simpler algorithm. Uses hashing idea: quickly compute

hashes for all M-length substrings in T, and compare with the hash for P.

Knuth Morris Pratt

The KMP idea

• Take advantage of what we already know during the match process.

• Suppose P = 1000000

• Suppose P[0..5] matches T[10..15]Suppose P[6] T[16]

• Suppose we know thatP[0] any of T[11..15]

• And the next possible match isP[0] ? T[16]

KMP example

• Match fails: T[i] P[j] i = 6 j = 6

• Next match attempt i = 6 j = 0

10000010000000000

100000- 1000000

Brute Force KMP

• A worse case example:196 + 14 = 210

comparisons

0000000000000000000000000001

0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 0000000000000- 00000000000001

0000000000000000000000000001

0000000000000- 0- 0- 0- 0- 0- 0- 0- 0- 0- 0- 0- 0- 0- 01

28+14 = 42 comparisons

Brute Force KMP

21 comparisons

abcdeabcdeabcedfghijkl

- bc- - - - - bc- - - - - bcedfg

- bc- - - - bc- - - - bcedfg

19 comparisons

Brute Force KMP

21 comparisons

- bc- - - - - bc- - - - - bcedfg

- bc- - - - bc- - - - bcedfg

19 comparisons

5 preparation comparisons

KMP – The Big Idea

• Retain information from prior attempts.

• Compute in advance how far to jump in P when a match fails. Suppose the match fails at P[j] T[i+j]. Then we know P[0 .. j-1] = T[i .. i+j-1].

• We must next try P[0] ? T[i+1]. But we know T[i+1]=P[1] There is another way to compare: P[1]?P[0]

• If so, increment j by 1. No need to look at T. What if P[1]=P[0] and P[2]=P[1]?

• Then increment j by 2. Again, no need to look at T.

• In general, we can determine how far to jump without any knowledge of T!

Implementing KMP

• Never decrement i, ever. Comparing

T[i] with P[j].

• Compute a table f of how far to jump j forward when a match fails.

The next match will compare T[i] with P[f[j-1]]

• Do this by matching P against itself in all positions.

Building the Table for f

• P = 1010011

• Find self-overlaps Prefix Overlap j f1 . 1 010 . 2 0101 1 3 11010 10 4 210100 . 5 0101001 1 6 11010011 1 7 1

What f means

Prefix Overlap j f1 . 1 010 . 2 0101 1 3 11010 10 4 210100 . 5 0101001 1 6 11010011 1 7 1

• If f is zero, there is no self-match. This is good news: Set j=0 Do not change i.

• The next match isT[i] ? P[0]

• f non-zero implies there is a self-match. This is bad news: E.g., f=2 means

P[0..1] = P[j-2..j-1]• Hence must start new

comparison at j-2, since we know T[i-2..i-1] = P[0..1]

In general: Set j=f[j-1] Do not change i.

• The next match isT[i] ? P[f[j-1]]

Favorable conditions

• P = 1234567

• Find self-overlaps Prefix Overlap j f1 . 1 012 . 2 0123 . 3 01234 . 4 012345 . 5 0123456 . 6 01234567 . 7 0

Mixed conditions

• P = 1231234

• Find self-overlaps Prefix Overlap j f1 . 1 012 . 2 0123 . 3 01231 1 4 112312 12 5 2123123 123 6 31231234 . 7 0

Poor conditions

• P = 1111110

• Find self-overlaps Prefix Overlap j f1 . 1 011 1 2 1111 11 3 21111 111 4 311111 1111 5 4111111 11111 6 51111110 . 7 0

KMP matcher

static int match(char[] T, char[] P) { int n = T.length; int m = P.length; int[] f = computeF(P);

int i = 0; int j = 0; while(i<n) { if(P[j]==T[i]) { if (j==m-1) return i-m+1; i++; j++; } else if (j>0) j=f[j-1]; else i++; } return -1;}

Use f to determine next

value for j.

KMP pre-process

static int[] computeF(char[] P) { int m = P.length; int[] f = new int[m]; f[0] = 0;

int i = 1; int j = 0; while(i<m) { if(P[j]==P[i]) { f[i] = j+1; i++; j++; } else if (j>0) j=f[j-1]; else {f[i] = 0; i++;} } return f;}

Use previous values of f

KMP Performance

• At each iteration, one of three cases: T[i] = P[j]

• i increases T[i] <> P[j] and j>0

• i-j increases T[I] <> P[j] and j=0

• i increases and i-j increases

• Hence, maximum of 2N iterations.

• Constructing f[] needs 2M iterations.

• Thus worst case performance is O(N+M).

KMP Summary

• performs the comparisons from left to right;

• preprocessing phase in O(m) space and time complexity;

• searching phase in O(n+m) time complexity (independent from the alphabet size);

• performs at most 2n-1 information gathered during the scan of the text;

Boyer Moore

Brute Force KMP

21 comparisons

- bc- - - - - bc- - - - - bcedfg

- bc- - - - bc- - - - bcedfg

19 comparisons

Brute Force B-M

15 + 6 = 21 comparisons

- bc- - - - - bc- - - - - bcedfg

- - g f d e c b

Boyer Moore

• Perhaps the most efficient algorithm for general pattern matching

• Ideas Scan pattern from right to left (and

target from left to right)• Allows for bigger jumps on early failures• Could use a table similar to KMP. • But follow a better idea:

Use information about T as well as P in deciding what to do next.

Brute Force B-M

This string is textual

- - - - - - t- - - - - - - - - textual

This string is textual

- - - l a u t x e t

Brute Force B-M

25 comparisons

- - - - - - - - - - - - - - - - -

- - - - -

foobar

5 comparisons

Boyer Moore

• Ideas Scan pattern from right to left (and

target from left to right)• Allows for bigger jumps on early failures• Could use a table similar to KMP. • But follow a better idea:

Use information about T as well as P in deciding what to do next.• If T[i] does not appear in the pattern, skip

forward beyond the end of the pattern.

Boyer Moore matcher

static int[] buildLast(char[] P) { int[] last = new int[128]; int m = P.length;

for (int i=0; i<128; i++) last[i] = -1;

for (int j=0; j<P.length; j++) last[P[j]] = j;

return last;}

Mismatch char is nowhere in the pattern

(default). last says “jump the distance”

Mismatch is a pattern char. last says

“jump to align pattern with last instance of this char”

Boyer Moore matcher

static int match(char[] T, char[] P) { int[] last = buildLast(P); int n = T.length; int m = P.length; int i = m-1; int j = m-1; if (i > n-1) return -1; do { if (P[j]==T[i]) if (j==0) return i; else { i--; j--; } else { i = i + m – Math.min(j, 1 + last[T[i]]); j = m - 1; } } while (i <= n-1); return -1;}

Use last to determine next

value for i.

KMP B-M

13 comparisons

1234561234356

- - - - - - - - - - - - -

1234561234356

7777777

1 comparison

KMP B-M

16 comparisons

This is a string

- - - - - - - - - - - - ring

This is a string

- - - g n i r

7 comparisons

KMP B-M

16 comparisons

This is a string

- - - - - - - - - - - tring

This is a string

- - - g n i r t

8 comparisons

Matching Summary

Boyer-Moore Summary

• performs the comparisons from right to left;

• preprocessing phase in O(m+ ) time and space complexity;

• searching phase in O(mn) time complexity;

• 3n text character comparisons in the worst case when searching for a non periodic pattern;

• O(n / m) best performance.

Knuth-Morris-Pratt Summary

• For text, similar performance to brute force Can be slower, due to precomputation

• Works well for self-repetitive patterns in self-repetitive text

• Never decrements i. Matching an input stream …

• Intuition: derives from thinking about a Matching FSM.

Karp and Rabin

• In 1980, Karp and Rabin discovered a simpler algorithm. Uses hashing ideas: quickly compute

hashes for all M-length substrings in T, and compare with the hash for P.

Compute the hashes in a cumulative way, so each T[i] needs to be seen only once.

Average case time is O(M+N). Worst case is unlikely (all collisions) at

O(MN).

• Go to recitation Wednesday Discuss more about string matching

algorithms very important!

• On Thursday, we will discuss Union Find Many many applications Read chapter 24

• Work on Homework 6

String Matching 15-211 Fundamental Data Structures and Algorithms April 22, 2003.

Documents