+ All Categories
Home > Documents > COMP170 Tutorial 13: Pattern Matching

COMP170 Tutorial 13: Pattern Matching

Date post: 15-Jan-2016
Category:
Upload: anise
View: 51 times
Download: 2 times
Share this document with a friend
Description:
COMP170 Tutorial 13: Pattern Matching. T:. P:. Overview. 1. What is Pattern Matching? 2. The Naive Algorithm 3. The Boyer-Moore Algorithm 4. The Rabin-Karp Algorithm 5. Questions. 1. What is Pattern Matching?. Definition: - PowerPoint PPT Presentation
29
1/39 COMP170 Tutorial 13: Pattern Matching 1 a b a c a a b 2 3 4 a b a c a b a b a c a b T: P:
Transcript
Page 1: COMP170 Tutorial 13: Pattern Matching

1/39

COMP170 Tutorial 13: Pattern Matching

1

a b a c a a b

234

a b a c a b

a b a c a b

T:

P:

Page 2: COMP170 Tutorial 13: Pattern Matching

2/39

Overview

1. What is Pattern Matching?

2. The Naive Algorithm

3. The Boyer-Moore Algorithm

4. The Rabin-Karp Algorithm

5. Questions

Page 3: COMP170 Tutorial 13: Pattern Matching

3/39

1. What is Pattern Matching?

Definition:– given a text string T and a pattern string P, find the patter

n inside the text T: “the rain in spain stays mainly on the plain” P: “n th”

Applications:– text editors, Search engines (e.g. Google), image analysi

s

Page 4: COMP170 Tutorial 13: Pattern Matching

4/39

SSSSSS SSSSSSSS

Assume S is a string of size m.

A substring S[i .. j] of S is the string fragment between indexes i and j.

A prefix of S is a substring S[0 .. i] A suffix of S is a substring S[i .. m-1]

– i is any index between 0 and m-1

Page 5: COMP170 Tutorial 13: Pattern Matching

5/39

SSSSSSSS

Substring S[1..3] == "ndr"

All possible prefixes of S:– "andrew", "andre", "andr", "and", "an”, "a"

All possible suffixes of S:– "andrew", "ndrew", "drew", "rew", "ew", "w"

a n d r e wS

0 5

Page 6: COMP170 Tutorial 13: Pattern Matching

6/39

2. The Naive SSSSSSSSS

Check each position in the text T to see if the pattern P starts in that position

a n d r e wT:

r e wP:

a n d r e wT:

r e wP:

. . . .P moves 1 char at a time through T

Page 7: COMP170 Tutorial 13: Pattern Matching

7/39

Algorithm and Analysis

Brutal force

continued

Naive-Search(T,P) 01 for s 0 to n – m02 j 003 // check if T[s..s+m–1] = P[0..m–1]04 while T[s+j] = P[j] do05 j j + 106 if j = m return s07 return –1

Page 8: COMP170 Tutorial 13: Pattern Matching

8/39

The brute force algorithm is fast when the alphabet of the text is large

– e.g. A..Z, a..z, 1..9, etc. It is slower when the alphabet is small

– e.g. 0, 1 (as in binary files, image files, etc.) Example of a worst case:

– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"– P: "aaah"

Example of a more average case:– T: "a string searching example is standard"– P: "store"

continued

Page 9: COMP170 Tutorial 13: Pattern Matching

9/39

Reverse naive algorithm

Why not search from the end of P?– Boyer and Moore

Reverse-Naive-Search(T,P) 01 for s 0 to n – m02 j m – 1 // start from the end 03 // check if T[s..s+m–1] = P[0..m–1]04 while T[s+j] = P[j] do05 j j - 106 if j < 0 return s07 return –1

Running time is exactly the same as of the naive algorithm…

Page 10: COMP170 Tutorial 13: Pattern Matching

10/39

- 3. The Boyer Moore Algorithm

The Boyer-Moore pattern matching algorithm is based on two techniques.

1. The looking-glass technique– find P in T by moving backwards through P, starting at its

end

Page 11: COMP170 Tutorial 13: Pattern Matching

11/39

2. The character-jump technique– when a mismatch occurs at T[i] =/= P[m-1] – the character in pattern P[m-1] is not the

same as T[i]

There are 2 possible cases.

xTi

bP

Page 12: COMP170 Tutorial 13: Pattern Matching

12/39

S1

If P contains x somewhere, then try to shift P right to align the last occurrence of x in P with T[i].

x aTi

bP x c

x aT

bP x c

? ?

Page 13: COMP170 Tutorial 13: Pattern Matching

13/39

SSSS 2

If the character T[i] does not appear in P, then shift P to align P[0] with T[i+1].

x aTi

bP d c

x aT

inew

bP d c

? ?

No x in P

?

0

Page 14: COMP170 Tutorial 13: Pattern Matching

14/39

SSSS 3

If T[i] = P[m-1] and the match is incomplete, align T[i] with the last occurrence of T[i] in P.

x aT

i

b aP a c

x aTinew

b aP a c

? ? ?

Page 15: COMP170 Tutorial 13: Pattern Matching

15/39

- Boyer Moore Example (1)

1

a p a t t e r n m a t c h i n g a l g o r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

2

3

4

5

6

7891011

T:

P:

Page 16: COMP170 Tutorial 13: Pattern Matching

16/39

Boyer-Moore algorithm

To implement, we need to find out for each character c in the alphabet, the amount of shift needed if P[m-1] aligns with the character c in the input text and they don’t match.

This takes O(m + A) time, where A is the number of possible characters. Afterwards, matching P with substrings in T is very fast in practice.

Example: Suppose the alphabet is

{a, b,c} and the pattern is ababbb.

Then,

shift[c] = 6

shift[a] = 3

shift[b] = 1

Page 17: COMP170 Tutorial 13: Pattern Matching

17/39

Analysis

Boyer-Moore worst case running time is O(nm + A)

But, Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small.

– e.g. good for English text, poor for binary

Boyer-Moore is significantly faster than brute force for searching English text.

Page 18: COMP170 Tutorial 13: Pattern Matching

18/39

Fingerprint idea

Assume:– We can compute a fingerprint f(P) of P in O(m) time.– If f(P)f(T[s .. s+m–1]), then P T[s .. s+m–1]– We can compare fingerprints in O(1)– We can compute f’ = f(T[s+1.. s+m]) from f(T[s .. s+m–1]),

in O(1)

f

f’

Page 19: COMP170 Tutorial 13: Pattern Matching

19/39

Algorithm with Fingerprints

Let the alphabet ={0,1,2,3,4,5,6,7,8,9} Let fingerprint to be just a decimal number, i.e., f(“1045”) =

1*103 + 0*102 + 4*101 + 5 = 1045

Fingerprint-Search(T,P)01 fp compute f(P)02 f compute f(T[0..m–1])  03 for s 0 to n – m do04 if fp = f return s05 f (f – T[s]*10m-1)*10 + T[s+m] 06 return –1

f

new fT[s]

T[s+m]

Running time O(m+n) Where is the catch?

Page 20: COMP170 Tutorial 13: Pattern Matching

20/39

Using a Hash Function

Problem: – we can not assume we can do arithmetics with m-digits-

long numbers in O(1) time Solution: Use a hash function h = f mod q

– For example, if q = 7, h(“52”) = 52 mod 7 = 3– h(S1) h(S2) S1 S2

– But h(S1) = h(S2) does not imply S1=S2! For example, if q = 7, h(“73”) = 3, but “73” “52”

Basic “mod q” arithmetics:– (a+b) mod q = (a mod q + b mod q) mod q– (a*b) mod q = (a mod q)*(b mod q) mod q

Page 21: COMP170 Tutorial 13: Pattern Matching

21/39

Preprocessing and Stepping

Preprocessing:– fp = P[m-1] + 10*(P[m-2] + 10*(P[m-3]+ … … +

10*(P[1] + 10*P[0])…)) mod q– In the same way compute ft from T[0..m-1]– Example: P = “2531”, q = 7, what is fp?

Stepping:– ft = (ft – T[s]*10m-1 mod q)*10 + T[s+m]) mod q– 10m-1 mod q can be computed once in the preprocessing– Example: Let T[…] = “5319”, q = 7, what is the corresponding

ft?

ft

new ftT[s]

T[s+m]

Page 22: COMP170 Tutorial 13: Pattern Matching

22/39

Rabin-Karp Algorithm

Rabin-Karp-Search(T,P)01 q a prime larger than m02 c 10m-1 mod q // run a loop multiplying by 10 mod q

03 fp 0; ft 004 for i 0 to m-1 // preprocessing 05 fp (10*fp + P[i]) mod q06   ft (10*ft + T[i]) mod q07 for s 0 to n – m // matching08 if fp = ft then // run a loop to compare strings 09 if P[0..m-1] = T[s..s+m-1] return s 10 ft ((ft – T[s]*c)*10 + T[s+m]) mod q 11 return –1 How many character comparisons are done if

T = “2531978” and P = “1978”?

Page 23: COMP170 Tutorial 13: Pattern Matching

23/39

Analysis

If q is a prime, the hash function distributes m-digit strings evenly among the q values

– Thus, only every q-th value of shift s will result in matching fingerprints (which will require comparing stings with O(m) comparisons)

Expected running time (if q > m):– Outer loop: O(n-m)– All inner loops: – Total time: O(n-m)

Worst-case running time: O((n-m+1)m)

n mm O n m

q

Page 24: COMP170 Tutorial 13: Pattern Matching

24/39

Rabin-Karp in Practice

If the alphabet has d characters, interpret characters as radix-d digits (replace 10 with d in the algorithm).

Choosing prime q > m can be done with randomized algorithms in O(m), or q can be fixed to be the largest prime so that 10*q fits in a computer word.

Rabin-Karp is simple and can be easily extended to two-dimensional pattern matching.

Page 25: COMP170 Tutorial 13: Pattern Matching

25/39

Question 1

What is the worst case complexity of the Naïve algorithm? Find an example of the worst case.

What is the worst case complexity of the BM algorithm? Find an example of the worst case.

Page 26: COMP170 Tutorial 13: Pattern Matching

26/39

Question 2

Illustrate how does BM work for the following pattern matching problem.

T: abacaabadcabacabaabb P: abacab

Page 27: COMP170 Tutorial 13: Pattern Matching

27/39

Answer to question 1

Example of a worst case for Naïve algorithm:– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"– P: "aaah“

Time complexity O(mn)

Page 28: COMP170 Tutorial 13: Pattern Matching

28/39

BM Worst Case Example

T: "aaaaa…a" P: "baaaaa“ Complexity

– O(mn+A)

11

1

a a a a a a a a a

23456

b a a a a a

b a a a a a

b a a a a a

b a a a a a

7891012

131415161718

192021222324

T:

P:

Page 29: COMP170 Tutorial 13: Pattern Matching

29/39

Answer to question 2( )

a b a c a a b a d c a b a c a b a a b b

6

7

a b a c a b

a b a c a b

a b a c a ba b

a b a c a b

6a b a c a b

T:

P:


Recommended