+ All Categories
Home > Documents > 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen,...

1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen,...

Date post: 31-Dec-2015
Category:
Upload: maximilian-tate
View: 216 times
Download: 1 times
Share this document with a friend
57
1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen , CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W. Algorithmica, Vol.12, 1994, pp.247-267
Transcript
Page 1: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

1

Speeding up on two string matching algorithms

Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen

, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T.,

PLANDOWSKI, W. and RYTTER, W.

Algorithmica, Vol.12, 1994, pp.247-267

Page 2: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

2

Problem Definition

• Input : A text T and a pattern P.• Output : Find all occurrences of P in T

Page 3: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

3

Rule 1: The Suffix to Prefix Rule • For a window to have any chance to match a pattern,

in some way, there must be a suffix of the window which is equal to a prefix of the pattern.

Page 4: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

4

Basic Ideas• Open a window W with size |P| in the text.

T|P|

W

p

• Find the longest suffix of W is also the prefix of pattern.

T|P|

p

W

Match!

Case 1:

Page 5: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

5

T|P|

W

p

Case 2:

T|P|

W

p

T|P|

W

p

Case 3:

|P|

If there is no such suffix, we move W with length |P|.

Page 6: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

6

Preprocessing phase

• T=GCATCGGCGAGAGTATACAGTACG 

• P=GCAGAGAG

• L(S): a set contains all prefixes of the pattern.

}G,GC,GCA,GCAG,GCAGA,GCAGAG,GCAGAGA, {GCAGAGAG,)( SL

08 7 6 5 4 3 2 1GA GAGG AC

C

C

C A

We construct the suffix automaton of P.

Suffix Automaton

Page 7: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

7

Preprocessing: Construct a Suffix Tree The reversal string of P.

Example: GCAGAGAG GAGAGACGSuffixes of GAGAGACG AGAGACG GAGACG AGACG GACG

ACG CG G

:P:P

:P

:P

C G

G

6

121

CG

A

54

CG

GACG

A

2

3

11109

7

8

GA

GACG

01

2

8 6 4 7 5 3

Suffix tree for GAGAGACG:P

Page 8: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

8

G C A T C G C A G C A G A G A GW P

We want to find the longest suffix of W which is equal to a prefix of P.

A C G C T A C G

G A G A G A C G

W

P

Suffix tree for P

We find that ACG (a prefix of , a suffix of W) is a suffix of (a prefix of P).

Thus ACG is the longest suffix of W which is equal to a prefix of P.

W P

C G

G

6

121

CG

A

54

CG

GACG

A

2

3

GA

11109

CG

7

8

GA

CG

GACG

01

2

8 6 4 7 5 3

•Example 1

Page 9: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

9

G T A T A C A G

G C A G A G A G

W

P

G A C A T A T G

G A G A G A C G

W

P

Suffix tree for P

We find that GAC is the longest prefix of (thus the longest suffix of W) which is equal to a substring of .

But GAC is not a suffix of and GACA is not a suffix of either.

WP

•Example 2

PP

Page 10: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

10

G A C A T A T G

G A G A G A C G

W

P

Luckily, a prefix of GACG, namely G, is also a suffix of .

G can be found by finding the lowest common ancestor of G and GACG.

P

C G

G

6

1

CG

AA

2

01

4

Thus G is the longest prefix of (suffix of W) which is equal to a suffix of (prefix of P).P

W

Page 11: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

11

Let X be the longest prefix of (suffix of W) which is equal to a substring of , but not a suffix of .

Let Y be a prefix of X (a suffix of W) which is equal to a suffix of (prefix of P).

Then Y is the longest suffix of W equal to a prefix of P.

PW

P

P

G C A A

G C A A G C

C C A A C G A

A C G AA

W

P

XY

X Y

Page 12: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

12

Z is a suffix of which can be found in the suffix tree of .

Y may not exist.

If it exists, it must be in the suffix tree of and must have been found before X is found because Y is a prefix of X.

P

P

X Y

W

P

X

Y

Z

P

Page 13: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

13

• Preprocessing phase: the worst case of the time complexity is O(m).

• Searching phase: the worst case of the time complexity is O(mn).

• But it needs time O( ) in average case where r is the size of the alphabet as shown in this paper.

m

mn rlog

Page 14: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

14

About the average case analysis of RF algorithm, assume that the text is a random sequence over a size r alphabet and is preserved such that m must be enough large.

This assumption is reasonable.

Let m=16, r=4.

83log

mmr

hold. 8

3log

6238

163

83

216loglog 4

mm

m

m

r

r

Page 15: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

15

Theorem. The expected average time of the RF algorithm is O( ).

Proof.

m

mn rlog

Note that r>1, and .8

3log

mmr

For a pattern with length m, there are no more than m substrings. Thus, there are at most m substrings with length . mrlog2

P

...

mrlog2

m

Page 16: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

16

Let Li be the length of the shift in the ith attempt of RF algorithm and Let Xi and Yi be the X and the Y in ith attempt respectively.

Let Si be the length of the longest prefix of which appear in in the ith attempt. That is, Si=|Xi|.

Let Ai=|Yi| such that because Yi is a prefix of Xi.

ii AS

P

W

Page 17: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

17

In the first attempt of RF algorithm,

mm

m

mrr

mmS m

mrr

r

1

log2Pr

2

2log2log21

m

mS r

11log2Pr 1

Page 18: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

18

.4

8

3logby

8

32

log2

by

1

1

1

11

11

11

mL

mm

mmL

mmL

SmL

ASmLS

mLA

r

r

ii

4

mLi

mS ri log2

Let us call the ith shift long if and only if and short otherwise. (It implies that Li is long if .)

Page 19: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

19

When at least new symbols are being read at the current attempt, with probability there are at most characters of the suffix of the window can match a substring of P, which causes a long shift.

,1

1m

mrlog2

Tmrlog2

P... ...

mrlog2

Page 20: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

20

We divide all attempts into phases. Each phase ends on the first long shift. In other words, there is exactly one long shift in each phase.

T

Short shift

Long shift

Phase 1 Phase 2 Phase 3 Phase 4 Phase 5

Page 21: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

21

There are two main ideas in the paper:

(1)The number of all phases is .

(2)We calculate the expected number of comparison of each phase. An expected number of comparison of each phase is .

We shall discuss above two ideas in the next slides.

m

nO

mO log

Page 22: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

The number of all phases is .

We know that the length of long shift is

Then

The number of all phases is

m

nO

.4

m

.4

shift longphasem

.

44

m

nO

m

n

nm

Page 23: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

23

2

1

4Pr 2

mLi

Claim 1: Assume that Li and Li+1 are both short.

Proof. Suppose Li and Li+1 < , then the pattern is of the form where , w, .

4

m

szwvv k z3k

Then .

That is , Li+2 is the end of a phase.

Next, we calculate the expected number of comprison of each phase.

Page 24: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

24

Note that Yi denotes a longest suffix of the window Wi which is equal to a prefix of the pattern, where Wi is a window of the text of length m in the ith attempt. Let Bi be the set of new symbols to read in the ith attempt.

ki wvvY Note that the pattern is of the form .Then , , .

szwvv k

szBi szLi

i i+1 i+2

Li+1

TP Lii attempt

i+1 attempt Yi Bi

Yi+1 Bi+1i+2 attempt

Wi

Wi+1

Page 25: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

25

vwLi 1Let Bi+1 be because there exists an overlap between Yi and Yi+1, and

.11 szwvvvszvwBY kkii

i i+1 i+2

Li+1

TP Lii attempt

i+1 attempt Yi Bi

Yi+1 Bi+1i+2 attempt

Wi

Wi+1

Page 26: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

26

Example: T=bbcabcabcabcabcadc P=cabcabcabcadd, w=ab,v=c, s=a,z=dd.

Then , addabcc 3P .abcc 3 ki wvvY

,

When P shifts Li+1, the overlap of Yi and Yi+1 is

,addabcccaddcab 3333 szwvvvszvwP

.1 vwLi

.cab 33 vvw

c a b c a b c a b c a

c a b c a b c a b c a Li+1

b b c a b c a b c a b c a b c

Li

c a b c a b c a b c a

a dT

P

Li+1

i attempt

i+1

d

d

d

Overlap

d

d

d

c

i+1 attempt

Page 27: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

27

11vw kwvv

''vw 11'' vwvw

.

If there exists a word such that , then because is a minimal period of .

11vw kwvv

Without loss of generality we can assume thatis a minimal period of .

''11 vwvw

Hence,

111 iLwvvw

Page 28: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

28

Example: P=abcabcabcabcabcabcabbc, w=cabc,v=ab, s=b,z=c.

szvwvP

szwvvP k

111

6

3

bccabab

bccabcabab

w1v1 is a minimal period of P.

Page 29: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

29

We can also assume (eventually changing wv and k) that and sz do not have a common prefix. We may therefore obtain a new fragment s1z1 such that

11vw

.11 szLzs i

Page 30: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

A suffix of the read part of the text is of the form , and we have at least C=min(Li+1, Li) new symbols to read in the (i+2)th attempt.

Let e be a random word of length C to be read part of the text such that .

111 svw k

1zCe

30

Page 31: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

31

Note that

If |Bi|>|Bi+1|, then , otherwise, , .

11 zCBi

1zBi 1s

i+1 attempt

z1i+2 attempt

BiYi

Yi+1 Bi+1

z1s1

BiYi

Yi+1 Bi+1

z1

1 ii BB1 ii BB

. and 11 iiii LBLB

Page 32: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

32

We give an example when |Bi|>|Bi+1|.

T=bbbaaaaaaaacda P=aaaaaaaabc, w=a,v=a, s=a, z=bc.

ezsvw

szwv

wv

P

bc , , a,

perfixcommon a havenot do and bca

of period minimalabca

abcaaa

1111

8

7

3

a a a a a a a a

b b b a a a a a a a a a a

i

T

P Li

i+1

c

i+2

Li+1

d a

b c

a a a a a a a a b c

a a a a a a a a b c

Page 33: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

33

We give another example when

T=bbcabcabcabcabcadc P=cabcabcabcadd, w=ab,v=c, s=a, z=dd.

ezsvw

szwv

wv

P

dd , ,ca b,

perfixcommon a havenot do and ddbcaca

of period minimaladdabcc

addabcc

1111

3

3

3

.1 ii BB

c a b c a b c a b c a

c a b c a b c a b c a Li+1

b b c a b c a b c a b c a b c

Li

c a b c a b c a b c a

a dT

P

i+1

d

d

d

d

d

d

c

i+2i

Page 34: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

34

It is easy to see that if w1v1s1e is a substring of , then y must be either equal to pref(z1) if , or otherwise.

11111 zsvwv k

1s 111 zvwpref

i i+1 i+2

Li+1

TP Lii attempt

i+1 attempt Yi Bi

Bi+1

e111 svw

1111 svwv li+2 attempt

Page 35: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

35

In other words, by the above condition, if , w1v1s1e would only appear to the end of P.

Therefore, e=pref(z1).

1s

i+2T

v1w1v1w1v1w1v1w1v1w1v1w1s1i+2 attempt P z1

w1v1w1v1w1s1 e

otherwise, w1v1s1e may appear to any position of P. Therefore, .111 zvwprefy

Tv1w1v1w1v1w1v1w1v1w1v1w1P z1

w1v1w1v1w1 e

Page 36: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

36

The probability that reading e new symbols leads to a long (longer than Li+Li+1 which is less than ) substring of the pattern

is no greater than .2

1

er

e4

2m

i i+1 i+2

Li+1

TP Lii attempt

i+1 attempt Yi Bi

Bi+1

e111 svw

1111 svwv li+2 attempt

Note that . and 11111 ii LzsLwv

Page 37: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

37

Therefore,

.2

1

4 Pr 2

mLi

.44

222

mmSmL ii

Page 38: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

38

By Claim 1, the assumptions say that when the (k-1)th and (k-2)th shifts are both short, the kth shift is long with probability .

It implies that the kth shift of the phase is short with probability for

2

1

2

1 .3k

Page 39: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

39

Let F be the random variable which is the number of short shifts in the phase.

What can we say about the probability distribution of F?

Page 40: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

40

By claim 1, we know when (k-2)th and (k-1)th are both short, .

2

1short isth Pr k

3.for ,1

2

1Pr

,1

2

13Pr

,1

2Pr

,1

1Pr

,1

10Pr

2

km

kF

mF

mF

mF

mF

k

Page 41: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

41

Let G be the random variable which is the number of comparison of the phase and let L be the number of comparison of a long shift of the phase. Then

The problem is on how to find L.

mFLG

Page 42: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

42

For the number of comparison of a long shift of the phase, we know and .

Note that Si is the length of the substring of the pattern that is matched in Wi.

m

mS ri

11log2Pr

m

mS ri

1log2Pr

1log2

11log2

11

m

mmm

mL

r

r

Hence, mFm

mFLG

r

1log2

Page 43: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

43

mO

mm

km

m

mmO

mkmm

mmm

mm

kFmkm

FmmFmGE

r

k kkk

kk

rr

kkr

rr

kr

rr

log

2

1

22

log2log

2

11log2

11log2

111log2

Pr1log2

1Pr11log20Pr1log2

2 222

22

22

2

For the expected number of comparison of each phase, we have

Page 44: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

According to above discussion, we know that there are phases in the algorithm and an expected number of comparison of each phase is .

Therefore, the expected time of the RF

algorithm is .

mnO

mO r log

m

mnO rlog

Page 45: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

45

In this paper, they use X to analyze the average case of RF algorithm finally note that X is the longest suffix of W which is equal to a substring of P .

In fact, the main idea of RF algorithm is to find out Y, but not X. Therefore, we may re-analyze the expected length of Yi.

Note that the Li=shift is equal to Li=m - |Yi|=m - Ai. If Ai is small, Li is large. We expect Ai to be very small.

Page 46: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

46

Given a window Wi of T in the ith attempt and a pattern P, the expected length of the longest suffix of Wi equal to a prefix of P is

…..(1)

…..(2)

mmi rm

rm

rrA

111

12

11

12

132

111

12

11

mmi rm

rm

rrA

r

Page 47: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

47

(2) - (1)

mii rrrA

rA

11112

.1

1

11

1

11

11

1

11111

1

2

r

r

r

r

rr

rrrA

r

m

mi

Page 48: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

48

We can deduce that

21

1

111

r

rA

rA

r

i

i

Page 49: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

49

We randomly generate some texts and patterns using Knuth’s random generating function in the first experiment.

Data source

The length of string

Alphabet size r

The number of total comparison with matched

The number

of window

The expected number of

comparison per window r/(r-1 )2

The number of average

comparison per window

text pattern

Random

1000 30 4 17 33 0.4444 0.51515210000 30 4 151 338 0.4444 0.446746

100000 30 4 1316 3377 0.4444 0.3896951000000 30 4 13074 33769 0.4444 0.38716

1000 50 5 6 20 0.3125 0.310000 50 5 64 201 0.3125 0.318408

100000 50 5 645 2012 0.3125 0.3205771000000 50 5 6045 20120 0.3125 0.300447

Page 50: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

50

Data source

The length of string

Alphabet size r

The number of total compari

son with

matched

The number

of window

The expected number of

comparison per window r/(r-1 )2

The number of average

comparison per window

textpatter

n

Random

1000 30 10 5 33 0.1234 0.15151510000 30 10 30 334 0.1234 0.08982

100000 30 10 346 3344 0.1234 0.1034691000000 30 10 3930 33463 0.1234 0.117443

1000 100 7 4 10 0.1944 0.410000 100 7 14 100 0.1944 0.14

100000 100 7 195 1001 0.1944 0.1948051000000 100 7 1865 10018 0.1944 0.186165

Page 51: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

51

In the second experiment, we take news reports from CNN site as T and randomly obtain a word as P.

Data source

The length of string

Alphabet size r

The number of total compar

ison with

matched

The number

of window

The expected number of

comparison per window r(r-1 )2

The number of average

comparison per window

text pattern

CNN news

3715 7 35 32 535 0.0302 0.05982222 14 40 2 158 0.0262 0.0126

Page 52: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

52

In the 3rd experiment, we take three fragments from human chromosome as T. The pattern is taken from the part of T.

Data source

The length of string

Alphabet size r

The number of

total compariso

n with matched

The number of window

The expected

number of compariso

n per window r/(r-1 )2

The number of

average comparison

per window

text pattern

Human Chromosome

21 NT_011512.10

1627105 70 4 8942 23372 0.4444 0.3826

Human Chromosome

22 NT_011515.11

3437231 70 4 24648 49455 0.4444 0.4984

Human Chromosome

X NT_033330.7

754004 70 4 5029 10843 0.4444 0.4638

Page 53: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

53

Data source

The length of string

rThe distribution length of the longest suffix of the window

which is equal to a prefix of the pattern

T P0 1 2 3 4 5 6 7 8 9 10 30 70

Random

1000 30 4 23 4 5 1 0 0 0 0 0 0 0 0 010000 30 4 240 60 27 7 4 0 0 0 0 0 0 0 0

100000 30 4 2429 661 225 48 9 5 0 0 0 0 0 0 0

1000000 30 4 24650 6200 2147 587 128 41 11 4 1 0 0 0 0

1000 50 5 16 3 0 1 0 0 0 0 0 0 0 0 010000 50 5 150 40 9 2 0 0 0 0 0 0 0 0 0

100000 50 5 1508 395 88 15 1 5 0 0 0 0 0 0 0

1000000 50 5 15314 3799 812 163 27 5 0 0 0 0 0 0 0

1000 30 10 28 5 0 0 0 0 0 0 0 0 0 0 010000 30 10 305 28 1 0 0 0 0 0 0 0 0 0 0

100000 30 10 3027 289 27 1 0 0 0 0 0 0 0 0 0

1000000 30 10 29959 3119 353 27 6 0 0 0 0 0 0 0 0

1000 100 7 7 2 1 0 0 0 0 0 0 0 0 0 010000 100 7 86 14 0 0 0 0 0 0 0 0 0 0 0

100000 100 7 841 131 23 6 0 0 0 0 0 0 0 0 0

Page 54: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

54

Data source

The length of string

rThe distribution length of the longest suffix of the window

which is equal to a prefix of the patternT P

0 1 2 3 4 5 6 7 8 9 10 30 70

CNN news3715 7 35 503 32 0 0 0 0 0 0 0 0 0 0 02222 14 40 156 2 0 0 0 0 0 0 0 0 0 0 0

Human Chromosome 21 NT_011512.10

1627105 70 4 16265 5823 967 213 68 16 13 3 2 1 0 0 1

Human Chromosome 22 NT_011515.11

3437231 70 4 32269 11843 3990 891 295 111 45 6 2 1 1 0 1

HumanChromosome X NT_033330.7

754004 70 4 7177 2722 701 164 54 17 7 0 0 0 0 0 1

Page 55: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

55

We calculate the distribution length of the longest suffix of the window which is equal to a prefix of the pattern in above experiments. We find that almost all Ai are smaller than 5.

Therefore, we conclude that the probability of finding large Ai is very small.

Page 56: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

56

Reference• [A90]Algorithms for finding patterns in strings, A. V. Aho, Handbook of

Theoretical Computer Science, Vol. A, Elsevier, Amsterdam, 1990, pp.255-300.

• [A85]The myriad virtues of suffix trees, Apostolico, A., Combinatorial Algorithms on words, NATO Advanced Science Institutes, Series F, Vol. 12, 1985, pp.85-96

• [AG86]The Boyer-Moore-Galil string searching strategies revisited, Apostolico, A. and Giancarlo, R., SIAM, Comput. 15, 1986, pp98-105.

• [BR92]Average running time of the Boyer-Moore-Horspool algorithm, Baeza-Yates, R. A. and Regnier, M. Theoret. Comput. Sci., 1992, pp.19-31.

• [BKR91]Analysis of algorithms and Data Structures, Banachowski, L., Kreczmar, A. and Rytter, W., Addison-Wesley. Reading, MA,1991.

• [BM77] A fast string searching algorithm. Boyer, R. S. and Moore, J. S., Communications of the ACM, Vol. 20, 1977, pp.762-772.

• [C99]Tight bounds on the complexity of the Boyer-Moore pattern string searching algorithm, Cole, R. Proceedings of the second annual ACM-SIAM symposium on Discrete algorithms, 1999, pp.224-233.

Page 57: 1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

57

• [C86] Transducers and repetitions, Crochemore, M., Theoret. Comput. Sci., Vol. 45, 1986, pp.63-86.

• [G79] On improving the worst case running time of the Boyer-Moore string searching algorithm, Galil, Z., Comm. ACM, Vol.22, 1979, pp.505-508.

• [G80] A new proof of the linearity of the Boyer-Moore string searching algorithm, Guibas, L. J. and Odlyzko, A. M., SIAM J. Comput., Vol. 9, 1980, pp. 672-682.

• [H80] Practical fast searching in strings, Horspool, R. N., Software-Practice and Experience, Vol.10, 1980, pp. 501-506.

• [HS80] Fast string searching, Hume, A. and Sunday, D. M.,Software-Practice and Experience, 1980, pp. 1221-1248.

• [KMP77] Fast pattern matching in strings, D.E. Knuth, J.H. Morris and V.R. Pratt, SIAM Journal on Computing, Vol. 6, No.2, 1977, pp 323-350 .

• [L92] A variation on Boyer-Moore algorithm, Lecroq, T.,Theorer. Comput. Sci., Vol.92, 1992, pp.119-144.

• [R80] A correct prprocessing algorithm for Boyer-Moore string searching, SIAM Journal on Computing, Rytter, W.,Vol.9, 1980, pp.509-512.

• [Y79] The complexity of pattern matching for a random string, Yao, A. C.,SIAM Journal on Computing, Vol. 8, 1979, pp.368-387.


Recommended