+ All Categories
Home > Documents > A Fast String Searching Algorithm (Boyer-Moore Original)

A Fast String Searching Algorithm (Boyer-Moore Original)

Date post: 30-May-2018
Category:
Upload: xml
View: 220 times
Download: 0 times
Share this document with a friend

of 11

Transcript
  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    1/11

    1. Introduction

    P r o g r a m m i n gT e c h n i q u e s G . M a n a c h e r , S . L . G r a h a mE d i t o r sA Fast StringSearching AlgorithmRobert S. BoyerStanford Research InstituteJ Strother MooreXerox Palo Alto Research Center

    An algorithm is presented that searches for thelocat ion, "i," of the f irst occurrence of a characterstring, "'pat,'" in another str ing, "string." During thesearch operation, the characters of pa t are matchedstarting with the last character of pat . The informat iongained by starting the match at the end of the patternof ten a l lows the a lgor i thm to p roceed in large jumpsthrough the text be ing searched. Thus the a lgor ithmhas the unusual property that , in most cases, not al l ofthe first i characters of string are inspec ted. Thenumber of characters actual ly inspected (on the aver-age ) decreases as a funct ion of the l ength of pat . For arandom Engl i sh pat tern of l ength 5 , the a lgor i thm wi l ltypical ly inspect i /4 characters of string before f indinga match at i . Furthermore, the algorithm has beenimplemented so that (on the average ) fewer than i +patlen machine instruct ions are executed. T hese con-clusions are supported with empirical evidence and atheoretical analysis of the average behavior of thealgorithm. The worst case behavior of the algorithm islinear in i + patlen, assuming the avai labi l i ty of arrayspace for tables l inear in patlen plus the size of thea lphabet .

    Key W ords and Phrases: bibl iographic search, com -putational complexity, information retr ieval , l ineart ime bou nd, pat tern matching , t ext edi t ing

    CR Categor ie s: 3 .74 , 4 .40 , 5 .25

    Copyr igh t 1977 , A s s oc ia t ion fo r Com put i ng M ach inery , I nc .G enera l pe rm is s ion to r epub l i s h , bu t no t fo r p ro f i t , a l l o r pa r t o fth i s m ate r ia l i s g r an ted p rov ided tha t A C M ' s copyr igh t no t i ce i sg iven and tha t r e f e r ence i s m ade to the pub l ica t ion , to i t s da te o fi s s ue , and to the f ac t tha t r ep r in t ing p r iv i l eges w ere g ran te d by per -m i s s io n o f t h e A s s o c i a t i o n f o r C o m p u t i n g M a c h i n e r y .A u t h o r s ' p r e s e n t a d d r e s s e s : R . S . B o y e r , C o m p u t e r S c i e n c eL a b o r a t o r y , S t a n f o r d R e s e a r c h I n s t i t u t e , M e n l o P a r k , C A 9 4 0 2 5 .T h i s w o r k w a s p a rt i a l ly s u p p o r t e d b y O N R C o n t r a c t N 0 0 0 1 4 - 7 5 - C -0 8 1 6 ; J S . M o o r e w a s i n t h e C o m p u t e r S c i e n c e L a b o r a t o r y , X e r o xP a l o A l t o R e s e a r c h C e n t e r , P a l o A l t o , C A 9 4 3 0 4 , w h e n t h i s w o r kw a s d o n e . H i s c u r r e n t a d d r e s s is C o m p u t e r S c i en c e L a b o r a t o r y , S R II n t e r n a t i o n a l , M e n l o P a r k , C A 9 4 0 2 5 .

    7 6 2

    S u p p o s e t h a t pat is a s tr ing of length patlen a n d w ew i s h t o f in d t h e p o s i t i o n i o f t h e l e f t m o s t c h a r a c t e r i nt h e f i r s t o c c u r r e n c e o f pat in some s t r ing string:p a t : AT-THATs t r i n g : . . . W H ICH -F IN A L L Y -H A L T S . - -A T -T H A T -P O IN T . . .T h e o b v i o u s s e a r c h a l g o r i t h m c o n s i d e r s e a c h c h a r a c t e rp o s i t i o n o f string a n d d e t e r m i n e s w h e t h e r t h e s u c c e s -s ive patlen c h a r a c t e r s o f string s t a r t i n g a t t h a t p o s i t i o nm a t c h t h e s u c c e s s i v e patlen c h a r a c t e r s o f pat. K n u t h ,M o r r i s , a n d P r a t t [ 4 ] h a v e o b s e r v e d t h a t t h i s a l g o r i t h mi s q u a d r a t i c . T h a t i s , i n t h e w o r s t c a s e , t h e n u m b e r o fc o m p a r i s o n s i s o n t h e o r d e r o f i * patlen.l

    K n u t h , M o r r i s , a n d P r a t t h a v e d e s c r i b e d a l i n e a rs e a r c h a l g o r i t h m w h i c h p r e p r o c e s s e s pat in t ime l inea rin patlen a n d t h e n s e a r c h e s string in t ime l inea r in i +patlen. I n p a r t i c u l a r , t h e i r a l g o r i t h m i n s p e c t s e a c h o fthe f ir s t i + patlen - 1 char a c te r s o f string p r e c i s e l yo n c e .

    W e n o w p r e s e n t a s e a r c h a l g o r i t h m w h i c h i s u s u a l l y" s u b l i n e a r " : I t m a y n o t i n s p e c t e a c h o f t h e f i rs t i +patlen - 1 char ac te r s o f string. B y " u s u a l l y s u b l i n e a r "w e m e a n t h a t t h e e x p e c t e d v a l u e o f t h e n u m b e r o fi n s p e c t e d c h a r a c t e r s i n string is c * (i + patlen), w h e r ec < 1 and ge t s smal le r a s patlen i n c r e a s e s . T h e r e a r ep a t t e r n s a n d s t r i n g s f o r w h i c h w o r s e b e h a v i o r i s e x -h i b i t e d . H o w e v e r , K n u t h , i n [ 5 ] , h a s s h o w n t h a t t h ea l g o r i t h m i s l i n e a r e v e n i n t h e w o r s t c a s e .

    T h e a c t ua l n u m b e r o f c h a r a c te r s i n s p e ct e d d e p e n d so n s t a t i s t i c a l p r o p e r t i e s o f t h e c h a r a c t e r s i n pat a n dstring. H o w e v e r , s in c e th e n u m b e r o f c h a r a c te r s i n -s p e c t e d o n t h e a v e r a g e d e c r e a s e s a s patlen i n c r e a s e s ,o u r a l g o r i t h m a c t u a l l y s p e e d s u p o n l o n g e r p a t t e r n s .

    F u r t h e r m o r e , t h e a l g o r i t h m i s s u b l i n e a r i n a n o t h e rs e n s e : I t h a s b e e n i m p l e m e n t e d s o t h a t o n t h e a v e r a g ei t r e q u i r e s t h e e x e c u t i o n o f fe w e r t h a n i + patlenm a c h i n e i n s t r u c t i o n s p e r s e a r c h .

    T h e o r g a n i z a t i o n o f t h i s p a p e r i s a s f o l l o w s : I n t h en e x t t w o s e c t i o n s w e g i v e a n i n f o r m a l d e s c r i p t i o n o ft h e a l g o r i t h m a n d s h o w a n e x a m p l e o f h o w i t w o r k s .W e t h e n d e f i n e t h e a l g o r i t h m p r e c i se l y a n d d i s c us s i tse f f ic i e n t i m p l e m e n t a t i o n . A f t e r t h is d i s cu s s i o n w e p r e s -e n t t h e r e s u l ts o f a t h o r o u g h t e s t o f a p a r t i c u l a rm a c h i n e c o d e i m p l e m e n t a t i o n o f o u r a l g o r it h m . W ec o m p a r e t h e s e r e s u l t s t o s i m i l a r r e s u l t s f o r t h e K n u t h ,M o r r i s , a n d P r a t t a l g o r i t h m a n d t h e s i m p l e s e a r c ha l g o r i t h m . F o l l o w i n g t h i s e m p i r i c a l e v i d e n c e i s a t h e o -r e t i c a l a n a l y s i s w h i c h a c c u r a t e l y p r e d i c t s t h e p e r f o r m -a n c e m e a s u r e d . N e x t w e d e s c r i b e s o m e s i t u a t i o n s i nw h i c h i t m a y n o t b e a d v a n t a g e o u s t o u s e o u r a l g o r i t h m .W e c o n c l u d e w i t h a d i sc u s s i o n o f th e h i s t o r y o f o u ra l g o r i t h m .

    1 T he quad ra t i c na tu re o f th i s a lgo r i thm appear s w he n in i t i als ubs t r ings o f pa t occu r o f ten in string. Becaus e th i s i s a r e la t ive lyra r e phenom enon in s t r ing s ea rches over E ng l i s h t ex t , th i s s im p lea lgo r i thm i s practically l inea r in i + patlen a n d t h e r e f o r e a c c e p t a b l efo r m os t app l ica t ions .C o m m u n i c a t i o n s O c t o b e r 1 9 7 7o f V o l u m e 2 0t h e A C M N u m b e r 1 0

  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    2/11

    2. Informal DescriptionT h e b a s i c i d e a b e h i n d t h e a l g o r i t h m i s t h a t m o r e

    i n f o r m a t i o n is g a i n e d b y m a t c h i n g t h e p a t t e r n f r o mt h e r i g h t t h a n f r o m t h e l e f t . I m a g i n e t h a t pat i s p l acedo n t o p o f t he l e f t - h a n d e n d o f string so t ha t t he f i r s tc h a r a c t e r s o f t h e t w o s t ri n g s a r e a l ig n e d . C o n s i d e rw h a t w e l e a r n i f w e f e t c h t h e pat lenth c h a r a c t e r , char,o f string. T h i s i s t h e c h a r a c t e r w h i c h i s a l i g n e d w i t ht h e last c h a r a c t e r o f pat .

    Observation 1. If char is k n o w n n o t t o o c c u r i n pat ,t h e n w e k n o w w e n e e d n o t c o n s i d e r t h e p o s s i b i l i t y o fa n o c c u r r e n c e of pa t s t a r t i ng a t string p o s i t i o n s 1 , 2 , . . or patlen: S u c h a n o c c u r r e n c e w o u l d r e q u i r e t h a tchar b e a c h a r a c t e r o f pat.

    Observat ion 2 . M ore gene ra l l y , i f t he l a s t ( r i gh t -m o s t ) o c c u r r e n c e o f char in pat is deltal c h a r a c t e r sf r o m t h e r i g h t e n d o f pat , t h e n w e k n o w w e c a n s l i d epat d o w n delta~ p o s i ti o n s w i t h o u t c h e c k i n g f o r m a t c h e s .T h e r e a s o n i s t h a t i f w e w e r e t o m o v e pat by l e s s t handeltas, t h e o c c u r r e n c e o f char in string w o u l d b e a l i g n e dw i t h s o m e c h a r a c t e r i t c o u l d n o t p o s s i b l y m a t c h : S u c ha m a t c h w o u l d r e q u i r e a n o c c u r r e n c e o f char in pat t ot h e r i g h t o f th e r i g h t m o s t .

    T h e r e f o r e u n l e s s char m a t c h e s t h e l a s t c h a r a c t e r o fpat w e c a n m o v e p a s t delta1 c h a r a c t e r s o f string wi t h-o u t l o o k i n g a t t h e c h a r a c t e r s s k i p p e d ; delta~ is af u n c t i o n o f t h e c h a r a c t e r char o b t a i n e d f r o m string. I fchar d o e s n o t o c c u r i n pat, delta~ is patlen. I f char d o e so c c u r i n pat, delta~ i s t h e d i f f e r e n c e b e t w e e n patlena n d t h e p o s i t i o n o f t h e r i g h t m o s t o c c u r r e n c e o f char inpat.

    N o w s u p p o s e t h a t char m a t c h e s t h e l a s t c h a r a c t e ro f pat. T h e n w e m u s t d e t e r m i n e w h e t h e r t h e p r e v i o u sc h a r a c t e r i n string m a t c h e s t h e s e c o n d f r o m t h e l a s tc h a r a c t e r i n pat. I f s o , w e c o n t i n u e b a c k i n g u p u n t i lw e h a v e m a t c h e d a l l o f pat ( a n d t h u s h a v e s u c c e e d e di n f i n d i n g a m a t c h ) , o r e ls e w e c o m e t o a m i s m a t c h a ts o m e n e w char a f t e r m a t c h i n g t h e l as t m c h a r a c t e r s o fpat.

    In t h i s l a t t e r case , we wi sh t o sh i f t pat d o w n t oc o n s i d e r th e n e x t p l a u s i b l e ju x t a p o s i t i o n . O f c o u r s e ,we woul d l i ke t o sh i f t i t a s f a r down as poss i b l e .

    Observat ion 3 ( a ) . W e c a n u s e t h e s a m e r e a s o n i n gd e s c ri b e d a b o v e - b a s e d o n t h e m i s m a t c h e d c h a ra c t e rchar a n d d e l t a l - t o s l ide pat d o w n k s o a s t o a l i g n t h et w o k n o w n o c c u r r e n c e s o f char. T h e n w e w i l l w a n t t oi n s p e c t t h e c h a r a c t e r o f string a l i gned wi t h t he l a s tc h a r a c t e r o f pat . T h u s w e w i l l a c t u a l l y s h i f t o u r a t t e n -t i o n d o w n string b y k + m . T h e d i s t a n c e k w e s h o u l dsl ide pat d e p e n d s o n w h e r e char o c c u r s i n pat. I f t h er i g h t m o s t o c c u r r e n c e o f char in pat i s t o t he r i gh t o ft h e m i s m a t c h e d c h a r a c t e r ( i . e . w i t h i n t h a t p a r t o f patw e h a v e a l r e a d y p a s s e d ) w e w o u l d h a v e t o m o v e patb a c k w a r d s t o al ig n t h e tw o k n o w n o c c u r r e n c e s o f char.W e w o u l d n o t w a n t t o d o t h i s . I n th i s ca s e w e s a y th a tdelta~ i s " w o r t h l e s s " a n d s l i d e pat f o r w a r d b y k = 1( w h i c h is a l w a y s s o u n d ) . T h i s s h if t s o u r a t t e n t i o n d o w nstring b y 1 + m . I f t h e r i g h t m o s t o c c u r r e n c e o f char in763

    pat i s t o t h e l e f t o f t h e m i s m a t c h , w e c a n s l i de f o r w a r dby k = deltal(char) - rn t o a l i g n t h e t w o o c c u r r e n c e so f char. T h i s s h i f t s o u r a t t e n t i o n d o w n string b ydeltal(ch ar) - m + m = deltas(char).

    H o w e v e r , i t i s p o s s i b l e t h a t w e c a n d o b e t t e r t h a nthis .

    Observat ion 3 ( b ) . W e k n o w t h a t t h e n e x t m c h a r -a c t e r s of string m a t c h t h e f i n a l m c h a r a c t e r s of pat . L e tt h i s s u b s t r i n g o f pat b e subpat . W e a l s o k n o w t h a t t h i so c c u r r e n c e ofsubpat instr ing i s p r e c e d e d b y a c h a r a c t e r(char) w h i c h i s d i f f e r e n t f r o m t h e c h a r a c t e r p r e c e d i n gt h e t e r m i n a l o c c u r r e n c e o f subpat in pat . R o u g h l ys p e a k i n g , w e c a n g e n e r a l i z e t h e k i n d o f r e a s o n i n g u s e da b o v e a n d s l i d e pat d o w n b y s o m e a m o u n t s o t h a t t h ed i s c o v e r e d o c c u r r e n c e o f subpat in string i s a l i gnedw i t h t h e r i g h t m o s t o c c u r r e n c e o f subpat in pa t w h i c h i sn o t p r e c e d e d b y t h e c h a r a c t e r p r e c e d i n g i t s t e r m i n a lo c c u r r e n c e i n pat. W e c a l l s u c h a r e o c c u r r e n c e o fsubpat in pat a " p l a u s i b l e r e o c c u r r e n c e . " T h e r e a s o nw e s a id " r o u g h l y s p e a k i n g " a b o v e i s t h a t w e m u s ta l l o w f o r t h e r i g h t m o s t p l a u s i b l e r e o c c u r r e n c e ofsubpatt o " f a l l o f f " t h e l e f t e n d o f pat. T h i s i s m a d e p r e c i s el a t e r .

    T h e r e f o r e , a c c o r d i n g t o O b s e r v a t i o n 3 ( b ) , i f w eh a v e m a t c h e d t h e l as t m c h a r a c t e r s o f pat b e f o r ef i n d i n g a m i s m a t c h , w e c a n m o v e pat d o w n b y kc h a r a c t e r s , w h e r e k i s b a s e d o n t h e p o s i t i o n i n pat o ft h e r i g h t m o s t p l a u s ib l e r e o c c u r r e n c e o f th e t e r m i n a ls u b s t r i n g o f pat h a v i n g m c h a r a c t e r s . A f t e r s l id i n gd o w n b y k , w e w a n t t o i n s p e c t t h e c h a r a c t e r o f stringa l i g n e d w i t h t h e l a s t c h a r a c t e r of pat . T h u s w e a c t u a l l ys h i f t o u r a t t e n t i o n d o w n string b y k + r n c h a r a c t e r s .W e ca l l t h i s d i s t ance deltaz, a n d w e d e f i n e deltaz as af u n c t i o n o f t h e p o s i t i o n ] in pat a t w h i c h t h e m i s m a t c ho c c u r r e d , k i s j u s t t h e d i s t a n c e b e t w e e n t h e t e r m i n a lo c c u r r e n c e o f subpat a n d i t s r i g h t m o s t p l a u s i b l e r e o c -c u r r e n c e a n d i s a l w a y s g r e a t e r t h a n o r e q u a l t o 1 . m i sj us t patlen - ].

    I n t h e c a se w h e r e w e h a v e m a t c h e d t h e f i n a l mc h a r a c t e r s of pat b e f o r e f a i l i n g , w e c l e a r l y w i s h t o s h i f to u r a t te n t i o n d o w n string by 1 + m o r deltal(char) o rdeltaz(]), a c c o r d i n g t o w h i c h e v e r a l l o w s t h e l a r g e s ts h i f t . F r o m t h e d e f i n i t i o n o f deltae as k + m wh er e k isa l w a y s g r e a t e r t h a n o r e q u a l t o 1 , i t i s c l e a r t h a t delta2i s a t l eas t a s l a rge a s 1 + m. T h ere for e w e can sh i f to u r a t t en t i o n d o w n string b y t h e m a x i m u m o f j u s t th et w o deltas. T hi s ru l e a l so app l i e s wh en m --- 0 ( i .e .w h e n w e h a v e n o t y e t m a t c h e d a n y c h a ra c t e r s o f pat) ,b e c a u s e i n th a t c a s e ] = patlen a n d delta2(]) >- 1.

    3. ExampleI n t h e f o l l o w i ng e x a m p l e w e u s e a n " 1 ' " u n d e r

    string t o i n d i c a t e t h e c u r r e n t char. W h e n t h i s " p o i n t e r "i s p u s h e d t o t h e r i g h t , i m a g i n e t h a t i t d r a g s t h e r i g h te n d o f pat wi t h i t ( i . e . i magi ne pat h a s a h o o k o n i t sr i g h t e n d ) . W h e n t h e p o i n t e r i s m o v e d t o t h e l e f t,k e e p pat f i x e d w i t h r e s p e c t t o string.Communications Octo ber 1977of Volume 20the ACM Number 10

  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    3/11

    pat: AT-THATs t r i n g : . .. WHICH-FINALLY-HALTS.--AT-THAT-POINT ...Since "F" is known not to occur in pat, we can appealto Observation 1 and move the pointer (and thus pat)down by 7:p a t : A T - T H A Tstring: ... WHICH-FINALLY -HALTS.--AT-T HAT-POINT ...

    Appealing to Observation 2, we can move the pointerdown 4 to align the two hyphens:pat: A T - T H A Tstring: ... WHICH-FINALL Y-HALTS.--AT- THAT-POINT ...

    No w char matches its opposite in pat. Therefore westep left by one:p a t : AT-THATs t r i n g : .. . WHICH-FINALLY-HALTS.--AT T~IAT POINT .. .Appealing to Observation 3(a), we can move thepointer to the right by 7 positions because "L" doesnot occur in pat. 2 Note that this only moves pat to theright by 6.p a t : AT-THATs t r i n g : .. . WHICH-FINALLY-HALTS.--AT-THAT-POINT .. .Again char matches the last character of pat. Steppingto the left we see that the previous character in stringalso matches its opposite in pat. Stepping to the left asecond time produces:p a t : A T - T H A Tstring: ... WHICH-FINALLY -HALTS.--AT- THAT-POINT ...

    Noting that we have a mismatch, we appeal to Obser-vation 3(b). The delta2 move is best since it allows usto push the pointer to the right by 7 so as to align thediscovered substring "AT" with the beginning of pat. ~pat: AT-THATs t r i n g : .. . WHICH-FINALLY-HALTS.--AT-THAT-POINT .. .This time we discover that each character of patmatches the corresponding character in string so wehave found the pattern. Note that we made only 14references to string. Seven of these were required toconfirm the final match. The other seven allowed us tomove past the first 22 characters of string.

    2 Note that deltaz would allow us to move the pointer to theright only 4 positions in order to align the discovered substring "T"in string with its second from last occurrence at the beginn ing of theword "TH AT" in pat.3 The delta~ move only allows the pointer to be pushed to theright by 4 to align the hyphens.764

    4. The AlgorithmWe now specify the algorithm. The notation pat(j)

    refers to the jth charac ter in pat (counting from 1 onthe left).We assume the existence of two tables, delta1 an ddeltas. The first has as many entries as there arecharacters in the alphabet. The entry for some charac-te r char will be denoted by deltas(char). The secondtable has as many entries as there are character posi-tions in the pattern. The jth entry will be den oted bydelta2(j). Both tables contain non-negative integers.

    The tables are initialized by preprocessing pat, an dtheir entries correspond to the values deltaa an d delta2referred to earlier. We will specify their precise con-tents after it is clear how they are to be used.

    Our search algorithm may be specified as follows:stringlen ,,-- length of string.i ~ patlen.top: if i > stringlen then return false.j ,,-- patlen.loop: ifj = 0 then return J + 1.if string(i) = pat(j)

    thenj ~ " - j - 1 .i , ~ - - i - 1 .goto loop.close;

    i ~-- i + max(delta1 (strin g(i)) , delta2 (j) ).goto top.

    If the abov e algorithm returns false, then pat does notoccur in string. If the algorithm returns a number,then it is the position of the left end of the firstoccurrence of pat in string.The deltal table has an entry for each characterchar in the alphabet. The definition of delta~ is:deltas(char) = If char does not occur in pat, then pat-len; else patlen - j, where j is the

    maximum integer such that pat(j) =char.

    The deltaz table has one entry for each of the integersfrom 1 to patlen. Roughly speaking, delta2(j) is (a) thedistance we can slide pat down so as to align thediscovered occurrence (in string) of the last p a t l e n - jcharacters of pat with its rightmost plausible reoccurr-ence, plus (b) the additional distance we must slide the"pointer" down so as to restart the process at the rightend of pat. To define delta2 precisely we must definethe rightmost plausible reoccurrence of a terminalsubstring of pat. To this end let us make the followingconventions: Let $ be a character that does not occurin pat and let us say that if i is less than 1 then pat(i) is$. Let us also say that two sequences of characters [c~ . . c,] and [d~ . . . d,] "unify" if for all i from 1 to neit her c~ = d i or c~ = $ or d~ = $.Finally, we define the position of the rightmostplausible reoccurrence of the terminal substring whichstarts at positionj + 1, rpr(j), forj from 1 topat len , tobe the greatest k less than or equal to patlen such thatCommunications October 1977of Volume 20the ACM Number 10

  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    4/11

    [ p a t ( j + 1) . . . p a t ( p a t l e n ) ] a n d [ p a t (k ) . . . p a t ( k +pat len - j - 1) ] un i fy and e i t he r k -< 1 o r p a t ( k - 1) :~pat(]) .4 ( T h a t i s, t h e p o s i t i o n o f t h e r i g h t m o s t p l a u s i b l er e o c c u r r e n c e o f th e s u b s t r i n g s u b p a t , whi c h s t a r t s a t j+ 1 , i s t h e r i g h t m o s t p l a c e w h e r e s u b p a t o c c u r s i n p a ta n d i s n o t p r e c e d e d b y t h e c h a r a c t e r p a t ( j ) w h i c hp r e c e d e s i ts t e r m i n a l o c c u r r e n c e - w i t h s u i ta b l e a ll o w -a n c e s f o r ei t h e r t h e r e o c c u r r e n c e o r t h e p r e c e d i n gc h a r a c t e r t o f al l b e y o n d t h e l e f t e n d o f p a t . N o t e t h a tr p r ( j ) m a y b e n e g a t i v e b e c a u s e o f th e s e a l l o w a n c e s .)

    T h u s t h e d i s t a n c e w e m u s t s l i d e p a t t o a l i gn t hed i sco vere d subs t r i ng wh i ch s t a r t s a t j + 1 w i t h i tsr i g h t m o s t p l a u s i b le r e o c c u r r e n c e i s j + 1 - r p r ( j ) . T h ed i s t a n c e w e m u s t m o v e t o g e t b a c k t o t h e e n d o f p a t isj us t p a t l e n - j . d e lt a 2 ( j ) i s j u s t t h e s u m o f t h e s e t w o .T h u s w e d e f i n e delta2 as fo l l ows :de l ta2( j) = pa t le n + 1 - rpr ( j ) .T o m a k e t h i s d e f i n i t i o n c l e a r , c o n s i d e r t h e f o l l o w i n gt w o e x a m p l e s :j : 1 2 3 4 5 6 7 8 9pa t: A B C X X X A B Cd e l t a 2 ( J ) : _ 14 13 12 11 10 9 11 10 1

    j: 1 2 3 4 5 6 7 8 9pat: A B Y X C D E Y Xdelta2(J):_ 1 7 16 15 14 13 12 7 10 15. Implementation Considerations

    T h e m o s t f r e q u e n t l y e x e c u t e d p a r t o f t h e a l g o r i t h mis t h e c o d e t h a t e m b o d i e s O b s e r v a t i o n s 1 a n d 2 . T h ef o l l o w i n g v e r s io n o f o u r a l g o r i t h m i s e q u i v a l e n t t o t h eo r i g i n a l v e r s i o n p r o v i d e d t h a t de l tao i s a t ab l e con t a i n -i n g t h e s a m e e n t r i e s a s delta1 e x c e p t t h a td e l t a o ( p a t ( p a t l e n ) ) i s s e t t o an i n t ege r large w h i c h i sg r e a t e r t h a n s t r i n g l e n + p a t l e n (whi l e d e l t a l ( p a t ( p a t l e n ) )i s a l ways 0 ) .

    stringlen ,:-- length of string.i ~-- patle n.if i > stringlen then return false.fast: i ,--- i + delta0(string(i)).if i

  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    5/11

    l eng th patlen, f o r e a c h patlen f r om 1 to 14 . W e then Fig. 1.used our a lgor i t hm to sea r ch f or each of the te s t 1.0 -pa t te r ns in i t s sour ce s t r ing , s ta r t ing each sea r ch in ar a n d o m p o s i t i o n s o m e w h e r e i n t h e f i rs t h a lf o f th e 0.9s o u r c e s t r in g . A l l o f th e c h a r a c t e r s f o r b o t h t h e p a t t e r n sa n d t h e s t r i n g s w e r e i n p r i m a r y m e m o r y ( r a t h e r t h a n aseco ndar y s tor age me diu m such as a d i sk) . 0.8

    W e m e a s u r e d t h e c o s t o f e a c h s e a r c h i n t w o w a y s :t h e n u m b e r o f r e fe r e n c e s m a d e t o string a n d t h e t o t a ln u m b e r o f m a c h i n e i n s t ru c t i o n s t h a t a c t u a l ly g o t e x e - ~ 07c u t e d ( i g n o r i n g t h e p r e p r o c e s s i n g t o s e t u p t h e t w otab les ). ~ o.e

    B y d i v i d i n g t h e n u m b e r o f r e f e r e n c e s t o string b y ithe nu mb er of cha r a c te r s i - 1 pas se d be f o r e the ~ o.sp a t t e r n w a s f o u n d ( o r string was exh aus t ed) , we ob- -~ it a i n e d t h e n u m b e r o f r e f e r e n c e s t o string p e r c h a r a c t e rp a s s e d . T h is m e a s u r e is i n d e p e n d e n t o f t h e p a r t i c u l a r ~ 0.,i m p l e m e n t a t i o n o f t h e a l g o r i t h m . B y d i v i di n g th e n u m -b e r o f i n s t ru c t i o n s e x e c u t e d b y i - 1 , w e o b t a i n e d t h ea v e r a g e n u m b e r o f i n s t r u c ti o n s s p e n t o n e a c h c h a r a c t e r ~ 03 -p a s s e d . T h i s m e a s u r e d e p e n d s u p o n t h e i m p l e m e n t a -t ion , bu t we f ee l tha t it i s mea ning f u l s ince the imple - 0.2 -m e n t a t i o n i s a s t r a i g h t f o r w a r d e n c o d i n g o f t h e a l g o -r i thm as desc r ibed in the la s t s ec t ion .W e t h e n a v e r a g e d t h e s e m e a s u r e s a c r o s s a ll 3 0 0 0 .1 -s a m p l e s f o r e a c h p a t t e r n l e n g t h .

    B e c a u s e t h e p e r f o r m a n c e o f t h e a l g o r it h m d e p e n d s o _u p o n t h e s t a t i s t i c a l p r o p e r t i e s o f pa t a n d string ( a n d oh e n c e u p o n t h e p r o p e r t i e s o f t h e s o u r c e s t r in g f r o mw h i c h t h e t e s t p a t t e r n s w e r e o b t a i n e d ) , w e p e r f o r m e d Fig. 2t h is e x p e r i m e n t f o r t h r e e d i f f e r e n t k i n d s o f s o u r c e 7s t r ings , each of l ength 10 ,00 0 . T he f i r s t sour ce s t r ingc o n s i s t e d o f a r a n d o m s e q u e n c e o f O 's a n d l ' s . T h e 6s e c o n d s o u r c e s t r i ng w a s a p i e c e o f E n g l i s h t e x t o b -t a i n e d f r o m a n o n l i n e m a n u a l . T h e t h i r d s o u r c e s t r i n gw a s a r a n d o m s e q u e n c e o f c h a r a c t e r s f r o m a 1 0 0 - ~c h a r a c t e r a l p h a b e t .

    I n F i g u r e 1 t h e a v e r a g e n u m b e r o f r e f e r e n c e s t o ~ ,string p e r c h a r a c t e r i n string p a s s e d i s p l o t t e d a g a i n s tt h e p a t t e r n l e n g t h f o r e a c h o f t h r e e s o u r c e s t r in g s .N o t e t h a t t h e n u m b e r o f r e f er e n c e s t o string per o 3char ac te r pas sed i s l e s s than 1 . For exa mp le , f o r an ~_Engl i sh pa t te r n of l ength 5 , the a lgo r i thm typ ica l ly -~i n s p e c t s 0 . 2 4 c h a r a c t e r s f o r e v e r y c h a r a c t e r p a s s e d .T h a t i s , f o r e v e r y r e f e r e n c e t o string t h e a l g o r i t h m

    1p a s s e s a b o u t 4 c h a r a c t e r s , o r , e q u i v a l e n t l y , t h e a l g o -r i t h m i n s p e c t s o n l y a b o u t a q u a r t e r o f t h e c h a r a c t e r s i tpas ses wh en sea r ch ing f or a pa t te r n of l ength 5 in an oE n g l i s h t e x t s t ri n g . F u r t h e r m o r e , t h e n u m b e r o f re f e r -e n c e s p e r c h a r a c t e r d r o p s a s t h e p a t t e r n s g e t l o n g e r .T h i s e v i d e n c e s u p p o r t s t h e c o n c l u s i o n t h a t t h e a l g o -r i t h m i s " s u b l i n e a r " i n t h e n u m b e r o f r e f e r e n c e s t ostring.

    F o r c o m p a r i s o n , i t s h o u l d b e n o t e d t h a t t h e K n u t h ,M o r r i s , a n d P r a t t a l g o r i t h m r e f e r e n c e s string pr ec i se ly1 t i m e p e r c h a r a c t e r p a s s e d . T h e s i m p l e s e a r c h a l g o -r i t h m r e f e r e n c e s string a b o u t 1 . 1 t i m e s p e r c h a r a c t e rp a s s e d ( d e t e r m i n e d e m p i r i c a l l y w i t h th e E n g l i s h s a m -p l e a b o v e ) .7 6 6

    EMPIRICAL CO~T

    2 4 6 8 10 12 14LENGTH OF PATTERN

    I I I I I I I I I I I I [EMPIRICAL COST

    3.56

    I O.4730.266I I I I i ] ] [ I I I [ I

    2 4 6 8 10 12 14LENGTH OF PATTERN

    I n F i g u r e 2 t h e a v e r a g e n u m b e r o f i n s tr u c t i o n se x e c u t e d p e r c h a r a c t e r p a s s e d i s p l o t t e d a g a i n s t t h ep a t t e r n l e n g t h . T h e m o s t o b v i o u s f e a t u r e t o n o t e ist h a t t h e s e a r c h s p e e d s u p a s t h e p a t t e r n s g e t l o n g e r .T h a t i s , t h e t o t a l n u m b e r o f i n s t r u c t i o n s e x e c u t e d i no r d e r t o p a s s o v e r a c h a r a c t e r d e c r e a s e s a s t h e l e n g t ho f t h e p a t t e r n i n c r e a s e s .

    F i g u r e 2 a l s o e x h i b i t s a s e c o n d i n t e r e s t i n g f e a t u r eo f o u r i m p l e m e n t a t i o n o f t h e a l g o r i t h m : F o r s u f f i ci e n t l yC o m m u n i c a t i o n s O c t o b e r 1 9 7 7o f V o l u m e 2 0t h e A C M N u m b e r 1 0

  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    6/11

    l a r g e a l p h a b e t s a n d s u f f i c i e n t l y l o n g p a t t e r n s t h e a l g o -r i t h m e x e c u t e s f e w e r t h a n 1 i n s tr u c t i o n p e r c h a r a c t e rp a s s e d . F o r e x a m p l e , i n t h e E n g l i s h s a m p l e , l e s s t h a n1 i n s t r u c t i o n p e r c h a r a c t e r i s e x e c u t e d f o r p a t t e r n s o fl e n g t h 5 o r m o r e . T h u s t h i s i m p l e m e n t a t i o n i s " s u b -l i n e a r " i n t h e s e n s e t h a t i t e x e c u t e s f e w e r t h a n i +patlen i n s t ru c t i o n s b e f o r e f i n d i n g t h e p a t t e r n a t i. T h i sm e a n s t h a t n o a l g o r i t h m w h i c h r e f e r e n c e s e a c h c h a r -ac te r i t pas ses could possibly b e f a s t e r t h a n o u r s i nthese cases ( as suming i t t akes a t l eas t one ins t r uc t iont o r e f e r e n c e e a c h c h a r a c t e r ) .

    T h e b e s t a l t e r n a t i v e a l g o r i t h m f o r f i n d i n g a s in g l es u b s t r in g i s t h a t o f K n u t h , M o r r i s , a n d P r a t t . I f th a ta l g o r i t h m i s i m p l e m e n t e d i n t h e e x t r a o r d i n a r i l y e f fi -c ien t way desc r ibed in [ 4 , pp . 11- 12] and [ 2 , I t em1 7 9 ] ) t h e n t h e c o s t o f l o o k i n g a t a c h a r a c t e r c a n b eexp ec te d to be a t l eas t 3 - p ins t r uc t io ns , w her e p i st h e p r o b a b i l i t y t h a t a c h a r a c t e r j u s t f e t c h e d f r o m stringi s equa l to a g iven char ac te r o f pat. H e n c e a h o r i z o n t a ll ine a t 3 - p ins t r uc t ions /char ac te r r epr esen ts the bes t( a n d , p r a c t i c a l l y , t h e w o r s t ) t h e K n u t h , M o r r i s , a n dP r a t t a l g o r i t h m c a n a c h i e v e .

    T h e s i m p l e s t r i n g s e a r c h i n g a l g o r i t h m ( w h e n c o d e dwi th a 3 - ins t r uc t ion f as t loop 6) execu tes abo ut 3 .3i n s t ru c t i o n s p e r c h a r a c t e r ( d e t e r m i n e d e m p i r i c a l l y o nt h e E n g l i s h s a m p l e a b o v e ) .

    A s n o t e d , t h e p r e p r o c e s s i n g t i m e f o r o u r a l g o r i t h m( a n d f o r K n u t h , M o r r i s , a n d P r a t t ) h a s b e e n i g n o r e d .T h e c o s t o f t h i s p r e p r o c e s s i n g c a n b e m a d e l i n e a r inpatlen ( th i s i s d i scussed f ur the r in the nex t s ec t ion) andi s t r iv i a l c o m p a r e d t o a r e a s o n a b l y l o n g s e a r c h . W em a d e n o a t t e m p t t o c o d e t h i s p r e p r o c e s s i n g e f f i c i e n t l y .H o w e v e r , t h e a v e r a g e c o s t ( in o u r i m p l e m e n t a t i o n )r anges f r o m 160 ins t r uc t ions ( f or s t r ings of l ength 1)to abou t 500 ins t r uc t ions ( f or s t r ings of l ength 14) . I ts h o u l d b e e x p l a i n e d t h a t o u r c o d e u s e s a b l o c k t r a n s f e ri n s t r u c t i o n t o c l e a r t h e 1 2 8 - w o r d delta~ t a b l e a t t h eb e g i n n i n g o f t h e p r e p r o c e s s i n g , a n d w e h a v e c o u n t e dth i s s ing le ins t r uc t ion as though i t wer e 128 ins t r uc -t i o n s . T h i s a c c o u n t s f o r t h e u n e x p e c t e d l y l a r g e i n s t r u c -t i o n c o u n t f o r p r e p r o c e s s i n g a o n e - c h a r a c t e r p a t t e r n .

    7. Theoretical AnalysisT h e p r e p r o c e s s i n g f o r delta~ r e q u i r e s a n a r r a y t h e

    s i z e o f t h e a l p h a b e t . O u r i m p l e m e n t a t i o n f i r s t i n i t i a l -i zes a l l en t r ie s o f th i s a r r ay to patlen a n d t h e n s e t s u p

    s Th is implementation automatically compilespat into a machinecode program which implicitly has the skip table bu ilt in and w hichis executed to perform the search itself. In [2] they co mpile codewhich uses the PD P-10 capability of fetching a character andincrementing a by te address in one instruction. T his compiled codeexecutes at least two or three instructions per character fetchedfrom string, depending on the outcome of a comparison of thecharacter to one from pat.6 This loop avoids checking whether string is exhausted byassuming that the first character of pat occurs at the end of string.This can be arranged ahead o f time. The loop actually uses the samethree instruction codes used by the above-referenced implementationof the Knuth, Morris, and Pratt algorithm.76 7

    delta1 i n a l i n e a r s c a n t h r o u g h t h e p a t t e r n . T h u s o u rp r e p r o c e s s i n g f o r delta1 i s l inea r in patlen plus the s izeo f t h e a l p h a b e t .

    At a s l igh t los s o f e f f ic iency in the s ea r ch speedo n e c o u l d e l i m i n a t e t h e i n i t ia l i z at i o n o f t h e deltala r r a y b y s t o r i n g w i t h e a c h e n t r y a k e y i n d i c a t i n g t h en u m b e r o f t i m e s t h e a l g o r i t h m h a s p r e v i o u s l y b e e nca l led . This appr oach s t i l l r equi r es in i t i a l i z ing the a r r aythe f i r s t t ime the a lgor i thm i s used .T o i m p l e m e n t o u r a l g o r i t h m f o r e x t r e m e l y l ar g ea l p h a b e t s , o n e m i g h t i m p l e m e n t t h e deltal t ab le a s ahash a r r ay . I n the wor s t case , acces s ing delta~ d u r i n gt h e s e a r c h i t s e l f c o u l d r e q u i r e o r d e r patlen ins t r uc -t ions , s ign i f ican t ly impa i r in g the spee d of the a lgo-r i t h m . H e n c e t h e a l g o r i t h m a s i t s t a n d s a l m o s t c e r t a i n l ydoes no t r un in t ime l inea r in i + patlen f or in f in i tea l p h a b e t s .

    K n u t h , i n a n a l y z i n g t h e a l g o r i t h m , h a s s h o w n t h a ti t s t i l l r uns in l inea r t ime when deltaa i s o m i t t e d , a n dth i s r e su l t ho lds f or in f in i te a lphabe ts . Doing th i s ,h o w e v e r , w i l l d r a s t i c a l l y d e g r a d e t h e p e r f o r m a n c e o ft h e a l g o r i t h m o n t h e a v e r a g e . I n [ 5 ] K n u t h e x h i b i t s a na l g o r i t h m f o r s e t t i n g u p delta2 in t ime l inea r in patlen.

    F r o m t h e p r e c e d i n g e m p i r i c a l e v i d e n c e , t h e r e a d e rc a n c o n c l u d e t h a t t h e a l g o r i t h m i s q u i t e g o o d i n t h ea v e r a g e c a s e . H o w e v e r , t h e q u e s t i o n o f i ts b e h a v i o r i nt h e w o r s t c a s e i s n o n t r i v i a l . K n u t h h a s r e c e n t l y s h e dsom e l igh t on th i s ques t ion . I n [ 5] he pr ov es tha t thee x e c u t i o n o f t h e a l g o r i t h m ( a f t e r p r e p r o c e s s i n g ) i sl inea r in i + patlen, a s s u m i n g t h e a v a i l a b i l i t y o f a r r a yspace l inea r in patlen p l u s t h e s i ze o f t h e a l p h a b e t . I np a r t i c u l a r , h e s h o w s t h a t i n o r d e r t o d i s c o v e r t h a t patdoes no t occur in the f i r st i cha r ac te r s o f string, a tmos t 6 * i cha r ac te r s f r om string a r e m a t c h e d w i t hc h a r a c t e r s i n pat. H e g o e s o n t o s a y t h a t t h e c o n s t a n t 6i s p r o b a b l y m u c h t o o l a r g e , a n d i n v i t e s t h e r e a d e r t oi m p r o v e t h e t h e o r e m . H i s p r o o f r e v e a l s t h a t t h e l i n e a r-i ty of the a lgor i thm i s en t i r e ly due to delta2.

    W e n o w a n a l y z e t h e a v e r a g e b e h a v i o r o f t h e a l g o -r i t h m b y p r e s e n t i n g a p r o b a b i l i s t i c m o d e l o f i t s p e r -f o r m a n c e . A s w i ll b e c o m e c l e a r , t h e r e s u l t s o f t h isana lys i s wi l l suppor t the empi r ica l conc lus ions tha t thea l g o r i t h m i s u s u a l l y " s u b l i n e a r " b o t h i n t h e n u m b e r o fr e f e r e n c e s t o string a n d t h e n u m b e r o f i n s t ru c t i o n se x e c u t e d ( f o r o u r i m p l e m e n t a t i o n ) .

    T h e a n a l y si s b e l o w i s b a s e d o n t h e f o l l o w i n g s i m p li -f y i n g a s s u m p t i o n : E a c h c h a r a c t e r o f pat a n d string isa n i n d e p e n d e n t r a n d o m v a r i a b l e . T h e p r o b a b i l i t y t h a ta c h a r a c t e r f r o m pat o r string i s equa l to a g ivenc h a r a c t e r o f t h e a l p h a b e t i s p .

    I m a g i n e t h a t w e h a v e j u s t m o v e d pat d o w n stringt o a n e w p o s i t i o n a n d t h a t t h i s p o s i t i o n d o e s n o t y i e l da m a t c h . W e w a n t t o k n o w t h e e x p e c t e d v a l u e o f t h er a t i o b e t w e e n t h e c o s t o f d i s c o v e r i n g t h e m i s m a t c hand the d i s tance we ge t to s l ide pat d o w n u p o n f i n d i r / gt h e m i s m a t c h . I f w e d e f i n e t h e c o s t t o b e t h e t o t a ln u m b e r o f r e fe r e n c e s m a d e t o string b e f o r e d i s c o v e r i n gt h e m i s m a t c h , w e c a n o b t a i n t h e e x p e c t e d v a l u e o f t h ea v e r a g e n u m b e r o f r e f e r e n c e s t o s t r i n g p e r c h a r a c t e rCommunications October 1977of Volume 20the ACM Number 10

  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    7/11

    passed. If we define the cost to be the total number ofmachine instructions executed in discovering the mis-match, we can obtain the expected value of the numberof instructions executed per character passed.

    In the following we say "only the last m charactersof pat match" to mean "the last m characters of pa tmatch the correspo nding m characte rs in string but the(m + 1)-th charact er from the right end of pat fails tOmatch the corresponding character in string."

    The expected value of the ratio of cost to characterspassed is given by:~=o cost(m) * prob( m) ))=0 prob( m) * ~ k~=l sk ip (m ,k) * k

    where cost(m) is the cost associated with discoveringthat only the last m characters of pat match; prob(m) isthe probability that only the last m characters of patmatch; and skip (m, k) is the probability that, supposingonly the last m characters of pat match, we will get toslide pat down by k.

    Under our assumptions, the probability that onlythe last m characters of pat match is:prob(m ) = pro(1 - p)/(1 - ppa t t e n ) .(The denominator is due to the assumption that amismatch exists.)

    The probability that we will get to slide pat downby k is determined by analyzing how i is incremented.However, note that even though we increment i by themaximum m ax of the two deltas, this will actually onlyslide pat down by max - m, since the increment of ialso includes the m necessary to shift our attentionback to the end of pat. Thus when we analyze thecontributions of the two deltas we speak of the amountby which they allow us to slide pat down, rather thanthe amount by which we increment i. Finally, recallthat if the mismatched character char occurs in thealready matched final m characters of pat, then deltaais worthless and we always slide by deltas. The proba-bility that deltal is wort hle ss is just (1 - (1 - p)m). Le tus call this probdelta~worthless(m).

    The conditions under which delta~ will naturally letus slide forward by k can be broken down into fourcases as follows: (a) delta~ will let us slide down by 1 ifchar is the (m + 2)-th character from the righthandend of pat (or else there are no more characters in pat)and char does not occur to the right of that position(which has probabil ity (1 - p) " * (if m + 1 = patlenthen 1 else p)). (b) delta1 allows us to slide down k,where 1 < k < patlen - m, provided the rightmostoccurrence of char in pat is m + k characters from theright end of pat (which has proba bilit y p * (1 -p)k+m-~). (c) When patle n - m > 1, deltai allows us toslide past patlen - m characters if char does not occurin pat at all (which has probabi lity (1 - p)paae,-1 giventhat we know char is not the (m + 1)-th cha rac ter from7 6 8

    the right end of pat). Finally, (d) delta~ never allows aslide longer than patlen - m (since the maximumvalue of deltal is patlen).Thus we can define the probability probdelta~(m,k) that when only the last m characters of pat match,delta~ will allow us to move down by k as follows:probde l t a l ( m , k ) = i f k = 1

    t h e n( 1 - p ) m . ( i f m + 1 = pat len t h e n 1 e l s e p ) ;

    e l se i f I < k < pat len - m t h e n p * ( 1 - p ) k + , . - 1 ;e l se i f k = patlen - m t h e n ( 1 - p ) p . a e . - 1 ;e l s e ( i . e . k > pat len - m) O.

    (It should be noted that we will not put these formulasinto closed form, but will simply evaluate them toverify the validity of our empirical evidence.)

    We now perform a similar analysis for deltas; deltaslets us slide down by k if (a) doing so sets up analignment of the discovered occurrence of the last mcharacters of pat in string with a plausible reoccurrenceof those m characters elsewhere in pat, and (b) nosmaller move will set up such an alignment. Theprobability probpr(m, k ) that the terminal substring ofpat of length m has a plausible reoccurrence k charac-ters to the left of its first character is:p r o b p r ( m , k) = if m + k < pat len

    t h e n ( 1 - p ) * p "e l s e ptaatlen-k

    Of course, k is just the distance delta2 lets us slideprovided there is no earlier reoccurrence. We cantherefore define the probability probdelta2(m, k) that,when only the last m characters of pat match, delta2will allow us to mov e down by k recursively as follows:probdelta2(m, k)

    =probpr ( m , k ) ( 1 - k ~ =11Probde l ta2 ( m , n ) ) We slide down by the maximum allowed by the

    two deltas (taking adequate account of the possibilitythat delta1 is worthless). If the values of the deltaswere independ ent, the probability that' we would ac-tually slide down by k would just be the sum of theproducts of the probabilities that one of the deltasallows a move of k while the oth er allows a move ofless than or equal to k.However, the two moves are not entirely indepen-dent. In particular, consider the possibility that delta1is worthless. Then the char just fetched occurs in thelast m characters of pat and does not match the (m +1)-th. But if delta2 gives a slide of 1 it means thatsliding these m characters to the left by i produces amatch. This implies that all of the last m characters ofpat are equal to the chara cter m + 1 from the right.But this character is known not to be char. Thus charcannot occur in the last m characters of pat, violatingthe hypothesis that delta~ was worthless. Therefore ifdelta~ is worthless, the probability that delta2 specifiesa skip of 1 is 0 and the probability that it specifies oneof the larger skips is correspondingly increased.C o m m u n i c a t i o n s O c t o b e r 1 9 7 7o f V o l u m e 2 0t h e A C M N u m b e r 1 0

  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    8/11

    T h i s i n t e r a c t i o n b e t w e e n t h e t w o deltas is a lso fel t( to a l e s se r ex ten t ) f o r the nex t m poss ib le delta2's, b u tw e i g n o r e t h e s e ( a n d i n s o d o i n g a c c e p t t h a t o u rana lys i s may pr ed ic t s l igh t ly wor se r esu l t s than mightb e e x p e c t e d s i n c e w e a l l o w s o m e s h o r t delta2 m o v e sw h e n l o n g e r o n e s w o u l d a c t u a l l y o c c u r ) .

    T h e p r o b a b i l i t y t h a t delta2 will a l low us to s l ided o w n b y k w h e n o n l y t h e l a st m c h a r a c t e r s o f patm a t c h , a s s u m i n g t h a t deltai i s wor th les s , i s :

    probdel ta~(m,k) = i fk = 1t h e n 0e l s epro bpr( m, k) 1 - probdelta'2(m, n) .

    Fina l ly , we can de f ine skip(m, k), t h e p r o b a b i l i t ytha t we wi l l s l ide dow n b y k i f on ly the la s t m ch ar ac te r sof pat m a t c h :sk ip ( m , k) = if k = 1

    t h e n probdel tal (m, 1) * probdelta2(m, 1)e l s e probdel talworthless(m) * probdel ta~(m, k)

    k- I+ ~_. probd eltal(m , k) * probdelta2(m, n)n= l

    k--1+ ~. probdel tal (m, n) * probdel ta2(m, k)n= l+ probd eltal( m, k) * probdelta=(m, k).

    N o w l e t u s c o n s i d e r t h e t w o a l t e r n a t i v e cost f unc-t i o n s . I n o r d e r t o a n a l y z e t h e n u m b e r o f r e f e r e n c e s t ostring p e r c h a r a c t e r p a s s e d o v e r , cost(m) s h o u l d j u s t b em + 1 , t h e n u m b e r o f r e f e r e n c e s n e c e s s a r y t o c o n f i r mtha t on ly the la s t m ch ar ac te r s o f pat m a t c h .

    I n o r d e r t o a n a l y z e t h e n u m b e r o f i n s t ru c t i o n se x e c u t e d p e r c h a r a c t e r p a s s e d o v e r , cost(m) s h o u l d b et h e t o t a l n u m b e r o f i n s t r u ct i o n s e x e c u t e d i n d i s c o v e r i n gtha t on ly the la s t m c har ac te r s o f pat m a t c h . B yi n s p e c t io n o f o u r P D P - 1 0 c o d e :cost(m) = i f m = 0 then 3 e l se 12 + 6 m .

    W e h a v e c o m p u t e d t h e e x p e c t e d v a l u e o f t h e r a t i oo f c o s t p e r c h a r a c t e r s k i p p e d b y u s in g t h e a b o v ef o r m u l a s ( a n d b o t h d e f i n i t i o n s o f cost). W e d i d s o f o rp a t t e r n l e n g t h s r u n n i n g f r o m 1 t o 1 4 ( a s in o u r e m p i r i -c a l e v i d e n c e ) a n d f o r t h e v a l u e s o f p a p p r o p r i a t e f o rt h e t h r e e s o u r c e s t r i n g s u s e d : F o r a r a n d o m b i n a r ys t r ing p i s 0 . 5 , f o r an a r b i t r a r y E ngl i sh s t r ing i t i s( a p p r o x i m a t e l y ) 0 . 0 9 , a n d f o r a r a n d o m s t ri n g o v e r a1 0 0 - c h a r a c t e r a l p h a b e t i t i s 0 . 0 1 . T h e v a l u e o f p f o rE n g l i s h w a s d e t e r m i n e d u s i n g a s t a n d a r d f r e q u e n c yc o u n t f o r t h e a l p h a b e t i c c h a r a c t e r s [ 3] a n d e m p i r i c a l l yd e t e r m i n i n g t h e f r e q u e n c y o f s p a c e , c a r r ia g e r e t u r n ,a n d l i n e f e e d t o b e 0 . 2 3 , 0 . 0 3 , a n d 0 . 0 3 , r e s p e c t i v e l y F

    I n F i g u r e 3 w e h a v e p l o t t e d t h e t h e o r e t i c a l r a t i o o fr e f e r e n c e s t o string p e r c h a r a c t e r p a s s e d o v e r a g a i n s t

    7 6 9

    0 2 4 10 12 14

    Fig. 3.1.0 I I I I I

    I I I I I I4 6 8 10 12 14

    L E N G T H O F P A T T E R N

    Fig . 4 .

    r W e h a v e d e t e r m i n e d e m p i r i c a l l y th a t t h e a l g o r i t h m ' s p e r f o r -m a n c e o n t r u l y r a n d o m s t r i n g s w h e r e p = 0 .0 9 i s v i r t u a l l y i d e n t i c a lt o i t s p e r f o r m a n c e o n E n g l i s h s t r i n g s. I n p a r t i c u l a r , t h e r e f e r e n c ec o u n t a n d i n s t r u c t i o n c o u n t c u r Ve s g e n e r a t e d b y s u c h r a n d o m s t r i n g sa r e a l m o s t c o i n c i d e n t a l w i t h t h e E n g l i s h c u r v e s i n F i g u r e s 1 a n d 2 .

    6 8L E N G T H O F P A T T E R N

    0,24

    t h e p a t t e r n l e n g t h . T h e m o s t i m p o r t a n t f a c t t o o b s e r v ein F igur e 3 is tha t the a lgor i thm can be expected tom a k e f e w e r t h a n i + patlen r e f e r e n c e s t o string b e f o r ef i n d i n g t h e p a t t e r n a t l o c a t i o n i . F o r e x a m p l e , f o rE n g l i s h t e x t s t r in g s o f l e n g t h 5 o r g r e a t e r , t h e a l g o r i t h mm a y b e e x p e c t e d t o m a k e l e s s t h a n ( i + 5 ) / 4 r e f e r -e n c e s t o string. T h e c o m p a r a b l e f i g u r e f o r t h e K n u t h ,C o m m u n i c a t i o n s O c t o b e r 1 9 7 7o f V o l u m e 2 0t h e A C M N u m b e r 1 0

  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    9/11

    Morr i s , and P ra t t a lgo r i thm i s of course prec i se ly i .The f igure for the in tu i t ive s ea rch a lgor i thm i s a lwaysg r e a t e r t h a n o r e q u a l t o i .

    T h e r e a s o n t h e n u m b e r o f r e fe r e n c e s p e r c h a r a c t e rpas sed decreases more s lowly as pat len increases istha t for longer pa t t e rns the probabi l i ty i s h igher tha tt h e c h a r a c t e r j u s t f e t c h e d o c c u r s s o m e w h e r e i n t h ep a t t e r n , a n d t h e r e f o r e t h e d i s t a n c e t h e p a t t e r n c a n b em o v e d f o r w a r d i s s h o r t e n e d .

    In F igure 4 we have p lo t t ed the theore t i ca l ra t io oft h e n u m b e r o f i n s t ru c t i o n s e x e c u t e d p e r c h a r a c t e rpas sed versus the pa t t e rn l ength . Aga in we f ind tha to u r i m p l e m e n t a t i o n o f t h e a l g o r i t h m can be expected( for suf f i ci en t ly l a rge a lphabe t s ) to execu te fewer thani + patlen ins t ruc t ions be fore f inding the pa t t e rn a tloca t ion i. Tha t i s , our im plem enta t i on i s usua l ly " sub-l i n e a r " e v e n i n t h e n u m b e r o f i n s t r u c t i o n s e x e c u t e d .T h e c o m p a r a b l e f i g u re f o r t h e K n u t h , M o r r i s , a n dPra t t a lgor i thm i s a t bes t (3 - p) * (i + pat len - 1). 8F o r t h e s i m p l e s e a r c h a l g o r i t h m t h e e x p e c t e d v a l u e o ft h e n u m b e r o f in s t r u c ti o n s e x e c u t e d p e r c h a r a c t e rpas sed is (appr oxim ate ly) 3 .28 ( for p = 0 .09) .

    I t i s d i f f i cu l t to fu l ly apprec ia te the ro le p layed bydelta2. For ex amp le , i f the a lph abe t i s l a rge and pa t -t e r n s a r e s h o r t , t h e n c o m p u t i n g a n d t r y in g t o u s e delta2p r o b a b l y d o e s n o t p a y o f f m u c h ( b e c a u s e t h e c h a n c e sa re h igh tha t a g iven charac te r in string d o e s n o t o c c u ra n y w h e r e i n pat and one wi l l a lmos t a lways s t ay in thefas t l o o p i g n o r i n g delta2). 9 C o n v e r s e l y , delta2 b e c o m e sv e r y i m p o r t a n t w h e n t h e a l p h a b e t i s s m a ll a n d t h ep a t t e r n s a r e l o n g ( f o r n o w e x e c u t i o n w i l l f r e q u e n t l yleave the fas t l o o p ; del tal will in general be smallb e c a u s e m a n y o f t h e c h a r a c t e r s in t h e a l p h a b e t w il loccur in pat and only the t e rmina l subs t r ing observa-t ions could cause l a rge sh i f t s ) . Despi t e the fac t tha t i ti s d i f f i cu l t to apprec ia te the ro le of delta2, i t should benoted tha t the l inea r i ty resu l t for the wors t case behav-ior of the a lgor i thm i s due ent i re ly to the presence ofdelta2.

    C o m p a r i n g t h e e m p i r i c a l e v i d e n c e ( F i g u r e s 1 a n d2) wi th the theore t i ca l ev idence (F igures 3 and 4 ,r e s p e c t i v e l y ) , w e n o t e t h a t t h e m o d e l i s c o m p l e t e l ya c c u r a t e f o r E n g l i s h a n d t h e 1 0 0 - c h a r a c t e r a l p h a b e t .T h e m o d e l p r e d i c t s m u c h b e t t e r b e h a v i o r t h a n w eac tua l ly exper ience in the b inary case . Our only expla -nat ion is that s ince delta2 p r e d o m i n a t e s i n t h e b i n a r ya l p h a b e t a n d s e t s u p a l i g n m e n t s o f th e p a t t e r n a n d t h es t ri n g , t h e a l g o r i t h m b a c k s u p o v e r l o n g e r t e r m i n a ls u b s tr i n g s o f t h e p a t t e r n b e f o r e f i n d in g m i s m a t c h e s .O u r a n a l y s i s i g n o r e s t h i s p h e n o m e n o n .

    H o w e v e r , i n s u m m a r y , t h e t h e o r e t i c a l a n al y si s s u p-por t s the conc lus ion tha t on the average the a lgor i thmis subl inear in the number of re fe rences to string a n d ,for suf f ic i en t ly l a rge a lphabe t s and pa t t e rn s , su bl ineari n t h e n u m b e r o f i n s t r u c t i o n s e x e c u t e d ( i n o u r i m p l e -m e n t a t i o n ) .

    8. Caveat ProgrammerI t should be observed tha t the preceding ana lys i shas as sumed tha t string i s e n t i r e l y i n p r i m a r y m e m o r y

    and tha t we can obta in the i th charac te r in i t i n onei n s t ru c t i o n a f t e r c o m p u t i n g i t s b y t e a d d r e s s . H o w e v e r ,if string i s ac tua l ly on secondary s torage , then thecharac te rs in i t mus t be read in . TM This t rans fe r wi l le n t a i l s o m e t i m e d e l a y e q u i v a l e n t t o t h e e x e c u t i o n o f ,s ay , w ins t ruc t ions pe r charac te r brought in , and (be -cause of the na ture of comp ute r I /O) a ll o f the f ir s t i +pat len - 1 charac te rs wi l l eventua l ly be broug ht inw h e t h e r we a c t u a l l y r e f e r e n c e a l l o f t h e m o r n o t . ( Ar e p r e s e n t a t i v e f i g u r e f o r w f o r p a g e d t r a n s f e r s f r o m afas t d i sk i s 5 ins t ruc t ions /charac te r . ) Thus the re maybe a h idden cos t of w ins t ruc t ions pe r chara c te r pas sedo v e r .

    A c c o r d i n g t o t h e s t a t i s t i c s p r e s e n t e d a b o v e o n em i g h t e x p e c t o u r a l g o r i t h m t o b e a p p r o x i m a t e l y t h r e et imes fas te r than the Knuth , Morr i s , and P ra t t a lgo-r i thm ( for , s ay , Engl i sh s t r ings of l ength 6) s ince tha ta l g o r i th m e x e c u t e s a b o u t t h r e e i n s tr u c t io n s t o o u r o n e .Ho we ver , i f the C PU i s id le for the w ins t ruc t ionsn e c e s s a r y t o r e a d e a c h c h a r a c t e r , t h e a c t u a l r a t i o s a r ec lose r to w + 3 ins t ruc t ions than to w + 1 ins t ruc t ions .T h u s f o r p a g e d d i s k t r a n s f e r s o u r a l g o r i t h m c a n o n l ybe exp ec te d to be rough ly 4/3 fas te r ( i . e . 5 + 3ins t ruc t ions to 5 + 1 ins t ruc t ions ) i f we as sume tha twe a re id le dur ing I /O . Thus for l a rge va lues of w thed i f f e r e n c e b e t w e e n t h e v a r i o u s a l g o r i t h m s d i m i n i s h esi f the CPU i s id le dur ing I /O .

    O f c o u r s e , in g e n e r a l , p r o g r a m m e r s ( o r o p e r a t i n gsys tems) t ry to avoid the s i tua t ion in which the CPU i sid le whi le awai t ing an I /O t rans fe r by over lapping I /Ow i t h s o m e o t h e r c o m p u t a t i o n . I n t h i s s i t u a t i o n , t h ec h a n c e s a r e t h a t o u r a l g o r i t h m w i l l b e I / O b o u n d ( w ewi l l s ea rch a page fas te r than i t can be brought in) ,a n d i n d e e d s o w i l l t h a t o f K n u t h , M o r r i s , a n d P r a t t i fw > 3 . Our a lgor i thm wi l l requi re tha t fewer CPUcyc les be d evo ted to the s ea rch i t s e lf so tha t i f the rea re o ther jobs to pe r form, the re wi l l s t i l l be an overa l ladvantage in us ing the a lgor i thm. .

    s Although the Knuth, Morris, and Pratt algorithm will fetcheach of the first i + patlen - 1 charac ters of string precisely once,sometimes a character is involved in several tests against charactersin pat. The number of such tests (each involving three instructions)is bounded by log.(patlen), where qb is the golden ratio.9 However , if the algorithm is implement ed without deltaz,recall that, in exiting the slow loop, one must now take the max ofdelta1 an d patlen - ./ + 1 to allow for the possibility that deltal isworthless.

    x0 We have implemented a version of our algorithm for searchingthrough disk files. It is available as the subrouti ne FFIL EPOS in thelatest release of INTERLISP-10. This function uses the TENEXpage mapping capability to identify one file page at a time with abuffer area in virtual memory. In addition to being faster thanreading the page by conventional methods, this means the operatingsystem's memory management takes care of references to pageswhich happen to still be in memory, etc. The algorithm is as muchas 50 times faster than the standard INTERLISP-10 FILEPOSfunction (depending on the length of the pattern).770 Communications October 1977of Volume 20the ACM Number 10

  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    10/11

    T h e r e a r e s e v e r a l s i t u a t i o n s i n w h i c h i t m a y n o t b ea d v i s a b l e t o u se o u r a l g o r i t h m . I f t h e e x p e c t e d p e n e t r a -t ion i a t which the pa t te r n i s f ound i s smal l , thep r e p r o c e s s i n g t i m e i s s i g n i f i c a n t a n d o n e m i g h t t h e r e -f o r e c o n s i d e r u s i n g t h e o b v i o u s i n t u i t i v e a l g o r i t h m .

    A s p r e v i o u s l y n o t e d , o u r a l g o r i t h m c a n b e m o s te f f i c i e n t l y i m p l e m e n t e d o n a b y t e - a d d r e s s a b l e m a -c h i n e . O n a m a c h i n e t h a t d o e s n o t a l l o w b y t e a d d r e s s e st o b e i n c r e m e n t e d a n d d e c r e m e n t e d d i r e c t l y , t w o p o s -s ib le sour ces of ine f f ic iency mus t be addr es sed : Thea l g o r i t h m t y p i c a l l y s k i p s t h r o u g h string in s teps l a r ge rt h a n 1 , a n d t h e a l g o r i t h m m a y b a c k u p t h r o u g h string.Unles s these pr oces ses a r e coded e f f ic ien t ly , i t i s p r ob-a b l y n o t w o r t h w h i l e t o u s e o u r a l g o r i t h m .

    F u r t h e r m o r e , i t s h o u l d b e n o t e d t h a t b e c a u s e t h ea l g o r i t h m c a n b a c k u p t h r o u g h string, i t is poss ible toc r o s s a p a g e b o u n d a r y m o r e t h a n o n c e . W e h a v e n o tf ound th i s to be a s e r ious sour ce of ine f f ic iency .H o w e v e r , i t d o e s r e q u i r e a c e rt a i n a m o u n t o f c o d e t oh a n d l e t h e n e c e s s a r y b u f f e r i n g ( i f p a g e I / O i s b e i n gh a n d l e d d i r e ct l y a s i n o u r F F I L E P O S ) . O n e b e a u t y o ft h e K n u t h , M o r r i s , a n d P r a t t a l g o r i t h m i s t h a t i t a v o i d st h i s p r o b l e m a l t o g e t h e r .

    A f ina l s i tua t ion in which i t i s unadvisab le to useo u r a l g o r i t h m i s i f t h e s t r i n g m a t c h i n g p r o b l e m t o b es o l v e d is ac t u a l ly m o r e c o m p l i c a t e d t h a n m e r e l y f i n d i n gt h e f i rs t o c c u r r e n c e o f a s i n g le s u b s t r i n g . F o r e x a m p l e ,i f the p r ob lem i s to f ind the f i r s t o f s ever a l p oss ib lesubs t r ings or to iden t i f y a loca t ion in string d e f i n e d b ya r e g u l a r e x p r e s s i o n , i t i s m u c h m o r e a d v a n t a g e o u s t ou s e a n a l g o r i t h m s u c h a s t h a t o f A h o a n d C o r a s i c k [ 1] .

    I t m a y o f c o u r s e b e p o s s i b l e t o d e s i g n a n a l g o r i t h mt h a t s e a r c h e s f o r m u l t i p l e p a t t e r n s o r i n s t a n c e s o fr e g u l a r e x p r e s s i o n s b y u s i n g t h e i d e a o f s t a r t i n g t h em a t c h a t t h e r ig h t e n d o f t h e p a t t e r n . H o w e v e r , w e

    h a v e n o t d e s i g n e d s u c h a n a l g o r i t h m .

    9 . H i s to r ica l Rem a rksO u r e a r l ie s t f o r m u l a t i o n o f t h e a l g o r i t h m i n v o l v e d

    o n l y delta1 a n d i m p l e m e n t e d O b s e r v a t i o n s 1 , 2 , a n d3 ( a ) . W e w e r e a w a r e t h a t w e c o u l d d o s o m e t h i n ga long the l ines of delta2 a n d O b s e r v a t i o n 3 ( b ) , b u t d i dn o t p r e c i s e l y f o r m u l a t e i t. I n s t e a d , i n A p r i l 1 9 7 4 , w ec o d e d t h e delta1 v e r s i o n o f t h e a l g o r i t h m i n I n t e r l i s p ,m e r e l y t o t e s t i ts s p e e d . W e c o n s i d e r e d c o d i n g t h ea l g o r it h m i n P D P - 1 0 a s s e m b l y l a n g u a g e b u t a b a n d o n e dt h e i d e a a s im p r a c t i c a l b e c a u s e o f t h e c o s t o f i nc r e -m e n t i n g b y t e p o i n t e r s b y a r b i t r a r y a m o u n t s .

    W e h a v e s i n c e l e a r n e d t h a t R . W . G o s p e r , o f S ta n -f o r d U n i v e r s i t y , s i m u l t a n e o u s l y a n d i n d e p e n d e n t l y d i s -c o v e r e d t h e deltal v e r s i o n o f t h e a l g o r i t h m ( p r i v a t ec o m m u n i c a t i o n ) .

    I n A p r i l 1 9 7 5 , w e s t a r t e d t h i n k i n g a b o u t t h e i m p l e -m e n t a t i o n a g a i n a n d d i s c o v e r e d a w a y t o i n c r e m e n tb y t e p o i n t e r s b y i n d e x i n g t h r o u g h a t a b l e . W e t h e nf o r m u l a t e d a v e r s i o n o f deltas a n d c o d e d t h e a l g o r i t h m

    mor e or l e s s a s i t i s p r esen ted he r e . This o r ig ina lde f in i t ion of delta2 d i f f e r e d f r o m t h e c u r r e n t o n e i n t h ef o l l o w i n g r e sp e c t : I f o n ly t h e l a s t m c h a r a c t e r s o f pat(cal l this substr ing subpat) w e r e m a t c h e d , deltas spec-i f ied a s l ide to the s econd f r om the r igh tmos t occur -r e n c e o f subpa t in pat ( a l l ow i n g t h i s o c c u r r e n c e t o " f a llo f f " the le f t end of pat) b u t w i t h o u t a n y s p e c i a lc o n s i d e r a t i o n o f th e c h a r a c t e r p r e c e d i n g t h i s o c c u r-r e n c e .

    T h e a v e r a g e b e h a v i o r o f t h a t v e r s i o n o f t h e a l g o -r i t h m w a s v i r t u a l l y i n d i s t i n g u i s h a b l e f r o m t h a t p r e -s e n t e d i n t h i s p a p e r f o r l a r g e a l p h a b e t s , b u t w a ss o m e w h a t w o r s e f o r s m a l l a l p h a b e t s . H o w e v e r , i t sw o r s t c a s e b e h a v i o r w a s q u a d r a t i c ( i . e . r e q u i r e d o nt h e o r d e r o f i * patlen c o m p a r i s o n s ) . F o r e x a m p l e ,c o n s i d e r s e a r c h i n g f o r a p a t t e r n o f t h e f o r m C A ( B A ) ri n a s t ri n g o f th e f o r m ( ( X X ) r ( A A ) ( B A ) r ) * ( e . g . r =2, pat = " C A B A B A , " a n d string = "X X X X A A B A -B A X X X X A A B A B A . . . " ) . T h e o ri g in a l d e fi n it io no f deltas a l l o w e d o n l y a s l i de o f 2 i f t h e l a s t " B A " o fpat w a s m a t c h e d b e f o r e t h e n e x t " A " f a i l e d t o m a t c h .Of cour se in th i s s i tua t ion th i s on ly se t s up anothe rm i s m a t c h a t t h e s a m e c h a r a c t e r i n string, b u t t h ea l g o r i t h m h a d t o r e i n s p e c t t h e p r e v i o u s l y i n s p e c t e dc h a r a c t e r s t o d i s c o v e r it . T h e t o t a l n u m b e r o f r e f e r -e n c e s t o string in pas s ing i cha r ac te r s in th i s s i tua t ionwa s (r + 1) * (r + 2) * i /(4r + 2) , where r = (patlen -2 ) / 2 . T h u s t h e n u m b e r o f r e f e r e n c e s w a s o n t h e o r d e ro f i * patlen.

    H o w e v e r , o n t h e a v e r a g e t h e a l g o r i t h m w a s b l i n d -ing ly f as t . To our sur pr i s e , i t was sever a l t imes f as te rt h a n t h e s t r i n g s e a r c h i n g a l g o r i t h m i n t h e T e n e x T E C Ot e x t e d i t o r . T h i s a l g o r i t h m i s r e p u t e d t o b e q u i t e a ne f f ic i e n t i m p l e m e n t a t i o n o f th e s i m p l e s e a r c h a l g o r i t h mb e c a u s e i t s e a r c h e s f o r t h e f i r s t c h a r a c t e r o f pat o n ef u ll w o r d a t a t i m e ( r a t h e r t h a n o n e b y t e a t a t i m e ) .I n th e s u m m e r o f 1 9 7 5 , w e w r o t e a b r i e f p a p e r o nt h e a l g o r i t h m a n d d i s t r i b u t e d i t o n r e q u e s t .

    I n D e c e m b e r 1 9 7 5 , B e n K u i p e r s o f t h e M . I . T .A r t i f i c i a l I n t e l l i g e n c e L a b o r a t o r y r e a d t h e p a p e r a n db r o u g h t t o o u r a t t en t i o n t h e i m p r o v e m e n t to deltasc o n c e r n i n g t h e c h a r a c t e r p r e c e d i n g t h e t e r m i n a l s u b -s t r i n g a n d i t s r e o c c u r r e n c e ( p r i v a t e c o m m u n i c a t i o n ) .A l m o s t s i m u l t a n e o u sl y , D o n a l d K n u t h o f S t a n f o rdU n i v e r s i t y s u g g e s t e d t h e s a m e i m p r o v e m e n t a n d o b -s e r v e d t h a t t h e i m p r o v e d a l g o r i t h m c o u l d c e r t a i n l ym a k e n o m o r e t h a n o r d e r (i + patlen ) * log(patl en)r e f e r e n c e s t o string ( p r iv a t e c o m m u n i c a t i o n ) .

    W e m e n t i o n e d t h i s i m p r o v e m e n t i n t h e n e x t r e v i -s i o n o f t h e p a p e r a n d s u g g e s t e d a n a d d i t i o n a l i m p r o v e -m e n t , n a m e l y t h e r e p l a c e m e n t o f b o t h d e l t a I a n d deltasb y a s i n g l e t w o - d i m e n s i o n a l t a b l e . G i v e n t h e m i s -m a t c h e d char f r o m string and th e pos i t ion j in pat atw h i c h t h e m i s m a t c h o c c u r r e d , t h i s t a b l e i n d i c a t e d t h ed is tance to the la s t occu r r enc e ( i f any) of the sub s t r ing[char, pat(] + 1 ) . . . . . pat(patlen)] in pat. T h e r e v i s e dp a p e r c o n c l u d e d w i t h t h e q u e s t i o n o f w h e t h e r t h i si m p r o v e m e n t o r a s i m i l a r o n e p r o d u c e d a n a l g o r i t h m

    771 Communication s October 1977of Volume 20the ACM Numbe r 10

  • 8/14/2019 A Fast String Searching Algorithm (Boyer-Moore Original)

    11/11

    which was at worst linear and on the average "sub-linear."In January 1976, Knuth [5] proved that the simplerimprovement in fact produces linear behavior, even inthe worst case. We therefore revised the paper againand gave delta2 its current definition.In April 1976, R.W. Floyd of Stanford Universitydiscovered a serious statistical fallacy in the first versionof our formula giving the expected value of the ratioof cost to characters passed. He provided us (privatecommunication) with the current version of this for-mula.Thomas Standish, o f the University of California atIrvine, has suggested (private commun ication) that theimplementation of the algorithm can be improved byfetching larger bytes in the fast loop (i.e. bytes contain-ing several characters) and using a hash array to encodethe extended deltat table. Provided the difficulties atthe boundaries of the pattern are handled efficiently,this could improve the behavior of the algorithm enor-mously since it exponentially increases the effectivesize of the alphabet and reduces the frequency ofcommon characters.

    Acknowledgments. We would like to thank B.Kuipers, of the M.I.T. Artificial Intelligence Labora-

    tory, for his suggestion concern ing delta2 and D. Knuth,of Stanford University, for his analysis of the improvedalgorithm. We are grateful to the anonymous reviewerfor Communications who suggested the inclusion ofevidence comparing our algorithm with that of Knuth,Morris, and Pratt, and for the warnings contained inSection 8. B. Mont- Reyna ud, of the Stanford ResearchInstitute, and L. Guibas, of Xerox Palo Alto ResearchCenter, proofread drafts of this paper and suggestedseveral clarifications. We would also like to thank E.Taft and E. Fiala of Xerox Palo Alto Research Centerfor their advice regarding machine coding the algo-rithm.

    R e c e i v e d J u n e 1 9 7 5 ; r e vi s e d A p r i l 1 9 7 6

    R e f e r e n c e s1 . A h o , A . V . , a n d C o r a s i c k , M . J . F a s t p a t t e r n m a t c h i n g : A n a i dt o b i b l i o g r a p h i c s e a r c h . C o m m . A C M 1 8 , 6 ( J u n e , 1 9 7 5 ) , 3 3 3 - 3 4 0 .2 . B e e l e r , M . , G o s p e r , R . W . , a n d S c h r o e p p e l , R . H a k m e m .M e m o N o . 2 3 9 , M . I . T . A r t i f i c i a l I n t e l l ig e n c e L a b . , M . I . T . , C a m -b r i d g e , M a s s . , F e b . 2 9 , 1 9 7 2 .3 . D e w e y , G . R e l a t i v F r e q u e n cy o f E n g l i s h S p e e ch S o u n d s . H a r -v a r d U . P r e s s , C a m b r i d g e , M a s s . , 1 9 2 3 , p . 1 8 5 .4 . K n u t h , D . E . , M o r r i s , J . H . , a n d P r a t t , V . R . F a s t p a t t e r n m a t c h -i n g i n s t r in g s . T R C S - 7 4 - 4 4 0 , S t a n f o r d U . , S t a n f o r d , C a l i f . , 1 9 7 4 .5 . K n u t h , D . E . , M o r r i s , J . H . , a n d P r a tt , V . R . F a s t p a t t e r n m a t c h -i n g in s t r i n g s . ( t o a p p e a r i n S I A M J . C o m p u t . ) .

    Profes s ional Act iv i ti esCa l enda r o f Even t s

    A C M ' s c a l e n d a r p o l i c y i s t o l i s t o p e n c o m -pu ter sc ience meet ings tha t a re he ld on a no t - fo r-p ro f i t bas i s . No t inc luded in the ca lendar a re edu -ca t ional seminars in s t i tu tes , and cou rses . Sub-mi t ta l s shou ld be subs tan t ia ted wi th name o f thesponso r ing o rgan iza t ion , fee schedu le , and chai r -m a n ' s n a m e a n d f u l l a d dr e s s .O n e t e l e p h o n e n u m b e r c o n t a c t f o r t h o s e i n -te res ted in a t t end ing a meet ing wi l l be g iven whena number i s spec i f ied fo r th i s pu rpose in the newsre lease t ex t o r in a d i rec t commu nica t i on to th i sper iod ica l .A l l r e q u e s t s f o r A C M s p o n s o r s h i p o r c o o p -e r a t i o n s h o u l d b e a d d r e s s e d t o C h a i r m a n , C o n -f e r e n c e s a n d S y m p o s i a C o m m i t t e e . D r . W . S .Dorsey , Dep t . 503 /504 Rockwel l In terna t ionalC o r p o r a t i o n , A n a h e i m , C A 9 2 8 0 3 . F o r E u r o p e a neven ts , a copy o f the reques t shou ld a l so be sen tt o t h e E u r o p e a n R e g i o n a l R e p r e s e n t a t i v e . T e c h -n ica l Meet ing Reques t Fo rms fo r th i s pu rposec a n b e o b t a i n e d f r o m A C M H e a d q u a r t e r s orf r o m t h e E u r o p e a n R e g i o n a l R e p r es e n t a t iv e . L e a dt ime shou ld inc lude 2 mon ths (3 mon ths i f fo rE urope) fo r p rocess ing o f the reques t , p lu s thenecessary m o n t h s ( m i n i m u m 2 ) f o r a n y p u bl i c i tyt o a p p e a r i n Communica t ions .E v e n t s f o r w h i c h A C M o r a s u b u n i t o f A C Mis a s p o n s o r o r c o l l a b o r a t o r a r e i n d i c a t e d b y .Dates p recede t i t l es .In this issue the calendar is given to April1978. New Listings are shown first; they will ap-pear nex t month as Prev ious Lis t ings .N E W L I S T IN G S16-17 Nov emb er 1977 W ork s h op on Fu t u re Direc t ion s in Comp u t erArch i t ec t u re , Aus t in , T ex . Sponso rs : ACM SIG-A R C H , I E E E - C S T C C A , a n d U n i v e r s i ty o f T e x asa t Aus t in . Conf . chm: G. Jack L ipovsk i . Dep t . o fE E , Un ivers i ty o f T exas , Aus t in , T X 78712 .5 -9 December 1977Th ird I nt ern at ion al S ymp os iu m on C om-p u t in g M et h od s in Ap p l ied S c ien ces an d En gi -n eer in g , V e r s a i ll e s , F r a n c e . O r g a n i z e d b y I R I A .S p o n s o r s: A F C E T , G A M N I , I F I P W G 7 . 2 C o n -t a c t : I n s t i t u t d e R e c h e r c h e D ' I n f o r m a t i q u e e tD ' A u t o m a t i q u e , D o m a i n e d e V o i n c e a n , R o c q u e n -cour t , 78150 L e Chesnay , France .13 -15 Feb rua ry 1978 S y m p o s i u m o n C o m p u t e r N e t w o r k Pr o t o -co ls , L i6ge , Belg ium. Sponso rs : IFIP T .C .6 andA C M B e l g i a n C h a p t e r . C o n t a c t : A . D a n t h i n e ,S y m p o s i u m o n C o m p u t e r N e t w o r k P r o t o c o l s ,Aven ue de s T i l l eul s , 49 , B-4000 , L i6ge , Belg ium.3 March 1978I n d ian a Un ivers i t y Comp u t er Net work Con -f eren ce on I n s t ru ct ion al Comp u t in g Ap p l icat ion s ,I n d i a n a U n i v e r s i ty - E a s t , R i c h m o n d , I n d . S p o n s o r:

    7 7 2

    I n d i a n a U n i v e r s i t y C o m p u t i n g N e t w o r k . C h m :T om Osgood , IU-E AST , 2325 Ches ter Bou levard ,Richmond , IN 47374 .12 -17 March 1978S ymp os iu m on Comp u t er S imu lat ion of Bu lkM at t er f rom M olecu lar Pers p ec t ive , A n a h e i m ,Cal i f . ; par t o f 1978 Annual Sp r ing Meet ing o fA m e r i c a n C h e m i c a l S o c i e t y . S p o n s o r : A C S D i v .C o m p u t e r s i n C h e m i s t r y . C o n t a c t : P e t e r L y k e s ,I l l ino i s In s t i tu te o f T echno logy , Ch icago , IL60616; 312 567-3430.28 -30 Marc h 19783rd S ymp os iu m on Programmin g, Par i s ,F r a n c e . S p o n s o r: C e n t r e N a t i o n a l d e l a R e c h e r c h eScian t i f ique (CNRS) and Un ivers i t6 P ierre e tM a r i e C u r i e . C o n t a c t S 6 c r e t a r i a t d u C o l l o q u e ,I n s t i t u t d e P r o g r a m m a t i o n , 4 , P l a c e J u s s i e u ,75230 Par i s Cedex 05 , France .29 -31 Marc h 1978Con f eren ce on I n f ormat ion S c ien ces an dS ys t ems , J o h n s H o p k i n s U n i v e r s i t y , B a l t i m o r e ,Md . Con tac t : 1978 CISS, Depa r tme n t o f E lec t r i -ca l E ng ineer ing , Johns Hopk ins Un ivers i ty , Bal t i -more , MD 21218 .3-7 Apri l 1978Fif t h I n t ern at ion al S ymp os iu m on Comp u t -in g in Li t erary an d Lin gu is t ic Res earch , U n i v e r -s i t y o f A s t o n , B i r m i n g h a m , E n g l a n d . S p o n s o r :A s s o c i a t i o n f o r L i t e r a r y a n d L i n g u i s t i c C o m p u t -i n g . C o n t a c t : T h e S e c r e t a r y ( C L L R ) , M o d e r nL a n g u a g e s D e p t . , U n i v e r s i t y o f A s t o n , B i r m i n g -h a m B 4 7 E T , E n g l a n d .4 -8 Apri l 1978S econ d I n t ern at ion al Con f eren ce on Comb i-n at or iaI M at h emat ic s , B a r b i z o n - P l a z a H o t e l , N e wY o r k C i t y . S p o n so r : N e w Y o r k A c a d e m y o f S c i-e n c e s . C o n t a c t : C o n f e r e n c e D e p t . , N e w Y o r kAca dem y o f Sc iences , 2 E as t 63 S t . , New York ,NY 10021; 212 838-0230.15-19 May 197816t h An n u al Con ven t ion of t h e As s oc iat ionf or Ed u cat ion al Dat a S ys t ems , A t l a n t a , G a .Sponso r : AE D S. Con tac t : J ame s E . E ise le , Off ice

    o f Comput ing Act iv i t i es , Un ivers i ty o f Georg ia ,Athens , GA 30602 .22 -25 Ma y 1978S i x th I n t e rn a t i on a l C O D A T A C o n f e r e nc e ,T a o r m i n a , I t a l y . S p o n s o r : I n t e r n a t i o n a l C o u n c i lo f S c i e n ti f ic U n i o n s C o m m . o n D a t a f o r S c i e n c ea n d T e c h n o l o g y . C o n t a c t : C O D A T A S e c r e t a r i a t,51 , Bou levard de Mon tmorency , 75016 Par i s ,F r a n c e .24 -26 Ma y 1978 1978 S I A M N a t i o n a l M e e t i ng , Univers i ty o fWiscons in , Mad ison , Wis . Sponso r : SIAM in co -o p e r a t i o n w i t h A C M S I G S A M . C o n t a c t : H . B .Hai r , SIAM, 33 Sou th 17 S t . , Ph i lade lph ia , PA19103; 215 564-2929.26 May 1978 Comp u t er Algeb ra S ymp os iu m, Un ivers i t y ofWiscons in , Mad ison , Wis . ; par t o f the 1978 SIAMN a t i o n a l M e e t i n g . S p o n s o r: S I A M i n c o o p e r a t i o nw i t h A C M S I G S A M . S y r u p . c h i n : G e o r g e E .C o m m u n i c a t i o n so ft h e A C M

    Col l in s , Compu ter Sc iences Dep t , , Un ivers i ty o fWiscons in , 1210 W. Day ton S t ree t , Mad is on WI53706.12-16 June 19787t h Tr ien n ia l I FAC W orld Con gres s . Spon-s o r : I F A C . C o n t a c t : I F A C 7 8 S e c r e t a r i a t , P U B192, 00101 Helsinki 10, Finland.19-22 Jun e 1978An n u al Con f eren ce o f t h e Amer ican S oc ie t yf or En gin eer in g Ed u cat ion ( C o m p u t e r s i n E d u -c a t i o n D i v i s i o n P r o g r a m ) , U n i v e r s i t y o f B r i t i s hC o l u m b i a , V a n c o u v e r , B . C . , C a n a d a . S p o n s o r :A S E E C o m p u t e r s i n E d u c a t i o n D i v i s i on . C o n t a c t :A S E E , S u i t e 4 0 0 , O n e D u P o n t C i r c l e , W a s h i n g -ton , DC 20036 .22-23 June 1978 I n t ern at ion al Con f eren ce on t h e Per f orm-an ce of Comp u t er I n s t a l la t ion s , G a r d o n e R i v i e r a ,L a k e G a r d a , I t a l y . S p on s o r : S p e r r y U n i v a c , I t a l y ,w i t h c o o p e r a t i o n o f A C M S I G M E T R I C S ,E C O M A , A I C A , A C M I t a l i a n C h a p t e r . C o n t a c t :C o n f e r e n c e S e c r e t a r i a t , C I L E A , V i a R a f f a e l l oSanzio 4 , 20090 Seg ra te , M i lan , I t a ly .2 -4 Augus t 1978 I n t ern at ion al Con f eren ce on Dat ab as es : I m-p rovin g Us ab i l i t y an d Res p on s iven es s , T e c h n i o n ,H a i f a , I s r a e l . S p o n s o r : T e c h n i o n i n c o o p e r a t i o nw i t h A C M . P r o g . c h m : B e n S h n e i d e r m a n , D e p t .o f I n f o r m a t i o n S y s t e m s M a n a g e m e n t , U n i v e r s i t yo f M a r y l a n d , C o l l e ge P a r k , M D 2 0 7 42 .13-18 August 1978S ymp os iu m on M od e l in g an d S imu lat ionM e t h o d o l o g y , Weizmann Ins t i tu te o f Sc ience , Re-hovo t , I s rae l . Con tac t : H. J . High land , S ta te Un i -v e r s i ty T e c h n i c a l C o l l e ge F a r m i n g d a l e , N . Y . , o rB . P . Z e i g l e r , D e p t . o f A p p l i e d M a t h e m a t i c s ,Weizmann Ins t i tu te o f Sc ience , Rehovo t , I s rae l .3 0 O c t o b e r - 1 N o v e m b e r 1 97 81 9 7 8 S I A M F a l l M e e t i n g , H y a t t R e g e n c yHote l , Knoxv i l l e , T enn . Sponso r : SIAM. Con tac t :H.B. Hai r , SIAM, 33 Sou th 17 th S t . , Ph i lade lph ia ,PA 19103; 215 564-2929.P R E V I O U S L I S T I N G S

    17-19 October 1977 A C M 7 7 A n n u a l C o n f e r e n c e , O l y m p i c H o -te l . Sea t t l e , Wash . Gen . ch in : James S . K et ch e l .Box 16156, Seatt le, WA 98116; 206 935-6776.17-21 October 1977S ys t ems 77 , Comp u t er S ys t ems an d Th e irAp p l icat ion , M u n i c h , F e d e r a l R e p u b l i c o f G e r -m a n y . C o n t a c t : M i i n e h e n e r M e s s e - u n d A u s s t e l -l u n g s g es e l l s c ha f t m b h , K o n g r e s s z e n t r u m , K o n g -ressb i i ro Sys tems 77 , Pos t fach 12 10 09 , D-8000M f i n c h e n 1 2 , F e d e r a l R e p u b l i c o f G e r m a n y .18 -19 October 1977M S F C / V A H D a t a M a n a ge m e n t S ym p o s iu m ,Shera ton Moto r Inn , Hun tsv i l l e , Ala . Sponso rs :N A S A M a r s h a l l S p a c e F l i g h t C e n t e r , U n i v e r s i t yo f A l a b a m a i n H u n t s v i l l e. C o n t a c t : G e n e r a l C h a i r -(Calendar continued on p. 781)

    O c t o b e r 1 9 7 7V o l u m e 2 0N u m b e r 1 0


Recommended