+ All Categories
Home > Documents > Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons...

Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons...

Date post: 08-Sep-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
23
30.11.17 1 Text Algorithms Jaak Vilo 2016 fall 1 MTAT.03.190 Text Algorithms Jaak Vilo Topics Exact matching of one pattern(string) Exact matching of multiple patterns Suffix trie and tree indexes Applications Suffix arrays Inverted index Approximate matching Algorithms One-pattern Brute force Knuth-Morris-Pratt Karp-Rabin Shift-OR, Shift-AND Boyer-Moore Factor searches Regular expressions(?) Weight matrices(?) Multi-pattern Aho Corasick Commentz-Walter Indexing Trie (and suffix trie) Suffix tree Exact pattern matching S=s 1 s 2… s n (text) |S| = n (length) P=p 1 p 2 ..p m (pattern) |P| = m Σ - alphabet | Σ| = c Does S contain P? Does S = S' P S" fo some strings S' ja S"? Usually m << n and n can be (very) large Find occurrences in text S P Animations http://www-igm.univ-mlv.fr/~lecroq/string/ EXACT STRING MATCHING ALGORITHMS Animation in Java Christian Charras - Thierry Lecroq Laboratoire d'Informatique de Rouen Université de Rouen Faculté des Sciences et des Techniques 76821 Mont-Saint-Aignan Cedex FRANCE e-mails: {Christian.Charras, Thierry.Lecroq}@laposte.net
Transcript
Page 1: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

1

TextAlgorithms

JaakVilo2016fall

1MTAT.03.190TextAlgorithmsJaakVilo

Topics

• Exactmatchingofonepattern(string)• Exactmatchingofmultiplepatterns• Suffixtrie andtreeindexes

– Applications

• Suffixarrays• Invertedindex• Approximatematching

Algorithms

One-pattern• Bruteforce• Knuth-Morris-Pratt• Karp-Rabin• Shift-OR,Shift-AND• Boyer-Moore• Factor searches

• Regular expressions(?)• Weight matrices(?)

Multi-pattern• Aho Corasick• Commentz-Walter

Indexing• Trie (andsuffixtrie)• Suffixtree

Exactpatternmatching

• S=s1 s2… sn (text) |S|=n(length)

• P=p1p2..pm (pattern) |P|=m

• Σ - alphabet | Σ|=c

• DoesScontainP?– DoesS=S'PS"fosomestringsS'jaS"?– Usuallym<<nandncanbe(very)large

Findoccurrencesintext

S

P

Animations• http://www-igm.univ-mlv.fr/~lecroq/string/

• EXACTSTRINGMATCHINGALGORITHMSAnimationinJava

• ChristianCharras- ThierryLecroqLaboratoired'InformatiquedeRouenUniversitédeRouenFacultédesSciencesetdesTechniques76821Mont-Saint-AignanCedexFRANCE

• e-mails:{Christian.Charras,Thierry.Lecroq}@laposte.net

Page 2: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

2

Bruteforce:BABintext?

A B A C A B A B B A B B B AB A B

BruteForce

S

Pi i+j-1

j

Identifythefirstmismatch!

Question:

§Problemsofthismethod?§Ideastoimprovethesearch?

L

J

Bruteforce

AlgorithmNaiveInput:TextS[1..n]and

patternP[1..m]Output:Allpositionsi,where

PoccursinS

for(i=1;i<=n-m+1;i++)for (j=1;j<=m;j++)if(S[i+j-1]!=P[j])break;

if (j>m)printi;

attempt 1:gcatcgcagagagtatacagtacgGCAg....

attempt 2:gcatcgcagagagtatacagtacgg.......

attempt 3:gcatcgcagagagtatacagtacg

g.......

attempt 4:gcatcgcagagagtatacagtacg

g.......

attempt 5:gcatcgcagagagtatacagtacg

g.......

attempt 6:gcatcgcagagagtatacagtacg

GCAGAGAG

attempt 7:gcatcGCAGAGAGtatacagtacg

g.......

BruteforceorNaiveSearch

1 function NaiveSearch(string s[1..n],string sub[1..m])2 for i from 1to n-m+13 for j from 1tom4 if s[i+j-1]≠sub[j]5 jumptonextiterationofouterloop6 return i7return notfound

Ccodeint bf_2( char* pat, char* text , int n ) /* n = textlen */{

int m, i, j ; int count = 0 ; m = strlen(pat);

for ( i=0 ; i + m <= n ; i++) {

for( j=0; j < m && pat[j] == text[i+j] ; j++) ;

if( j == m )count++ ;

}

return(count);}

Ccodeint bf_1( char* pat, char* text ) {

int m ; int count = 0 ; char *tp;

m = strlen(pat); tp=text ;

for( ; *tp ; tp++ ) {if( strncmp( pat, tp, m ) == 0 ) {

count++ ; }

}

return( count ); }

Page 3: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

3

MainproblemofNaive

• ForthenextpossiblelocationofP,checkagainthesamepositionsofS

S

Pi i+j-1

jS

j

Goals

• Makesureonlyaconstantnrofcomparisons/operationsismadeforeachpositioninS– Move(only)fromlefttorightinS

– How?– AfteratestofS[i]<>P[j]whatdowenow?

Knuth-Morris-Pratt

• Makesurethatnocomparisons“wasted”

• AftersuchamismatchwealreadyknowexactlythevaluesofgreenareainS!

D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings.SIAM Journal on Computing 6:323-350, 1977.

x

y≠

Knuth-Morris-Pratt

• Makesurethatnocomparisons“wasted”

• P– longestsuffixofanyprefixthatisalsoaprefixofapattern

• Example: ABCABD

D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings.SIAM Journal on Computing 6:323-350, 1977.

prefix x

prefix y

p z

ABCABD

AutomatonforABCABD

1 2 3 4 5 6 7A AB C B D

NOT A

AutomatonforABCABD

1 2 3 4 5 6 7A AB C B D

NOT A

0 1 1 1 2 3 1Fail links:

A B C A B DPattern:1 2 3 4 5 6

Page 4: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

4

KMPmatching

Input:TextS[1..n]andpatternP[1..m]Output: FirstoccurrenceofPinS(ifexists)

i=1; j=1;

initfail(P) // Prepare fail links

repeat if j==0 or S[i] == P[j]

then i++ , j++ // advance in text and in pattern

else j = fail[j] // use fail link

until j>m or i>n

if j>m then report match at i-m

Initializationoffaillinks

Algorithm:KMP_InitfailInput:PatternP[1..m]Output:fail[]forpatternP

i=1, j=0 , fail[1]= 0

repeat

if j==0 or P[i] == P[j]

then i++ , j++ , fail[i] = j

else j = fail[j]

until i>=m

Initializationoffaillinks

i=1, j=0 , fail[1]= 0 repeat

if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = jelse j = fail[j]

until i>=m

0Fail:

ABCABDi

j

0 1

0 1 1 1

ABCABD

0 1 1 1 2

TimecomplexityofKMPmatching?

Input:TextS[1..n]andpatternP[1..m]Output: FirstoccurrenceofPinS(ifexists)

i=1; j=1;

initfail(P) // Prepare fail links

repeat if j==0 or S[i] == P[j]

then i++ , j++ // advance in text and in pattern

else j = fail[j] // use fail link

until j>m or i>n

if j>m then report match at i-m

Analysisoftimecomplexity

• Ateverycycleeitheriandjincreaseby1• Orjdecreases(j=fail[j])

• icanincreasen(orm)times• Q:Howoftencanjdecrease?

– A:notmorethannrofincreasesofi

• Amortisedanalysis: O(n),preprocessO(m)

Karp-Rabin

• CompareinO(1)ahashofPandS[i..i+m-1]

• Goal:O(n).• f(h(T[i..i+m-1])->h(T[i+1..i+m]))=O(1)

R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.

i..(i+m-1)

1..m

h(T[i.. i+m-1])

h(P)

Page 5: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

5

Karp-Rabin

• CompareinO(1)ahashofPandS[i..i+m-1]

• Goal:O(n).• f(h(T[i..i+m-1])->h(T[i+1..i+m]))=O(1)

R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.

i..(i+m-1)

1..m

h(T[i+1..i+m])

h(P)

i..(i+m-1)

Hash

• “Remove” theeffectofT[i]and“Introduce”theeffectofT[i+m]– inO(1)

• Usebase|Σ|arithmeticsandtreatcharctersasnumbers

• Incaseofhashmatch– checkallmpositions• Hashcollisions=>WorstcaseO(nm)

Let’susenumbers

• T=57125677• P=125(andforsimplicity,h=125)

• H(T[1])=571• H(T[2])=(571-5*100)*10+2 =712

• H(T[3])=(H(T[2])– ord(T[1])*10m)*10+T[3+m-1]

hash

• c– sizeofalphabet

• HSi=H(S[i..i+m-1])

• H(S[i+1..i+m])=(HSi– ord(S[i])*cm-1 )*c+ord(S[i+m])

• Moduloarithmetic– tofitvalueinaword!

• hash(w[0..m-1])=(w[0]*2m-1+w[1]*2m-2+···+w[m-1]*20)modq

Karp-RabinInput: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S 1. c=20; /* Size of the alphabet, say nr. of aminoacids */ 2. q = 33554393 /* q is a prime */ 3. cm = cm-1 mod q 4. hp = 0 ; hs = 0 5. for i = 1 .. m do hp = ( hp*c + ord(p[i]) ) mod q // H(P) 6. for i = 1 .. m do hs = ( hp*c + ord(s[i]) ) mod q // H(S[1..m]) 7. if hp == hs and P == S[1..m] report match at position

8. for i=2 .. n-m+1 9. hs = ( (hs - ord(s[i-1])*cm) * c + ord(s[i+m-1]) mod q 10. if hp == hs and P == S[i..i+m-1] 11. report match at position i

Page 6: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

6

MorewaystoensureO(n)? Shift-AND/Shift-OR

• RicardoBaeza-Yates,GastonH.GonnetAnewapproachtotextsearchingCommunicationsoftheACM October1992,Volume35Issue10[ACMDigitalLibrary:http://doi.acm.org/10.1145/135239.135243][DOI]

• PDF

Bit-operations

• Maintainasetofallprefixesthathavesofarhadaperfectmatch

• Onthenextcharacterintextupdateallpreviouspointerstoanewset

• Bitvector:foreverypossiblecharacter

Matchinginlineartime(shift-OR)Pattern: ABCB

ABCABCBEA

Text:… 1 1 1 1 1 0

… 1 1 1 1 1 0 0

A

B 0 1 0 1… 1 1 1 1 1 0 1

1 0 1 1C … 1 1 1 1 0 1 0 shift

1 1 1 0

shift

bv[ T[j] ]

… 1 1 1 1 0 1 1

… 1 1 1 1 1 0

… 1 1 1 0 1 1 01 1 0 1

… 1 1 1 1 1 1 1

|

State:which prefixes match?Shift-AND;shift-OR

1

0

0

1

0

Move to next:shift-ANDshift 1,introduce 1,bitwise and

1

0 0

0

1

1 1

0 0

01

1

0

1

1

0

1

0

0

0

&

Pattern[S[i]]

1

1

1

0

0

=

Page 7: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

7

Trackpositionsofprefixmatches

0 1 0 1 0 1

1 0 0 0 1 1

1 0 1 0 1 1 Shift left <<

1 0 0 0 1 1Mask on char T[i] Bitwise AND

VectorsforeverycharinΣ

• P=aste

a s t e b c d .. z

1 0 0 0 0 ...

0 1 0 0 0 ...

0 0 1 0 0 ...

0 0 0 1 0 ...

• T=lasteaed

l a s t e a e d

0 1

0 0

0 0

0 0

• T=lasteaed

l a s t e a e d

0 1 0

0 0 1

0 0 0

0 0 0

• T=lasteaed

l a s t e a e d

0 1 0 0 0 1

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

• T=lasteaed

l a s t e a e d

0 1 0 0 0 1

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

Page 8: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

8

http://www-igm.univ-mlv.fr/~lecroq/string/node6.html

[A]11010101

SummaryAlgorithm Worstcase Ave.Case Preprocess

Bruteforce O(mn) O(n*(1+1/|Σ|+..)

Knuth-Morris-Pratt O(n) O(n) O(m)

Rabin-Karp O(mn) O(n) O(m)

Boyer-Moore O(n/m)?

BMHorspool

Factorsearch

Shift-OR O(n) O(n) O(m|Σ|)

• R.Boyer,S.Moore:Afaststringsearchingalgorithm.CACM 20(1977),762-772[PDF]

• http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf

48

Page 9: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

9

Findoccurrencesintext

• Havewemissedanything?

49

S

P

Findoccurrencesintext

• Whathavewelearnedifwetestforapotentialmatchfromtheend?

50

S

P

ABCDEBBCDE

51

Findoccurrencesintext

S

P

AB

52

BadcharacterheuristicsmaximalshiftonS[i]

S

P

AB

X

SXX

delta1( S[i] ) – |m| if pattern does not contain S[i]patlen-j max j so that P[j] == S[i]

S[i]

First x in pattern (from end)

53

void bmInitocc() {

char a; int j;

for(a=0; a<alphabetsize; a++)

occ[a]=-1;

for (j=0; j<m; j++) {

a=p[j];

occ[a]=j; } }

54

Page 10: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

10

Goodsuffixheuristics

S

P

AB

µ

S

delta2( S[i] ) – minimal shift so that matched region is fully coveredor that the sufix of match is also a prefix of P

µµS

µµ’

1.

2.

55

Boyer-Moorealgorithm

Input: Text S[1..n] and pattern P[1..m]

Output: Occurrences of P in S

preprocess_BM() // delta1 and delta2

i=m

while i <= n

for( j=m; j>0 and P[j]==S[i-m+j]; j-- ) ;

if j==0 report match at position i-m+1

i = i+ max( delta1[ S[i] ], delta2[ j ] )

56

• http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm

• http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf

• Animation:http://www-igm.univ-mlv.fr/~lecroq/string/

57

SimplificationsofBM

• TherearemanyvariantsofBoyer-Moore,andmanyscientificpapers.

• Onaveragethetimecomplexityissublinear• Algorithmspeedcanbeimprovedandyetsimplifythecode.

• Itisusefultousethelastcharacterheuristics(Horspool(1980),Baeza-Yates(1989),HumeandSunday(1991)).

58

AlgorithmBMH(Boyer-Moore-Horspool)

• RNHorspool - PracticalFastSearchinginStringsSoftware- PracticeandExperience,10(6):501-5061980

Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j

3. i=m 4. while i <= n 5. if S[i] == P[m] 6. j = m-1 7. while ( j>0 and P[j]==S[i-m+j] ) j = j-1 ; 8. if j==0 report match at i-m+1 9. i = i + delta[ S[i] ]

59

StringMatching:Horspoolalgorithm

Text :

Pattern :From right to left: suffix search

• Which is the next position of the window?

• How the comparison is made?

Pattern :

Text : a

It depends of where appears the last letter of the text, say it ‘a’, in the pattern:

a a a

Then it is necessary a preprocess that determines the length of the shift.

aa aa a a

Page 11: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

11

AlgorithmBoyer-Moore-Horspool-Hume-Sunday(BMHHS)

• Usedeltainatightloop• Ifmatch(delta==0)thencheckandapplyoriginaldeltad

Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. d = delta[ P[ m ] ]; // memorize d on P[m]4. delta[ P[ m ] ] = 0; // ensure delta on match of last char is 05. for ( i=m ; i<= n ; i = i+d ) 6. repeat // skip loop7. t=delta[ S[i] ] ; i = i + t 8. until t==09. for( j=m-1 ; j> 0 and P[j]==S[i-m+j] ; j = j-1 ) ;10. if j==0 report match at i-m+1

BMHHS requires that the text is padded by P: S[n+1]..S[n+m] = P(in order for the algorithm to finish correctly – at least one occurrence!).

61

• DanielM.Sunday: Averyfastsubstringsearchalgorithm[PDF]CommunicationsoftheACMAugust1990,Volume33Issue8

• Loopunrolling:• Avoidtoomanyloops(eachlooprequirestests)byjustrepeatingcode

withintheloop.• Line7inpreviousalgorithmcanbereplacedby:

7. i += delta[ S[i] ];i += delta[ S[i] ];i +=(t=delta[S[i]]) ;

62

63

Forward-Fast-Search:AnotherFastVariantoftheBoyer-MooreStringMatchingAlgorithm

• ThePragueStringologyConference'03• DomenicoCantoneandSimoneFaro

• Abstract: WepresentavariationoftheFast-Searchstringmatchingalgorithm,arecentmemberofthelargefamilyofBoyer-Moore-likealgorithms,andwecompareitwithsomeofthemosteffectivestringmatchingalgorithms,suchasHorspool,QuickSearch,TunedBoyer-Moore,ReverseFactor,Berry-Ravindran,andFast-Searchitself.Allalgorithmsarecomparedintermsofrun-timeefficiency,numberoftextcharacterinspections,andnumberofcharactercomparisons.Itturnsoutthatournewproposedvariant,thoughnotlinear,achievesverygoodresultsespeciallyinthecaseofveryshortpatternsorsmallalphabets.

• http://cs.felk.cvut.cz/psc/event/2003/p2.html• PS.gz (localcopy)

64

Factorbasedapproach

• Optimalaverage-casealgorithms– Assumingindependentcharacters,sameprobability

• Factor– asubstringofapattern– Anysubstring– (howmany?)

65

Factorbasedapproach

• Optimalaverage-casealgorithms– Assumingindependentcharacters,sameprobability

66

Page 12: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

12

Factorsearches

Do not compare characters, but find the longest match to anysubregion of the pattern.

S

P

X u

67

Examples

• BackwardDAWGMatching(BDM)– Crochemoreetal1994

• BackwardNondeterministicDAWGMatching(BNDM)– Navarro,Raffinot2000

• BackwardOracleMatching(BOM)– Allauzen,Crochermore,Raffinot2001

68

BackwardDAWGMatchingBDM

Do not compare characters, but find the longest match to anysubregion of the pattern. 69

Suffix automaton recognises all factors (and suffixes) in O(n)

BNDM– simulateusingbitparallelism

70

Bits – show where the factors have occurred so far

BNDMmatchesanNDA

NDAonthesuffixesof‘announce’

71

DeterministicversionofthesameBackwardFactorOracle

72

Page 13: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

13

BNDM – Backward Non-Deterministic DAWG MatchingBOM - Backward Oracle matching

73

StringMatchingofonepattern

CTACTACTACGTCTATACTGATCGTAGCTACTACGGTATGACTAA

Factor search

Prefix search

Suffix search

1.

2.

3.

Multiplepatterns

S

{P}

Why?

• Multiplepatterns• Highlightmultipledifferentsearchwords onthepage• Virusdetection – filterforvirussignatures• Spamfilters• Scannerincompiler needstosearchformultiplekeywords• Filterout stopwordsordisallowedwords• Intrusiondetectionsoftware• Next-generationsequencingproduceshugeamounts

(manymillions)ofshortreads(20-100bp)thatneedtobemappedtogenome!

• …

Algorithms

• Aho-Corasick(searchformultiplewords)– GeneralizationofKnuth-Morris-Pratt

• Commentz-Walter– GeneralizationofBoyer-Moore&AC

• WuandManber– improvementoverC-W

• Additionalmethods,tricksandtechniques

Aho-Corasick(AC)• AlfredV.AhoandMargaretJ.Corasick(BellLabs,MurrayHill,NJ)

Efficientstringmatching.Anaidtobibliographicsearch.CommunicationsoftheACM,Volume18,Issue6,p333-340(June1975)

• ACM:DOI PDF• ABSTRACT Thispaperdescribesasimple,efficientalgorithmtolocateall

occurrencesofanyofafinitenumberofkeywordsinastringoftext.Thealgorithmconsistsofconstructingafinitestatepatternmatchingmachinefromthekeywordsandthenusingthepatternmatchingmachinetoprocessthetextstringinasinglepass.Constructionofthepatternmatchingmachinetakestimeproportionaltothesumofthelengthsofthekeywords.Thenumberofstatetransitionsmadebythepatternmatchingmachineinprocessingthetextstringisindependentofthenumberofkeywords.Thealgorithmhasbeenusedtoimprovethespeedofalibrarybibliographicsearchprogrambyafactorof5to10.

References:

Page 14: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

14

• GeneralizationofKMPformanypatterns• TextSlikebefore.• SetofpatternsP ={P1 ,..,Pk }• Totallength|P|=m=Σi=1..k mi

• Problem:findalloccurrencesofany ofthePi∈ P fromS

Idea

1. Createanautomaton fromallpatterns

2. Matchtheautomaton

• UsethePATRICIAtrieforcreatingthemainstructureoftheautomaton

PATRICIAtrie• D.R.Morrison,"PATRICIA:PracticalAlgorithmToRetrieveInformation

CodedInAlphanumeric",JournaloftheACM15(1968)514-534.• Abstract PATRICIAisanalgorithmwhichprovidesaflexiblemeansof

storing,indexing,andretrievinginformationinalargefile,whichiseconomicalofindexspaceandofreindexingtime.Itdoesnotrequirerearrangementoftextorindexasnewmaterialisadded.Itrequiresaminimumrestrictionofformatoftextandofkeys;itisextremelyflexibleinthevarietyofkeysitwillrespondto.Itretrievesinformationinresponsetokeysfurnishedbytheuserwithaquantityofcomputationwhichhasaboundwhichdependslinearlyonthelengthofkeysandthenumberoftheirproperoccurrencesandisotherwiseindependentofthesizeofthelibrary.IthasbeenimplementedinseveralvariationsasFORTRANprogramsfortheCDC-3600,utilizingdiskfilestorageoftext.Ithasbeenappliedtoseverallargeinformation-retrievalproblemsandwillbeappliedtoothers.

• ACM:DOI PDF

• Wordtrie - agooddatastructuretorepresentasetofwords(e.g.adictionary).

• trie (datastructure)

• Definition: Atreeforstoringstringsinwhichthereisonenodeforeverycommonpreffix.Thestringsarestoredinextraleafnodes.

•Seealsodigitaltree,digitalsearchtree,directedacyclicwordgraph,compactDAWG,Patriciatree,suffixtree.

•Note: Thenamecomesfromretrievalandispronounced,"tree."

• Totestforawordp,onlyO(|p|)timeisusednomatterhowmanywordsareinthedictionary...

TrieforP={he,she,his,hers}

0

1

2

h

e

0

1

2

h

e

3

s

4

5

e

h

Page 15: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

15

TrieforP={he,she,his,hers}0

1

2

h

e

3

s

4

5

e

h

8

i

7

s

9

r

6

s

Howtosearchforwordslikehe,sheila,hi.Dotheseoccurinthetrie?

0

1

2

h

e

3

s

4

5

e

h

8

i

7

s

9

r

6

s

Aho-Corasick

1. CreateanautomatonMP forasetofstringsP.2. Finitestatemachine:reada characterfromtext,and

changethestateoftheautomatonbasedonthestatetransitions...

3. Mainlinks:goto[j,c] - readacharactercfromtextandgofromastatejtostategoto[j,c].

4. Iftherearenogoto[j,c]linksoncharactercfromstatej,usefail[j].

5. Reporttheoutput.Reportallwordsthathavebeenfoundinstatej.

ACAutomaton(vsKMP)

0

1

2

h

e3

s

4

5

e

h

8

i

7

s

9

r6

s

goto[1,i] = 6. ;

fail[7] = 3, fail[8] = 0 , fail[5]=2.

Output tablestate output[j] 2 he 5 she, he 7 his 9 hers

NOT { h, s }

AC- matching

Input:TextS[1..n]andanACautomatonMforpatternsetPOutput:OccurrencesofpatternsfromPinS(lastposition)1. state=02. for i=1..ndo

3. while (goto[state,S[i]]==∅ )and (fail[state]!=state)do4. state=fail[state]

5. state=goto[state,S[i]]6. if (output[state]notempty)7. then reportmatchesoutput[state]atpositioni

AlgorithmAho-CorasickpreprocessingI(TRIE)Input:P={P1,...,Pk }Output:goto[]andpartialoutput[]Assume:output(s)isemptywhenastatesiscreated;

goto[s,a]isnotdefined.

procedure enter(a1,...,am)/*Pi =a1,...,am */begin1.s=0;j=1;2.while goto[s,aj]≠∅ do //followexistingpath3.s=goto[s,aj];4.j=j+1;5.for p=jtomdo //addnewpath(states)6.news=news+1;7.goto[s,ap]=news;8.s=news;9.output[s]=a1,...,amend

begin10. news = 011. for i=1 to k do enter( Pi )12. for a ∈ Σ do

13. if goto[0,a] = ∅ then goto[0,a] = 0 ; end

Page 16: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

16

PreprocessingIIforAC(FAIL)queue = ∅for a ∈ Σ do

if goto[0,a] ≠ 0 thenenqueue( queue, goto[0,a] )fail[ goto[0,a] ] = 0

while queue ≠ ∅r = take( queue )for a ∈ Σ do

if goto[r,a] ≠ ∅ then s = goto[ r, a ]enqueue( queue, s ) // breadth first searchstate = fail[r]while goto[state,a] = ∅ do state = fail[state]fail[s] = goto[state,a]output[s] = output[s] + output[ fail[s] ]

Correctness

• Letstringt"point"frominitialstatetostatej.

• Mustshowthatfail[j]pointstolongestsuffixthatisalsoaprefixofsomewordinP.

• Lookatthearticle...

ACmatchingtimecomplexity

• Theorem FormatchingtheMP ontextS,|S|=n,lessthan2ntransitionswithinMaremade.

• Proof ComparetoKMP.• Thereisatmostngotosteps.• CannotbemorethannFail-steps.• Intotal-- therecanbelessthan2ntransitionsinM.

Individualnode(goto)

• Fulltable

• List

• Binarysearchtree(?)

• Someotherindex?

ACthoughts

• Scalesformanystringssimultaneously.• Forverymanypatterns– searchtime(ofgrep)improves(??)

– SeeWu-Manberarticle

• Whenkgrows,thenmorefail[]transitionsaremade(why?)• Butalwayslessthann.• Ifallgoto[j,a]areindexedinanarray,thenthesizeis

|MP|*|Σ|,andtherunningtimeofACisO(n).• Whenkandcarebig,onecanuselistsortreesforstoring

transitionfunctions.

• Then,O(nlog(min(k,c))).

AdvancedAC

• Precalculatethenextstatetransitioncorrectlyforeverypossiblecharacterinalphabet

• Canbegoodforshortpatterns

Page 17: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

17

ProblemsofAC?

• Needtorebuildonadding/removingpatterns

• Detailsofbranchingoneachnode(?)

Commentz-Walter

• GeneralizationofBoyer-Mooreformultiplesequencesearch

• BeateCommentz-WalterAStringMatchingAlgorithmFastontheAverageProceedingsofthe6thColloquium,onAutomata,LanguagesandProgramming.LectureNotesInComputerScience;Vol.71,1979. pp.118- 132,Springer-Verlag

• http://www.fh-albsig.de/win/personen/professoren.php?RID=36• YoucandownloadheremyalgorithmStringMatchingFastOnTheAverage (PDF,~17,2MB)or

hereStringMatchingFastOnTheAverage(extendedabstract) (PDF,~3MB)

C-Wdescription

• AhoandCorasick[AC75]presentedalinear-timealgorithmforthisproblem,basedonanautomataapproach.ThisalgorithmservesasthebasisfortheUNIXtoolfgrep.Alinear-timealgorithmisoptimalintheworstcase,butastheregularstring-searchingalgorithmbyBoyerandMoore[BM77]demonstrated,itispossibletoactuallyskipalargeportionofthetextwhilesearching,leadingtofasterthanlinearalgorithmsintheaveragecase.

Commentz-Walter[CW79]

• Commentz-Walter[CW79]presentedanalgorithmforthemulti-patternmatchingproblemthatcombinestheBoyer-MooretechniquewiththeAho-Corasickalgorithm.TheCommentz-WalteralgorithmissubstantiallyfasterthantheAho-Corasickalgorithminpractice.Hume[Hu91]designedatoolcalledgrebasedonthisalgorithm,andversion2.0offgrepbytheGNUproject[Ha93]isusingit.

• Baeza-Yates[Ba89]alsogaveanalgorithmthatcombinestheBoyer-Moore-Horspoolalgorithm[Ho80](whichisaslightvariationoftheclassicalBoyer-Moorealgorithm)withtheAho-Corasickalgorithm.

IdeaofC-W

• Buildabackward trieofallkeywords

• Matchfromtheenduntilmismatch...

• Determinetheshiftbasedonthecombinationofheuristics

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

4. Start the search

T A

A

G

GAT

TT

T

G

A

A

AA T

1. Build the trie of the inverted patterns

2. lmin=4A 1C 4 (lmin)G 2T 1

3. Table of shifts

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Page 18: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

18

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GAT

TT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GAT

TT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GAT

TT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GAT

TT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GAT

TT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GAT

TT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

…Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Page 19: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

19

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GAT

TT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Short Shifts!

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

WhatarethepossiblelimitationsforC-W?

• Manypatterns,smallalphabet– minimalskips

• Whatcanbedonedifferently?

Wu-Manber• WuS.,andU.Manber,"AFastAlgorithmforMulti-PatternSearching,"

TechnicalReportTR-94-17,DepartmentofComputerScience,UniversityofArizona(May1993).

• Citeseer:http://citeseer.ist.psu.edu/wu94fast.html [Postscript]• WepresentadifferentapproachthatalsousestheideasofBoyerand

Moore.Ouralgorithmisquitesimple,andthemainengineofitisgivenlaterinthepaper.Anearlierversionofthisalgorithmwaspartofthesecondversionofagrep[WM92a,WM92b],althoughthealgorithmhasnotbeendiscussedin[WM92b]andonlybrieflyin[WM92a].Thecurrentversionisusedinglimpse[MW94].Thedesignofthealgorithmconcentratesontypicalsearchesratherthanonworst-casebehavior.Thisallowsustomakesomeengineeringdecisionsthatwebelievearecrucialtomakingthealgorithmsignificantlyfasterthanotheralgorithmsinpractice.

Keyidea

• MainproblemwithBoyer-Mooreandmanypatternsisthat,themoretherearepatterns,theshorterbecomethepossibleshifts...

• WuandManber:checkseveralcharacterssimultaneously,i.e.increasethealphabet.

• Insteadoflookingatcharactersfromthetextonebyone,weconsidertheminblocksofsizeB.

• logc2M;inpractice,weuseeitherB=2orB=3.• TheSHIFTtable playsthesameroleasintheregularBoyer-Moorealgorithm,exceptthatitdeterminestheshiftbasedon thelastBcharactersratherthanjustonecharacter.

AA 1 AC 3 (LMIN-L+1)AG 3AT 1CA 3CC 3CG 3…

2 símbols

Horspoolto Wu-ManberHow do we can increase the length of the shifts?

With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG

AA 1AT 1GT 1TA 2TG 2

A 1C 4 (lmin)G 2T 1

1 símbol

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Page 20: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

20

Wu-ManberalgorithmSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GAT

TT

T

G

A

A

AA T

into the text: ACATGCTATGTGACATAATA

AA 1AT 1GT 1TA 2TG 2

Experimental length: log|Σ| 2*lmin*rSlides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

BackwardOracle

• SetBackwardsoracleSBDM,SBOM

• Pages68-72

Stringmatchingofmanypatterns

5 10 15 20 25 30 35 40 45

8

4

2

| S|

Wu-Manber

SBOMLmin

(5 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

(100 patterns)

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Stringmatchingofmanypatterns

5 10 15 20 25 30 35 40 45

8

4

2

| S|

Wu-Manber

SBOM

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

5 10 15 20 25 30 35 40 45

8

4

2

SBOM

Lmin

(5 patterns)

(10 patterns)

(100 patterns)(1000 patterns)

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

5strings

Page 21: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

21

10strings 100strings

1000strings FactorOracle

FactorOracle:safeshift FactorOracle:

Shift to match prefix of P2?

Page 22: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

22

Factororacle ConstructionoffactorOracle

Factororacle• Allauzen,C.,Crochemore,M.,andRaffinot,M.1999.FactorOracle:ANew

StructureforPatternMatching.InProceedingsofthe26thConferenceonCurrentTrendsintheoryandPracticeofinformaticsontheoryandPracticeofinformatics (November27- December04,1999).J.Pavelka,G.Tel,andM.Bartosek,Eds.LectureNotesInComputerScience,vol.1725.Springer-Verlag,London,295-310.

• http://portal.acm.org/citation.cfm?id=647009.712672&coll=GUIDE&dl=GUIDE&CFID=31549541&CFTOKEN=61811641#

• http://www-igm.univ-mlv.fr/~allauzen/work/sofsem.ps

Sofar

• GeneralisedKMP->AhoCorasick• GeneralisedHorspool->CommentzWalter,WuManber

• BDM,BOM->SetBackwardOracleMatching…

• Othergeneralisations?

Page 23: Algorithms Exact pattern matching · Knuth-Morris-Pratt •Make sure that no comparisons “wasted” •After such a mismatch we already know exactly the values of green area in

30.11.17

23

MultipleShift-AND

• P={P1,P2,P3,P4}. GeneralizeShift-AND

• Bits=

• Start=

• Match=

P1P2P3P4

1111

1111


Recommended