30.11.17
1
TextAlgorithms
JaakVilo2016fall
1MTAT.03.190TextAlgorithmsJaakVilo
Topics
• Exactmatchingofonepattern(string)• Exactmatchingofmultiplepatterns• Suffixtrie andtreeindexes
– Applications
• Suffixarrays• Invertedindex• Approximatematching
Algorithms
One-pattern• Bruteforce• Knuth-Morris-Pratt• Karp-Rabin• Shift-OR,Shift-AND• Boyer-Moore• Factor searches
• Regular expressions(?)• Weight matrices(?)
Multi-pattern• Aho Corasick• Commentz-Walter
Indexing• Trie (andsuffixtrie)• Suffixtree
Exactpatternmatching
• S=s1 s2… sn (text) |S|=n(length)
• P=p1p2..pm (pattern) |P|=m
• Σ - alphabet | Σ|=c
• DoesScontainP?– DoesS=S'PS"fosomestringsS'jaS"?– Usuallym<<nandncanbe(very)large
Findoccurrencesintext
S
P
Animations• http://www-igm.univ-mlv.fr/~lecroq/string/
• EXACTSTRINGMATCHINGALGORITHMSAnimationinJava
• ChristianCharras- ThierryLecroqLaboratoired'InformatiquedeRouenUniversitédeRouenFacultédesSciencesetdesTechniques76821Mont-Saint-AignanCedexFRANCE
• e-mails:{Christian.Charras,Thierry.Lecroq}@laposte.net
30.11.17
2
Bruteforce:BABintext?
A B A C A B A B B A B B B AB A B
BruteForce
S
Pi i+j-1
j
Identifythefirstmismatch!
Question:
§Problemsofthismethod?§Ideastoimprovethesearch?
L
J
Bruteforce
AlgorithmNaiveInput:TextS[1..n]and
patternP[1..m]Output:Allpositionsi,where
PoccursinS
for(i=1;i<=n-m+1;i++)for (j=1;j<=m;j++)if(S[i+j-1]!=P[j])break;
if (j>m)printi;
attempt 1:gcatcgcagagagtatacagtacgGCAg....
attempt 2:gcatcgcagagagtatacagtacgg.......
attempt 3:gcatcgcagagagtatacagtacg
g.......
attempt 4:gcatcgcagagagtatacagtacg
g.......
attempt 5:gcatcgcagagagtatacagtacg
g.......
attempt 6:gcatcgcagagagtatacagtacg
GCAGAGAG
attempt 7:gcatcGCAGAGAGtatacagtacg
g.......
BruteforceorNaiveSearch
1 function NaiveSearch(string s[1..n],string sub[1..m])2 for i from 1to n-m+13 for j from 1tom4 if s[i+j-1]≠sub[j]5 jumptonextiterationofouterloop6 return i7return notfound
Ccodeint bf_2( char* pat, char* text , int n ) /* n = textlen */{
int m, i, j ; int count = 0 ; m = strlen(pat);
for ( i=0 ; i + m <= n ; i++) {
for( j=0; j < m && pat[j] == text[i+j] ; j++) ;
if( j == m )count++ ;
}
return(count);}
Ccodeint bf_1( char* pat, char* text ) {
int m ; int count = 0 ; char *tp;
m = strlen(pat); tp=text ;
for( ; *tp ; tp++ ) {if( strncmp( pat, tp, m ) == 0 ) {
count++ ; }
}
return( count ); }
30.11.17
3
MainproblemofNaive
• ForthenextpossiblelocationofP,checkagainthesamepositionsofS
S
Pi i+j-1
jS
j
Goals
• Makesureonlyaconstantnrofcomparisons/operationsismadeforeachpositioninS– Move(only)fromlefttorightinS
– How?– AfteratestofS[i]<>P[j]whatdowenow?
Knuth-Morris-Pratt
• Makesurethatnocomparisons“wasted”
• AftersuchamismatchwealreadyknowexactlythevaluesofgreenareainS!
D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings.SIAM Journal on Computing 6:323-350, 1977.
x
y≠
Knuth-Morris-Pratt
• Makesurethatnocomparisons“wasted”
• P– longestsuffixofanyprefixthatisalsoaprefixofapattern
• Example: ABCABD
D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings.SIAM Journal on Computing 6:323-350, 1977.
prefix x
prefix y
p z
≠
≠
ABCABD
AutomatonforABCABD
1 2 3 4 5 6 7A AB C B D
NOT A
AutomatonforABCABD
1 2 3 4 5 6 7A AB C B D
NOT A
0 1 1 1 2 3 1Fail links:
A B C A B DPattern:1 2 3 4 5 6
30.11.17
4
KMPmatching
Input:TextS[1..n]andpatternP[1..m]Output: FirstoccurrenceofPinS(ifexists)
i=1; j=1;
initfail(P) // Prepare fail links
repeat if j==0 or S[i] == P[j]
then i++ , j++ // advance in text and in pattern
else j = fail[j] // use fail link
until j>m or i>n
if j>m then report match at i-m
Initializationoffaillinks
Algorithm:KMP_InitfailInput:PatternP[1..m]Output:fail[]forpatternP
i=1, j=0 , fail[1]= 0
repeat
if j==0 or P[i] == P[j]
then i++ , j++ , fail[i] = j
else j = fail[j]
until i>=m
Initializationoffaillinks
i=1, j=0 , fail[1]= 0 repeat
if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = jelse j = fail[j]
until i>=m
0Fail:
ABCABDi
j
0 1
0 1 1 1
ABCABD
0 1 1 1 2
TimecomplexityofKMPmatching?
Input:TextS[1..n]andpatternP[1..m]Output: FirstoccurrenceofPinS(ifexists)
i=1; j=1;
initfail(P) // Prepare fail links
repeat if j==0 or S[i] == P[j]
then i++ , j++ // advance in text and in pattern
else j = fail[j] // use fail link
until j>m or i>n
if j>m then report match at i-m
Analysisoftimecomplexity
• Ateverycycleeitheriandjincreaseby1• Orjdecreases(j=fail[j])
• icanincreasen(orm)times• Q:Howoftencanjdecrease?
– A:notmorethannrofincreasesofi
• Amortisedanalysis: O(n),preprocessO(m)
Karp-Rabin
• CompareinO(1)ahashofPandS[i..i+m-1]
• Goal:O(n).• f(h(T[i..i+m-1])->h(T[i+1..i+m]))=O(1)
R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.
i..(i+m-1)
1..m
h(T[i.. i+m-1])
h(P)
30.11.17
5
Karp-Rabin
• CompareinO(1)ahashofPandS[i..i+m-1]
• Goal:O(n).• f(h(T[i..i+m-1])->h(T[i+1..i+m]))=O(1)
R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.
i..(i+m-1)
1..m
h(T[i+1..i+m])
h(P)
i..(i+m-1)
Hash
• “Remove” theeffectofT[i]and“Introduce”theeffectofT[i+m]– inO(1)
• Usebase|Σ|arithmeticsandtreatcharctersasnumbers
• Incaseofhashmatch– checkallmpositions• Hashcollisions=>WorstcaseO(nm)
Let’susenumbers
• T=57125677• P=125(andforsimplicity,h=125)
• H(T[1])=571• H(T[2])=(571-5*100)*10+2 =712
• H(T[3])=(H(T[2])– ord(T[1])*10m)*10+T[3+m-1]
hash
• c– sizeofalphabet
• HSi=H(S[i..i+m-1])
• H(S[i+1..i+m])=(HSi– ord(S[i])*cm-1 )*c+ord(S[i+m])
• Moduloarithmetic– tofitvalueinaword!
• hash(w[0..m-1])=(w[0]*2m-1+w[1]*2m-2+···+w[m-1]*20)modq
Karp-RabinInput: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S 1. c=20; /* Size of the alphabet, say nr. of aminoacids */ 2. q = 33554393 /* q is a prime */ 3. cm = cm-1 mod q 4. hp = 0 ; hs = 0 5. for i = 1 .. m do hp = ( hp*c + ord(p[i]) ) mod q // H(P) 6. for i = 1 .. m do hs = ( hp*c + ord(s[i]) ) mod q // H(S[1..m]) 7. if hp == hs and P == S[1..m] report match at position
8. for i=2 .. n-m+1 9. hs = ( (hs - ord(s[i-1])*cm) * c + ord(s[i+m-1]) mod q 10. if hp == hs and P == S[i..i+m-1] 11. report match at position i
30.11.17
6
MorewaystoensureO(n)? Shift-AND/Shift-OR
• RicardoBaeza-Yates,GastonH.GonnetAnewapproachtotextsearchingCommunicationsoftheACM October1992,Volume35Issue10[ACMDigitalLibrary:http://doi.acm.org/10.1145/135239.135243][DOI]
Bit-operations
• Maintainasetofallprefixesthathavesofarhadaperfectmatch
• Onthenextcharacterintextupdateallpreviouspointerstoanewset
• Bitvector:foreverypossiblecharacter
Matchinginlineartime(shift-OR)Pattern: ABCB
ABCABCBEA
Text:… 1 1 1 1 1 0
… 1 1 1 1 1 0 0
A
B 0 1 0 1… 1 1 1 1 1 0 1
1 0 1 1C … 1 1 1 1 0 1 0 shift
1 1 1 0
shift
bv[ T[j] ]
… 1 1 1 1 0 1 1
… 1 1 1 1 1 0
… 1 1 1 0 1 1 01 1 0 1
… 1 1 1 1 1 1 1
|
State:which prefixes match?Shift-AND;shift-OR
1
0
0
1
0
Move to next:shift-ANDshift 1,introduce 1,bitwise and
1
0 0
0
1
1 1
0 0
01
1
0
1
1
0
1
0
0
0
&
Pattern[S[i]]
1
1
1
0
0
=
30.11.17
7
Trackpositionsofprefixmatches
0 1 0 1 0 1
1 0 0 0 1 1
1 0 1 0 1 1 Shift left <<
1 0 0 0 1 1Mask on char T[i] Bitwise AND
VectorsforeverycharinΣ
• P=aste
a s t e b c d .. z
1 0 0 0 0 ...
0 1 0 0 0 ...
0 0 1 0 0 ...
0 0 0 1 0 ...
• T=lasteaed
l a s t e a e d
0 1
0 0
0 0
0 0
• T=lasteaed
l a s t e a e d
0 1 0
0 0 1
0 0 0
0 0 0
• T=lasteaed
l a s t e a e d
0 1 0 0 0 1
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
• T=lasteaed
l a s t e a e d
0 1 0 0 0 1
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
30.11.17
8
http://www-igm.univ-mlv.fr/~lecroq/string/node6.html
[A]11010101
SummaryAlgorithm Worstcase Ave.Case Preprocess
Bruteforce O(mn) O(n*(1+1/|Σ|+..)
Knuth-Morris-Pratt O(n) O(n) O(m)
Rabin-Karp O(mn) O(n) O(m)
Boyer-Moore O(n/m)?
BMHorspool
Factorsearch
Shift-OR O(n) O(n) O(m|Σ|)
• R.Boyer,S.Moore:Afaststringsearchingalgorithm.CACM 20(1977),762-772[PDF]
• http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf
48
30.11.17
9
Findoccurrencesintext
• Havewemissedanything?
49
S
P
Findoccurrencesintext
• Whathavewelearnedifwetestforapotentialmatchfromtheend?
50
S
P
ABCDEBBCDE
51
Findoccurrencesintext
S
P
AB
52
BadcharacterheuristicsmaximalshiftonS[i]
S
P
AB
X
SXX
delta1( S[i] ) – |m| if pattern does not contain S[i]patlen-j max j so that P[j] == S[i]
S[i]
First x in pattern (from end)
53
void bmInitocc() {
char a; int j;
for(a=0; a<alphabetsize; a++)
occ[a]=-1;
for (j=0; j<m; j++) {
a=p[j];
occ[a]=j; } }
54
30.11.17
10
Goodsuffixheuristics
S
P
AB
µ
S
delta2( S[i] ) – minimal shift so that matched region is fully coveredor that the sufix of match is also a prefix of P
µµS
µµ’
1.
2.
55
Boyer-Moorealgorithm
Input: Text S[1..n] and pattern P[1..m]
Output: Occurrences of P in S
preprocess_BM() // delta1 and delta2
i=m
while i <= n
for( j=m; j>0 and P[j]==S[i-m+j]; j-- ) ;
if j==0 report match at position i-m+1
i = i+ max( delta1[ S[i] ], delta2[ j ] )
56
• http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm
• http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf
• Animation:http://www-igm.univ-mlv.fr/~lecroq/string/
57
SimplificationsofBM
• TherearemanyvariantsofBoyer-Moore,andmanyscientificpapers.
• Onaveragethetimecomplexityissublinear• Algorithmspeedcanbeimprovedandyetsimplifythecode.
• Itisusefultousethelastcharacterheuristics(Horspool(1980),Baeza-Yates(1989),HumeandSunday(1991)).
58
AlgorithmBMH(Boyer-Moore-Horspool)
• RNHorspool - PracticalFastSearchinginStringsSoftware- PracticeandExperience,10(6):501-5061980
Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j
3. i=m 4. while i <= n 5. if S[i] == P[m] 6. j = m-1 7. while ( j>0 and P[j]==S[i-m+j] ) j = j-1 ; 8. if j==0 report match at i-m+1 9. i = i + delta[ S[i] ]
59
StringMatching:Horspoolalgorithm
Text :
Pattern :From right to left: suffix search
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text : a
It depends of where appears the last letter of the text, say it ‘a’, in the pattern:
a a a
Then it is necessary a preprocess that determines the length of the shift.
aa aa a a
30.11.17
11
AlgorithmBoyer-Moore-Horspool-Hume-Sunday(BMHHS)
• Usedeltainatightloop• Ifmatch(delta==0)thencheckandapplyoriginaldeltad
Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. d = delta[ P[ m ] ]; // memorize d on P[m]4. delta[ P[ m ] ] = 0; // ensure delta on match of last char is 05. for ( i=m ; i<= n ; i = i+d ) 6. repeat // skip loop7. t=delta[ S[i] ] ; i = i + t 8. until t==09. for( j=m-1 ; j> 0 and P[j]==S[i-m+j] ; j = j-1 ) ;10. if j==0 report match at i-m+1
BMHHS requires that the text is padded by P: S[n+1]..S[n+m] = P(in order for the algorithm to finish correctly – at least one occurrence!).
61
• DanielM.Sunday: Averyfastsubstringsearchalgorithm[PDF]CommunicationsoftheACMAugust1990,Volume33Issue8
• Loopunrolling:• Avoidtoomanyloops(eachlooprequirestests)byjustrepeatingcode
withintheloop.• Line7inpreviousalgorithmcanbereplacedby:
7. i += delta[ S[i] ];i += delta[ S[i] ];i +=(t=delta[S[i]]) ;
62
63
Forward-Fast-Search:AnotherFastVariantoftheBoyer-MooreStringMatchingAlgorithm
• ThePragueStringologyConference'03• DomenicoCantoneandSimoneFaro
• Abstract: WepresentavariationoftheFast-Searchstringmatchingalgorithm,arecentmemberofthelargefamilyofBoyer-Moore-likealgorithms,andwecompareitwithsomeofthemosteffectivestringmatchingalgorithms,suchasHorspool,QuickSearch,TunedBoyer-Moore,ReverseFactor,Berry-Ravindran,andFast-Searchitself.Allalgorithmsarecomparedintermsofrun-timeefficiency,numberoftextcharacterinspections,andnumberofcharactercomparisons.Itturnsoutthatournewproposedvariant,thoughnotlinear,achievesverygoodresultsespeciallyinthecaseofveryshortpatternsorsmallalphabets.
• http://cs.felk.cvut.cz/psc/event/2003/p2.html• PS.gz (localcopy)
64
Factorbasedapproach
• Optimalaverage-casealgorithms– Assumingindependentcharacters,sameprobability
• Factor– asubstringofapattern– Anysubstring– (howmany?)
65
Factorbasedapproach
• Optimalaverage-casealgorithms– Assumingindependentcharacters,sameprobability
66
30.11.17
12
Factorsearches
Do not compare characters, but find the longest match to anysubregion of the pattern.
S
P
X u
67
Examples
• BackwardDAWGMatching(BDM)– Crochemoreetal1994
• BackwardNondeterministicDAWGMatching(BNDM)– Navarro,Raffinot2000
• BackwardOracleMatching(BOM)– Allauzen,Crochermore,Raffinot2001
68
BackwardDAWGMatchingBDM
Do not compare characters, but find the longest match to anysubregion of the pattern. 69
Suffix automaton recognises all factors (and suffixes) in O(n)
BNDM– simulateusingbitparallelism
70
Bits – show where the factors have occurred so far
BNDMmatchesanNDA
NDAonthesuffixesof‘announce’
71
DeterministicversionofthesameBackwardFactorOracle
72
30.11.17
13
BNDM – Backward Non-Deterministic DAWG MatchingBOM - Backward Oracle matching
73
StringMatchingofonepattern
CTACTACTACGTCTATACTGATCGTAGCTACTACGGTATGACTAA
Factor search
Prefix search
Suffix search
1.
2.
3.
Multiplepatterns
S
{P}
Why?
• Multiplepatterns• Highlightmultipledifferentsearchwords onthepage• Virusdetection – filterforvirussignatures• Spamfilters• Scannerincompiler needstosearchformultiplekeywords• Filterout stopwordsordisallowedwords• Intrusiondetectionsoftware• Next-generationsequencingproduceshugeamounts
(manymillions)ofshortreads(20-100bp)thatneedtobemappedtogenome!
• …
Algorithms
• Aho-Corasick(searchformultiplewords)– GeneralizationofKnuth-Morris-Pratt
• Commentz-Walter– GeneralizationofBoyer-Moore&AC
• WuandManber– improvementoverC-W
• Additionalmethods,tricksandtechniques
Aho-Corasick(AC)• AlfredV.AhoandMargaretJ.Corasick(BellLabs,MurrayHill,NJ)
Efficientstringmatching.Anaidtobibliographicsearch.CommunicationsoftheACM,Volume18,Issue6,p333-340(June1975)
• ACM:DOI PDF• ABSTRACT Thispaperdescribesasimple,efficientalgorithmtolocateall
occurrencesofanyofafinitenumberofkeywordsinastringoftext.Thealgorithmconsistsofconstructingafinitestatepatternmatchingmachinefromthekeywordsandthenusingthepatternmatchingmachinetoprocessthetextstringinasinglepass.Constructionofthepatternmatchingmachinetakestimeproportionaltothesumofthelengthsofthekeywords.Thenumberofstatetransitionsmadebythepatternmatchingmachineinprocessingthetextstringisindependentofthenumberofkeywords.Thealgorithmhasbeenusedtoimprovethespeedofalibrarybibliographicsearchprogrambyafactorof5to10.
References:
30.11.17
14
• GeneralizationofKMPformanypatterns• TextSlikebefore.• SetofpatternsP ={P1 ,..,Pk }• Totallength|P|=m=Σi=1..k mi
• Problem:findalloccurrencesofany ofthePi∈ P fromS
Idea
1. Createanautomaton fromallpatterns
2. Matchtheautomaton
• UsethePATRICIAtrieforcreatingthemainstructureoftheautomaton
PATRICIAtrie• D.R.Morrison,"PATRICIA:PracticalAlgorithmToRetrieveInformation
CodedInAlphanumeric",JournaloftheACM15(1968)514-534.• Abstract PATRICIAisanalgorithmwhichprovidesaflexiblemeansof
storing,indexing,andretrievinginformationinalargefile,whichiseconomicalofindexspaceandofreindexingtime.Itdoesnotrequirerearrangementoftextorindexasnewmaterialisadded.Itrequiresaminimumrestrictionofformatoftextandofkeys;itisextremelyflexibleinthevarietyofkeysitwillrespondto.Itretrievesinformationinresponsetokeysfurnishedbytheuserwithaquantityofcomputationwhichhasaboundwhichdependslinearlyonthelengthofkeysandthenumberoftheirproperoccurrencesandisotherwiseindependentofthesizeofthelibrary.IthasbeenimplementedinseveralvariationsasFORTRANprogramsfortheCDC-3600,utilizingdiskfilestorageoftext.Ithasbeenappliedtoseverallargeinformation-retrievalproblemsandwillbeappliedtoothers.
• ACM:DOI PDF
• Wordtrie - agooddatastructuretorepresentasetofwords(e.g.adictionary).
• trie (datastructure)
• Definition: Atreeforstoringstringsinwhichthereisonenodeforeverycommonpreffix.Thestringsarestoredinextraleafnodes.
•Seealsodigitaltree,digitalsearchtree,directedacyclicwordgraph,compactDAWG,Patriciatree,suffixtree.
•Note: Thenamecomesfromretrievalandispronounced,"tree."
• Totestforawordp,onlyO(|p|)timeisusednomatterhowmanywordsareinthedictionary...
TrieforP={he,she,his,hers}
0
1
2
h
e
0
1
2
h
e
3
s
4
5
e
h
30.11.17
15
TrieforP={he,she,his,hers}0
1
2
h
e
3
s
4
5
e
h
8
i
7
s
9
r
6
s
Howtosearchforwordslikehe,sheila,hi.Dotheseoccurinthetrie?
0
1
2
h
e
3
s
4
5
e
h
8
i
7
s
9
r
6
s
Aho-Corasick
1. CreateanautomatonMP forasetofstringsP.2. Finitestatemachine:reada characterfromtext,and
changethestateoftheautomatonbasedonthestatetransitions...
3. Mainlinks:goto[j,c] - readacharactercfromtextandgofromastatejtostategoto[j,c].
4. Iftherearenogoto[j,c]linksoncharactercfromstatej,usefail[j].
5. Reporttheoutput.Reportallwordsthathavebeenfoundinstatej.
ACAutomaton(vsKMP)
0
1
2
h
e3
s
4
5
e
h
8
i
7
s
9
r6
s
goto[1,i] = 6. ;
fail[7] = 3, fail[8] = 0 , fail[5]=2.
Output tablestate output[j] 2 he 5 she, he 7 his 9 hers
NOT { h, s }
AC- matching
Input:TextS[1..n]andanACautomatonMforpatternsetPOutput:OccurrencesofpatternsfromPinS(lastposition)1. state=02. for i=1..ndo
3. while (goto[state,S[i]]==∅ )and (fail[state]!=state)do4. state=fail[state]
5. state=goto[state,S[i]]6. if (output[state]notempty)7. then reportmatchesoutput[state]atpositioni
AlgorithmAho-CorasickpreprocessingI(TRIE)Input:P={P1,...,Pk }Output:goto[]andpartialoutput[]Assume:output(s)isemptywhenastatesiscreated;
goto[s,a]isnotdefined.
procedure enter(a1,...,am)/*Pi =a1,...,am */begin1.s=0;j=1;2.while goto[s,aj]≠∅ do //followexistingpath3.s=goto[s,aj];4.j=j+1;5.for p=jtomdo //addnewpath(states)6.news=news+1;7.goto[s,ap]=news;8.s=news;9.output[s]=a1,...,amend
begin10. news = 011. for i=1 to k do enter( Pi )12. for a ∈ Σ do
13. if goto[0,a] = ∅ then goto[0,a] = 0 ; end
30.11.17
16
PreprocessingIIforAC(FAIL)queue = ∅for a ∈ Σ do
if goto[0,a] ≠ 0 thenenqueue( queue, goto[0,a] )fail[ goto[0,a] ] = 0
while queue ≠ ∅r = take( queue )for a ∈ Σ do
if goto[r,a] ≠ ∅ then s = goto[ r, a ]enqueue( queue, s ) // breadth first searchstate = fail[r]while goto[state,a] = ∅ do state = fail[state]fail[s] = goto[state,a]output[s] = output[s] + output[ fail[s] ]
Correctness
• Letstringt"point"frominitialstatetostatej.
• Mustshowthatfail[j]pointstolongestsuffixthatisalsoaprefixofsomewordinP.
• Lookatthearticle...
ACmatchingtimecomplexity
• Theorem FormatchingtheMP ontextS,|S|=n,lessthan2ntransitionswithinMaremade.
• Proof ComparetoKMP.• Thereisatmostngotosteps.• CannotbemorethannFail-steps.• Intotal-- therecanbelessthan2ntransitionsinM.
Individualnode(goto)
• Fulltable
• List
• Binarysearchtree(?)
• Someotherindex?
ACthoughts
• Scalesformanystringssimultaneously.• Forverymanypatterns– searchtime(ofgrep)improves(??)
– SeeWu-Manberarticle
• Whenkgrows,thenmorefail[]transitionsaremade(why?)• Butalwayslessthann.• Ifallgoto[j,a]areindexedinanarray,thenthesizeis
|MP|*|Σ|,andtherunningtimeofACisO(n).• Whenkandcarebig,onecanuselistsortreesforstoring
transitionfunctions.
• Then,O(nlog(min(k,c))).
AdvancedAC
• Precalculatethenextstatetransitioncorrectlyforeverypossiblecharacterinalphabet
• Canbegoodforshortpatterns
30.11.17
17
ProblemsofAC?
• Needtorebuildonadding/removingpatterns
• Detailsofbranchingoneachnode(?)
Commentz-Walter
• GeneralizationofBoyer-Mooreformultiplesequencesearch
• BeateCommentz-WalterAStringMatchingAlgorithmFastontheAverageProceedingsofthe6thColloquium,onAutomata,LanguagesandProgramming.LectureNotesInComputerScience;Vol.71,1979. pp.118- 132,Springer-Verlag
• http://www.fh-albsig.de/win/personen/professoren.php?RID=36• YoucandownloadheremyalgorithmStringMatchingFastOnTheAverage (PDF,~17,2MB)or
hereStringMatchingFastOnTheAverage(extendedabstract) (PDF,~3MB)
C-Wdescription
• AhoandCorasick[AC75]presentedalinear-timealgorithmforthisproblem,basedonanautomataapproach.ThisalgorithmservesasthebasisfortheUNIXtoolfgrep.Alinear-timealgorithmisoptimalintheworstcase,butastheregularstring-searchingalgorithmbyBoyerandMoore[BM77]demonstrated,itispossibletoactuallyskipalargeportionofthetextwhilesearching,leadingtofasterthanlinearalgorithmsintheaveragecase.
Commentz-Walter[CW79]
• Commentz-Walter[CW79]presentedanalgorithmforthemulti-patternmatchingproblemthatcombinestheBoyer-MooretechniquewiththeAho-Corasickalgorithm.TheCommentz-WalteralgorithmissubstantiallyfasterthantheAho-Corasickalgorithminpractice.Hume[Hu91]designedatoolcalledgrebasedonthisalgorithm,andversion2.0offgrepbytheGNUproject[Ha93]isusingit.
• Baeza-Yates[Ba89]alsogaveanalgorithmthatcombinestheBoyer-Moore-Horspoolalgorithm[Ho80](whichisaslightvariationoftheclassicalBoyer-Moorealgorithm)withtheAho-Corasickalgorithm.
IdeaofC-W
• Buildabackward trieofallkeywords
• Matchfromtheenduntilmismatch...
• Determinetheshiftbasedonthecombinationofheuristics
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
4. Start the search
T A
A
G
GAT
TT
T
G
A
A
AA T
1. Build the trie of the inverted patterns
2. lmin=4A 1C 4 (lmin)G 2T 1
3. Table of shifts
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
30.11.17
18
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GAT
TT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GAT
TT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GAT
TT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GAT
TT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GAT
TT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GAT
TT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
…Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
30.11.17
19
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GAT
TT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
…
Short Shifts!
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
WhatarethepossiblelimitationsforC-W?
• Manypatterns,smallalphabet– minimalskips
• Whatcanbedonedifferently?
Wu-Manber• WuS.,andU.Manber,"AFastAlgorithmforMulti-PatternSearching,"
TechnicalReportTR-94-17,DepartmentofComputerScience,UniversityofArizona(May1993).
• Citeseer:http://citeseer.ist.psu.edu/wu94fast.html [Postscript]• WepresentadifferentapproachthatalsousestheideasofBoyerand
Moore.Ouralgorithmisquitesimple,andthemainengineofitisgivenlaterinthepaper.Anearlierversionofthisalgorithmwaspartofthesecondversionofagrep[WM92a,WM92b],althoughthealgorithmhasnotbeendiscussedin[WM92b]andonlybrieflyin[WM92a].Thecurrentversionisusedinglimpse[MW94].Thedesignofthealgorithmconcentratesontypicalsearchesratherthanonworst-casebehavior.Thisallowsustomakesomeengineeringdecisionsthatwebelievearecrucialtomakingthealgorithmsignificantlyfasterthanotheralgorithmsinpractice.
Keyidea
• MainproblemwithBoyer-Mooreandmanypatternsisthat,themoretherearepatterns,theshorterbecomethepossibleshifts...
• WuandManber:checkseveralcharacterssimultaneously,i.e.increasethealphabet.
• Insteadoflookingatcharactersfromthetextonebyone,weconsidertheminblocksofsizeB.
• logc2M;inpractice,weuseeitherB=2orB=3.• TheSHIFTtable playsthesameroleasintheregularBoyer-Moorealgorithm,exceptthatitdeterminestheshiftbasedon thelastBcharactersratherthanjustonecharacter.
AA 1 AC 3 (LMIN-L+1)AG 3AT 1CA 3CC 3CG 3…
2 símbols
Horspoolto Wu-ManberHow do we can increase the length of the shifts?
With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG
AA 1AT 1GT 1TA 2TG 2
A 1C 4 (lmin)G 2T 1
1 símbol
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
30.11.17
20
Wu-ManberalgorithmSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GAT
TT
T
G
A
A
AA T
into the text: ACATGCTATGTGACATAATA
…
AA 1AT 1GT 1TA 2TG 2
Experimental length: log|Σ| 2*lmin*rSlides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
BackwardOracle
• SetBackwardsoracleSBDM,SBOM
• Pages68-72
Stringmatchingofmanypatterns
5 10 15 20 25 30 35 40 45
8
4
2
| S|
Wu-Manber
SBOMLmin
(5 patterns)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM(10 patterns)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
(100 patterns)
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Stringmatchingofmanypatterns
5 10 15 20 25 30 35 40 45
8
4
2
| S|
Wu-Manber
SBOM
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
5 10 15 20 25 30 35 40 45
8
4
2
SBOM
Lmin
(5 patterns)
(10 patterns)
(100 patterns)(1000 patterns)
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
5strings
30.11.17
21
10strings 100strings
1000strings FactorOracle
FactorOracle:safeshift FactorOracle:
Shift to match prefix of P2?
30.11.17
22
Factororacle ConstructionoffactorOracle
Factororacle• Allauzen,C.,Crochemore,M.,andRaffinot,M.1999.FactorOracle:ANew
StructureforPatternMatching.InProceedingsofthe26thConferenceonCurrentTrendsintheoryandPracticeofinformaticsontheoryandPracticeofinformatics (November27- December04,1999).J.Pavelka,G.Tel,andM.Bartosek,Eds.LectureNotesInComputerScience,vol.1725.Springer-Verlag,London,295-310.
• http://portal.acm.org/citation.cfm?id=647009.712672&coll=GUIDE&dl=GUIDE&CFID=31549541&CFTOKEN=61811641#
• http://www-igm.univ-mlv.fr/~allauzen/work/sofsem.ps
Sofar
• GeneralisedKMP->AhoCorasick• GeneralisedHorspool->CommentzWalter,WuManber
• BDM,BOM->SetBackwardOracleMatching…
• Othergeneralisations?
30.11.17
23
MultipleShift-AND
• P={P1,P2,P3,P4}. GeneralizeShift-AND
• Bits=
• Start=
• Match=
P1P2P3P4
1111
1111