Intrusion Detection and Malware AnalysisString matching algorithms
Pavel LaskovWilhelm Schickard Institute for Computer Science
Exact string matching
Problem (Exact string matching)Given a string P called a pattern and a longer string T called text,find all occurrences, if any, of pattern P in text T.
ExampleLet P = aba and T = bbabaxababay, then P occurs in T starting atpositions 3, 7 and 9 (notice the possible overlap of P in T).
Remark (Substrings vs. subsequences)It is a general convention to denote contiguous patterns asstrings whereas non-contiguous patterns (in the left-to-rightorder) are referred to as sequences.
Inexact matching and alignment
Problem (Inexact string matching)Given a string P and a text T find all strings S in T that contain atmost k “errors” with respect to P.
ExampleLet P = aba and T = bbabaxababay, then inexact match of P in Twith respect to 1 character substitution occurs at positions 1, 3, 5,7 and 9.
Problem (Sequence alignment)Alignment of two strings S1 and S2 is obtained by inserting spaceseither into or at the ends of S1 and S2 so that every character ofone string corresponds to exactly one character of the other one.
Example
q a c _ d b dq a w x _ b _
A whirlwind tour of string matching
Today:Naive string matchingFundamentalpreprocessingKnuth-Morris-PrattSet matching
Other problems/algorithms:Regular expression matchingRabin-Karp fingerprintsSuffix trees/arraysInexact matchingSequence alignment
Naive string matching
Align left end of P with the left end of T and comparecharacters of P and T until a mismatch is found or P isexhausted.Shift P by one character and restart from the left of P.Continue until the right end of P shifts past the right end of T.Running time: O(mn)
Speeding up the naive method
Naive algorithm:
T: xabxyabxyabxz
P: abxyabxz
*
abxyabxz
^^^^^^^*
abxyabxz
*
abxyabxz
*
abxyabxz
*
abxyabxz
^^^^^^^^
Larger shifts:
T: xabxyabxyabxz
P: abxyabxz
*
abxyabxz
^^^^^^^*
abxyabxz
^^^^^^^^
Saving partial matches:
T: xabxyabxyabxz
P: abxyabxz
*
abxyabxz
^^^^^^^*
abxyabxz
^^^^^
Speeding up the naive method
Naive algorithm:
T: xabxyabxyabxz
P: abxyabxz
*
abxyabxz
^^^^^^^*
abxyabxz
*
abxyabxz
*
abxyabxz
*
abxyabxz
^^^^^^^^
Larger shifts:
T: xabxyabxyabxz
P: abxyabxz
*
abxyabxz
^^^^^^^*
abxyabxz
^^^^^^^^
Saving partial matches:
T: xabxyabxyabxz
P: abxyabxz
*
abxyabxz
^^^^^^^*
abxyabxz
^^^^^
Speeding up the naive method
Naive algorithm:
T: xabxyabxyabxz
P: abxyabxz
*
abxyabxz
^^^^^^^*
abxyabxz
*
abxyabxz
*
abxyabxz
*
abxyabxz
^^^^^^^^
Larger shifts:
T: xabxyabxyabxz
P: abxyabxz
*
abxyabxz
^^^^^^^*
abxyabxz
^^^^^^^^
Saving partial matches:
T: xabxyabxyabxz
P: abxyabxz
*
abxyabxz
^^^^^^^*
abxyabxz
^^^^^
Preprocessing
General idea: spend some modest amount of time onlearning about the internal structure of P or T in order tosave some time during search.Different preprocessing techniques were employed invarious original string matching algorithms.Similarity of preprocessing techniques can be expressed interms of a fundamental preprocessing which is independentof a search algorithm.
Fundamental preprocessing
DefinitionGiven a string S and a position i > 1, let Zi(S) be the length ofthe longest substring of S that starts at i and matches a prefix ofS. This substring is called a Z-box.
� �
�-boxes
DefinitionFor any index i, ri is the right-most end of Z-boxes beginning at orbefore i; li is the left end of the corresponding Z-box.
Fundamental preprocessing
DefinitionGiven a string S and a position i > 1, let Zi(S) be the length ofthe longest substring of S that starts at i and matches a prefix ofS. This substring is called a Z-box.
� �
�-boxes
� ���� �
DefinitionFor any index i, ri is the right-most end of Z-boxes beginning at orbefore i; li is the left end of the corresponding Z-box.
Linear time fundamental preprocessing
Goal: Compute Zi for each successive position i startingfrom i = 2.All Z-values are kept, as well as the recent pair r, l.Initialization: explicitly compute Z2 by scanning S[2 . . . |S|]and comparing it with S[1 . . . |S|].Recursion: given Z2, . . . , Zk−1, rk−1, lk−1, compute Zk, rk, lk.
Recursive step of the Z-algorithm
Key insight: character in the position k also appears in theposition k′ = k− lk−1 + 1; this also holds for entire substringsS[k . . . rk−1] and S[k′ . . . Zlk−1
].
S :α α
β β
k ′ Zlk−1 lk−1 k rk−1
If Zk′ ≤ |β|, then Zk = Zk′ , and r, l remain unchanged.
S :α α
β βγ γ γ
k ′ Zlk−1 lk−1 k rk−1Zk ′
If Zk′ ≥ |β| then the entire β is a prefix of S. Keep scanninguntil a mismatch occurs, set l to k and r to the characterbefore a mismatch.
S :α α
β β βγ γ ?k ′ Zlk−1 lk−1 k rk−1Zk ′
Recursive step of the Z-algorithm
Key insight: character in the position k also appears in theposition k′ = k− lk−1 + 1; this also holds for entire substringsS[k . . . rk−1] and S[k′ . . . Zlk−1
].
S :α α
β β
k ′ Zlk−1 lk−1 k rk−1
If Zk′ ≤ |β|, then Zk = Zk′ , and r, l remain unchanged.
S :α α
β βγ γ γ
k ′ Zlk−1 lk−1 k rk−1Zk ′
If Zk′ ≥ |β| then the entire β is a prefix of S. Keep scanninguntil a mismatch occurs, set l to k and r to the characterbefore a mismatch.
S :α α
β β βγ γ ?k ′ Zlk−1 lk−1 k rk−1Zk ′
Recursive step of the Z-algorithm
Key insight: character in the position k also appears in theposition k′ = k− lk−1 + 1; this also holds for entire substringsS[k . . . rk−1] and S[k′ . . . Zlk−1
].
S :α α
β β
k ′ Zlk−1 lk−1 k rk−1
If Zk′ ≤ |β|, then Zk = Zk′ , and r, l remain unchanged.
S :α α
β βγ γ γ
k ′ Zlk−1 lk−1 k rk−1Zk ′
If Zk′ ≥ |β| then the entire β is a prefix of S. Keep scanninguntil a mismatch occurs, set l to k and r to the characterbefore a mismatch.
S :α α
β β βγ γ ?k ′ Zlk−1 lk−1 k rk−1Zk ′
Running time of the Z-algorithm
TheoremAll values Zi(S) can be computed in O(|S|) time.
Proof.For each “≤”-iteration k, a constant-time work is needed foreach iteration.In each “≥”-iteration, the value rk strictly increases, howevernot beyond the end of S. Hence the overall amount of work isbound by |S|.
Knuth-Morris-Pratt algorithm
General idea:
Keep scanning the pattern P and text T left-to-right untilmismatch is found.Shift P such that its prefix overlaps with its suffix before themismatch position.
�
��
�
�
��
�
�before shift
�after shift
Continue scanning from the mismatching position.
Shifts in KMP
Let the shift pointer spi be the length of the longest prefix of Pwhich is the suffix of P[1 . . . i] (0 if no such prefix exists).In KMP, if a mismatch is found in position i + 1 in P, P can beshifted by i− spi positions to the right.For each i in P, spi is the length of a Z-box ending at i.Computing spi: after initializing all sp’s to 0, loop backwardsover j and set spj+Zj−1 = Zj.
Correctness of the shift rule
TheoremFor any alignment of P and T, if characters 1 through i of P matchtheir counterparts in T but character i + 1 mismatches T(k), thenP can be shifted by i− spi positions to the right without passingany occurrence of P in T.
Proof.Let β denote the prefix of P after shift of length sp. By definition ofsp, β matches its counterpart in T. Suppose there exist a missedoccurrence of P in T which begins earlier than a shifted P, i.e. itbegins with a prefix αβ. Then αβ is a suffix of P before shift whichis a prefix of (missed) P. However, the longest such suffix musthave been of |β|, which is a contradiction.
Complete KMP pseudocode
procedure KNUTH-MORRIS-PRATT(T, P) . |P| = n; |T| = mPreprocess P to find F(k) = spk−1 + 1 for all k from 1 to n + 1.c← 1 . Text running indexp← 1 . Pattern running indexwhile c + (n− p) ≤ m do
while P(p) = T(c) and p ≤ n do . Scan until mismatchp← p + 1c← c + 1
end whileif p = n + 1 then . End of p
Report an occurrence of P at the position c− n in Tend ifif p = 1 then . Mismatch at position 1 of p
c← c + 1end ifp← F(p) . Shift P by p− sp
end whileend procedure
Exact pattern set matching
ProblemGiven a set of patterns P = {P1, . . . , Pz}, find all occurrences ofsome pattern from P in text T.
Possible solutions:Run a standard single pattern matching algorithm (e.g.KMP) z times: O(z(m + n)).Build a suffix tree for T and scan each pattern in P against it:O(m + zn).Build a keyword tree for P and run the Aho-Corasickalgorithm; O(m + n + k), where k is the number of matches.
Keyword tree (trie)
A keyword tree K corresponding to a set of patterns P is a treesatisfying the following conditions:
Each edge is labeled with exactly one character.Any two edges out of the same node have different labels.Every pattern in P maps to some node v in K such thispattern is spelled out by edge labels on the path from theroot to v.Every leaf in K corresponds to some pattern in P .
Keyword tree example
Keyword tree for P = {“error”,“potato”,“pottery”,“other”,“otter”}:
p
o
t
a
t
o
2
o
t
h
e
r
4
t
e
r
y
3
t
e
r
5
er
ro
r
1
Naive set matching using keyword tree
Follow a unique path in K that matches a substring of Tstarting from a fixed position l until either a marked node or amismatch is encountered.Move to the next position l and repeat until T is exhausted.Running time: O(mn).
Failure links
DefinitionFor any node v in a keyword tree, letL(v) denote the label sequence on the path from root to v,lp(v) be the longest suffix of L(v) which is a prefix of somepattern in P , andnv be the unique node in K labeled with the suffix of L(v) oflength lp(v).
DefinitionA failure link is a pair of nodes (v, nv).
Failure link example
Keyword tree for P = {“error”,“potato”,“pottery”,“other”,“otter”}augmented with failure links:
p
o
t
a
t
o
2
o
t
h
e
r
4
t
e
r
y
3
t
e
r
5
er
ro
r
1
Aho-Corasick algorithm
procedure AHO-CORASICK SEARCH(T, P,K)c← 1 . Text running indexl← 1 . Starting position of current matchw← root of K . Running keyword tree node pointerrepeat
while there is an edge (w, w′) labeled with T(c) doif w′ is marked with i then
Report an occurrence of Pi at the position l in Tend ifw← w′
c← c + 1end whilew← nw . Follow a failure linkl← c− lp(w) . Jump to the next search start
until c > mend procedure
Computing failure links
function FAILURE_LINK(v, K)v′ ← parent of vx← character on the edge (v′, v)w← nv′ . Begin with the failure link of v’s parentwhile there is no edge of w labeled with x, and w 6= root do
w← nw . Follow failure links until an edge labeled x foundend whileif there is an edge (w, w′) out of w labeled with x then
nv ← w′
elsenv ← root
end ifend function
Recommended reading
D. Gusfield.Algorithms on strings, trees, and sequences.Cambridge University Press, 1997.