Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string...

Intrusion Detection and Malware AnalysisString matching algorithms

Pavel LaskovWilhelm Schickard Institute for Computer Science

Exact string matching

Problem (Exact string matching)Given a string P called a pattern and a longer string T called text,find all occurrences, if any, of pattern P in text T.

ExampleLet P = aba and T = bbabaxababay, then P occurs in T starting atpositions 3, 7 and 9 (notice the possible overlap of P in T).

Remark (Substrings vs. subsequences)It is a general convention to denote contiguous patterns asstrings whereas non-contiguous patterns (in the left-to-rightorder) are referred to as sequences.

Inexact matching and alignment

Problem (Inexact string matching)Given a string P and a text T find all strings S in T that contain atmost k “errors” with respect to P.

ExampleLet P = aba and T = bbabaxababay, then inexact match of P in Twith respect to 1 character substitution occurs at positions 1, 3, 5,7 and 9.

Problem (Sequence alignment)Alignment of two strings S1 and S2 is obtained by inserting spaceseither into or at the ends of S1 and S2 so that every character ofone string corresponds to exactly one character of the other one.

Example

q a c _ d b dq a w x _ b _

A whirlwind tour of string matching

Today:Naive string matchingFundamentalpreprocessingKnuth-Morris-PrattSet matching

Other problems/algorithms:Regular expression matchingRabin-Karp fingerprintsSuffix trees/arraysInexact matchingSequence alignment

Naive string matching

Align left end of P with the left end of T and comparecharacters of P and T until a mismatch is found or P isexhausted.Shift P by one character and restart from the left of P.Continue until the right end of P shifts past the right end of T.Running time: O(mn)

Speeding up the naive method

Naive algorithm:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

*

abxyabxz

*

abxyabxz

*

abxyabxz

^^^^^^^^

Larger shifts:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^^^^

Saving partial matches:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^


Naive algorithm:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

*

abxyabxz

*

abxyabxz

*

abxyabxz

^^^^^^^^

Larger shifts:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^^^^


T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^


Naive algorithm:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

*

abxyabxz

*

abxyabxz

*

abxyabxz

^^^^^^^^

Larger shifts:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^^^^


T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^

Preprocessing

General idea: spend some modest amount of time onlearning about the internal structure of P or T in order tosave some time during search.Different preprocessing techniques were employed invarious original string matching algorithms.Similarity of preprocessing techniques can be expressed interms of a fundamental preprocessing which is independentof a search algorithm.

Fundamental preprocessing

DefinitionGiven a string S and a position i > 1, let Zi(S) be the length ofthe longest substring of S that starts at i and matches a prefix ofS. This substring is called a Z-box.

� �

�-boxes

DefinitionFor any index i, ri is the right-most end of Z-boxes beginning at orbefore i; li is the left end of the corresponding Z-box.

Fundamental preprocessing

DefinitionGiven a string S and a position i > 1, let Zi(S) be the length ofthe longest substring of S that starts at i and matches a prefix ofS. This substring is called a Z-box.

� �

�-boxes

� ��

DefinitionFor any index i, ri is the right-most end of Z-boxes beginning at orbefore i; li is the left end of the corresponding Z-box.

Linear time fundamental preprocessing

Goal: Compute Zi for each successive position i startingfrom i = 2.All Z-values are kept, as well as the recent pair r, l.Initialization: explicitly compute Z2 by scanning S[2 . . . |S|]and comparing it with S[1 . . . |S|].Recursion: given Z2, . . . , Zk−1, rk−1, lk−1, compute Zk, rk, lk.

Recursive step of the Z-algorithm

Key insight: character in the position k also appears in theposition k′ = k− lk−1 + 1; this also holds for entire substringsS[k . . . rk−1] and S[k′ . . . Zlk−1

].

S :α α

β β

k ′ Zlk−1 lk−1 k rk−1

If Zk′ ≤ |β|, then Zk = Zk′ , and r, l remain unchanged.

S :α α

β βγ γ γ

k ′ Zlk−1 lk−1 k rk−1Zk ′

If Zk′ ≥ |β| then the entire β is a prefix of S. Keep scanninguntil a mismatch occurs, set l to k and r to the characterbefore a mismatch.

S :α α

β β βγ γ ?k ′ Zlk−1 lk−1 k rk−1Zk ′



].

S :α α

β β



S :α α

β βγ γ γ



S :α α




].

S :α α

β β



S :α α

β βγ γ γ



S :α α


Running time of the Z-algorithm

TheoremAll values Zi(S) can be computed in O(|S|) time.

Proof.For each “≤”-iteration k, a constant-time work is needed foreach iteration.In each “≥”-iteration, the value rk strictly increases, howevernot beyond the end of S. Hence the overall amount of work isbound by |S|.

Knuth-Morris-Pratt algorithm

General idea:

Keep scanning the pattern P and text T left-to-right untilmismatch is found.Shift P such that its prefix overlaps with its suffix before themismatch position.

�

��

�

�

��

�

�before shift

�after shift

Continue scanning from the mismatching position.

Shifts in KMP

Let the shift pointer spi be the length of the longest prefix of Pwhich is the suffix of P[1 . . . i] (0 if no such prefix exists).In KMP, if a mismatch is found in position i + 1 in P, P can beshifted by i− spi positions to the right.For each i in P, spi is the length of a Z-box ending at i.Computing spi: after initializing all sp’s to 0, loop backwardsover j and set spj+Zj−1 = Zj.

Correctness of the shift rule

TheoremFor any alignment of P and T, if characters 1 through i of P matchtheir counterparts in T but character i + 1 mismatches T(k), thenP can be shifted by i− spi positions to the right without passingany occurrence of P in T.

Proof.Let β denote the prefix of P after shift of length sp. By definition ofsp, β matches its counterpart in T. Suppose there exist a missedoccurrence of P in T which begins earlier than a shifted P, i.e. itbegins with a prefix αβ. Then αβ is a suffix of P before shift whichis a prefix of (missed) P. However, the longest such suffix musthave been of |β|, which is a contradiction.

Complete KMP pseudocode

procedure KNUTH-MORRIS-PRATT(T, P) . |P| = n; |T| = mPreprocess P to find F(k) = spk−1 + 1 for all k from 1 to n + 1.c← 1 . Text running indexp← 1 . Pattern running indexwhile c + (n− p) ≤ m do

while P(p) = T(c) and p ≤ n do . Scan until mismatchp← p + 1c← c + 1

end whileif p = n + 1 then . End of p

Report an occurrence of P at the position c− n in Tend ifif p = 1 then . Mismatch at position 1 of p

c← c + 1end ifp← F(p) . Shift P by p− sp

end whileend procedure

Exact pattern set matching

ProblemGiven a set of patterns P = {P1, . . . , Pz}, find all occurrences ofsome pattern from P in text T.

Possible solutions:Run a standard single pattern matching algorithm (e.g.KMP) z times: O(z(m + n)).Build a suffix tree for T and scan each pattern in P against it:O(m + zn).Build a keyword tree for P and run the Aho-Corasickalgorithm; O(m + n + k), where k is the number of matches.

Keyword tree (trie)

A keyword tree K corresponding to a set of patterns P is a treesatisfying the following conditions:

Each edge is labeled with exactly one character.Any two edges out of the same node have different labels.Every pattern in P maps to some node v in K such thispattern is spelled out by edge labels on the path from theroot to v.Every leaf in K corresponds to some pattern in P .

Keyword tree example

Keyword tree for P = {“error”,“potato”,“pottery”,“other”,“otter”}:

p

o

t

a

t

o

2

o

t

h

e

r

4

t

e

r

y

3

t

e

r

5

er

ro

r

1

Naive set matching using keyword tree

Follow a unique path in K that matches a substring of Tstarting from a fixed position l until either a marked node or amismatch is encountered.Move to the next position l and repeat until T is exhausted.Running time: O(mn).

Failure links

DefinitionFor any node v in a keyword tree, letL(v) denote the label sequence on the path from root to v,lp(v) be the longest suffix of L(v) which is a prefix of somepattern in P , andnv be the unique node in K labeled with the suffix of L(v) oflength lp(v).

DefinitionA failure link is a pair of nodes (v, nv).

Failure link example

Keyword tree for P = {“error”,“potato”,“pottery”,“other”,“otter”}augmented with failure links:

p

o

t

a

t

o

2

o

t

h

e

r

4

t

e

r

y

3

t

e

r

5

er

ro

r

1

Aho-Corasick algorithm

procedure AHO-CORASICK SEARCH(T, P,K)c← 1 . Text running indexl← 1 . Starting position of current matchw← root of K . Running keyword tree node pointerrepeat

while there is an edge (w, w′) labeled with T(c) doif w′ is marked with i then

Report an occurrence of Pi at the position l in Tend ifw← w′

c← c + 1end whilew← nw . Follow a failure linkl← c− lp(w) . Jump to the next search start

until c > mend procedure

Computing failure links

function FAILURE_LINK(v, K)v′ ← parent of vx← character on the edge (v′, v)w← nv′ . Begin with the failure link of v’s parentwhile there is no edge of w labeled with x, and w 6= root do

w← nw . Follow failure links until an edge labeled x foundend whileif there is an edge (w, w′) out of w labeled with x then

nv ← w′

elsenv ← root

end ifend function

Recommended reading

D. Gusfield.Algorithms on strings, trees, and sequences.Cambridge University Press, 1997.

Date post:	08-Sep-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string...

Documents