[Lecture Notes in Computer Science] Automata, Languages and Programming Volume 510 || String...

String matching with preprocessing

pat tern

Moni Naor

IBM Almaden l~esearch Center

650 Harry Road

San Jose, CA 95120, USA

of text and

Abstract

We consider the well known string matching problem where a text and pattern are

given and the problem is to determine if the pattern appe~s as a substring in the text.

The setting we investigate is where the pattern and the text are preprocessed separately

at two different sites. At some point in time the two sites wish to determine if the

pattern appears in the text (this is the on-fine stage). We provide preprocessing and on-

line algorithms such that the preprocessing algorithm runs in linear time and requires

linear storage and the on-fine complexity is logarithmic in the text. We also give an

application of the algorithm to parallel data compression, and show how to implement

the Lempel Ziv algorithm in logarithmic time with linear number of processors.

1 Introduction

The string matching problem is: given a text S and a pattern P over some alphabet, does

P appear as a substring of S. This is one of the most studied problems in computer science

and a large body of literature is devoted to it. (See e.g. the survey by Aho [Ah] or the

proceedings edited by Apostolico and Gall1 [AG].) Most pattern matching algorithms fa~ into one of two categories: either the text is first

preprocessed and then, given the pattern, the work is simple and proportional to the length

of the pattern (for instance [We, Mc]); or, the pattern is first preprocessed and then, given

740

the text, it is simple to determine if the text contains the pattern as a substring (for instance

[KMP]). In this work we combine the two approaches and show what can be done when both

pattern and text are preprocessed.

Suppose that a system consists of two sites and one site is given a text and a~uother is

given a pattern. They have time to preproeess their inputs and store the results (this is the

preprocessing stage). At some later time they wish to determine whether the pattern appears

in the text (the on-line stage). What we are aiming for is a procedure for the preprocessing

which is not more expensive than the traditional algorithms in terms of time and memory

requirements in the preprocessing stage, but whose on-line complexity is considerably smaller

than the length of the text or the pattern.

This problem is applicable in several scenarios. It fits naturally in a distributed environ-

ment where the text and the pattern reside in two different locations. Another scenario is

where there are many texts and many patterns, and we wish to know for some pairs of (text,

pattern) whether the pattern appears in the text. We need not know in advance which pat-

tern should be searched for in which text. (Searching for all patterns in all texts might yield

too much information, i.e the length of the output might be larger than the complexity we

are aiming for .) Using the methods described in this paper, after the initial preprocessing

we can easily determine for any given pair of pattern and text whether the pattern appears

in the text. Another application is outlined in Section 3~ where we show how to use the

solution of this problem in order to implement in parallel with a linear number of processors

the data compression method of Lempel-Ziv [ZL1].

Compare the scenario considered in this paper with the communication complexity one:

in a communication complexity setting only the number of bits exchanged between the parties

are counted, and the complexity of their internal computation is completely disregarded.

Nevertheless, impossibility results for our model can be derived from the communication

complexity model. In particular, we can deduce that our algorithms must be probabilistic,

if we insist that the on-line complexity be smaller than the length of the pattern: even the

simpler string equality problem, i.e. in the case where the pattern and the text are of the

same length, the deterministic communication complexity is linear [Yao].

In this paper we provide an algorithm where the preprocessing complexity is linear and

the on-line complexity is logarithmic in the length of the text. The probability of error can

be made inverse polynomially small within those time bounds. We assume that the system

has a globally known short random string. The inputs and the shared random string are

741

independent of each other. The adversary who chooses the pattern and text does not know

the random string.

The two main tools applied in the algorithm are Suffix trees and the Karp-Rabin string

matching algorithm. We briefly summarize them:

A probabilistic string matching algorithm has been suggested by Karp and Rabin [Ktt].

In their algorithm, a random hash function h is chosen, for instance the function h(x) =

x rood q for a random prime q. h is computed on the pattern, and then it is computed on a

sliding window of length m on the string. Whenever the function is evaluated for a segment

in the text it is compared with the value computed for the pattern. If the evaluations of h

are equal, then the segment and the pattern are tested to see whether they match.

Suffix trees [Mc] are a trie representation of a~ suffixes of a string. The advantage they

offer is that searching for a pattern can be done in time proportional to the pattern.

The main idea of the algorithm is to combine the two tools by searching the suffix tree

quickly via the Karp-Kabin string matching algorithm.

Note that if the length of the pattern m is known at the text site in the preprocessing

stage then the problem is easy: the text preprocessor compute and stores for every segment

of length m the value of h. The on-line algorithm simply searches for h(P) in the stored

values, ttowever, if m is not known, or known only approximately, then the preprocessing

requires n 2 or n- rn time azad memory.

Notation: The input at one site is a text S = S[1], S[2],... S[n] and at the other site

a pattern P = P[1], P[2], . . . P[m]. S[i] and P~] belong to some alphabet ~... The pattern

matching problem is to find (if it exists) an index i such that P = S[i], S [ i+I ] , . . . , S[i-{-m-1].

For simplicity we assume that the alphabet ~ = {0, 1}. The case where the alphabet is larger

(but not unbounded) is treated similarly, without any performance degradation.

We can regard a binary string as an integer by defining s to be ~ i ~ S[i] • 2 ~-I.

2 The algorithm

We first give a general outline of the algorithm to the problem. We then identify an important

subproblem: substring equality, and in section 2.2 give a probabilistic algorithm for it.

742

2 . 1 G e n e r a l

A suffix tree for a text S is a trie of all the suffixes of S. Since the total length of all suffixes

might be 0(n2), nodes with outdegree one are compressed and a pointer to the relevant place

in the text is added. The total storage requirement is thus linear in the length of the text.

Every edge in the suffix tree corresponds to a substring of S. A pointer to the text and the

length of the corresponding substring is stored for every edge.

Algorithms that construct a suffix tree in linear time and space are known: [We], [Mc].

Further details on suffix trees appear in [CS].

Su f f i trees are traditionally used for the pattern matching problem as follows: The suffix

tree of the text is constructed. Every substring of the text corresponds to a path in the suffix

tree in a natural way. To test if a pattern appears in the text, the path corresponding to

the pattern is traversed, until either there is no edge in the tree corresponding to the next

character in the pattern (in which case the pattern is not a substring), or there are no

characters left in the pattern (in which case the pattern appears in the text).

For any node in a suffix tree there is the substring associated with the path from the root

to it, which is the concatenation of the substrings of the edges along the path from the root.

For our purposes we require that the length of this substring be stored in the node, and that

a pointer to the text referring to the endpoint of an occurrence of this substring in the text

be stored as well. This is information can easily be added in the suffix tree construction

stage.

The general approach of the algorithm is as follows: try to find the path from the root

that the pattern defines; however, instead of finding it by expanding node by node, try to

expand by longer paths in each iteration.

In a rooted tree a node v is called a separator if its removal from the tree splits the tree

so that no connected component contains more than ~n of the nodes. It is well known fact

which has been extensively applied that every tree contains a separator (see [Jo] or [Me]).

For any tree T we define a complete separator decomposition ~ree U on the same set of

nodes: U is a rooted tree whose root v is a separator of T; v's children (in U) are the roots

of the recursively defined separator decomposition trees for the connected components of

T resulting from the removal of v. Thus the number of children of v in U is exactly its

outdegree in T plus one.

743

Proposition 1 For any tree T with n nodes, the height of the separator decomposition tree

U is O(logn).

Also by a modification of the recursive method for finding separators we can have

Proposition 2 A complete separator decomposition tree can be constructed in linear time.

Any path in T starting from the root can be mapped to a path starting from the root in

U in a natural way: start from the root of U and at each node choose the child that is the

decomposition tree of the connected component containing the end point of the path in T.

Our general algorithm is to find in U the path corresponding to the one in T defined by

P. I~om the proposition we know that this path is at most of length log n. The algorithm

traverses the path in U by expanding it node by node starting from the root.

Let P be the pattern and let w be the end point of the path it determines in T. if P

is not a substring of S then we define the endpoint w to be the node corresponding to the

longest prefix of P which is a substring of S. Our goal is to find w.

Thus, the general structure of the algorithm is as follows:

1. Let v be the root of U

v.. While w is not known

• decide in which of the connected components represented by v's children w is in.

• assign v to that child.

In order to implement this scheme throughout the execution of the algorithm the following

objects are maintained:

• v, the current node in U. v is a separator for some connected component C of T.

• u, the topmost node of C (in T).

• k, a pointer to S that corresponds to the suffix starting at v.

• j the length (in characters) of the path from u to v

• i, an index to P such that P[1] , . . . P[i - 1] determines the path to u.

744

u is initialized to be the root of T and v is initialized as the root of U. i and k are initialized

to 1 and j is initialized to 0. Note that C does not necessarily contain all the descendants

of u. v is a descendant of u in T (or u itself) and u is a descendant of v in U. Given u and v

computing j and k is easy, since we have assumed that the suiTu¢ tree stores in the nodes the

length of the path from the root and a pointer to the text to an endpoint of this substring.

In order to find w we must find a way to decide quicldy which of the connected components

separated by v contains w. This can be done in two stages:

1. determine if w is a descendent of v. If it is not, then the connected component is the

one "above" v. v is assigned its child corresponding to that connected component, u

remains the same. i is not changed and k and j are updated from the information on

the new v in T.

2. If w is a descendent of v and w ~ v, then determine which of v's children in 7' is an

ancestor of w. v is assigned the corresponding child in U and u is assigned the old v.

i becomes i + j and/c and j are updated from the information provided from the tree

T on the new v.

Step is essenti ly aging whether S[k - j ] . . . S[k] -- P[ i ] . . . P[i + j]. In order to execute

Step 2, first find the child of v in T that corresponds to the symbol P[i ÷ j + 1]. Then,

determine if the substring corresponding to the edge between v and its child is equal to the

corresponding substring of P starting at P[i + j + 1].

This leads us to the formulation of the sv.bstring eqv.atity problem:

• Input to the preprocessing phase: text S and pattern P.

• On-line query input: indices k, i and length j. Question: Is the substring S[k], S[/c +

1], . . . S[k + j] equal to the substring P[i], P[i + 11,... P[i + j].

A solution to this problem yields a solution to our problem, since both step 1 and 2

are based on answering this type of query. The next subsection is devoted to solving the

substring equality problem.

To summarize, the preprocessing of the text S consists of:

• Construct a suffix tree T for S

• Construct a complete decomposition tree U for T

745

• Other work for answering quickly substring equality queries, as described in the next

subsection.

The preprocessing stage for the pattern consists only of the preprocessing required for the

substring equality problem.

Since the general algorithm consists of O(log n) iterations, the number of calls to the

substring equality problem is also logarithmic. Thus, the time required to answer the on-line

query is proportional to log n times the time required to answer the substring equality query.

Furthermore, if the probability of error in answering the substring query problem is at most

f , then the probability of making an error in the algorithm is at most log n . f .

2 . 2 S u b s t r i n g e q u a l i t y

In this section we show how to preprocess the string and the pattern so that we can answer

on line and quickly whether two substrings are equal or not. The method is based on the

Karp-Rabin [KR] string matching algorithm.

For 1 ~ i ~_ j ~ n let s~z = E~=~2 j - ~ ' S [ k ] and for 1 ~ i ~ j <: m let pi j =

~ = ~ 2 j - k . P[k]. By definition s i j = sl,~. 2 j-~ + s~,j.

Consider the hash function h ( z ) = z rood q for q a prime. We assume that q is sufficiently

small so that computation mod q can be done in constant time. We know that for 1 < i <

j ~ n

sl , j = si,~-1 • 2 j-~ + s~ 5 rood q

and therefore

h(s~ j ) = h ( s l j ) - h(sl ,~_l)2 j-~ rood q

Suppose that for every 1 ~_ i ~_ n h(s l , i ) had been computed and stored and for all 1 < k < n

2 k mod q had been computed and stored. From the above we can conclude that h(sl , j ) can

be computed in constant time for all 1 < i < j _< n.

It is known ([KR]) that if q is chosen as a random prime from the set of primes smaller

than M, then for any z < 2 n, the probability that x = 0 mod q is at most =~M) where ~r(M)

denotes the number of primes smaller than M. =(M) behaves asymptotically as lo~M" Thus

by choosing M to be n 2 logn we get that if q is chosen as random prime smaller than M,

then the probability of error is 1 nlog~"

746

Therefore we assume that the shared random string is a randomly chosen prime q. The

preprocessing performed on the text S (in addition to the preprocessing described in the

previous subsection) is

• Vl < i < n compute and store h(sl,i)

• V1 < k < n compute and store 2 k rood q

The preprocessing performed on the pattern P is similar

• V1 < i < m compute and store h(pl,,)

• V1 < k < m compute and store 2 k mod q

Given this preprocessing it can easily be determined for any two substrings of the text and

pattern whether they we equal: To test if Pit], P[i+ 1], . . . Pit+j] = S[k], S[k + 1], . . . S[k +j],

compute h(pq~+j) and h(sa,k+j) and compare them for equality.

During the execution of the algorithm log n times substrings of P and S are compared.

The probability of error in any comparison is at most 1 ~l-E~g~" Thus the probability of any

error occurring in the execution of the algorithm is at most t-. rt"

T h e o r e m 1 The string matching algorithm resulting from using the method for substring

equality described in this section has linear time and space complexity Jot the preprocessing

stage and logarithmic complexity for answering queries. The probability of error is at most

1_ and can be made arbitrarily polynomially small.

3 An application to parallel data compression

In this section we outline an application to data compression on a parallel machine, specif-

ically it is shown how to implement one of Lempel and Ziv's universal data compression

algorithms [LZ1], [ZL1]. For a general survey on data compression see [St].

The data compression problems we consider are of the form: given a string S, the task

is to find a compressed representation of S. (All of S is known at the beginning of the

execution.) Similarly, given the compressed representation, the task is to uncompress it.

The goal is to find an efficient parallel algorithm for performing these tasks. The number of

processors should be linear and the time polylogarithrnie.

747

Lempel and Ziv [ZL1] suggested an algorithm which is perfect in the information theoretic

sense of [LZ1]. To the best of our knowledge, no parallel implementation of an algorithm

which is perfect in this sense was known.

The Lempel Ziv method partitions the string into words such that no word is a prefix of

another one. The (sequential) algorithm proceeds in stages, each one generating one such

word. At every stage

the longest prefix of the remaining unreazI part of the input string which appears as a

substring of the input that has already been scanned is found. I.e., if ~he string has been

read up to index i, then the maximum j such that S[~ + 1]. . . S[i + j] appears as a substring

of S[1].. . S[~ + j - 1]. The new word is now S[~]...S[~ + j + 1]. A triple that consists of

a pointer to the word that S[~ -{- 1]. . . S[~ + j] is a prefix of, the length j + 1 and the next

character S[~ + j + 1] is created for the compressed representation.

l~odeh, Pra, tt and Even [I~PE] suggested using suffix trees for an efficient sequential

implementation of the algorithm. We follow them and use our sequential on-line algorithm

to provide an efficient paratIetimplementation. Apostolico, ]]iopoulos, Landau, Schieber and

Vishkin [Ap] have provided an efficient parallel algorithms for constructing suffix trees: A

CKGW PI~AM with n processors can construct a suffix tree in logarithmic time.

The general structure of our algorithm is:

1. For all 1 < i < n find the longest substring starting at ~ that appears prior to ~.

2. Using path doubling find the locations for which a triple should be generated.

The two key observations for using suffix trees to implement step 1 efficiently are

• The algorithm of the previous section can be used to solve the maximum prefix pattern

matching problem: given a text S and a pattern P find the longest prefix of P that

appears as a substring of S. The on-line processing time is still log n.

• The same suffix tree can be used for all prefixes of the string: store at each node the

first index of the string where the substring that corresponds to the node appears.

Now it can eazily be determined whether the subtree rooted at the node is part of the

suffix tree of a given prefix.

Thus, given the string S, the preproeessing described in the previous section is performed (all

the algorithms used in the preprocessing are parallelizable with linear number of processors.)

748

Then, for each 1 < i < n a processor is allocated. It searches for the largest j such that

S[i], S[i + 1],. . . S[i + j~] appears in S[1],.. . Sit + j - 1] and denotes it by j~. From the above

this can be performed in logarithmic (sequential) time. After all the j~'s have been found,

those indices for which a triple should be generated ( the breakpoints) should be computed.

If every i "points" to i + jl, then by pointer doubling after log n steps the brea.kpoints are

discovered.

The uncompressing procedure can be done as follows: first, using parallel prefix, for every

position k in the output string the triple to which it belongs is computed. A pointer from

k to the position corresponding to k in a previous triple is added, or a pointer to k itself in

case it is at the end of a word. Now via pointer doubling after log n steps the value of S[k]

can be found.

The string matching algorithm is probabilistic and there is a small chance of error. To

make the data compression algorithm error free (Las-Vegas) we can add a step that checks

the success of the above algorithm by uncompressing and comparing the result with the

original string. (The uncompressing algorithm is deterministic.)

We thus have

Theo r e m 2 The Lempef-Ziv data compression algorithm [ZL1] can be implemented on a

CRCW PRAM using linear number of processors and logarithmic expected time.

4 Open problems

We showed how to implement in parallel the first Lempel Ziv algorithm. Can their second

algorithm [ZL2] be implemented in parallel? We do not see even an inefficient a way of doing

it. Interestingly, implementing this algorithm sequentially is simpler than implementing the

first one.

A different question is whether the randomization can be moved to the on-line stage, so

that we do not have to assume that the two sites share a random string. Specifically, can the

substring equality problem be solved by a deterministic preprocessing stage a randomized

on-line stage.

749

Acknowledgments

It is a pleasure to thank Amos Fiat for his help in writing the paper, Dafna Sheinvald for

advise concerning data compression, Alex Sehaffer for discussions leading to this paper and

Gadi Landau for very helpful remarks concerning the paper.

References

[Ah] A.V. Aho, Algorithms/or finding patterns in strings, in The Handbook of Theo-

retical Computer Science, edited by J. van Leeuwen, Elsevier, Amsterdam, 1990.

[AHU] A. V. Aho, J. E. Hopcroft and J. D. Ullman, The design and analysis of com-

puter algorithms, Addison-Wesley, Reading, MA, 1974.

[AG] A. Apostolico and Z. Ga~l, Combinatorial algorithms on words, Springer-Vexlag,

1985.

[Ap] A. Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber and U. Vishkin, Parallel

construction of a suffix tree with applications, Algorithmica 3, 1988, pp. 347-365.

[CS] Chen and Seifres, Efficient and elegant subword-tree construction, in combinatorial

algorithms on words , Springer-Verlag, 1985~ pp. 97-110.

[Jo] C. Jordan, Sur le assemblages des lignes, J. Reine und Ang. Math. 70, 1869, pp.

185-190.

[KR] R.M. Karp and M. O. Rabin Efficient randomized pattern-matching algorithms, IBM

Journal of R&D, 31, 1987, 249-260.

[KMP] D. E. Knuth, J. H. Morris and V. It. Pratt, Fast pattern matching in strings, Siam

journal on Computing 6, 1977, pp. 323-349.

[LZ1] A. Lempel and J. Ziv, On the complexity o/finite sequences, IEEE Transactions on

Information Theory 22, 1976, pp. 75-81.

[Mc] E.M. McCreight, A space economical suffix tree construction algorithm, Journal of

the ACM 23, 1976, 262-272.

[Me]

[RPE]

[st]

We]

[Y o]

[ZL1]

[zT.2]

750

N. Megiddo, Applying parallel computation algorithms in the design o] serial algorithms, Journal of the ACM 30, 1983, pp 852-865.

M. Rodeh, V. R. Pratt and S. Even, Linear algorithms for compression via string

matching, Journal of the ACM 28, 1980, pp. 16-24.

J. A. Storer, Data compression: methods and theory, Computer Science Press,

Rockville, 1988.

P. Weiner, Linear pattern matching algorithms Proc. 14th annual Symp. on Switching

and Automata Theory, pp. 1-11.

A. C. Yao Some complexity question related to distributive computing Proc. 11th

ACM Symposioum on Theory of Computing, 1979, pp. 209-213.

J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE

Transactions on Information Theory 23, 1977, pp. 337-343.

J. Ziv and A. Lempel, Compression of individual sequences via variable-rate coding,

IEEE Transactions on Information Theory 24, 1978, pp. 530-536.

Date post:	09-Dec-2016
Category:	Documents
Upload:	mario-rodriguez
View:	213 times
Download:	0 times

[Lecture Notes in Computer Science] Automata, Languages and Programming Volume 510 || String...

Documents