Date post: | 09-Dec-2016 |
Category: |
Documents |
Upload: | mario-rodriguez |
View: | 213 times |
Download: | 0 times |
String matching with preprocessing
pat tern
Moni Naor
IBM Almaden l~esearch Center
650 Harry Road
San Jose, CA 95120, USA
of text and
Abstract
We consider the well known string matching problem where a text and pattern are
given and the problem is to determine if the pattern appe~s as a substring in the text.
The setting we investigate is where the pattern and the text are preprocessed separately
at two different sites. At some point in time the two sites wish to determine if the
pattern appears in the text (this is the on-fine stage). We provide preprocessing and on-
line algorithms such that the preprocessing algorithm runs in linear time and requires
linear storage and the on-fine complexity is logarithmic in the text. We also give an
application of the algorithm to parallel data compression, and show how to implement
the Lempel Ziv algorithm in logarithmic time with linear number of processors.
1 Introduction
The string matching problem is: given a text S and a pattern P over some alphabet, does
P appear as a substring of S. This is one of the most studied problems in computer science
and a large body of literature is devoted to it. (See e.g. the survey by Aho [Ah] or the
proceedings edited by Apostolico and Gall1 [AG].) Most pattern matching algorithms fa~ into one of two categories: either the text is first
preprocessed and then, given the pattern, the work is simple and proportional to the length
of the pattern (for instance [We, Mc]); or, the pattern is first preprocessed and then, given
740
the text, it is simple to determine if the text contains the pattern as a substring (for instance
[KMP]). In this work we combine the two approaches and show what can be done when both
pattern and text are preprocessed.
Suppose that a system consists of two sites and one site is given a text and a~uother is
given a pattern. They have time to preproeess their inputs and store the results (this is the
preprocessing stage). At some later time they wish to determine whether the pattern appears
in the text (the on-line stage). What we are aiming for is a procedure for the preprocessing
which is not more expensive than the traditional algorithms in terms of time and memory
requirements in the preprocessing stage, but whose on-line complexity is considerably smaller
than the length of the text or the pattern.
This problem is applicable in several scenarios. It fits naturally in a distributed environ-
ment where the text and the pattern reside in two different locations. Another scenario is
where there are many texts and many patterns, and we wish to know for some pairs of (text,
pattern) whether the pattern appears in the text. We need not know in advance which pat-
tern should be searched for in which text. (Searching for all patterns in all texts might yield
too much information, i.e the length of the output might be larger than the complexity we
are aiming for .) Using the methods described in this paper, after the initial preprocessing
we can easily determine for any given pair of pattern and text whether the pattern appears
in the text. Another application is outlined in Section 3~ where we show how to use the
solution of this problem in order to implement in parallel with a linear number of processors
the data compression method of Lempel-Ziv [ZL1].
Compare the scenario considered in this paper with the communication complexity one:
in a communication complexity setting only the number of bits exchanged between the parties
are counted, and the complexity of their internal computation is completely disregarded.
Nevertheless, impossibility results for our model can be derived from the communication
complexity model. In particular, we can deduce that our algorithms must be probabilistic,
if we insist that the on-line complexity be smaller than the length of the pattern: even the
simpler string equality problem, i.e. in the case where the pattern and the text are of the
same length, the deterministic communication complexity is linear [Yao].
In this paper we provide an algorithm where the preprocessing complexity is linear and
the on-line complexity is logarithmic in the length of the text. The probability of error can
be made inverse polynomially small within those time bounds. We assume that the system
has a globally known short random string. The inputs and the shared random string are
741
independent of each other. The adversary who chooses the pattern and text does not know
the random string.
The two main tools applied in the algorithm are Suffix trees and the Karp-Rabin string
matching algorithm. We briefly summarize them:
A probabilistic string matching algorithm has been suggested by Karp and Rabin [Ktt].
In their algorithm, a random hash function h is chosen, for instance the function h(x) =
x rood q for a random prime q. h is computed on the pattern, and then it is computed on a
sliding window of length m on the string. Whenever the function is evaluated for a segment
in the text it is compared with the value computed for the pattern. If the evaluations of h
are equal, then the segment and the pattern are tested to see whether they match.
Suffix trees [Mc] are a trie representation of a~ suffixes of a string. The advantage they
offer is that searching for a pattern can be done in time proportional to the pattern.
The main idea of the algorithm is to combine the two tools by searching the suffix tree
quickly via the Karp-Kabin string matching algorithm.
Note that if the length of the pattern m is known at the text site in the preprocessing
stage then the problem is easy: the text preprocessor compute and stores for every segment
of length m the value of h. The on-line algorithm simply searches for h(P) in the stored
values, ttowever, if m is not known, or known only approximately, then the preprocessing
requires n 2 or n- rn time azad memory.
Notation: The input at one site is a text S = S[1], S[2],... S[n] and at the other site
a pattern P = P[1], P[2], . . . P[m]. S[i] and P~] belong to some alphabet ~... The pattern
matching problem is to find (if it exists) an index i such that P = S[i], S [ i+I ] , . . . , S[i-{-m-1].
For simplicity we assume that the alphabet ~ = {0, 1}. The case where the alphabet is larger
(but not unbounded) is treated similarly, without any performance degradation.
We can regard a binary string as an integer by defining s to be ~ i ~ S[i] • 2 ~-I.
2 The algorithm
We first give a general outline of the algorithm to the problem. We then identify an important
subproblem: substring equality, and in section 2.2 give a probabilistic algorithm for it.
742
2 . 1 G e n e r a l
A suffix tree for a text S is a trie of all the suffixes of S. Since the total length of all suffixes
might be 0(n2), nodes with outdegree one are compressed and a pointer to the relevant place
in the text is added. The total storage requirement is thus linear in the length of the text.
Every edge in the suffix tree corresponds to a substring of S. A pointer to the text and the
length of the corresponding substring is stored for every edge.
Algorithms that construct a suffix tree in linear time and space are known: [We], [Mc].
Further details on suffix trees appear in [CS].
Su f f i trees are traditionally used for the pattern matching problem as follows: The suffix
tree of the text is constructed. Every substring of the text corresponds to a path in the suffix
tree in a natural way. To test if a pattern appears in the text, the path corresponding to
the pattern is traversed, until either there is no edge in the tree corresponding to the next
character in the pattern (in which case the pattern is not a substring), or there are no
characters left in the pattern (in which case the pattern appears in the text).
For any node in a suffix tree there is the substring associated with the path from the root
to it, which is the concatenation of the substrings of the edges along the path from the root.
For our purposes we require that the length of this substring be stored in the node, and that
a pointer to the text referring to the endpoint of an occurrence of this substring in the text
be stored as well. This is information can easily be added in the suffix tree construction
stage.
The general approach of the algorithm is as follows: try to find the path from the root
that the pattern defines; however, instead of finding it by expanding node by node, try to
expand by longer paths in each iteration.
In a rooted tree a node v is called a separator if its removal from the tree splits the tree
so that no connected component contains more than ~n of the nodes. It is well known fact
which has been extensively applied that every tree contains a separator (see [Jo] or [Me]).
For any tree T we define a complete separator decomposition ~ree U on the same set of
nodes: U is a rooted tree whose root v is a separator of T; v's children (in U) are the roots
of the recursively defined separator decomposition trees for the connected components of
T resulting from the removal of v. Thus the number of children of v in U is exactly its
outdegree in T plus one.
743
Proposition 1 For any tree T with n nodes, the height of the separator decomposition tree
U is O(logn).
Also by a modification of the recursive method for finding separators we can have
Proposition 2 A complete separator decomposition tree can be constructed in linear time.
Any path in T starting from the root can be mapped to a path starting from the root in
U in a natural way: start from the root of U and at each node choose the child that is the
decomposition tree of the connected component containing the end point of the path in T.
Our general algorithm is to find in U the path corresponding to the one in T defined by
P. I~om the proposition we know that this path is at most of length log n. The algorithm
traverses the path in U by expanding it node by node starting from the root.
Let P be the pattern and let w be the end point of the path it determines in T. if P
is not a substring of S then we define the endpoint w to be the node corresponding to the
longest prefix of P which is a substring of S. Our goal is to find w.
Thus, the general structure of the algorithm is as follows:
1. Let v be the root of U
v.. While w is not known
• decide in which of the connected components represented by v's children w is in.
• assign v to that child.
In order to implement this scheme throughout the execution of the algorithm the following
objects are maintained:
• v, the current node in U. v is a separator for some connected component C of T.
• u, the topmost node of C (in T).
• k, a pointer to S that corresponds to the suffix starting at v.
• j the length (in characters) of the path from u to v
• i, an index to P such that P[1] , . . . P[i - 1] determines the path to u.
744
u is initialized to be the root of T and v is initialized as the root of U. i and k are initialized
to 1 and j is initialized to 0. Note that C does not necessarily contain all the descendants
of u. v is a descendant of u in T (or u itself) and u is a descendant of v in U. Given u and v
computing j and k is easy, since we have assumed that the suiTu¢ tree stores in the nodes the
length of the path from the root and a pointer to the text to an endpoint of this substring.
In order to find w we must find a way to decide quicldy which of the connected components
separated by v contains w. This can be done in two stages:
1. determine if w is a descendent of v. If it is not, then the connected component is the
one "above" v. v is assigned its child corresponding to that connected component, u
remains the same. i is not changed and k and j are updated from the information on
the new v in T.
2. If w is a descendent of v and w ~ v, then determine which of v's children in 7' is an
ancestor of w. v is assigned the corresponding child in U and u is assigned the old v.
i becomes i + j and/c and j are updated from the information provided from the tree
T on the new v.
Step is essenti ly aging whether S[k - j ] . . . S[k] -- P[ i ] . . . P[i + j]. In order to execute
Step 2, first find the child of v in T that corresponds to the symbol P[i ÷ j + 1]. Then,
determine if the substring corresponding to the edge between v and its child is equal to the
corresponding substring of P starting at P[i + j + 1].
This leads us to the formulation of the sv.bstring eqv.atity problem:
• Input to the preprocessing phase: text S and pattern P.
• On-line query input: indices k, i and length j. Question: Is the substring S[k], S[/c +
1], . . . S[k + j] equal to the substring P[i], P[i + 11,... P[i + j].
A solution to this problem yields a solution to our problem, since both step 1 and 2
are based on answering this type of query. The next subsection is devoted to solving the
substring equality problem.
To summarize, the preprocessing of the text S consists of:
• Construct a suffix tree T for S
• Construct a complete decomposition tree U for T
745
• Other work for answering quickly substring equality queries, as described in the next
subsection.
The preprocessing stage for the pattern consists only of the preprocessing required for the
substring equality problem.
Since the general algorithm consists of O(log n) iterations, the number of calls to the
substring equality problem is also logarithmic. Thus, the time required to answer the on-line
query is proportional to log n times the time required to answer the substring equality query.
Furthermore, if the probability of error in answering the substring query problem is at most
f , then the probability of making an error in the algorithm is at most log n . f .
2 . 2 S u b s t r i n g e q u a l i t y
In this section we show how to preprocess the string and the pattern so that we can answer
on line and quickly whether two substrings are equal or not. The method is based on the
Karp-Rabin [KR] string matching algorithm.
For 1 ~ i ~_ j ~ n let s~z = E~=~2 j - ~ ' S [ k ] and for 1 ~ i ~ j <: m let pi j =
~ = ~ 2 j - k . P[k]. By definition s i j = sl,~. 2 j-~ + s~,j.
Consider the hash function h ( z ) = z rood q for q a prime. We assume that q is sufficiently
small so that computation mod q can be done in constant time. We know that for 1 < i <
j ~ n
sl , j = si,~-1 • 2 j-~ + s~ 5 rood q
and therefore
h(s~ j ) = h ( s l j ) - h(sl ,~_l)2 j-~ rood q
Suppose that for every 1 ~_ i ~_ n h(s l , i ) had been computed and stored and for all 1 < k < n
2 k mod q had been computed and stored. From the above we can conclude that h(sl , j ) can
be computed in constant time for all 1 < i < j _< n.
It is known ([KR]) that if q is chosen as a random prime from the set of primes smaller
than M, then for any z < 2 n, the probability that x = 0 mod q is at most =~M) where ~r(M)
denotes the number of primes smaller than M. =(M) behaves asymptotically as lo~M" Thus
by choosing M to be n 2 logn we get that if q is chosen as random prime smaller than M,
then the probability of error is 1 nlog~"
746
Therefore we assume that the shared random string is a randomly chosen prime q. The
preprocessing performed on the text S (in addition to the preprocessing described in the
previous subsection) is
• Vl < i < n compute and store h(sl,i)
• V1 < k < n compute and store 2 k rood q
The preprocessing performed on the pattern P is similar
• V1 < i < m compute and store h(pl,,)
• V1 < k < m compute and store 2 k mod q
Given this preprocessing it can easily be determined for any two substrings of the text and
pattern whether they we equal: To test if Pit], P[i+ 1], . . . Pit+j] = S[k], S[k + 1], . . . S[k +j],
compute h(pq~+j) and h(sa,k+j) and compare them for equality.
During the execution of the algorithm log n times substrings of P and S are compared.
The probability of error in any comparison is at most 1 ~l-E~g~" Thus the probability of any
error occurring in the execution of the algorithm is at most t-. rt"
T h e o r e m 1 The string matching algorithm resulting from using the method for substring
equality described in this section has linear time and space complexity Jot the preprocessing
stage and logarithmic complexity for answering queries. The probability of error is at most
1_ and can be made arbitrarily polynomially small.
3 An application to parallel data compression
In this section we outline an application to data compression on a parallel machine, specif-
ically it is shown how to implement one of Lempel and Ziv's universal data compression
algorithms [LZ1], [ZL1]. For a general survey on data compression see [St].
The data compression problems we consider are of the form: given a string S, the task
is to find a compressed representation of S. (All of S is known at the beginning of the
execution.) Similarly, given the compressed representation, the task is to uncompress it.
The goal is to find an efficient parallel algorithm for performing these tasks. The number of
processors should be linear and the time polylogarithrnie.
747
Lempel and Ziv [ZL1] suggested an algorithm which is perfect in the information theoretic
sense of [LZ1]. To the best of our knowledge, no parallel implementation of an algorithm
which is perfect in this sense was known.
The Lempel Ziv method partitions the string into words such that no word is a prefix of
another one. The (sequential) algorithm proceeds in stages, each one generating one such
word. At every stage
the longest prefix of the remaining unreazI part of the input string which appears as a
substring of the input that has already been scanned is found. I.e., if ~he string has been
read up to index i, then the maximum j such that S[~ + 1]. . . S[i + j] appears as a substring
of S[1].. . S[~ + j - 1]. The new word is now S[~]...S[~ + j + 1]. A triple that consists of
a pointer to the word that S[~ -{- 1]. . . S[~ + j] is a prefix of, the length j + 1 and the next
character S[~ + j + 1] is created for the compressed representation.
l~odeh, Pra, tt and Even [I~PE] suggested using suffix trees for an efficient sequential
implementation of the algorithm. We follow them and use our sequential on-line algorithm
to provide an efficient paratIetimplementation. Apostolico, ]]iopoulos, Landau, Schieber and
Vishkin [Ap] have provided an efficient parallel algorithms for constructing suffix trees: A
CKGW PI~AM with n processors can construct a suffix tree in logarithmic time.
The general structure of our algorithm is:
1. For all 1 < i < n find the longest substring starting at ~ that appears prior to ~.
2. Using path doubling find the locations for which a triple should be generated.
The two key observations for using suffix trees to implement step 1 efficiently are
• The algorithm of the previous section can be used to solve the maximum prefix pattern
matching problem: given a text S and a pattern P find the longest prefix of P that
appears as a substring of S. The on-line processing time is still log n.
• The same suffix tree can be used for all prefixes of the string: store at each node the
first index of the string where the substring that corresponds to the node appears.
Now it can eazily be determined whether the subtree rooted at the node is part of the
suffix tree of a given prefix.
Thus, given the string S, the preproeessing described in the previous section is performed (all
the algorithms used in the preprocessing are parallelizable with linear number of processors.)
748
Then, for each 1 < i < n a processor is allocated. It searches for the largest j such that
S[i], S[i + 1],. . . S[i + j~] appears in S[1],.. . Sit + j - 1] and denotes it by j~. From the above
this can be performed in logarithmic (sequential) time. After all the j~'s have been found,
those indices for which a triple should be generated ( the breakpoints) should be computed.
If every i "points" to i + jl, then by pointer doubling after log n steps the brea.kpoints are
discovered.
The uncompressing procedure can be done as follows: first, using parallel prefix, for every
position k in the output string the triple to which it belongs is computed. A pointer from
k to the position corresponding to k in a previous triple is added, or a pointer to k itself in
case it is at the end of a word. Now via pointer doubling after log n steps the value of S[k]
can be found.
The string matching algorithm is probabilistic and there is a small chance of error. To
make the data compression algorithm error free (Las-Vegas) we can add a step that checks
the success of the above algorithm by uncompressing and comparing the result with the
original string. (The uncompressing algorithm is deterministic.)
We thus have
Theo r e m 2 The Lempef-Ziv data compression algorithm [ZL1] can be implemented on a
CRCW PRAM using linear number of processors and logarithmic expected time.
4 Open problems
We showed how to implement in parallel the first Lempel Ziv algorithm. Can their second
algorithm [ZL2] be implemented in parallel? We do not see even an inefficient a way of doing
it. Interestingly, implementing this algorithm sequentially is simpler than implementing the
first one.
A different question is whether the randomization can be moved to the on-line stage, so
that we do not have to assume that the two sites share a random string. Specifically, can the
substring equality problem be solved by a deterministic preprocessing stage a randomized
on-line stage.
749
Acknowledgments
It is a pleasure to thank Amos Fiat for his help in writing the paper, Dafna Sheinvald for
advise concerning data compression, Alex Sehaffer for discussions leading to this paper and
Gadi Landau for very helpful remarks concerning the paper.
References
[Ah] A.V. Aho, Algorithms/or finding patterns in strings, in The Handbook of Theo-
retical Computer Science, edited by J. van Leeuwen, Elsevier, Amsterdam, 1990.
[AHU] A. V. Aho, J. E. Hopcroft and J. D. Ullman, The design and analysis of com-
puter algorithms, Addison-Wesley, Reading, MA, 1974.
[AG] A. Apostolico and Z. Ga~l, Combinatorial algorithms on words, Springer-Vexlag,
1985.
[Ap] A. Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber and U. Vishkin, Parallel
construction of a suffix tree with applications, Algorithmica 3, 1988, pp. 347-365.
[CS] Chen and Seifres, Efficient and elegant subword-tree construction, in combinatorial
algorithms on words , Springer-Verlag, 1985~ pp. 97-110.
[Jo] C. Jordan, Sur le assemblages des lignes, J. Reine und Ang. Math. 70, 1869, pp.
185-190.
[KR] R.M. Karp and M. O. Rabin Efficient randomized pattern-matching algorithms, IBM
Journal of R&D, 31, 1987, 249-260.
[KMP] D. E. Knuth, J. H. Morris and V. It. Pratt, Fast pattern matching in strings, Siam
journal on Computing 6, 1977, pp. 323-349.
[LZ1] A. Lempel and J. Ziv, On the complexity o/finite sequences, IEEE Transactions on
Information Theory 22, 1976, pp. 75-81.
[Mc] E.M. McCreight, A space economical suffix tree construction algorithm, Journal of
the ACM 23, 1976, 262-272.
[Me]
[RPE]
[st]
We]
[Y o]
[ZL1]
[zT.2]
750
N. Megiddo, Applying parallel computation algorithms in the design o] serial algo- rithms, Journal of the ACM 30, 1983, pp 852-865.
M. Rodeh, V. R. Pratt and S. Even, Linear algorithms for compression via string
matching, Journal of the ACM 28, 1980, pp. 16-24.
J. A. Storer, Data compression: methods and theory, Computer Science Press,
Rockville, 1988.
P. Weiner, Linear pattern matching algorithms Proc. 14th annual Symp. on Switching
and Automata Theory, pp. 1-11.
A. C. Yao Some complexity question related to distributive computing Proc. 11th
ACM Symposioum on Theory of Computing, 1979, pp. 209-213.
J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE
Transactions on Information Theory 23, 1977, pp. 337-343.
J. Ziv and A. Lempel, Compression of individual sequences via variable-rate coding,
IEEE Transactions on Information Theory 24, 1978, pp. 530-536.