String Search in Coarse-Grained Parallel Computers

Algorithmica (1999) 24: 177–194 Algorithmica© 1999 Springer-Verlag New York Inc.

String Search in Coarse-Grained Parallel Computers1

P. Ferragina2 and F. Luccio2

Abstract. Given a text stringT [1,n], the multistring search problemconsists of determining which ofkpattern strings{X1[1,m], X2[1,m], . . . , Xk[1,m]}, provided on-line, occur inT . We study this problem in theBSPmodel [27], and then extend our analysis to other coarse-grained parallel computational models. We referto the relevant and difficult case oflongpatterns, that ism≥ p, wherep is the number of available processors,and provide an optimal result with respect to both computation and communication times, attaining a constantnumber of supersteps. We then studysinglestring search (i.e.,k = 1), and show how the multistring searchalgorithm can be employed to speed up the process and balance the communication cost. The proposed solutiontakes a constant number of supersteps, and achieves optimal communication time if the string to be searchedis longer thanp2. Our results are based on the distribution of a proper data structure among thep processors,to reduce and balance the communication cost. We also indicate howshort patterns can be efficiently dealtwith, through a completely different algorithmic approach.

Key Words. String matching, Distributed data structures, BSP model, Parallel computing, Computationalcomplexity.

1. Introduction. The availability of very large text databases makes the design ofefficient methods for string processing more and more critical [2]. Important applicationson digital libraries [15], biological and textual databases [26], make use of very largetext collections requiring specialized nontrivial search operations. Parallelism offers astrong hope for meeting such a challenge. A large amount of research has been recentlydirected to designing data structures and algorithms for string processing that exploit theparallelism inherent in the external storage devices (multidisks) [4], [8], [12], [13]. Anatural and related question is how computing withp processors can speed up the searchoperations. Several techniques for PRAM fast string search have been proposed (e.g.,see [1], [3], [16], [19], and [25]). As known, however, in this model of computation thenumber of processors is treated as an unbounded resource, and the cost of communicationis not taken into account. In this paper we study the worst-case complexity of somestring searching problems in a coarse-grained parallel environment, by developing simpledistributed data structures that reduce and balance the communication cost.

Themultistring searchproblem considered here can be regarded as the problem ofperforming a number of string search processes on a givenset of text strings. For simplic-ity, we can restrict our attention to justone text stringthat is obtained by concatenatingall available texts, separated by a special character that does not occur elsewhere. Informal terms, a multistring search on a textT [1,n] consists of determining, for a setof k pattern strings{X1[1,m], X2[1,m], . . . , Xk[1,m]} provided on-line, whichXj ’s

1 This research was supported in part by NATO Project No. CRG971467. Part of this work has been presentedas a preliminary conference version in [13].2 Dipartimento di Informatica, Universit`a di Pisa, Pisa, Italy.{ferragin,luccio}@di.unipi.it.

Received June 1, 1997; revised March 10, 1998. Communicated by F. Dehne.

178 P. Ferragina and F. Luccio

occurs inT . Since the textT is given in advance, the goal is to preprocess properly itin such a way that thek queried patterns can be searched intoT with a complexity thatdepends on their total lengthkm, and is as independent as possible of the text lengthn. This problem naturally generalizes classical on-line string matching [24] to a set ofpattern strings. Although for simplicity we refer to aboolean querythat establishes “ifa pattern occurs” and not “where it occurs,” our approach can be immediately extendedto the more general “reporting” case. Furthermore, our assumption that all the patternstrings have the same lengthm can be easily relaxed.

To study the communication and computation performance of our solutions we firstadopt theBulk Synchronous Parallel(BSP) model [27], and then consider some of itsrecent extensions. ABSP(p, g, L) computer consists ofp processors, each providedwith a local memory (of sizeO(n/p)) and communicating through a network of band-width g and latencyL. The computation of this machine is organized as a sequence ofsupersteps. In a superstep, the processors operate independently performing local com-putation and generating a number of point-to-point messages. At the end of the superstep,the messages are delivered to their destinations and a global barrier synchronization isenforced. If each processor performs at mostw local operations and sends/receives atmosth messages (this is denoted as anh-relation), the superstep requiresw + gh+ Ltime, where max{w, L} is thecomputation timeand max{gh, L} is thecommunicationtime. Some optimality criteria proposed in [17] characterize the possible speed-up onreal machines. In this paper we say that a BSP algorithm isc-optimalif its speed-up isclose top/c for both the communication and computation time.

Some variations of the BSP model have been subsequently introduced (e.g., see [5],[10], [9], and [20]) to encourage the use ofspatial locality, and to measureI/O costinaddition to communication cost. These concepts can be combined in several ways, todepict different families of physical architectures. We consider here the models BSP∗

of [5] and EM-BSP∗ of [9], as paradigms respectively based on spatial locality and I/Ocomplexity. In BSP∗, blockwise communication is accounted for by a new parameterbwhich measures the packet size that yields best throughput. In EM-BSP∗ each processorhas a memory of sizeM , andD disk drives that can be accessed concurrently. IfB denotesthe disk block size, then a processor can transferD× B data from disks to local memoryin a single I/O operation, at a costG. It is assumed thatM ≥ DB. As a limit EM-BSP∗

can contain one processor, thereby resembling the one-processor version of theParallelDisk Modelintroduced in [28]. In this case parallelism occurs at disk (I/O) level.

In this paper we study the multistring search problem in theBSP(p, g, L) model,assuming that the patterns arelong, that ism ≥ p, and provide an optimal result withrespect to both the computation and communication times (the number of superstepsis constant and thus also optimal). Note that the conditionm ≥ p is fairly natural inpractice. A typical case is the indexing of WEB servers which usually consist of fewpowerful commodity workstations connected by a high-speed local network. We thenconsider the problem ofsingle string search (i.e.,k = 1), and show how to exploitthe previous algorithms for multisearch in order to speed up the process and balancethe communication cost. In this way we achieve optimal communication time if thestring to be searched is sufficiently long (i.e., longer thanp2), using a constant (optimal)number of supersteps. The study conducted on BSP is then extended to BSP∗, and toEM-BSP∗ with one processor, through simulation between models. This also leads to

String Search in Coarse-Grained Parallel Computers 179

optimal algorithms if proper conditions are met. In the concluding section we also brieflyindicate howshortpatterns (i.e.,m< p) can be efficiently dealt with, through a differentalgorithmic approach based on a known transformation of the multistring search probleminto multisearch on integers.

Our results are based on a simple method to distribute a string searching data structureamong thep processors, in order to reduce and balance the communication cost. Anotherfeature of our approach is that the distribution of the data structure does not depend on thenumber of searched items (k, the number of strings), as occurs in previous works [5], [6].

2. Preliminaries. For a stringT = T [1,n], T [1, i ] is a prefix, T [ j,n] is a suffix, andT [i, j ] is a substringof T (for 1 ≤ i ≤ j ≤ n). Given a pattern stringX[1,m], we saythat there is anoccurrenceof X at positioni in T if X = T [i, i +m−1]; or, equivalently,if X is a prefix of thei th suffix T [i,n].

Given the text stringT [1,n], we denote bySUF(T) = {T1, T2, . . . , Tn} the lexi-cographically ordered set of all text suffixes, whereTi = T [ ji ,n] for 1 ≤ ji ≤ n,Ti ≤L Ti+1 for all i = 1,2, . . . ,n− 1, and≤L is the lexicographic order. Searching forthe occurrences of a patternX[1,m] in the textT amounts to retrieving all the suffixesof T that have the patternX as a prefix. The ordered setSUF(T) exploits an interestingproperty found by Manber and Myers [23], namely,the suffixes having prefix X occupya contiguous part of SUF(T). In addition, the leftmost string of this contiguous part canbe identified by using the fact thatit is adjacent to the lexicographic position of X in theordered set SUF(T). Consequently, searching forX in T mainly consists of retrievingits lexicographic position inSUF(T). In fact, since we are only interested in establishingif X occurs inT , we can check this by establishing ifX is a prefix of the string thatfollows its lexicographic position inSUF(T).

As an example, we setT = abababaabbabcand consider the lexicographicallyordered set of all text suffixes (indicated by means of their starting positions)SUF(T) ={7,5,3,1,8,11,6,4,2,10,9,12,13}. If we haveX = aba, then its lexicographic po-sition in SUF(T) is betweenT [7,13] = aabbabcandT [5,13] = abaabbabc. Noticethat X is a prefix ofT [5,13] and in factX occurs inT ; moreover, we can concludethat X occurs three times inT sinceX is a prefix of two other text suffixes adjacentin SUF(T) to T [5,13], namely,T [3,13] andT [1,13]. If we instead haveX = abbc,then its position is betweenT [8,13]= abbabcandT [11,13]= abc, so thatX is not aprefix of T [11,13] and in factX does not occur inT .

To perform fast string searches is then paramount to using a data structure that ef-ficiently retrieves the lexicographic position of a string in the ordered setSUF(T). Anexample of such a structure is the Blind Trie [12], which is actually a generalization ofthe Patricia Tree [22] since strings are allowed to be built over an arbitrary alphabet.Letting S be an ordered set of strings, the blind trieBTS built on S can be defined inthree steps: (1) build a compacted trie3 on the strings ofS; (2) label each node by the

3 A compacted trie storing a set of distinct strings in its leaves is defined as follows: Each arc is labeledby a substring and its sibling arcs are ordered according to their first characters, which are distinct. Thereare no nodes having only one child except (possibly) the root, and each node stores the string obtained byconcatenating the substrings that label the arcs in the path that connects the root to the node.


Fig. 1. An example of a blind trie built on the suffixes of the text stringT = abababaabbabc. The numbersinside the squares denote the starting positions of the suffixes pointed to by those leaves. The labeling substringsof the engrossed arcs have been substituted with their first characters.

length of the substring stored in it, and replace each substring labeling an arc by its firstcharacter only, called thebranching character; (3) replace all the strings in the leaveswith pointers to them. An illustrative example is given in Figure 1.

The resulting data structure loses some information with respect to the compactedtrie because all the characters in each arc label are deleted, except for the branchingcharacter. Nonetheless lexicographic searches can be performed inS, because the blindtrie maintains the power of the compacted trie but none of its drawbacks. Indeed, thesearch for the lexicographic position of a stringX in S, calledblind search operation[12], is implemented by means of the following two phases:

• Trace a downward path inBTS to locate a leafl (which does not necessarily identifyX’s position inS). Indeed, start from the root of the blind trie, and only compare thecharacters ofX with the branching characters found in the traversed arcs until a leaflis reached or no further branching is possible (in this case, choosel to be a descendingleaf from the last traversed node).• Compare the string pointed to byl with X in order to determine their common prefix.

A useful property holds (see [12]), namely,the leaf l stores one of the strings inS thatshare thelongestcommon prefix with X. The length of this common prefix, saylcp,is used in two ways: first to determine the shallowest ancestor ofl whose label is aninteger equal to or greater thanlcp; and then to find the position ofX in S by using itsmismatching character,X[lcp+ 1], and by choosing a proper leaf descending fromthat ancestor node.

The properties of the blind trie are fully described in [12]. We only recall here themain features that can be inferred from the discussion above, and that will be useful in


the design of our distributed data structures.BTS requires linear space in thenumberofstrings inS, therefore the space usage is independent of theirtotal length. Furthermore,BTS allows us to find the lexicographic position ofX in S by comparingX with justone of the strings inS. Finally in the blind search operation, except when comparingXwith the string pointed to byl , we deploy only the information available inBTS .

From the discussion above it immediately follows that we can useBTSUF(T) to searchfor the lexicographic position of an arbitrary stringX in SUF(T). Therefore,BTSUF(T)

is the main tool to check ifX occurs inT .

3. Searching Long Strings in BSP. In this section we study the multistring searchproblem in theBSP(p, g, L) model, in the case that the stringsX1[1,m], X2[1,m],. . . , Xk[1,m] to be searched in the textT have lengthm ≥ p. We also assume thatk ≥ p so that each processor is in charge of performing at least one search process. Twomain problems arise when implementing a multisearch operation on long strings: (1) thetext characters have to be properly distributed, to reduce the congestion that inevitablyarises when comparing the searched strings with some text suffixes; (2) the blind trieBTSUF(T) built on the whole setSUF(T) cannot be kept in the local memory of eachprocessor, because this would requireO(n) local space. We must then distributeBTSUF(T)

among thep processors, in such a way that the amount of information exchanged duringthe searching process is as small as possible.

THE DATA STRUCTURE. We distribute the characters ofT [1,n] in a circular fashionamong thep processors. Namely, we mapT [1] to the first processor,T [2] to the secondprocessor,. . ., T [ p] to the pth processor and, again,T [ p + 1] to the first processor,T [ p + 2] to the second processor, and so on. In addition, we select fromSUF(T) asubset of2(n/p) text suffixesS = {T1, Tp+1, T2p+1, T3p+1, . . .} and we build a blindtrie BTS on this set (see [7] for a similar idea). Recalling thatBTS contains one characterper arc, one integer per internal node, and one pointer per leaf, we have thatBTS requiresO(n/p)space. Therefore, a copy of it can be stored in the local memory of each processor.We note that the suffixes inS implicitly induce a partition ofSUF(T) into O(n/p)contiguous subsequences of text suffixes, namely,S j = {Tjp+1, Tjp+2, . . . , T( j+1)p}, forj = 0,1, . . . , bn/pc. We build a blind trieBT j on each sequenceS j , remarking thatBT j occupiesO(p) space because it is built onp text suffixes. We distribute each blindtrie BT j among thep processors by storing aconstant numberof its nodes/arcs perprocessor. In conclusion, each processor will have a “constant-size piece” of each blindtrie BT j , a “full” copy of the blind trieBTS , andn/p characters of the text (at distancep one from the other).

An illustrative example is given in Figure 2. There the BSP model consists ofp = 3processors, so that the starting positions of the sampled suffixes areS = {7,1,6,10,13}.The notationBT j

i is used to denote the piece of the blind trieBT j stored in the localmemoryMi of processorPi . Notice that eachMi stores a copy ofBTS , some charactersof T (below them are specified the text positions), and the pieces of the blind triesBT0

i and BT3i (the other piecesBT1

i , BT2i are not shown). It isS0 = {7,5,3} and

S3 = {10,9,12}.


Fig. 2.Refer to Figure 1 for the textT and its whole blind trieBTSUF(T). The figure above shows the contentof each local memoryMi after the building stage.

BUILDING STAGE. An important task is the construction of the ordered setSUF(T),which in turn is used to build the blind tries constituting the data structure above. Thistask can be executed by easily adapting to the BSP model one of the algorithms presentedin [4] for the two-level memory model (see [23] for an efficient RAM algorithm). Thealgorithm extends the naming technique of Karp et al. [21] to sort the suffixes of thestring T efficiently by imposing that the names (integers) assigned toT ’s substringsreflect their lexicographic order [23]. Namely, for any two substrings ofT , says1 ands2, if s1 is lexicographically smaller thans2, then the name assigned tos1 is smallerthan the name assigned tos2. This way, the lexicographic comparison between any twosubstrings can be done by just looking at their names. Consequently, the building stagedeterminesSUF(T) in O(logn) phases, each one in charge of sorting substrings ofT ofincreasing power of two length (see also [4] and [23]). Theqth phase sorts the substringsof T of length 2q by exploiting the fact that each of those substrings can be seen as theconcatenation of a prefix and a suffix of length 2q−1. Indeed, since these two substringshave been inductively named (according to the property above), all the substrings oflength 2q are sorted by comparing in constant time the names of their prefixes andsuffixes of length 2q−1. As a consequence, theqth phase is reduced to sorting a set ofO(n) pairs of integers. After that, the invariant on the naming process is preserved, bynaming each length-2q substring with itsrank in the resulting sorted sequence.


In our setting, we can use the algorithm in [18] to perform the sorting step effi-ciently at each phase on theO(n) pairs of integers. Since we haveO(logn) phases, thewhole construction process takesO((n log2 n)/p) computation time andO((gn log2 n)/(p log(n/p))) communication time.

We now consider thek patterns,X1, X2, . . . , Xk, to be searched inT . They are dis-tributed evenly among thepprocessors:X1, . . . , Xk/p on the first processor,Xk/p+1, . . . ,

X2k/p on the second processor, and so on. Our searching procedure consists of two mainphases. In the first phase, each processor works independently of the others and searchesfor the lexicographic position of itsk/p patterns in the (sampled string) setS by meansof its own copy ofBTS . In the second phase, each processor continues the search ofeach one of its patterns, sayXj , in the corresponding blind trieBT pos( j ), wherepos( j )is the lexicographic position ofXj in S. Two key observations are in order: (i) in thefirst phase, we need a communication step in which we collect the characters of thetext suffixes that must be lexicographically compared with theXj ’s (see the first phaseof the blind search in Section 2); (ii) at the beginning of the second phase, we need acommunication step to collect the pieces ofBT pos( j ) in order to search locally forXj ,since these pieces are evenly distributed among thep processors. Due to the way thetext T and the blind triesBT j have been distributed among the processors, it will turnout that the communication is well balanced, and its cost is optimal for a wide range ofparameter values.

Before proceeding with the technical details of the searching procedure, we presentthe main ideas underlying our approach with the running example depicted in Figures 3–5. The stringsX1 = aba, X2 = aac, and X3 = bba are the patterns to be searchedin the textT = abababaabbabc(herek = p = m = 3). The content of the localmemories of the three processors before that the multistring search process starts areillustrated in Figure 2. The first phase of the searching process aims at determining thelexicographic position of eachXi in S, and this is done in two stages. In the formerstage, each processorPi determines the starting positionsi of a suffix inS sharing thelongest common prefix withXi , by usingBTS . Figure 3 shows the downward path (withengrossed arcs) followed by eachPi in its local copy ofBTS , and the selected suffixess1 = 1, s2 = 7, ands3 = 6.

In the latter stage of the first phase, each processorPi broadcastssi to all the otherprocessors and receives back the characters ofT [si , si + 2] (in Figure 3, the arrows andtheir labels describe the actual characters sent by the processors). These characters arestored in the local memories and are exploited by each processorPi , in combination withthe copy ofBTS , to determine locally the lexicographic position ofXi in S. Figure 4shows the result of this second stage by indicating, in each local memoryMi , the substringT [si , si + 2] collected byPi and the lexicographic position ofXi in S (with an upwardarrow). Namely,pos(1) = 0, pos(2) = 0, andpos(3) = 3 denote the subsetsS pos(i )

where the search ofXi must continue in the second phase of the multistring searchprocess.

At the beginning of the second phase, each processorPi broadcasts the valuepos(i )to all the other processors and asks for the piecesBT pos(i )

j of the blind trieBT pos(i )

stored in the local memoriesMj , with j 6= i (in Figure 4, the arrows and their labelsdescribe the actual pieces of the blind tries sent by the processors). AfterPi has collected


Fig. 3.The first phase of the multistring search algorithm: first stage.

in its local memoryMi the pieces ofBT pos(i ), it reconstructs the blind trie (see Figure 5)and searches for the lexicographic position ofXi in S pos(i ) as done in the first phase (thedetermined position is indicated in Figure 5 by an arrow). Notice that, the position ofXi

in S pos(i ) is indeed its lexicographic position in the whole setSUF(T). Consequently,each processorPi comparesXi with its adjacent string inSUF(T) (this requires anothercommunication step) and thus establishes ifXi occurs inT (see Section 2). In the runningexample of Figure 5, sinceX1 andX3 are prefixes ofT [5,13] andT [9,13], respectively,they occur inT ; while X2 is not a prefix ofT [5,13], so thatX2 does not occur inT .

The searching procedure executed by a generic processor is described formally below.For simplicity, we present only the behavior of thefirst processor:

• Superstep 1.The processor performs the first phase of the blind search procedureon its own copy ofBTS (see Section 2), thus retrieving the leavesl1, l2, . . . , lk/p thatcorrespond to strings inS sharing the longest common prefix withX1, X2, . . . , Xk/p,respectively. This search does not require communication but onlyO(km/p) compu-tation time. Then the processor broadcasts the tuple (l1, l2, . . . , lk/p) to all the otherprocessors in order to inform them which text suffixes have to be compared withX1, X2, . . . , Xk/p. This needs a(k/p)-relation.


Fig. 4.The first phase of the multistring search algorithm: second stage.

• Superstep 2.The processor collects the requestsl1, . . . , lk sent to it by all the otherprocessors during the previous Superstep 1. For each requested text suffix, the proces-sor retrieves its firstm/p characters stored in the local memory and sends them backto the requesting processor. This way, the pieces of a text suffix retrieved by all theprocessors, form the prefix of lengthm of this text suffix. This step needsO(km/p)computation time and a(km/p)-relation.• Superstep 3.The processor receives the prefixes of lengthm of the text suffixes

associated withl1, l2, . . . , lk/p (this is a(km/p)-relation), compares them with thestrings X1, X2, . . . , Xk/p, and finds their lexicographic positionspos(1), pos(2),. . . , pos(k/p) in S. This implements the second phase of the blind search opera-tion (see Section 2) and takesO(km/p) time. Then the processor asks for the piecesof the blind triesBT pos(1), BT pos(2), . . . , BT pos(k/p) stored in the local memory ofall the other processors, thus issuingk/p requests. This needs a(k/p)-relation.• Superstep 4.The processor collects thek requests sent by all the other processors

during Superstep 3 (i.e.,k/p requests are sent from each processor to all the otherones) and sends out the blind trie pieces requested and contained in its local memory.This step needs ak-relation because each blind-trie piece has constant size.


Fig. 5.The second phase of the multistring search process.

• Supersteps 5–7.The processor rebuilds into its local memory the blind triesBT pos(1),

BT pos(2), . . . , BT pos(k/p) and searches for the stringsX1, X2, . . . , Xk/p in them, sim-ilarly as done in the Supersteps 1–3 above. This takesO(km/p) computation timeand a(km/p)-relation. The new lexicographic positionspos() are the final ones inthe whole setSUF(T).• Supersteps 8–9.By means of a further lexicographic comparison between each pattern

Xj and the text suffixTpos( j ), we can establish ifXj occurs inT , for j = 1,2, . . . , k/p.This requires a(km/p)-relation (similarly as done before).

The procedure globally takesO((km/p)+k) computation time andO(kg+(km/p)g)communication time. Ifk = p (i.e., we search one pattern per processor), the algorithmneedsO(m) computation time,O(mg) communication time, and a constant number ofsupersteps (recall thatm≥ p = k). Consequently, the algorithm isc-optimal for a propersmall constantc, becauseO(km) = O(pm) is the (optimal) time needed in the sequentialcase. In the case thatk > p (i.e., we search more than one pattern per processor), thealgorithm needsO(km/p) computation time,O(kmg/p) communication time, and aconstant number of supersteps. The algorithm is stillc-optimal becauseO(km) is the(optimal) time needed in the sequential case. Note that we have assumed we work withaconstant-sizedalphabet. We can therefore state the following result:


THEOREM3.1. Given a text string T[1,n], whose characters are drawn from a constant-sized alphabet, we can search in T for k≥ p strings, each of length m≥ p, in O(mk/p)computation time, O(mkg/p)communication time,and a constant number of supersteps.

In the case of an unbounded alphabet, it is easy to show that Superstep 1 takesO((km/p) log(n/p))computation time, and Supersteps 5–7 takeO((km/p) log p)com-putation time. Globally, the procedure takesO((km/p) logn+k) computation time andO(kg+ (km/p)g) communication time. In this case, only the communication time isc-optimal.

THEOREM3.2. Given a text string T[1,n], whose characters are drawn from an un-bounded alphabet, we can search in T for k≥ p strings, each of length m≥ p, inO((mk/p) logn) computation time, O(mkg/p) communication time, and a constantnumber of supersteps.

4. Searching One Long String in BSP. We now study the problem of searching foronepattern stringX in the textT [1,n]. Various CRCW PRAM parallel solutions areknown (e.g., see [3], [16], and [19]), but not one that accounts for communication cost.We devise here a simple approach that works with few processors and balances such acost.

THE DATA STRUCTURE. We denote bySp the lexicographically ordered set of all textsubstrings of lengthp (|Sp| ≤ n). Notice thatSp can be built fromSUF(T) by truncatingeach text suffix to its length-p prefix and by removing duplicates. We associate with eachlength-p substring a name (integer) given by its rank inSp. Consequently, if two textsubstrings of lengthp are equal they get the same name; conversely, if they are different,then they get a name that reflects their lexicographic order (i.e., the smaller the name, thenlexicographically smaller is the corresponding substring). For the sake of presentation,we append to the end of the textT , p special symbols $ which do not occur elsewhere,so thatT has now lengthn + p. From this new textT , we derivep compressed textsT1, T2, . . . , T p of lengthn/p each, which are constructed as follows. We first defineT j to be thelongestsubstring ofT that starts at positionj and has length a multipleof p, namely,T j = T [ j, j + h p− 1] whereh = dn/pe. Then we partitionT j intoh contiguous substrings of lengthp each (i.e.,T j = T [ j, j + p − 1]T [ j + p, j +2p − 1] · · · T [ j + (h − 1)p, j + hp− 1]), and finally we replace each of them bytheir names (assigned before). Consequently, we derive the new compressed textT j =name(T [ j, j+p−1])name(T [ j+p, j+2p−1]) · · ·name(T [ j+(h−1)p, j+hp−1]),for each j = 1,2, . . . , p. Notice that each compressed textT j now has lengthO(n/p)and is defined over an alphabet of size|Sp| ≤ n.4 We store the textT j into the localmemory of thej th processor and we build a suffix tree [24], [11] on it, thus requiring

4 We remark that in this paper we are dealing with a comparison-based BSP model in which pointers, integers,and characters are atomic objects thus occupyingO(1) space. If we consider, instead, a bit-model, then eachname occupiesO(min {logn, p}) space, and this would make the total occupied spaceO(min {n logn,np}).In the case of a constant-sized alphabet, the algorithm would be not space optimal by aO(logn) factor.


O(n/p) local space. Finally, we build onSp the data structure described in Section 3 inorder to support efficient multistring search operations on many long strings (i.e., stringslonger thanp).

BUILDING STAGE. We refer the reader to our considerations in Section 3. In the presentcase, however, the sorted setSp can be built by repeating a sorting step forO(log p) timesonly, since we want to sort substrings of lengthp [23]. Hence, the construction phase glob-ally takesO((n logn log p)/p) computation time andO((gn logn log p)/(p log(n/p)))communication time. The construction of the suffix tree for each textT j can be doneindependently on each processor, once the compressed textT j has been stored in itslocal memory. This requiresO((n/p) logn) computation time and no communicationcost (the alphabet is unbounded5).

We now assume thatX[1, x] is the long string to be searched inT [1,n], wherewithout loss of generalityx is assumed to be a multiple ofp. Our idea is to transformthe single string search problem onX[1, x] into a multistring search problem on a set ofx/p (long) strings each of lengthp, in order to use our previous multisearch algorithmfor long strings (withk = x/p andm = p). For this purpose two properties of thecompressed textsT j ’s are important and easy to prove.

FACT 4.1. If X [1, x] occurs in T, then each length-p substring of X belongs toSp.

Let X′[1, x/p] be the compressed version ofX obtained by replacing some ofits length-p substrings with their names: we setX′ = name(X[1, p]) name(X[ p +1,2p]) · · · name(X[x − p+ 1, x]). We can easily prove:

FACT 4.2. If the pattern X occurs in T at position i, then X′ is well defined and it occursin the compressed text Ti modp at position i div p (wherediv denotes the integralquotient of the division). Specifically, we have that X′ = Ti modp[i div p, i div p+x/p− 1].

The idea underlying our searching algorithm is the following. First the compressedversion of the patternX (i.e., the stringX′) is built by searching for its length-psubstringsX[1, p], X[ p+ 1,2p], . . . , X[x− p+ 1, x] in the ordered setSp. This can be done bymeans of the multistring search algorithm of Section 3 because those pattern substringsare long. Then each substring is replaced by its name, and the compressed stringX′ issimultaneously and locally searched in all the compressed textsT j , for j = 1,2, . . . , p,by means of their suffix trees. IfX′ occurs in some textT j , then X occurs inT (byFact 4.2).

Before proceeding with the detailed description of the algorithm, we consider the fol-lowing illustrative example. LetT = abababaabbabbabbbcbe the text to be searchedand assume thatp = 3. We sort all ofT ’s substrings of length 3 and label them with

5 We do not need to use the optimal result of Farach [11], because we can employ the suboptimal algorithmof McCreight [24] since the other steps of the algorithm take at leastÄ(n logn/p) computation time.


names that reflect their lexicographic order (starting from 0). Then we create the threecompressed textsT1 = 140447,T2 = 41228, andT3 = 13569, which are in turn storedin the local memories of the three processors (one text per processor), and build a suffixtree on each of them. Now a given patternX = abaabbabbis divided into p = 3pieces of lengthx/p = 3, each mapped to a distinct processor. Indeedaba is mappedto P1, abb to P2, andabb to P3. Then each processor searches its pattern substring inS3, using the multistring search algorithm described in Section 3, and thus determinesits name:name(aba) = 1, name(abb) = 2. Finally, the compressed stringX′ = 122is reconstructed by each processor via a global broadcast, and it is searched in the localsuffix trees. ProcessorP2 finds an occurrence ofX′ in T2 and thus declares thatX occursin T (by Fact 4.2). Notice that if the string to be searched wasY = abaabbbbb, thenthe processors would compress it toY′ = 126 (sincename(bbb) = 6), and they woulddiscover thatY′ does not occur in any compressed textTi ; in fact, Y does not actuallyoccur inT .

We now give the details of this algorithm. For simplicity, we assume that the alphabethas constant size, even if our considerations easily extend to the more general case of anunbounded alphabet. The patternX[1, x] is distributed evenly among thep processors.Namely, if x < p2 (i.e., 1≤ x/p < p), we distribute thex/p length-p substrings ofXto the firstx/p processors. Otherwise, ifx ≥ p2, we map2(x/p2) contiguous length-psubstrings ofX per processor:X[1, `] is stored on the first processor,X[` + 1,2`] isstored on the second processor,. . ., where` = p dx/p2e (at mostp/2 processors areempty).

In order to simplify the discussion and without loss of generality, we assume in therest of this section thatx is a multiple ofp2. Our algorithm is as follows:

• Each processor partitions the substring ofX stored in its local memory, intox/p2

contiguous length-p substrings. This takesO(x/p) time and no communication cost.• Each processor searches inSp for the length-psubstrings stored in its local memory, by

using the multisearch algorithm described in Section 3 (withk = x/p,m= p). If oneof those substrings does not occur inSp, thenX does not occur inT and the searchingprocess is stopped (Fact 4.1). Otherwise, a name is associated with each substring.In order to compute the cost of this step, remember that our multisearch algorithmtakesO(km/p+ p+ k +m) computation time andO(g(km/p)+ pg+ kg+mg)communication time for a general numberk of searchedlongstrings of lengthm≥ p(Section 3). Therefore, by substitutingk = x/pandm= p, this step takesO(x/p+p)computation time andO(xg/p+ pg) communication time.• The j th processor shrinks the substringX[( j−1)(x/p)+1, j (x/p)] stored in its local

memory, by replacing the length-p substringsX[( j −1)(x/p)+1, ( j −1)(x/p)+ p],X[( j−1)(x/p)+p+1, ( j−1)(x/p)+2p], . . . , X[ j (x/p)−p+1, j (x/p)] with theirnames, which were computed in the previous step. Consequently, thej th processorbuilds the compressed substringX′[( j −1)(x/p2)+1, j (x/p2)]. This takesO(x/p)computation time and no communication cost.• Each processor broadcasts the (compressed) substring ofX′ stored in its local memory

to all the other processors, so that each of them now has a copy of the whole stringX′[1, x/p]. This step takesO(x/p) computation time andO(xg/p) communicationtime.


• The j th processor searches forX′ in the compressed textT j stored in its local memory,by means of the corresponding suffix tree. We conclude thatX occurs inT , if X′

occurs in one of the compressed textsT1, T2, . . . , T p (by Fact 4.2). This step takesO((x/p) logn) computation time because|X′| = x/p and the compressed texts arebuilt on an alphabet of sizeO(n) (thus the branching from each suffix-tree node takesO(logn) worst-case time [24]).

The correctness of this approach comes from the definition ofT1, T2, . . . , T p andfrom the naming process performed on the length-p substrings ofX (see Facts 4.1and 4.2). As far as the complexity is concerned, the algorithm takesO(p+ (x/p) logn)computation time andO(xg/p+ pg) communication time. We have therefore provedthe following result:

THEOREM4.3. Given a text string T[1,n], whose characters are drawn from a constant-sized alphabet, we can search for a string X[1, x] longer than p, in O(p+ (x/p) logn)computation time, O(pg + xg/p) communication time, and a constant number ofsupersteps.

We remark that the communication time isO(xg/p), whenx > p2. Then this timeis c-optimal. Also, the total number of supersteps is optimal because it is a constant. Inthe case of an unbounded alphabet, we have to add a termO(p logn) to the computationtime. Whenx < p, we could operate as in the PRAM model [3] by assigning names toall power-of-two length substrings ofT and then labelingX according to these names.

5. Extension to BSP∗ and EM-BSP∗. To extend our results to other coarse-grainedparallel models, we refer to the simulation algorithms presented in [9]. In fact, we limitour discussion to multistring search in BSP∗, and EM-BSP∗ with a single processor, for aconstant-sized alphabet. Other search problems, and other coarse-grained computationalmodels, can be similarly analyzed. Two results of [9] are relevant to our discussion,respectively dealing with the simulation of a BSP algorithm in BSP∗ and a BSP∗ algorithmin EM-BSP∗. We report them here, without requiring, for the moment, that the simulatedalgorithm isc-optimal.

LEMMA 5.1 (Lemma 1 of [9]). Av-processor BSP algorithm with communication timegα+λL,computation timeβ+λL,and context sizeµ,can be simulated on a p-processorBSP∗ in communication time O(g(vα/pb)+λL),computation time(v/p)β+O(vα/p+λL),and context size O((v/p)µ), for b ≤ (v/p)ξ ,suitable constantξ > 0,andv ≥ p1+ε

for constantε > 0.

LEMMA 5.2 (Theorem 1 of [9]). A v-processor BSP∗ algorithm with communicationtime g(α/b) + λL, computation timeβ + λL, and local memoryµ can be simu-lated on a single processor EM-BSP∗ with computation time O(vβ) and I/O timeO(Gλlvµ/(DB)) with probability 1 − exp(−Ä(l log l log(M/B))) for suitable l ≥1, β = ω(λµ),M = 2(hµ), v ≥ hD log(M/B),b ≥ B, and arbitrary integer h.


Note that the context size appearing in Lemma 5.1 essentially refers to the size of thedata structures maintained by the processors. In the light of these lemmas, our multistringsearch algorithm of Section 3 can be efficiently simulated, since the required number ofsupersteps, henceλ, is a constant. We have:

THEOREM5.3. Given a text T[1,n], whose characters are drawn from a constant-sizedalphabet, there exists an algorithm for BSP∗( p, g,b, L) that searches in T for k longstrings, each of length m, requiring O(mk/ p+L) computation time, O(mkg/( p b)+L)communication time, a constant number of supersteps, and context size O((n+ km)/ p),for m, k ≥ p1+ε and b≤ pχ (for constantsε, χ > 0).

PROOF. We consider the BSP algorithm of Section 3, leading to the result stated inTheorem 3.1. Apply the simulation technique of [9], referred to by Lemma 5.1, byrenamingv, p with p, p, respectively. We have a number of superstepsλ = O(1), acommunication factorα = mk/p, a computation factorβ = mk/p, and a context sizeµ = (n+ km)/p. Substitute all these values into Lemma 5.1 to get the bounds and theconditions stated above. Indeed, the conditionsp ≥ p1+ε, b ≤ (p/ p)ξ , andm, k ≥ pcan be rephrased by settingp = p1+ε and thus gettingb ≤ pχ (whereχ = εξ ) andm, k ≥ p1+ε.

Notice that we still get a constant number of supersteps, hence ac-optimal algorithm,according to our criteria of optimality stated in Section 1. For the EM-BSP∗ model witha single processor we have:

THEOREM5.4. Given a text T[1,n], whose characters are drawn from a constant-sizedalphabet, there exists an algorithm for a single processor EM-BSP∗(p′,G,M, B, D)that searches in T for k long strings, each of length m. This algorithm requiresO(mk) computation time and O(Gl(n+mk)/(DB)) I/O time with probability1 −exp(−Ä(l log l log(M/B))).

PROOF. From Theorem 5.3 and Lemma 5.2, by straightforward arithmetic calculations.In the simulationp plays the role ofv andp′ plays the role ofp. Also,µ = (n+ km)/ p,α = β = mk/ p, andλ = O(1).

The result of Theorem 5.4 is subject to some conditions on the parameters, derivingfrom the simulation algorithm of [9]. We must have:mk= ω(n); m, k ≥ p1+ε; B ≤ b ≤pχ ; M = 2((n+ km)/ p); p = Ä(D log(M/B)), for suitable constantsχ, ε, l > 0.In particular,mk= ω(n) implies that the total length of the pattern strings exceeds thelength of the text to be searched through, as has already been imposed in other works(e.g., see [6]). For example the condition must be observed in batch processes, wheremany queries are accumulated before a search is started. Further comments on the sizeof data are made in the next section.

The result stated in Lemma 5.2 is extended in [9] to the simulation of a BSP∗ algorithmon a multiprocessor EM-BSP∗. By this extension, we could perform multistring searches


on a multiprocessor EM-BSP∗. This is now a matter of exercise. The conditions to keepunder control, however, become more complex.

6. Concluding Remarks. Designing parallel string search algorithms is not easy whencommunication and I/O costs must be taken into account. The solution proposed in thispaper refers to long strings, and crucially relies on the distribution of nontrivial datastructures among the processors to balance the communication cost. I/O efficiency isthen obtained by simulation.

Two major remarks are still in order. The first is related to searchingk shortstringsin a textT , that is, strings of lengthm < p. An efficient BSP solution to this problemhas been given in [14], showing that the difficulties arising with the distribution of datastructures can be avoided using a standard “naming” technique for the strings. In fact,this reduces the problem to a search of a set of small integers [5], [6]. Without givingany detail, we mention that the BSP algorithm of [14] allows us to solve the problem inO((km/p) logn) computation time andO((km/p)g) communication time. This resultcan be extended to other models of parallel computation, as done above for long strings.As a general comment, the problem for short strings is not as challenging as the onetreated in this paper.

Our second remark has to do with the family of problems that benefit from a simu-lation from BSP to EM-BSP∗. To attain efficiency with this approach, the computationtime must exceed the time to load the “context” of the problem into each processor mem-ory, so that the latter time is amortized. As pointed out in [9], this typically occurs withcomputationally intensive problems, or with problems that do not require very large datastructures. However, a further aspect must be underlined. Problems like our multistringsearch work on large inputs, thus need large data structures, no matter how well designedthese may be. However, the main difference with other problems having attractive par-allel solutions, like for example large problems in computational geometry, is that inthe present case the whole data structure may not be necessarily examined during theoperations. Therefore, a simulation that forces loading the context at each superstep maybe scarcely efficient. In this respect, note the term(n+mk)/(DB) appearing in the I/Otime of Theorem 5.4, which reflects the size of the input that is spread over the disks.A direct deterministic implementation of multistring search on EM-BSP∗ and other I/Osensitive models is currently under investigation. It would not be surprising to improvethe bounds of Theorem 5.4, which are nonetheless important as reference values.

Finally, it would be nice to have some experiments or simulations against a naive textpartitioning and searching approach, in order to establish when the proposed algorithmpays. A further challenging problem is to study thedynamic versionof multistring searchin coarse-grained parallel computers. Here the set of indexed texts can be changed online, by inserting or deleting individual texts [12]. As discussed above, the implicationson I/O cost tend to favor the design of an ad hoc algorithm for EM-BSP∗, rather thanproceeding by simulation.

Acknowledgments. We are grateful to the referees for their deep and constructivecomments.


References

[1] A. Amir and M. Farach. Adaptive dictionary matching. InProc. of the 32nd IEEE Symposium onFoundations of Computer Science, pp. 760–766, 1991.

[2] A. Apostolico. The myriad virtues of subword trees. In A. Apostolico and Z. Galil, editors,CombinatorialAlgorithms on Words. NATO ASI Series F: Computer and System Sciences, Springer-Verlag, New York,1985, pp. 85–96.

[3] A. Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber and U. Vishkin. Parallel construction of asuffix tree with applications.Algorithmica, 3:347–365, 1988.

[4] L. Arge, P. Ferragina, R. Grossi and J. S. Vitter. On sorting strings in external memory. InProc. of the29th ACM Symposium on the Theory of Computing, pp. 540–548, 1997.

[5] A. Baumker, W. Dittrich and F. Meyer auf der Heide. Truly efficient parallel algorithms:c-optimalmultisearch for an extension of the BSP model. InProc. of the3rd European Symposium on Algorithms,pp. 17–30. Lecture Notes in Computer Science 979, Springer-Verlag, Berlin, 1995.

[6] A. Baumker, W. Dittrich and A. Pietracaprina. The deterministic complexity of parallel multisearch.In Proc. of the 5th Scandinavian Workshop on Algorithmic Theory, pp. 404–415. Lecture Notes inComputer Science 1097, Springer-Verlag, Berlin, 1996.

[7] R. Baeza-Yates, E. F. Barbosa and N. Ziviani. Hierarchies of indices for text searching.InformationSystems, 21(6), 497–514, 1996.

[8] D. R. Clark and J. I. Munro. Efficient suffix trees on secondary storage. InProc. of the7th ACM–SIAMSymposium on Discrete Algorithms, pp. 383–391, 1996.

[9] F. Dehne, W. Dittrich and D. Hutchinson. Efficient external memory algorithms by simulating coarse-grained parallel algorithms. InProc. of the9th ACM Symposium on Parallel Algorithms and Architec-tures, pp. 106–115, 1997.

[10] F. Dehne, A. Fabri and A. Rau-Chaplin. Scalable parallel geometric algorithms for coarse grainedmulticomputers.International Journal on Computational Geometry, 6(3):379–400, 1996.

[11] M. Farach. Optimal suffix tree construction with large alphabets. InProc. of the38th IEEE Symposiumon Foundations of Computer Science, pp. 137–143, 1997.

[12] P. Ferragina and R. Grossi. A fully-dynamic data structure for external substring search. InProc. ofthe27th ACM Symposium on the Theory of Computing, pp. 693–702, 1995. Full version available inhttp://www.di.unipi.it/∼ferragin.

[13] P. Ferragina and R. Grossi. Fast string searching in secondary storage: theoretical developments andexperimental results. InProc. of the7th ACM–SIAM Symposium on Discrete Algorithms, pp. 373–382,1996. Full version available in http://www.di.unipi.it/∼ferragin.

[14] P. Ferragina and F. Luccio. Multi-string search in BSP. InProc. of Compression and Complexity ofSequences1997, pp. 240–252. IEEE Computer Society, Los Alamitos, CA, 1998.

[15] A. E. Fox et al. Special issue on “Digital Libraries.”Communications of the ACM, 38, 1995.[16] Z. Galil. A constant-time optimal parallel string-matching algorithm.Journal of the ACM, 42(4):908–

918, 1995.[17] A. V. Gerbessiotis and L. G. Valiant. Direct bulk-synchronous parallel algorithms.Journal of Parallel

and Distributed Computing, 22(2):251–267, 1994.[18] T. Goodrich. Communication-efficient parallel sorting. InProc. of the28th ACM Symposium on the

Theory of Computing, pp. 247–256, 1996.[19] J. JaJa.An Introduction to Parallel Algorithms. Addison-Wesley, Reading, MA, 1992.[20] J. JaJa and K. W. Ryu. The block distributed memory model.IEEE Transactions on Parallel and

Distributed Systems, 7(8):830–840, 1996.[21] R. Karp, R. Miller and A. Rosenberg. Rapid identification of repeated patterns in strings, arrays and

trees. InProc. of the4th ACM Symposium on the Theory of Computing, pp. 125–136, 1972.[22] D. E. Knuth.The Art of Computer Programming: Sorting and Searching, Vol. 3. Addison-Wesley,

Reading, MA, 1969.[23] U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches.SIAM Journal on

Computing, 22(5):935–948, 1993.[24] E. M. McCreight. A space-economical suffix tree construction algorithm.Journal of the ACM,

23(2):262–272, 1976.


[25] S. Muthukrishnan and K. Palem. Highly efficient dictionary matching in parallel. InProc. of the5thACM Symposium on Parallel Algorithms and Architectures, pp. 69–78, 1993.

[26] G. Salton and M. J. McGill.Introduction to Modern Information Retrieval. McGraw-Hill, New York,1983.

[27] L. G. Valiant. A bridging model for parallel computing.Communications of the ACM, 33:103–111,1990.

[28] J. S. Vitter and E. A. Shriver. Algorithms for parallel memory, I: Two-level memories.Algorithmica12:110–147, 1994.

Date post:	25-Aug-2016
Category:	Documents
Upload:	f
View:	214 times
Download:	2 times

String Search in Coarse-Grained Parallel Computers

Documents