+ All Categories
Home > Documents > The Complexity of Parallel Multisearch on Coarse-Grained Machines

The Complexity of Parallel Multisearch on Coarse-Grained Machines

Date post: 20-Aug-2016
Category:
Upload: a
View: 214 times
Download: 0 times
Share this document with a friend
34
Algorithmica (1999) 24: 209–242 Algorithmica © 1999 Springer-Verlag New York Inc. The Complexity of Parallel Multisearch on Coarse-Grained Machines 1 A. B¨ aumker, 2 W. Dittrich, 2 and A. Pietracaprina 3 Abstract. Given m ordered segments that form a partition of some universe (e.g., a two-dimensional strip), the multisearch problem consists of determining, for a set of n query points in the universe, the segments they belong to. We present the first nontrivial parallel deterministic scheme for performing multisearch on a distributed-memory machine when m = ω(n). The scheme is designed on the BSP* model of parallel computation, a variant of Valiant’s BSP which rewards blockwise communication, and relies on a suitable redundant representation of the segments. The time needed to answer the queries is analyzed as a function of the redundancy and of the BSP* parameters. We show that optimal performance can be obtained using logarithmic redundancy. We also prove a lower bound on the communication requirements of any deterministic multisearch scheme realized on a distributed-memory machine. The lower bound exhibits a tradeoff between the redundancy used to represent the segments and the performance of the scheme. Key Words. Parallel algorithms, Multisearch, Coarse-grained machines, BSP model, BSP* model. 1. Introduction. Multisearch is a fundamental problem that arises in several applica- tion fields. In general terms it can be regarded as the problem of performing a number of search processes on a given data structure, such as a set of ordered items, a balanced tree, or a graph. For concreteness, we adopt the definition given in [1] for this problem. Let a set {σ 1 2 ,...,σ m } of m segments be given, forming a partition of a given uni- verse (e.g., a strip in two dimensions). The segments are ordered in the sense that there exists an elementary operation to compare a point q in the universe with a segment σ i , which establishes whether q σ i , q S i -1 j =1 σ j , or q S m j =i +1 σ j . The multisearch problem consists of determining, for a set of n arbitrary points in the universe (queries), the segments they belong to. Multisearch has many important applications in fields such as computational geom- etry, vision and pattern recognition, among others, and it naturally arises in the parallel implementation of several data structures (see [2] for an extensive list of applications). 1 This research was supported in part by the ESPRIT III Basic Research Programme of the EC un- der Contract No. 9072 (project GEPPCOM), by the CNR Project CNR96.02538.CT07, and by DFG- Sonderforschungsbereich 376 “Massive Parallelit¨ at: Algorithmen, Entwurfsmethoden, Anwendungen.” Part of the work was done while the third author was visiting the Heinz Nixdorf Institute, Paderborn, Germany. A preliminary version of this work appeared in Proc. of the 5th Scandinavian Workshop on Algorithm Theory (SWAT 96), LNCS 1097, pages 404–415, 1996. 2 Department of Mathematics and Computer Science and Heinz Nixdorf Institute, University of Paderborn, Paderborn, Germany. {abk,dittrich}@uni-paderborn.de. 3 Dipartimento di Elettronica e Informatica, Universit` a di Padova, Padova, Italy. [email protected]. Received June 1, 1997; revised March 10, 1998. Communicated by F. Dehne.
Transcript

Algorithmica (1999) 24: 209–242 Algorithmica© 1999 Springer-Verlag New York Inc.

The Complexity of Parallel Multisearch onCoarse-Grained Machines1

A. Baumker,2 W. Dittrich,2 and A. Pietracaprina3

Abstract. Givenm ordered segments that form a partition of some universe (e.g., a two-dimensional strip),the multisearch problem consists of determining, for a set ofn query points in the universe, the segmentsthey belong to. We present the first nontrivial parallel deterministic scheme for performing multisearch ona distributed-memory machine whenm = ω(n). The scheme is designed on the BSP* model of parallelcomputation, a variant of Valiant’s BSP which rewards blockwise communication, and relies on a suitableredundant representation of the segments. The time needed to answer the queries is analyzed as a functionof the redundancy and of the BSP* parameters. We show that optimal performance can be obtained usinglogarithmic redundancy. We also prove a lower bound on the communication requirements of any deterministicmultisearch scheme realized on a distributed-memory machine. The lower bound exhibits a tradeoff betweenthe redundancy used to represent the segments and the performance of the scheme.

Key Words. Parallel algorithms, Multisearch, Coarse-grained machines, BSP model, BSP* model.

1. Introduction. Multisearch is a fundamental problem that arises in several applica-tion fields. In general terms it can be regarded as the problem of performing a numberof search processes on a given data structure, such as a set of ordered items, a balancedtree, or a graph. For concreteness, we adopt the definition given in [1] for this problem.Let a set{σ1, σ2, . . . , σm} of m segments be given, forming a partition of a given uni-verse (e.g., a strip in two dimensions). The segments are ordered in the sense that thereexists an elementary operation to compare apoint q in the universe with a segmentσi ,which establishes whetherq ∈ σi , q ∈ ⋃i−1

j=1 σj , or q ∈ ⋃mj=i+1 σj . The multisearch

problemconsists of determining, for a set ofn arbitrary points in the universe (queries),the segments they belong to.

Multisearch has many important applications in fields such as computational geom-etry, vision and pattern recognition, among others, and it naturally arises in the parallelimplementation of several data structures (see [2] for an extensive list of applications).

1 This research was supported in part by the ESPRIT III Basic Research Programme of the EC un-der Contract No. 9072 (project GEPPCOM), by the CNR Project CNR96.02538.CT07, and by DFG-Sonderforschungsbereich 376 “Massive Parallelit¨at: Algorithmen, Entwurfsmethoden, Anwendungen.” Partof the work was done while the third author was visiting the Heinz Nixdorf Institute, Paderborn, Germany. Apreliminary version of this work appeared inProc. of the5th Scandinavian Workshop on Algorithm Theory(SWAT 96), LNCS 1097, pages 404–415, 1996.2 Department of Mathematics and Computer Science and Heinz Nixdorf Institute, University of Paderborn,Paderborn, Germany.{abk,dittrich}@uni-paderborn.de.3 Dipartimento di Elettronica e Informatica, Universit`a di Padova, Padova, Italy. [email protected].

Received June 1, 1997; revised March 10, 1998. Communicated by F. Dehne.

210 A. Baumker, W. Dittrich, and A. Pietracaprina

In the sequential setting, a straightforward binary search yields a simple and optimalsolution to the multisearch problem, which requires2(n logm) time. In parallel, whilesuch strategy can be easily implemented on a CREW PRAM, attaining optimal speed-up,the development of efficient parallel algorithms for multisearch, which do not rely onthe concurrent-read capability, is challenging. Indeed, the problem appears already quitecomplicated for the EREW PRAM because of the congestion arising when several queriesneed to be compared with the same sequence of segments. The scenario is even worseon more realistic machines where the memory is partitioned into modules accessiblethrough a network, since both the modules and the network become bottlenecks foraccessing the shared data.

The aim of this paper is to study the complexity of the multisearch problem whenmis larger thann, by developing upper and lower bounds on the time required to solve it ona coarse-grained parallel machine. The upper bounds are designed on the BSP* model,a variant of Valiant’s BSP (Bulk Synchronous Parallel) model of parallel computing [3],while the lower bounds apply to any distributed memory machine.

1.1. BSP* Model. The BSP model regards a parallel computer as a set of processors,each equipped with a large local memory, which communicate by exchanging messagesthrough a router. Furthermore, the parallel computer is provided with a mechanismfor synchronizing the processors in barrier style. The computation is organized as asequence ofsupersteps. A superstep consists of a computation and a communicationphase followed by a barrier synchronization. In the computation phase each processorindependently performs operations on data that reside in its local memory at the start ofthe superstep, and generates messages for other processors. In the communication phasethe messages are delivered to their destinations through the router.

Many routers of real parallel machines support the exchange of large messages andachieve much higher throughput for large messages compared with small ones. This islargely due to the high overhead incurred by a processor in the transmission/reception ofa message. As a consequence, in order to achieve high performance, parallel algorithmsshould be designed so that the data exchanged by the processors are combined into largemessages. This is what we callblockwise communication. In order to encourage the useof blockwise communication, the BSP was extended in [1] by introducing an additionalparameter that accounts for message transmission/reception overhead (another modelwhere such overhead is taken into account is LogP [4]). More precisely, in the newmodel, calledBSP*, the parallel machine is characterized by the following parameters:

p: number of processors (as in BSP);L: minimum time between successive synchronization operations, i.e., minimum time

for a superstep (as in BSP);B: extra cost per message transmission/reception;g: reciprocal of the throughput of the router in terms of words delivered per time unit.

(In BSP,B andg are combined into a single bandwidth parameterg′ = gB.)

A BSP* machine with the above parameters will be denoted by BSP*(p, g, L , B).Consider a superstep where in the computation phase the processors perform a maxi-

mum ofw local operations and where in the communication phase the router realizes an(h, s)-relation, that is, a communication pattern in which each processor sends/receives

The Complexity of Parallel Multisearch on Coarse-Grained Machines 211

at mosts words of data combined into at mosth messages. The BSP* model chargestimew+ g(s+hB)+ L for this superstep. Now consider a computation that consists ofT supersteps where in thei th superstep an(hi , si )-relation is realized and a maximum ofwi local operations are performed by each processor. LetS = ∑Ti=1 si ,H =

∑Ti=1 hi ,

andW =∑Ti=1wi . Then the total runtime for the computation is

W + g(S + BH)+ T L .

Clearly, the design of efficient BSP* algorithms aims at finding values forW, S, H,andT , as functions of the machine parameters, which minimize the above formula.In particular, in order to achieve high performance when the communication costs arepredominant, it is crucial to hide the overhead of message transmission/reception (ac-counted for by parameterB) by combining the data exchanged by the processors intolarge messages, that is, messages of size at leastB. Indeed, if all messages exchanged areof sizeÄ(B), then the above formula becomesW+gS+T L, while if the same computa-tion was analyzed under the BSP cost model, its running time would beW+gBS+T L.In other words,B represents the minimum message size needed to exploit the bandwidthof the router fully. In this sense, the BSP* model rewards blockwise communication.

1.2. Previous Work. We say that a parallel algorithm for a problemA, which runs onpprocessors, isoptimalif it is 2(p) times faster than the best sequential algorithm forA.

A number of parallel algorithms for the multisearch problem are known in the litera-ture. The following results apply to problems where the number of segmentsm is small,namelym= O(n). In [5] an optimal randomized algorithm for the EREW PRAM is pre-sented. The algorithm runs in timeO(logn), with high probability, usingn processors,and can be adapted to run on the butterfly, as well. Deterministic algorithms are givenin [2] and [6], which require timeO(logn(log logn)3) on ann-node hypercube, andtime O(

√n) on a

√n×√n mesh, respectively. On the coarse-grained multiprocessor (a

model similar to BSP), it is shown in [7] that multisearch can be performed in optimalO((n/p) logn) time if p ≤ √n processors are used. In [8] a randomized BSP algorithmfor the multisearch problem is given that runs in optimalO((n/p) logn) time, and, forp = O(n1−ε), with ε > 0 constant, requiresO(1) communication rounds.

The case of largem is harder, especially in a distributed-memory setting, due to thedifficulty of providing efficient parallel access to the large number of shared data. Anoptimal EREW PRAM algorithm, running inO(logm) time onn processors, is givenin [9]. The algorithm, however, exploits a linear ordering of the queries which is notpossible when working with a multidimensional universe. In [10] Ranade presents arandomized algorithm that achieves optimal speed-up on ap-processor butterfly forp log p queries on a number of segments polynomial inp. Recently, Baumker et al.[1] studied the multisearch problem on the BSP* obtaining 1-optimal algorithms, thatis, algorithms that run in time(1 + o(1))n logm/p.4 In particular, they present a 1-optimal deterministic algorithm for the casem ≤ n, with n = Ä(p log2 n), and a1-optimal randomized algorithm for the casen < m < 2n, with n = Ä(p log3 n). Bothresults hold for wide ranges of the BSP* parameters (e.g., ifn ≥ p1+ε for an arbitrary

4 The base of all logarithms in the paper, when not explicitly indicated, is 2.

212 A. Baumker, W. Dittrich, and A. Pietracaprina

constantε > 0, thenB, L ≤ (n/p)η, for some small constantη, andg = o(logn) aresufficient).

For the casem = ω(n) no efficient deterministic parallel algorithms are known inthe literature that solve the multisearch problem, except for the straightforward CREWPRAM algorithm.

1.3. New Results. We conveniently regard a parallel solution to the multisearch prob-lem as consisting of two parts: amapping scheme, which specifies how them segmentsare organized in the processors’ local memories; and asearching protocol, which isemployed to answer an arbitrary set ofn queries. We call the combination of a mappingscheme and a searching protocol form segments andn queries an(n,m)-multisearchscheme.

In this paper we present an(n,m)-multisearch scheme for the BSP*(p, g, L , B),which is specifically designed for problems where the number of segments is larger thanthe number of queries. The scheme relies on a redundant representation of the segments,where each segment is replicated intor copies distributed among the processors’ mem-ories. The value ofr is called theredundancyof the scheme. The scheme features asuitable searching protocol which exploits the redundancy in order to control memorycontention. More precisely, the protocol works in a number of rounds, essentially per-forming, for each query, a binary search on the segments. In each round the queries arecompared with (possibly distinct) segments and the protocol carefully selects the actualcopies of the segments to be used for the comparisons, with the objective of minimizingthe number of selected copies that reside in a single memory module. We analyze theperformance of the protocol as a function of the relevant parameters and show that, forcertain ranges of parameters, optimality can be achieved in the worst case. Furthermore,we show that larger ranges of optimality can be obtained by introducing randomizationin the copy selection mechanism used by the searching protocol. The performance ofour scheme is summarized in the following theorem.

THEOREM1. Let ε1 andε2 be arbitrary constants, with ε1 > 0 andε2 > 1. For everyn,m, r such that

• n ≥ p max{pε1, Bε2, L log p},• n ≤ m≤ 2p,• c logp m log(logp m) ≤ r ≤ logm, for some constant c,

there exists an(n,m)-multisearch scheme with redundancy r for theBSP*(p, g, L , B),which uses O(mr/p) space per processor and answers any set of n queries in time

O

(n logm

p

(gr

logn+⌈

g

logn

⌉(m

n

)2/r))

,

in the worst case. If n = Ä(p2), the running time becomes

O

(n logm

p

(r

logn+⌈

g

logn

⌉(m

n

)2/r))

,

The Complexity of Parallel Multisearch on Coarse-Grained Machines 213

in the worst case. By using a randomized seaching protocol the running time becomes

O

(n logm

p

⌈g

logn

⌉(m

p1/ε2n1−1/ε2

)2/r),

with high probability, for every n ≥ p max{pε1, Bε2, L log p logm/ logn} and r =o(min{p, (n/p)1/ε2}/ log p).

A number of comments on the above results are in order. We first observe that theasymptotic running times are expressed as products of the form(n logm/p) f (n,m, r, g),where the functionf (n,m, r, g) represents theslowdownwith respect to the optimaln logm/p performance. Note that parametersB and L do not explicitly appear in theslowdown function. Indeed, the general strategy as well as the basic primitives employedby the searching protocol are designed in such a way to exploit blockwise communication.Hence, whenn is sufficiently large, the contributions of such parameters to the runningtime are hidden by the local computation time and the overall communication volume(accounted for by theg term).

The redundancy should be regarded as a design parameter employed to improve timeperformance at the expense of an increase in the memory requirements. We considermore closely the influence of the redundancy on the slowdown. Broadly speaking, theterm which is linear inr accounts for the complexity of manipulating multiple copies ofthe segments and, therefore, increases with the redundancy. Such a term disappears fromthe probabilistic performance since, as will be seen, the randomized protocol manipulatesonly one copy per segment at a time. On the other hand, the term which is exponentiallydecreasing inr accounts for the contention at the memory modules. Intuitively, if a fewcopies per segment are available the searching protocol has less flexibility in choosingthe copies to compare the queries to and, therefore, a higher contention at the modulesis to be expected. Thus, the choice of the redundancy constitutes a tradeoff between thecost of storing and manipulating the copies of the segments and the penalty caused bymemory contention. As a particular case, we observe that by choosingr = 2(logm)wehave(m/n)2/r ≤ (m/n1−1/ε2)2/r = O(1). Hence, ifm is polynomial inn, the slowdownbecomes proportional tog (worst case withn = o(p2)) or proportional todg/ logne(worst case withn = Ä(p2), or with high probability). In these cases, optimality can beattained wheng is small enough, as stated in the following corollary.

COROLLARY 1. If, in addition to the hypotheses of Theorem1, we have that m is poly-nomial in n and either g= O(1) or g = O(logn) and n = Ä(p2), then, by fixingr = 2(logn), the worst case performance of the multisearch scheme given in the theo-rem becomes optimal. If we employ the randomized searching protocol, then optimalityis achieved, with high probability, when g= O(logn) and r = 2(logn).

The above results provide the first nontrivial worst-case upper bounds for parallelmultisearch on a distributed-memory machine in the casem= ω(n). We also remark thatthe probabilistic performance of our scheme is obtained by introducing randomizationin the searching protocol while maintaining a fixed mapping of the segments to theprocessors. This is in contrast to the scheme in [1] where a randomized mapping and

214 A. Baumker, W. Dittrich, and A. Pietracaprina

a deterministic searching protocol was employed. The advantage of our solution is thatwhen for a set of queries the searching protocol performs poorly, it is sufficient to runthe protocol again, while in the multisearch scheme of [1] it was necessary to change themapping scheme, which is expensive.

A natural question is whether the redundancy used in our scheme is truly neededto achieve efficient performance. We tackle the question by proving a lower bound onthe amount of communication that any(n,m)-multisearch scheme, designed for ap-processor distributed-memory machine, uses to answern queries. The lower bound isexpressed as a function ofm, n, p, and the redundancyr , and it holds under the followingconditions: (1) initially, each processor is in charge ofn/p queries; (2) the answer toa query is based exclusively on the outcomes of individual comparisons between thequery and the segments; and (3) a segment or a query can be communicated using aconstant number of words. Also, the lower bound assumes that a fixed mapping schemeis employed, and relies on a more general notion of redundancy, defined as the averagenumber of copies per segment initially available in the processors’ memories. Hence, itapplies to both our deterministic and randomized upper bounds.

THEOREM2. Let 1 ≤ r ≤ min{p/24, log(m/(6n))/12}. For any (n,m)-multisearchscheme which uses redundancy r and runs on a p-processor distributed-memory ma-chine, there is a set of n queries which, in order to be answered, require that

Tcomm= Ä(

min

{n

r,

n

p

(m

n

)1/6r})

words be communicated from/to some processor.

The lower bound implies that whenm = O(2o(√

p)), thenÄ(log(m/n)/ log logm) re-dundancy is needed to reduce the communication requirements toO(n logm/p). Notethat the amount of redundancy used by our scheme to achieve optimality, whenm ispolynomial inn, as claimed in Corollary 1, is only a factorO(log logm) away from theabove value.

It must be remarked that the multisearch problem, as defined at the beginning of thissection, represents a special case of the more generaln-way search problemdefined in[11], wheren search processes (queries) must be performed on a leveled directed acyclicgraph ofmnodes of constant outdegree. In particular, each search process entails the visitof a set of nodes forming a path in the graph. We believe that the techniques presented inour paper can be applied to then-way search problem, as well, yielding the same upperand lower bounds.

The rest of the paper is organized as follows. Section 2 presents the BSP* implemen-tation of a number of primitives which will be used in the searching protocol (the detailsof the implementation of the sorting primitive is given in an appendix to the paper). InSection 3 we give an outline of the multisearch scheme. Section 4 describes a redundantmapping scheme together with the procedures that the searching protocol will employ toselect the copies of the segments. The deterministic and randomized searching protocolsare presented in Sections 5 and 6, respectively, where Theorem 1 and Corollary 1 areproved. Finally the lower bound is proved in Section 7.

The Complexity of Parallel Multisearch on Coarse-Grained Machines 215

2. BSP* Primitives. In this section we discuss the BSP* implementation of somebasic primitives that are used in our multisearch algorithm. Let the BSP* processors bedenoted byP0, P1, . . . , Pp−1.

`-BROADCAST. Consider a vector of sizestored in processorP0. The task is to delivera copy of the vector to each processor.

`-PREFIX. Let a0, . . . ,ap−1 be p integer vectors of size, with vectorai stored inprocessorPi , for 0≤ i < p. Thei th prefix sum is defined assi :=∑i

j=1 aj , where thesummations are done componentwise. The task is to calculatesi in Pi , for eachi .

By using the algorithms given in [1] and organizing the computation as ak-ary treerather than a binary tree we have:

LEMMA 1. For every integer k, with 2≤ k ≤ p− 1, both the -broadcast and -prefixproblems can be solved on aBSP*(p, g, L , B) in timeW + g (S + BH)+ LT with

W = O(k(`+ logk p)

),

S = O(k(`+ logk p)

),

H = O(k(logk p)

),

T = O(logk p

).

n-SCAN. Consider an arrayA of sizen that consists ofn′ ≤ n subarrays. Each arrayelement is labeled with the number of its subarray. LetA be partitioned intop blocksof sizen/p such that thei th block is held by thei th processor. The task is to determinefor each element inA: (1) the index of the first and the last processor that hold elementsfrom the same subarray; (2) the element’s position within its subarray; and (3) the size ofthe element’s subarray. This task can be performed by means of 1-broadcast and 1-prefixfrom above. The following lemma can be easily concluded.

LEMMA 2. For every integer k, with 2≤ k ≤ p− 1, the n-scan problem can be solvedon aBSP*(p, g, L , B) in timeW + g (S + BH)+ LT with

W = O((n/p)+ k logk p

),

S = O(k logk p

),

H = O(k logk p

),

T = O(logk p

).

n-SORTING. Let n integer keys from the range [0, . . . ,d− 1] be distributed among thep processors such that each processor holdsn/p keys. The task is to redistribute the keysso that processorPi holds then/p keys whose rank in the sorted sequence is betweeni (n/p)+1 and(i +1)(n/p), for 0≤ i < p. In the Appendix we present and analyze analgorithm for this problem. The running time of the algorithm is given in the followinglemma.

216 A. Baumker, W. Dittrich, and A. Pietracaprina

LEMMA 3. Fix arbitrary constants c1, c2 ≥ 1 and ε > 0. If n = Ä(p1+ε) and d =O((n/p)c2), then the n-sorting problem can be solved on aBSP*(p, g, L , B) in timeW + g (S + BH)+ T L with

W = O(n/p),

S = O(n/p),

H = O((n/p)1/c1),

T = O(1).

s-RELATION. An s-relationis a routing problem involving at mostsp one-word mes-sages, where each processor is source and destination of at mosts messages. Thes-relation is the fundamental communication primitive provided by BSP. In BSP*, ans-relation can be performed in timeg(s + Bs). However, the next result shows thatbetter performance can be obtained for larges.

LEMMA 4. Fix arbitrary constantsε > 0 and c ≥ 1. Then, for s = Ä(pε), anys-relation can be routed onBSP*(p, g, L , B) in timeW + g (S + BH)+ T L with

W = O(s),

S = O(s),

H = O(s1/c),

T = O(1).

PROOF. The algorithm works as follows. First the processors globally sort the messagesby destination, employing the sorting algorithm mentioned before. (We assume that,prior to sorting, each processor that holdss′ < s one-word messages generatess− s′

dummy messages of valuep, so that after the sorting the nondummy messages arestored in a compact fashion by the first processors.) By applying Lemma 3 withc1 = candc2 = dlogn/p pe, the sorting can be done within the stated time bound. Then eachprocessor combines its messages intoO(s1/c) packets as follows. If it has more thans1−(1/c) messages for a destination, then it combines these messages into one packetwith the same destination. If it has fewer thans1−(1/c) messages for a destination, thenit combines them with messages for consecutive destinations into a packet of size (atmost) 2s1−(1/c), whose destination is the smallest destination of the messages it contains.This task is executed locally at each processor inO(s) time. Then the newly createdpackets are sent to their destinations. Clearly, the routing time yieldsS = O(s) andH =O(s1/c). Finally, a processor that has received a packet containing messages for multipledestinations broadcasts the packet to all specified destinations (this is an instance of aO(s1−(1/c))-broadcast), and each destination gathers its messages from it. By applyingLemma 1 withk = min{pε/c, p−1}we have that the broadcast is done within the statedtime bound, thus Lemma 4 follows.

3. Overview of the Multisearch Scheme. Let d = n/p. In our multisearch schemethemsegments are organized in the form of a completed-ary search treeT with m/(d−1)

The Complexity of Parallel Multisearch on Coarse-Grained Machines 217

nodes, where each node contains an ordered set ofd − 1 segments. Thus,T has height2(logd m). It follows that for each queryq ∈ U there is a unique search path fromthe root to the node that contains the segmentq belongs to. The search paths are notknown in advance, but they have to be constructed step by step. If a queryq has a nodev on its search path, then the tree node to be visited next (if any) can be determined byperforming a binary search on the set of segments stored inv.

From a high level point of view, our solution for the multisearch problem is verysimple. The nodes ofT are distributed among the BSP* processors in a redundantfashion, as will be explained later. The search algorithm proceeds in rounds. In the`thround all queries reach level` of the tree and each query is brought together with thenode that it visits on level of the search tree. Then, for each query, the(`+ 1)st nodeon its search path is determined. When trying to bring together each query with the nodeit visits on the current level, two kinds of congestion, which we callnode congestionandprocessor congestion, occur. Specifically, thenode congestionof a nodev is the numberof queries that havev on their search path, while theprocessor congestionof a processorP is the number of queries that meet the nodes they want to visit atP.

The node congestion depends on the set of queries and may vary a lot for differentnodes. For example, the node congestion of the root is alwaysn, while other nodes maybe visited by only one query. The multisearch algorithm copes with this problem byemploying different strategies for nodes with high node congestion and for nodes withlow node congestion. If a nodev has high congestion, then a copy of it will be broadcastto the processors that hold the queries that want to visitv, while if v has low congestion,then the queries that want to visitv are moved to a processor storing a copy ofv.

Processor congestion becomes a problem only if the tree is much larger than thenumber of processors. Suppose, for example, that|T | > npand that the tree nodes werepartitioned among the processors according to some fixed scheme. Then an adversarycould selectn queries that haven different nodes in their search paths, all residing on thesame processorP. This situation is fatal for the performance, because, in order to answerthe queries, processorP would have to send2(n) nodes or receive2(n) queries. In anycaseÄ(n) words of data would have to be communicated from or to a single processor.To circumvent this problem one could map the tree nodes to the processors at random inorder to make high processor congestion unlikely. This approach has been followed in[1]. Here we employ a deterministic mapping scheme that uses redundancy in order toreduce processor congestion. Specifically, for each node there arer copies available in thelocal memories of distinct processors. In any round of the multisearch algorithm copiesof the tree nodes that need to be accessed to make the queries advance in their searchpaths are carefully selected so that processor congestion is low. (The idea of reducingcongestion by exploiting a redundant representation of the data is a key ingredient ofdeterministic shared-memory implementations, e.g., see [12] and [13].) The mappingof the copies of the tree nodes to the processors and the copy selection mechanism arepresented in the next section.

4. The Redundant Mapping Scheme. Let T be thed-ary search tree containing them segments. We useV to denote the set of nodes ofT andP to denote the set of BSP*processors. We have|V| = m/(d − 1) and|P| = p. We defineM = m/(d − 1).

218 A. Baumker, W. Dittrich, and A. Pietracaprina

4.1. Memory Map. Eachv ∈ V is replicated intor copies assigned to different pro-cessors. The assignment is done according to amemory map0: V → 2P , where0(v)returns the set ofr processors storing the copies ofv. Thus, the triplet(V,P, 0) definesa bipartite graphG with node setsV andP and edges{(v, P) : v ∈ V andP ∈ 0(v)}.We choose0 such that the resulting bipartite graphG is r-regular, that is, every vertexin V has degreer and every vertex inP has degreeMr/p. Further,0 is chosen suchthatG has a certainexpansion, as explained below. For a subsetS⊂ V and a subsetEof edges ofG, we let0E(S) denote the set of processors in0(S) that are incident withedges inE. The following terminology was introduced in [12]

DEFINITION 1. For S ⊆ V, a k-bundlefor S is a setE of k|S| edges ofG such that,for eachv ∈ S, E contains exactlyk edges that are incident onv. If k = 1, we callE atarget set for S. E hascongestion cif, for every processorP ∈ P, P is incident with atmostc edges ofE.

For convenience, we assumer even.

DEFINITION 2. The graphG has(γ, δ)-expansion, for someγ > 0 and 0< δ < 1, iffor any subsetS⊆ V, with |S| ≤ p/r , and anyr/2-bundleE for S,

|0E(S)| > γ r |S|1−δ.

The existence of graphs with suitable expansion is established by the following lemma,whose proof can be found in Lemma 4 of [13].

LEMMA 5. Let0: V → 2P be a random function such that, for eachv ∈ V, 0(v) isa random subset ofP of size r. Then there is a constant c= 2(1) such that, for everyr > c logp M log(logp M), the graph G has(γ, δ)-expansion, with δ = 2 logp(M/p)/randγ = 2(1), with high probability.

Based on the above lemma, throughout the paper we assume that the assignment oftree nodes to the BSP* processors is done according to a memory map function0 suchthat the graphG defined by0 is r -regular and has(γ, δ)-expansion, withγ = 2(1) andδ = 2 logp(M/p)/r . As a consequence of the regularity of the graph, each processorstoresO(Mr/p) nodes, that is,O(mr/p) segments. Therefore, the storage required ofeach processor is a factorr larger than the naturalÄ(m/p) bound. However, as will beshown in Section 7, the use of redundancy, hence the consequent increase in space, isnecessary to attain efficient performance.

4.2. Selection of Target Sets. Let S ⊆ V be a set of tree nodes to be visited by thequeries at some point during the course of the searching protocol. For each node inSonecopy, which the queries will visit, needs to be selected. In order to minimize processorcongestion it is crucial to select copies that are evenly distributed among the processors,which corresponds to constructing a target set of low congestion forS. We now presentthree procedures for selecting target sets. The first procedure only works for|S| ≤ p/r ,the second one can cope with larger setsS, and the third one deals with setsS where a

The Complexity of Parallel Multisearch on Coarse-Grained Machines 219

suitable generalization of the notion of congestion is employed. The procedures are firstdescribed at a high level, while the next subsection discusses their BSP* implementation.We use the following notation: for a set of edgesF and forV ⊆ V (resp.,Q ⊆ P),F(V) (resp.,F(Q)) denotes the set of edges inF that are incident toV (resp.,Q).

Let Sbe a subset ofV and letE be a set of edges incident on nodes ofSand containingan r/2-bundle forS. A target setT for S of congestion at mostk = (2/γ )|S|δ can beconstructed fromE using Procedure 1 below.

Procedure 1

1. R := S; F := E; T := ∅; k = (2/γ )|S|δ;2. while R 6= ∅ do

(a) Identify the setQ ⊆ P of processors that are incident with at leastk edgesfrom F .

(b) SetF ′ = F − F(Q).(c) Let R′ be the set of nodes fromR that have an incident edge inF ′. For each

v ∈ R′ select an arbitrary incident edge fromF ′ and add the selected edge to thesetT .

(d) F := F − F(R′); R := R− R′.

LEMMA 6. If |S| ≤ p/r , Procedure1 terminates and produces a target set T⊆ E forS of congestion at most(2/γ )|S|δ. Moreover, for every i> 0, at the beginning of the i thiteration of Step2, |R| ≤ |S|/2i−1.

PROOF. Let Ri and Fi denote the setsR and F , respectively, at the beginning of thei th iteration of the while loop, and letQi denote the setQ identified in Substep 2(a)of such iteration. We refer to the processors inQi as “congested processors.” We firstshow |Ri | ≤ |S|/2i−1, for i ≥ 1, which ensures termination for the procedure. Theproof is by induction overi . The inequality trivially holds fori = 1, which provides thebase of the induction. Now we assume that|Ri | ≤ |S|/2i−1 for somei ≥ 1, and showthat |Ri+1| ≤ |Ri |/2. For a contradiction, we assume that there is a setRi ⊆ Ri , with|Ri | > |Ri |/2 such that all edges inFi (Ri ) are incident only on congested processors.SinceFi (Ri ) contains anr/2-bundle forRi , by the expansion property ofG we havethat there are more thanγ r (|Ri |)1−δ congested processors, which account for at leastkγ r (|Ri |)1−δ edges inFi . Therefore the following holds:

|Fi | ≥ kγ r |Ri |1−δ > 2|S|δr (|Ri |/2)1−δ ≥ |Ri |r.

This is a contradiction because there are at most|Ri |r edges inFi . As for the correctnessof the procedure, it can be immediately verified that, at the end of the procedure,T isindeed a target set forS, andT ⊂ E. The bound on the congestion ofT follows by notingthat the edges added in each iteration contribute a congestion at mostk to any processorinP, and that edges added in distinct iterations cannot be incident on the same processorsince, for j > i , 0Fj (Rj ) ⊆ Qi and, hence,(0Fi (Ri )− Qi ) ∩ (0Fj (Rj )− Qj ) = ∅.

When r = 2 log(M/p) we haveδ = 1/ log p (in this case we say thatG has high

220 A. Baumker, W. Dittrich, and A. Pietracaprina

expansion), and the lemma shows that for a setSwith |S| ≤ p/r Procedure 1 determinesa target set of congestionO(1), which is asymptotically optimal.

Next, we show how to construct target sets of low congestion for subsets ofV oflarger size. Consider a setS⊆ V of arbitrary size, and letE = 0(S). A target setT forScan be constructed fromE using the following procedure. Letτ = d|S|/(p/r )e.

Procedure 2

1. SetT := ∅; F := E; k := (|S|/p)(p/r )δ(1/γ ).2. PartitionS into S(1), S(2), . . . , S(τ ), with |S(i )| ≤ p/r .3. Fori := 1 to τ do

(a) LetS′(i ) be the set of vertices fromS(i ) that are incident with at leastr/2 edgesin F .

(b) Run Procedure 1 to select a target setTi ⊆ F(S′(i )) for S′(i ).(c) T := T ∪ Ti .(d) Identify the setQ of processors that are incident with more thank edges inT .(e) F := E − E(Q).

4. Let S= S−⋃τi=1 S′(i ).

5. Run Procedure 1 to select a target setT ⊂ E(S) for S.6. T := T ∪ T .

LEMMA 7. Procedure2 produces a target set T for S of congestion at most( |S|p+ 4

)(p/r )δ

γ.

PROOF. In order to prove correctness, we must show that|S| ≤ p/r , since Procedure 1works on sets of size at mostp/r . Let Q denote the setQ at the end of Step 3 andnote that each processor inQ is incident with more thank edges of

⋃τi=1 Ti . Note also

that |⋃τi=1 Ti | ≤ |S|. By construction, every node inS must be adjacent to at leastr/2

processors ofQ. Suppose that|S| > p/r and consider a subsetS′ ⊆ Sof sizep/r . LetE′ be anr/2-bundle forS′ with E′ ⊆ E(Q). By the expansion ofG we conclude that

Q ≥ |0E′(S′)| > γ r( p

r

)1−δ.

Since each processor inQ is incident with more thank edges of⋃τ

i=1 Ti , we concludethat there are more than

k|Q| > |S|p

(p/r )δ

γγ r( p

r

)1−δ≥ |S|

edges in⋃τ

i=1 Ti , which is a contradiction. Therefore, we must have|S| < p/r andProcedure 1 can be applied to find the target setT . In order to establish the upper boundon the congestion of the final target setT constructed by Procedure 2, we note that, byLemma 6, eachTi has congestion at most(2/γ )(p/r )δ and the combined congestionof the Ti ’s cannot build up to more thank + (2/γ )(p/r )δ. Since the congestion ofT

The Complexity of Parallel Multisearch on Coarse-Grained Machines 221

contributes at most an additive(2/γ )(p/r )δ term to the congestion ofT , we concludethat the overall congestion is at most

k+ 4(p/r )δ

γ=( |S|

p+ 4

)(p/r )δ

γ.

Notice that whenr = 2 log(M/p) and δ = 1/ log p, the target set determined byProcedure 2 has congestionO(|S|/p), which is asymptotically optimal.

Finally, we consider a weighted setS⊆ V where an integral positive weightw(v) isattached to everyv ∈ S. All edges incident onv get the same weight asv. LetT be a targetset forSand, forP ∈ P, defineWT

P to be the sum of the weights of the edges inT(P).We now define theweighted congestionof T as max{WT

P : P ∈ P}. Note, that this is ageneralization of the notion of congestion given in Definition 1. The following algorithmconstructs a target set of low weighted congestion forS. Let t = max{w(v) : v ∈ S} andW =∑v∈Sw(v). Also, define9i = {v ∈ S : 2i ≤ w(v) < 2i+1}, for 0≤ i ≤ blog tc.

Procedure 3

1. Run Procedure 2 on each9i , in parallel, to produce a target setTi .2. T :=⋃blog tc

i=0 Ti .

LEMMA 8. The set T computed by Procedure3 is a target set for S of weighted con-gestion at most (

2W

p+ 16t

)(p/r )δ

γ.

PROOF. By Lemma 7, the (unweighted) congestion of eachTi is at most( |9i |p+ 4

)(p/r )δ

γ.

Each edge inTi contributes a weight at most 2i+1 to its endpoint inP, therefore theweighted congestion ofT is bounded from above by

blog tc∑i=0

2i+1

( |9i |p+ 4

)(p/r )δ

γ≤(

2W

p+ 16t

)(p/r )δ

γ,

sinceW ≥∑blog tci=0 2i |9i |,

Notice that whenr = 2 log(M/p) andδ = O(1/ log p), the target set determined byProcedure 3 has weighted congestionO(W/p+ t), which is asymptotically optimal.

4.3. BSP* Implementation of Target Set Selection. We now describe implementationsof Procedures 1–3, presented in the previous subsection, on a BSP*(p, g, L , B). In fact,we consider a slightly more general scenario where such procedures are run in parallel

222 A. Baumker, W. Dittrich, and A. Pietracaprina

on several input sets. We consider Procedure 1 first and letS1, S2, . . . , Sz be z ≤ pdisjoint subsets ofV of sizep/r each. EachSi will be referred to asclass i. There is anaccess-requestassociated with each element inS1∪ · · · ∪ Sz. Each request is issued by aprocessor, which is called theorigin of the request. We assume that every processor issuesthe same number (z/r ) of access-requests. For 1≤ i ≤ z, let Ei be anr/2-bundle forSi .We want to determine a target set of congestion at mostk = (2/γ )(p/r )δ for everySi

using only edges inEi . Each edge in⋃z

i=1 Ei is represented by apacketconsisting of fivefields:request , dest , origin , class , andflag . Letv ∈ Si and(v, P) ∈ Ei , and letP′ ∈ P be the origin of the request forv. Then(v, P) is represented by a packet held byP′, which hasrequest = v, dest = P, origin = P′, class = i , flag = 0. Letn = ∑z

i=1 |Ei | and assume that each processor initially holdsn/p packets. At the endof the procedure we require that only those packets corresponding to edges in the targetsets are left, with each packet residing in itsorigin . Procedure 1 from the previoussubsection can be performed in parallel for every classSi as follows.

BSP* Implementation of Procedure 1

1. Repeat the following substeps until no packets withflag = 0 are left:(a) Sort all packets withflag = 0 by (class,dest );(b) Setflag = 1 in those packets whose (class,dest ) pair occur in less thank

packets;(c) Undo the sorting of Step 1(a);(d) For eachv ∈ ⋃z

i=1 Si such that there is at least one packet withrequest = vandflag = 1, select one such packet and delete all other packets withrequest= v;

(e) Balance all packets withflag = 0 among the processors;2. Send all packets withflag = 1 to the processors specified in theirorigin -field.

It is easy to see that the above sequence of steps correctly implements Procedure 1 anddetermines a target set of congestion at most(2/γ )(p/r )δ for everySi . Moreover, uponcompletion, the surviving packets of each class identify the target set for the class andreside, as required, in their origins.

LEMMA 9. For arbitrary constantsε > 0 and c ≥ 1 the following holds. If n =Ä(p1+ε), then Procedure1 can be implemented on aBSP∗(p, g, L, B) in timeW +g (S + BH)+ LT , where

W = O(n/p),

S = O(n/p),

H = O((n/p)1/c),

T = O(log p).

PROOF. Consider the first iteration of Step 1. It is easy to see that: the sorting ofSubstep 1(a) is an instance ofn-sorting on keys representable by integers in range[0, . . . , p2−1]; Substep 1(b) requires an execution ofn-scan; in Substep 1(c) the packetsare moved back to the processors on which they resided before the sorting, thus an(n/p)-

The Complexity of Parallel Multisearch on Coarse-Grained Machines 223

relation is executed; Substep 1(d) can be executed by running an instance ofn-scanfollowed by O(n/p) local computation steps; Substep 1(e) can be executed by runningan instance ofn-scan followed by an(n/p)-relation. By Lemmas 2—4 all of thesesubsteps can be executed within the stated time bounds. Since by Lemma 6 the numberof packets in each iteration decreases geometrically and at mostO(log p) iterations areperformed, we conclude that Step 1 is executed within the stated time bounds. The factthat the packets decrease geometrically in each iteration also implies that at the beginningof Step 2 each processor hasO(n/p) packets withflag = 1. Hence, Step 2 is executedas an instance of(n/p)-relation and its complexity is dominated by that of Step 1.

We now turn our attention to Procedure 2 and consider, as input,z ≤ p subsets ofV of size p each, namelyS1, S2, . . . , Sz. For 1≤ i ≤ z, Si is referred to asclass i. Foreach element in

⋃zi=1 Si there is anaccess-requestissued by a processor, which is called

the origin of the request. We assume that the requests are sorted by classes and thateach processor issuesz requests. We want to run Procedure 2 to determine a target setof congestion at mostk = (5/γ )(p/r )δ for everySi . Each processor holdsz counters,one for each class. Initially, the counters are set to zero. At the end of the procedure, foreach access-request associated to a nodev, the respective edge of the target set is storedin the origin ofv. Procedure 2 can be executed in parallel for eachSi as follows:

BSP* Implementation of Procedure 2

1. Split each classSi into r groupsof sizep/r . Let Si ( j ) denote thej th group ofSi andlet 8j =

⋃zi=1 Si ( j ). The splitting is done such that each8j is evenly distributed

among the processors.2. For 1≤ j ≤ r do:

(a) For eachv ∈ 8j , creater packets, one for each element in0(v), consisting ofthe five fieldsrequest , dest , origin , class , andflag . Therequest fieldis set tov, dest is set to the correspondingP ∈ 0(v), origin is set to theprocessor that issued the request,class is set toi if v belongs toSi , andflagis set to 0.

(b) Sort all packets by the ordered pair (dest,class ).Comment: After the sorting, all packets with the same (dest, class ) pair areconsecutive. Call such packets “companions” and the first packet in the sequenceof companions their “leader.”

(c) Each processor issues for each of the leaders it holds acount-request, that containstheclass value of the leader. If the leader’sdest value isP, the count-requestis sent to processorP. P replies “yes” to the count-request if its internal counterfor the corresponding class is at mostk; otherwise it replies “no.” If a leader getsthe answer “no,”flag is set to 1 in all companions.Comment:For each class, the packets withflag = 0 identify those edges incidenton processors that have not yet reached the congestion thresholdk.

(d) Undo the sorting of Step 2(b).(e) Let S′i ( j ) denote the set of access-requests, among those inSi ( j ), that have at

leastr/2 packets withflag = 0, and let8′j =⋃z

i=1 S′i ( j ). For eachv ∈ 8′j

224 A. Baumker, W. Dittrich, and A. Pietracaprina

delete all packets withflag = 1 (at mostr/2); for eachv ∈ 8j −8′j delete allpackets;

(f) Balance the access-requests in8j − 8′j (leftover requests) evenly among theprocessors.

(g) Apply in parallel Procedure 1 to find a target set for eachS′i ( j ), 1 ≤ i ≤ z,using the edges corresponding to the packets withflag = 0. At the end, for eachv ∈ 8′j the origin processor forv saves the target set edge selected forv in someinternal register.

(h) Sort the packets corresponding to the target sets according to the ordered pair(dest,class ). For each group of companions the size of the group is deter-mined. Each processor issues for each leader it holds an increase-request thatcontains the number of its companions and itsclass value. If the leader’sdestvalue is P, then the increase-request is sent to processorP. A processor thatreceives an increase-request increases the counter of the corresponding class bythe number specified in the request.

(i) Delete all packets.3. Let S = ⋃r

j=18j − 8′j and let Si = S∩ Si , for 1 ≤ i ≤ z. So, S is the set ofaccess-requests that have not been answered yet.

4. For eachu ∈ Screater packets corresponding to its incident edges.5. Apply in parallel Procedure 1 to find a target set for eachSi , 1≤ i ≤ z.

It is easy to see that the above sequence of steps implements Procedure 2 and deter-mines a target set of congestion at most(5/γ )(p/r )δ for everySi , 1≤ i ≤ z.

LEMMA 10. For arbitrary constantsε > 0 and c ≥ 1 the following holds. If n =zp = Ä(p1+ε), then Procedure2 can be implemented on aBSP*(p, g, L , B) in timeW + g (S + BH)+ LT , where

W = O(rn/p),

S = O(rn/p),

H = O(r (n/p)1/c),

T = O(r log p).

PROOF. Step 1 can be implemented throughn-sorting,n-scan, and the routing of an(n/p)-relation. Moreover, the sorting is done on keys in the range [1, . . . , r ] wherer is atmostp. By Lemmas 2–4 the execution of this step requiresW = O(n/p),S = O(n/p),H = O((n/p)1/c), andT = O(1). Step 2 is performedr times, and all iterations havethe same worst-case complexity. Therefore, it is sufficient to analyze thej th iteration andmultiply its running time byr . Substep 2(a) requiresO(n/p) local computation. Aftersuch a step each processor holdsO(n/p) packets. It is easy to see that Substeps 2(b)–(f)can be implemented through a constant number ofn-sortings,n-scans, and routings of(n/p)-relations, with the sortings done on keys in the range [0, . . . , p2− 1]. Hence, theexecution of these substeps requiresW = O(n/p), S = O(n/p), H = O((n/p)1/c),andT = O(1). By Lemma 9, Substep 2(g) requiresW = O(n/p), S = O(n/p),H = O((n/p)1/c), andT = O(log p). It is easy to see that Substeps 2(h) and (i)can be implemented through a constant number of sortings, scans, and routings, and

The Complexity of Parallel Multisearch on Coarse-Grained Machines 225

their running times are dominated by those of previous substeps. Thus, we concludethat the execution of each iteration of Step 2 requiresW = O(n/p), S = O(n/p),H = O((n/p)1/c), andT = O(log p). Hence, the complexity of the step is within thestated time bounds.

Note that the balancing in Substep 2(f) can be done in such a way that at the beginningof Step 3 each processor holdsO(d|S|/pe) access-requests. We know that eachSi

contains at mostp/r requests, hence each processor holdsO(n/(pr)) access-requests.Then we can easily conclude that Steps 3–4 can be executed within the stated timebounds. The complexity of Step 5 is given by Lemma 9 and is dominated by that of theprevious steps.

We are now ready to describe the implementation of Procedure 3. As input we havea setS∈ V of access-requests. LetN denote the number of access-requests inS. Eachaccess-request is issued by a processor referred to as theorigin of the request. Theaccess-requests inSare weighted. LetW be the total weight of the requests, lett be themaximum weight of a request inS, and let each processor issue at mostN/p access-requests. We want to apply Procedure 3 to determine a target setT for S of weightedcongestion at most(2W/p+ 16t)(p/r )δ/γ , so that at the end each edge of the targetset is stored by the origin of the corresponding request.

BSP* Implementation of Procedure 3

1. PartitionS into families90, 91, . . . , 9blog tc, where9i consists of all requests ofweight at least 2i and less than 2i+1, for 1 ≤ i ≤ blog tc. Then sort the access-requests by family.

2. Split each family intoclassesas follows: a family of size at mostp forms a class onits own, while a family of sizex > p is subdivided intodx/pe classes of sizep each(except for the last one that may have size smaller thanp).

3. If N ≥ p2, then the classes are evenly distributed among the processors and for eachclass a target set is generated sequentially by employing Procedure 2.

4. If N < p2, then run Procedure 2 in order to determine a target set for each class ofaccess requests.

5. Return the edges of the target sets to the origins of the corresponding access-requests.

It is easy to see that the above sequence of steps correctly implements Procedure 3 anddetermines a target set of the desired weighted congestion forS.

LEMMA 11. For arbitrary constantsε > 0 and c ≥ 1 the following holds. If N =Ä(p1+ε) and W is polynomial in N, then Procedure3 can be implemented on aBSP*(p, g, L , B) in timeW + g (S + BH)+ LT , where

W = O (r (N/p)),

S = O (r (N/p)),

H = O(r (N/p)1/c

),

T = O(r log p).

226 A. Baumker, W. Dittrich, and A. Pietracaprina

If ε ≥ 1, we have

W = O (r (N/p)),

S = O(N/p),

H = O((N/p)1/c

),

T = O(1).

PROOF. It can easily be seen that Steps 1, 2, and 5 can be executed by means of aconstant number ofN-sortings,N-scans, and routings of(N/p)-relations. Moreover,the sortings are on positive integer keys with maximum value polynomial inN/p. ByLemmas 2–4, these steps can be executed withW = S = O(N/p),H = O((N/p)1/c),andT = O(1). Now, it is easy to see that2(N/p) classes are formed in Step 2. IfN ≥ p2

(i.e.,ε > 1), then we haveÄ(p) classes and the target set for each class is determinedsequentially. Since Procedure 2 can be executed in a single processor in timeO(rp), andsince each processor is in charge of(N/p2) classes, we getW = O(r (N/p)). If insteadN < p2 we have less thanp classes, and the target set for each class can be determinedby employing the implementation of Procedure 2 presented before, whose complexityis given in Lemma 10. In either case the lemma follows.

5. The Deterministic Searching Protocol. The protocol we present in this sectionresembles the one given in [1]. The crucial difference is that our protocol copes withprocessor congestion by exploiting the properties of the redundant mapping scheme andthe target set selection methods presented in the previous section, whereas the protocoldescribed in [1] relies on the random distribution of the segments among the processors.For completeness, we give a full description of our protocol rather than highlighting thedifferences with the one in [1].

Consider a set ofm segments and a set ofn queries issued by the processors ofa BSP*(p, g, L , B), where each processor issuesn/p queries. The segments are dis-tributed among the BSP* processors according to the mapping scheme presented inSection 4.1. Specifically, ford = n/p, the segments are organized in a completed-arysearch treeT with M = m/(d−1) nodes, where each node contains an ordered subset ofd−1 segments. The tree hash = 2(logm/ logd) levels numbered from 0 toh−1. Eachtree node is replicated intor copies which are stored by distinct processors as prescribedby a memory map function0 such that the bipartite graphG defined by0 isr -regular andhas(γ, δ)-expansion, withγ = 2(1) andδ = 2 logp(M/p)/r . Hence, each processorstoresO(Mr/p) copies of tree nodes, that is,O(mr/p) copies of segments.

We assume that the distribution of tree nodes to the processors is done prior to theexecution of the searching protocol, so its cost will not be accounted for in the protocol’srunning time. In this respect, our multisearch scheme is designed for applications whereseveral batches of queries must be answered for a fixed set of segments, so that the setup cost can be hidden by the cost of the repeated executions of the searching protocol.We also assume that each processor has an internal representation of the memory map0 associated withG which allows it to determine the location of a copy of any nodev

of T in constant time. This latter assumption will be discussed at the end of the section.

The Complexity of Parallel Multisearch on Coarse-Grained Machines 227

The protocol consists ofh rounds. In round , 0 ≤ ` < h, each query is broughttogether with a copy of the node on its search path (if any) which is at level` of the tree.Then the query isexecuted, that is, a binary search is performed on thed − 1 segmentsassociated with the node. If the segment the query belongs to is found, then the query isanswered and will not play any further role in the protocol, otherwise the node at level`+1 on the query’s search path is determined. We have to deal with the problem of nodecongestion, which arises if some nodes are visited by many queries, and the problemof processor congestion, which arises if many queries are brought together with nodesresiding in the same processor. The problem of processor congestion will be solved byapplying the target set selection procedure (Procedure 3), whereas the problem of nodecongestion will be solved as described below.

DEFINITION 3. Letv be a node ofT . The set of queries that havev in their search pathform thejob at nodev, denoted byJ(v). If v is at level` in T , thenJ(v) is referred toas ajob at level`. J(v) is regarded as asmall jobif |J(v)| < d, while it is regarded asa large job, otherwise.

In order to bring together the queries with the nodes they have to visit, we pur-sue different strategies, one for small jobs and one for large jobs. It is convenient toenforce that at the beginning of each round the queries are evenly distributed amongthe processors, withO(n/p) queries per processor, and queries from the same job arepacked in consecutive processors. For each job, the first processor holding its queriesis called theleaderand the others are calledcompanions. For each small job, a copyof the corresponding tree node is selected and the job is sent to the processor storingsuch copy. For each large job, the job leader selects and fetches a copy of the cor-responding tree node and broadcasts it to its companions. The need for distinct andsomehow opposite strategies for dealing with small and large jobs can be intuitivelyjustified as follows. A tree node has size2(d) (it containsd− 1 segments) andd needsto be fairly large to make the tree shallow, thus reducing the number of rounds in theprotocol, which are expensive since each round involves a target set selection. If foreach small job a copy of the corresponding tree node were sent to the (at most two)processors holding queries of the job, it may happen that a processor that holdsn/pqueries of distinct size-1 jobs receives2(dn/p) items overall, which is too costly forour purposes. On the other hand, moving a large job to the processor storing a copy ofthe corresponding tree node may cause severe processor congestion if the job size islarge.

More formally, let the nodes of the tree be numbered level by level from the root tothe leaves, and left to right within each level. The multisearch algorithm enforces thefollowing invariants at the beginning of each round`, for 0≤ ` ≤ h− 1.

1. Each query is labeled with the node it visits on level`, and with the type of job(“large” or “small”) it belonged to on level − 1.

2. Queries are sorted by the node they visited on level`− 1, and each processor holdsO(n/p) such queries. This implies that queries belonging to the same job are con-tiguous, even though they may span several processors.

If we assume that there is only one job (a large job) on level−1 and that each query is

228 A. Baumker, W. Dittrich, and A. Pietracaprina

initially labeled with the root of the tree, then the invariants clearly hold at the beginningof round 0.

Searching Protocol

For` = 0 to h− 1 do:1. Sort all queries by the node they visit on level`, in such a way that after the sorting

each processor holdsO(n/p) such queries.Comment: Queries that visit the same node on level` also visited the same nodeon the previous level. Thus, due to Invariant 2, we can sort queries that visiteddifferent nodes on the previous level separately from each other. This results in asegmented sorting problem with keys from a range of sized.

2. Identify the small jobs on levelwhose parents jobs on level`−1 were large, andchange their label from “large” to “small.”Comment: At this point, the labels attached to the queries denote their type ofjob at level`. Thus Invariant 1 for the next iteration is enforced. Note that, as aconsequence of Step 1 and the fact that the size of a small job isd = n/p, we havethat each small job is stored by at most two processors.

3. Compact each small job stored by two contiguous processors into the first of thetwo.Comment: After this step we still haveO(n/p) queries per processor and Invari-ant 2 is enforced.

4. Deal with large jobs:(a) For each large job, identify the processors (leader and companions) holding

the job.(b) For each large job, the leader issues an access-request for the correspond-

ing tree node. The request is assigned weightd. Then Procedure 3 is ap-plied to select a target set for all access-requests with weighted congestionO((n/p)(p/r )δ).Comment: As a consequence of Step 1 and since large jobs have size at leastd, each processor issues at mostO(n/(pd)) = O(1) access-requests, hencethe combined weight of all requests is at mostn. Therefore, Procedure 3 isapplied withW = n andt = d.

(c) Each leader fetches the selected copy and broadcasts it to its companions.5. Deal with small jobs:

(a) For each small job, the processor that holds the job issues an access-requestfor the corresponding node. The request is assigned weight equal to the size ofthe job. Then Procedure 3 is applied to select a target set for all access-requestswith weighted congestionO((n/p)(p/r )δ).Comment: Note that each processor issues at mostO(n/p) access-requestsof total weightO(n/p). Furthermore, the maximum weight of a request isdand the combined weight of all requests is at mostn. Therefore, Procedure 3is applied withW = n andt = d.

(b) Send each small job to the processor that holds the selected copy of the corre-sponding node.

6. Execute the queries:For each query, the processor that holds the query exe-cutes a binary search on thed − 1 segments contained in the node that the

The Complexity of Parallel Multisearch on Coarse-Grained Machines 229

query visits on level . As a result, either the segment the query belongs to isfound, hence the query is answered, or the node on level` + 1 which the querywill visit next is determined. In the latter case, the query is labeled with suchnode.

7. Return each small job to the processor it resided on before Step 5(b).

It is easy to see that the above protocol is correct and that the invariant is maintained ateach iteration. The running time is established in the following theorem.

THEOREM3. For arbitrary constantsε > 0 and c≥ 1 the following holds. If the msegments are organized according to the redundant mapping scheme presented in Sec-tion 4, then, for n = Ä(p1+ε), the above searching protocol can be implemented on aBSP*(p, g, L , B) in worst-case timeW + g(S + BH)+ T L, where

W = O

(n logm

p

(r

log(n/p)+( p

r

)δ)),

S = O

(n logm

p· r + (p/r )

δ

log(n/p)

),

H = O

((n

p

)1/c r logm

log(n/p)

),

T = O

(r log p logm

log(n/p)

).

If ε ≥ 1, thenW is as above while

S = O

(n logm

p· (p/r )

δ

log(n/p)

),

H = O

((n

p

)1/c logm

log(n/p)

),

T = O

(logm

log(n/p)

).

Moreover, in all cases2(mr/p) space per processor is used.

Before proving the theorem we show how to derive from it the worst-case perfor-mance claimed in Theorem 1 stated in the Introduction. First notice that sinceδ =2 logp(M/p)/r , M = m/(d − 1), andd = n/p, we have( p

r

)δ= O

((m

n

)2/r).

Let n ≥ p max{pε1, Bε2, L log p} for some arbitrary constantsε1 > 0 andε2 > 1,hencen = Ä(p1+ε), for some constantε > 0. We have log(n/p) = 2(logn) andL log p ≤ n/p. By fixing

c = ε2

ε2− 1> 1,

230 A. Baumker, W. Dittrich, and A. Pietracaprina

in Theorem 3 we also haveB(n/p)1/c = O(n/p). Simple calculations show thatthe worst-case running time of the searching protocol becomes as stated inTheorem 1.

PROOF OFTHEOREM3. The running time of each of theh = O(logm/ logd) roundsof the algorithm is determined as follows.

Step1. Based on the comment following Step 1, such a step can be executed bymeans ofn-sorting on integer keys in the range [0, . . . ,d− 1]. By Lemma 3 it requiresW = S = O(n/p),H = O((n/p)1/c), andT = O(1).

Step2. This step can be executed through ann-scan plusO(n/p) local computation.By applying Lemma 2 we can make its complexity as that of Step 1.

Step3. In this step each processor sends/receives at most one message of sizeO(n/p).Hence, this step has the same complexity as Step 1.

Step4(a). As Step 2.

Step4(b). We apply Lemma 11 for this step withN = n. Note that we are overesti-mating the number of requests, however, this will not increase the estimate of the overallrunning time. Ifε < 1, this step requiresW = S = O(rn/p), H = O(r (n/p)1/c),andT = O(r log p). Instead, ifε ≥ 1 this step requiresW = O(rn/p), S = O(n/p),H = O((n/p)1/c), andT = O(1).

Step4(c). Fetching nodes involves the routing of fetch-requests to the processors thathold the copies selected in Step 4(b) and returning the requested nodes. Sending thefetch-requests involves the routing of anO((p/r )δ)-relation and returning the nodes in-volves the routing of anO((n/p)(p/r )δ)-relation. By Lemma 4 this step can be executedwith W = S = O((n/p)(p/r )δ), H = O((n/p)1/c), andT = O(1). The broadcastfollowing the fetching involves vectors of sized. By applying Lemma 1, it is easy to seethat its cost is dominated by the fetching.

Step5(a). We apply Lemma 11 withN = n. If ε < 1, this step requiresW = S =O(rn/p),H = O(r (n/p)1/c), andT = O(r log p). Instead, ifε ≥ 1 this step requiresW = O(rn/p), S = O(n/p),H = O((n/p)1/c), andT = O(1).

Step5(b). This step requires the routing of an(n/p)(p/r )δ-relation. By Lemma 4 wegetW = S = O((n/p)(p/r )δ),H = O((n/p)1/c), andT = O(1).

Step6. The binary search in this step needs local computation timeO((n/p)(p/r )δ logd).

Step7. As Step 5(b).The theorem follows by summing up the contributions of all the steps and multiplying

by the number of rounds.

As we mentioned at the beginning of this section, the analysis assumes that a processorcan compute the location of a copy of any tree node in constant time. The assumption isreasonable if a simple and explicit representation for the expander graphG underlyingthe mapping scheme is available. However, although a number of explicit constructionsof bipartite expanders are known in the literature for certain ranges of parameters (e.g.,

The Complexity of Parallel Multisearch on Coarse-Grained Machines 231

see [14]), the development of general construction techniques still remains an openproblem, and, often, one can only rely on the existence of such graphs and on thefact that a random graph exhibits the desired expansion with high probability (e.g., seeLemma 5).

Suppose that a graph generated at random is used for the mapping scheme. In orderto comply with the assumption that constant time is sufficient to compute the locationof a copy of any node, one can store a complete look-up table in each processor. Sinced = n/p, the table has size2(mr/d). If d = Ä(p) (i.e.,n = Ä(p2), the space requiredto store the table at a processor is not larger than the space needed to store the copies ofthe tree nodes assigned to the processor.

Whend = o(p), we can still represent the graph usingO(mr/p) space per pro-cessor as long asm = O(2d/pγ ), for an arbitrary constantγ > 0. We partition theM = 2(m/d) entries of the look-up table forG (each entry of sizer ) into M ′ = Mr/dpages ofd/r entries each, so that a page has sized. Each page is then replicated intor ′ = 2 log(M ′/p) = O(logm) copies assigned to distinct processors. The assignmentof the copies of the pages to the processors is done according to anr ′-regular bipar-tite graphG′ with (2(1), α)-expansion, whereα = 2 logp(M

′/p)/r ′ = 1/ log p. Notethat in this fashion each processor receives2(M ′r ′/p) pages which requireO(mr/p)storage, sincer ′ = O(logm), that is,r ′ = O(d/pγ ).

Now, consider a round of the searching protocol. In Steps 4(b) and 5(a) the processorsneed access toO(n) entries of the look-up table, with each processor accessingO(n/p)entries. Note that, due to Invariant 2 and the ordering of the tree nodes, the entries tobe accessed by the processors are sorted by their index in the table. As a consequence,entries relative to the same page are contiguous, thus forming a “job.” By labeling thesenew jobs according to their size (i.e., distinguishing between small and large jobs) and byemploying the same strategy as we do for the queries, we can bring together each accessrequest for a table entry with a copy of the page containing the entry, and thus retrievethe entry by means of binary search in the page. It is not difficult to see that the timeneeded for this yieldsW = S = O(r ′n/p),H = O(r ′(n/p)1/c), andT = O(r ′ log p).If we add these contributions over all rounds and substitute logm for r ′, the overall costof the searching protocol becomesW + g(S + BH)+ T L, where

W = O

(n logm

p

(logm

log(n/p)+( p

r

)δ)),

S = O

(n logm

p· logm+ (p/r )δ

log(n/p)

),

H = O

((n

p

)1/c log2 m

log(n/p)

),

T = O

(log p log2 m

log(n/p)

).

Thus, whenn ≥ p max{pε1, Bε2, L log p} for some arbitrary constantsε1 > 0 andε2 > 1, the protocol’s running time becomes

O

(n logm

p

(g logm

logn+⌈

g

logn

⌉(m

n

)2/r))

.

232 A. Baumker, W. Dittrich, and A. Pietracaprina

Consequently, optimalO(n logm/p) time can be achieved whenm is polynomial innandg = O(1), by choosingr = O(logn), as claimed in Corollary 1.

The above argument implicitly assumes the graphG′ is efficiently represented allow-ing each processor to determine the location of a copy of any page in constant time. Thisrequires a look-up table forG′ stored in each processor. Note that sincer ′ = O(d/pγ ),such a table has size

M ′r ′ ≤ Mrr ′

d= O

(Mr

),

which is a factor at leastpγ smaller than the size of the table forG. If the look-up tablefor G′ is still larger thanO(mr/p), we can apply the above strategy recursively. After aconstant number of recursive steps, which do not affect the overall running time, we arereduced to a graph sufficiently small so that its table fits in one processor.

6. Improving the Performance by Randomization. Observe that the redundancyris a factor in the running time of the target set selection procedure, namely Procedure 3,which is used in Steps 4(b) and 5(a) of the searching protocol. In some cases, this yieldsa factorr in the protocol’s running time. In this section we show how the factorr canbe saved by replacing Procedure 3 with a randomized target selection procedure. It isimportant to remark that randomization is introduced only in the searching protocol andnot in the mapping scheme. This is in contrast to the approach followed in [1], wherea randomized mapping and a deterministic searching was applied. The advantage ofour solution is that when for a set of queries the searching protocol performs poorly,it is sufficient to run the protocol again, while in the multisearch scheme of [1] it wasnecessary to change the mapping scheme, which is expensive.

6.1. Randomized Target Set Selection. The problem with the implementation ofProcedure 3 (see Section 4.3) is that for each access-requestr packets are generated. So,givenN access-requests, Procedure 3 has to deal with up toNr packets, thus introducingthe factorr in the running time. The randomized strategy presented in this subsectionperformsO(logr ) rounds and generates, for each access-request, only one packet perround. Moreover, the number of access-requests decreases geometrically in each round,thus yielding a factorr improvement in the running time.

Suppose that theM tree nodes (setV) are distributed among thep processors (setP)by a memory map function0, so that the graphG defined by0 has(γ, δ)-expansion, withγ = 2(1) andδ = 2 logp(M/p)/r . For S⊆ V let E be the set of edges inG incidenton S. Let the nodes inSbe weighted such that the total weight isW and the maximumweight ist . We assume thatt = O(W/p), r = o(p/ log p), andS= Ä(p1+ε), for anarbitrary constantε > 0. For any constantc > 0 a target setT of S with congestionO(((W/p) + t)(p/r )δ + tr log p) can be constructed fromE with probability at least1− log3/2 r/pc, by using the following procedure.

Procedure 4

1. R := S; F := E; T := ∅;2. Fori = 1 to log3/2 r do

The Complexity of Parallel Multisearch on Coarse-Grained Machines 233

(a) For eachv ∈ R randomly choose an edge fromF(v). Let F ′ be the set of edgesrandomly chosen in this step.

(b) Identify the setQ of processors that are incident with edges inF ′ with a totalweight of at leastCi := 3( 2

3)i (((W/p)+ t)(p/r )δ(1/γ )+ tr (c+ 1) log p).

(c) T := T ∪ (F ′ − F ′(Q)). Let R′ be the set of nodes inR that are incident withan edge inF ′ − F ′(Q). SetF := F − F(R′); R := R− R′.

3. Let W′ be the weight of nodes inR after Step 2. Find a target setT ′ for R withcongestionO(((W′/p)+ t)(p/r )δ) by using the Procedure 3 from Section 4.2.

4. T := T ∪ T ′.

Let Ri denote the setR at the end of thei th iteration of the loop in Step 2 and letw(Ri ) denote the sum of the weights of the nodes inRi . SetR0 = Sandw(R0) = W.

LEMMA 12. Procedure4 produces a target set of congestion

O((W/p+ t) (p/r )δ + tr log p

).

Moreover, for 1≤ i ≤ log3/2 r , we have

w(Ri ) ≤ W(

23

)i + (p/r )t,|Ri | ≤ |S|

(23

)i + (p/r )with probability at least(1− (1/pc))i , where c≥ 1 is an arbitrary positive constant.

Note that the lemma guarantees that after Step 2 there are only|S|/r + p/r nodes ofS for which no incident edge has been added to the target set, with probability at least1− log3/2 r/pc.

PROOF. The bound on the congestion can be easily proved by observing that the amountof congestion contributed by each iteration of Step 2 decreases geometrically. To provethe bounds onw(Ri ) and|Ri | we need the following well-known tail estimate.

FACT 1 [15]. Let X1, . . . , Xk be independent random variables with Xi ∈ [0, . . . , `]andµ ≥ E(

∑ki=1 Xi ). Then, for anyη > 0,

Prob

(n∑

i=1

Xi ≥ (1+ η)µ)≤(

(1+ η)(1+η))µ/`

.

We use induction over the number of iterationsi . For i = 0 the lemma clearly holds. Sowe assume that the induction hypothesis holds for somei and letFi be the setF at theend of thei th iteration of the loop. Let alsoQi be the subset of processors such that eachprocessor inQi is incident with edges inFi with total weight larger thank = Ci r (

13).

First we show that there is a subsetRi of Ri of size at least|Ri |−(p/r ) such that eachnode inRi has at leastr/2 edges pointing into processors fromP − Qi , that is, at leasthalf of the edges incident on a node ofRi point into lightly congested processors. For acontradiction, assume that|Ri | < |Ri |− (p/r ), which implies that more thanp/r nodes

234 A. Baumker, W. Dittrich, and A. Pietracaprina

in Ri have more thanr/2 edges pointing into processors ofQi (i.e., into highly congestedprocessors). By the expansion property ofG, and fori ≤ log3/2 r , we conclude that theweight of edges inFi is more than

γ r (p/r )1−δk = γ r (p/r )1−δCi r(

13

)≥ γ r (p/r )1−δ

(23

)i(((W/p)+ t)(p/r )δ(1/γ ))r

= (23

)i(W + pt)r

≥((

23

)iW + (pt/r )

)r.

This is a contradiction, since, by the induction hypothesis, the cumulative weight of theedges inFi is at mostrw(Ri ) ≤ r (( 2

3)i W + (pt/r )). Hence, we know that there is a

subsetRi of Ri of size at least|Ri | − (p/r ) such that each node in this set has at leastr/2 edges into processors fromP − Qi .

Next we show that no processor inP− Qi will be included inQ in Step 2(b) of the(i+1)st iteration (call this “eventA”), with high probability. LetP be a processor inP− Qi .Let n be the number of edges fromFi incident onP and letwj be the weight associatedwith the j th such edge. Hence,

∑nj=1wj ≤ k = Ci (

13)r . Note that each of the edges in

Fi , pointing intoP, is chosen with probability 1/r in Step 2(a). So, letX1, . . . , Xn beindependent random variables withXj = wj , if the j th edge is chosen in Step 2(a), andXj = 0, otherwise. So we haveE(

∑nj=1 Xj ) ≤ k/r . By Fact 1 we conclude that

Prob

(n∑

j=1

Xj ≥ Ci

)= Prob

(n∑

j=1

Xj ≥ 3k/r

)≤ (e2/27)k/(r t )

≤ (13

)(c+1) log p.

So, as long as(c + 1) log p ≥ 3, processorP will belong to setQ in Step 2(b) withprobability at most 1/(3pc+1). Hence, the probability that any processor inP − Qi

belong toQ is at most 1/(3pc), that is,

Prob(A) ≥ 1− 1/(3pc).

Note that if eventA occurs, each node inRi will not belong toRi+1 with probability atleast1

2 (we say in this case that the nodedies). We now show that, givenA, the boundsonw(Ri+1) and|Ri+1| hold with high probability.

We start withw(Ri+1). We know thatw(Ri ) ≥ w(Ri )−(pt/r ) and that, by induction,w(Ri ) ≤ W( 2

3)i + (pt/r ). An easy argument can be made to show that for our purposes

the most unfavorable case is whenw(Ri ) = W( 23)

i . Letw1, . . . , w|Ri | be the weight of

the nodes inRi , and letX1, . . . , X|Ri | be independent random variables withXj = 0,

if the j th node inRi dies, andXj = wj , otherwise. So,∑|Ri |

j=1 Xj represents the weight

from Ri that contributes tow(Ri+1). Since eachXj is 0 with probability at least12 we

The Complexity of Parallel Multisearch on Coarse-Grained Machines 235

have thatE(∑|Ri |

j=1 Xj ) ≤ W( 23)

i /2. Hence, by Fact 1 we get

Prob

|Ri |∑j=1

Xj ≥(

23

)i+1W

= Prob

|Ri |∑j=1

Xj ≥(

43

) (23

)iW/2

≤ (e1/3/

(43

)4/3)W/(r 2t),

which is smaller than 1/(3pc) since we havet = O(W/p) andr = o(p/ log p). So atmostW( 2

3)i+1 weight fromRi contributes tow(Ri+1). Additionally, at mostpt/r weight

from Ri − Ri contributes tow(Ri+1). Hence, we conclude thatw(Ri+1) ≤ ( 23)

i+1W +(pt/r ), with probability at least 1−1/(3pc)which proves one part of the induction step.

Finally we prove the bound on|Ri+1|. Let X1, . . . , X|Ri | be independent random vari-

ables withXj = 0, if the j th node inRi dies, andXj = 1, otherwise. So,∑|Ri |

j=1 Xj

counts the nodes fromRi that will belong toRi+1. Since, givenA, eachXj is 1 with

probability at least12, we have thatE(∑|Ri |

j=1 Xj ) ≤ |S|( 23)

i /2. Hence, by Fact 1 we get

Prob

|Ri |∑j=1

Xj ≥(

23

)i+1 |S| = Prob

|Ri |∑j=1

Xj ≥(

43

) (23

)i |S|/2

≤ (e1/3/(

43

)4/3)|S|/(r 2),

which is smaller than 1/(3pc) since we have|S| = Ä(p1+ε) andr = o(p/ log p). Soat most|S|( 2

3)i+1 nodes inRi will belong to Ri+1 with probability at least 1− 1/(3pc).

Additionally, at mostp/r nodes fromRi − Ri will belong to Ri+1. Hence, we concludethat |Ri+1| ≤ ( 2

3)i+1|S| + (p/r ), with probability at least 1− 1/(3pc). By combining

the probability that eventA occurs with the probability that, givenA, the bounds onw(Ri+1) and|Ri+1| hold, we conclude that if the claimed bounds hold up to iterationiwith probability at least(1−(1/pc))i , then they hold up to iterationi+1 with probabilityat least(1− (1/pc))i (1− 1/pc) = (1− (1/pc))i+1.

We are now ready to describe the implementation of Procedure 4 on the BSP* model.As input we have a setS ∈ V of access-requests. LetN denote the number of access-requests inS. Each access-request is stored by the processor that issues the request. Asbefore, this processor is called theorigin of the request. The access-requests inS areweighted. LetW be the total weight of the requests and lett be the maximum weight ofa request inS. Initially, each processor holds at mostN/p access-requests. At the endof the procedure, for each nodev ∈ S, the respective edge of the target set is stored inthe origin ofv. Initially, all requests are marked “alive.”

BSP* Implementation of Procedure 4

1. Fori := 1 to log3/2 r do(a) For every access-requestv ∈ S which is “alive,” the processorP where the

requests resides creates a packet consisting of the five fieldsrequest , dest ,

236 A. Baumker, W. Dittrich, and A. Pietracaprina

weight , origin , andflag with request = v, origin = P, flag = 0 anddest set to a randomly chosen processor in0(v).

(b) Sort the packets by thedest value.Comment: After the sorting, all packets with the samedest value are consec-utive. Call such packets “companions” and the first packet in the sequence the“leader.”

(c) Determine for each group of companions the sum of theweight values of packetsin the group. If this number is smaller thanCi , setflag to 1 in all companions.

(d) Undo the sorting of Step 1(b). After that the packets reside on the same processoras the corresponding access-request. If a packet hasflag = 1, then the corre-sponding access-request stores the processor specified in thedest field of thepacket and the access-request is marked “dead,” while all other access-requestsremain “alive.”

(e) Balance the access-requests which are “alive” evenly among the processors.2. Apply Procedure 3 to determine a target set for access-requests which are still “alive”

after Step 1.

LEMMA 13. For arbitrary constantsε > 0 and c, c′ ≥ 1 the following holds. IfN = Ä(p1+ε), then Procedure4 can be implemented on aBSP* (p, g, L, B) in timeW + g(S + BH)+ LT , where

W = O(N/p),

S = O(N/p),

H = O((N/p)1/c),

T = O(r log p),

with probability at least1− (log3/2 r/pc′).

PROOF. First we analyze the first iteration of the loop in Step 1. The generation ofthe packets in Step 1(a) requires local computation timeO(N/p) per processor. ByLemma 3, Step 1(b) requiresW andS = O(N/p),H = O((N/p)1/c), andT = O(1).Step 1(c) essentially requires anN-scan, which, by Lemma 2, takesW andS = O(N/p),H = O((N/p)1/c), andT = O(1) if we choosek = pε/c. Step 1(d) requires the routingof an(N/p)-relation. Step 1(e) requires anN-scan and the routing of an(N/p)-relation.Thus, the running time of these steps is not larger than that of previous steps.

Since the number of live access-requests decreases geometrically according toLemma 12, Step 1 requires in totalW, S = O(N/p), H = O((N/p)1/c), andT =O(logr ). Furthermore, after Step 1 we have onlyO(N/r ) access-requests “alive.”Then, by Lemma 11, Step 2 requiresW andS = O(N/p), H = O((N/p)1/c), andT = O(r log p). By summing up the complexities of Steps 1 and 2 the stated boundson the running time follow. The probability bound follows by applying Lemma 12.

6.2. Randomized Searching Protocol. In order to answer a set ofn queries, we can runthe searching protocol presented in Section 5, where in Steps 4(b) and 5(a), we replacethe deterministic procedure for target set selection (Procedure 3) with the randomized

The Complexity of Parallel Multisearch on Coarse-Grained Machines 237

procedure (Procedure 4) described above. For efficiency reasons, it is convenient to mod-ify slightly the mapping scheme of Section 4.1 as follows. As before, the segments areorganized in a completed-ary search treeT with M = m/(d−1)nodes, where each nodecontains an ordered subset ofd − 1 segments, and where theh = 2(logd m) levels arenumbered from 0 toh−1. However, we now fixd = (n/p)1/c, for some constantc > 1,which will be chosen later (before we hadd = n/p). Each tree node is replicated intorcopies which are stored by distinct processors as prescribed by a memory map function0 such that the bipartite graphG defined by0 is r -regular and has(γ, δ)-expansion, withγ = 2(1) andδ = 2 logp(M/p)/r . Based on this mapping scheme, the randomizedversion of the searching protocol exhibits the following performance.

THEOREM4. For arbitrary constantsε > 0 and c′ ≥ 1 the following holds. Let msegments be organized according to the above mapping scheme. Then, if r = o(min{(n/p)1−1/c, p}/ log p) and n= Ä(p1+ε), the Searching Protocol of Section5 can beimplemented on theBSP*(p, g, L , B) in timeW + g(S + BH)+ T L, where

W = O

(n logm

p

( p

r

)δ),

S = O

(n logm

p· (p/r )

δ

log(n/p)

),

H = O

((n

p

)1/c logm(p/r )δ

log(n/p)

),

T = O

(r log p logm

log(n/p)

),

with probability at least1− (logd m/pc′).

PROOF. The same proof as for Theorem 3 can be applied relying on Lemma 13 ratherthan Lemma 11 for Steps 4(b) and 5(a). Note, however, that the congestion of the targetset produced by Procedure 4 is larger than the congestion produced by Procedure 3 byan additive termdr(c + 1) log p. Sinced = (n/p)1/c andr = o((n/p)1−1/c/ log p),such an additive term becomeso(n/p), which is negligible. Since we also assumedr = o(p/ log p), the probability bound follows by Lemma 13 and by observing thatProcedure 4 is applied twice in each of theh = O(logd m) rounds of the searchingprotocol.

Letn ≥ p max{pε1, Bε2, L log p logm/ logn}, for some constantsε1 > 0 andε2 > 1,and fix c = ε2/(ε2 − 1) > 1. Thus, 1/c = 1 − 1/ε2. Sinceδ = 2 logp(M/p),M = m/(d − 1), andd = (n/p)1/c we have that( p

r

)δ= O

((m

p1−1/cn1/c

)2/r)= O

((m

p1/ε2n1−1/ε2

)2/r).

Simple calculations show that the probabilistic performance as stated in Theorem 1 isattained.

238 A. Baumker, W. Dittrich, and A. Pietracaprina

7. Lower Bound. In this section we present a lower bound on the amount of commu-nication required for performing multisearch on anyp-processor distributed-memorymachine where the only available storage is provided by the processors’ local memories.Let S be a set ofm ordered segments and consider a multisearch scheme designed toanswer any set ofn ≤ m/3 queries onS. Without loss of generality, we assume thatfor each segmentσ ∈ S a certain numberrσ of copies are initially available at distinctprocessors. We also assume that the protocol employed by the scheme to answer a set ofn queries satisfies the following conditions: (1) initially, each processor is in charge ofn/p queries; (2) the answer to a query is based exclusively on the outcomes of individualcomparisons between the query and the segments; and (3) a segment or a query can becommunicated using a constant number of words.

The lower bound does not account for the complexity of the initial distribution of thecopies of the segments among the processors, however, it does account for any move-ment or replication of the copies made during the execution of the algorithm. We definer =∑σ∈S rσ /m as theredundancyof the scheme, that is, the average number of copiesper segment initially available. (This notion of redundancy is more general than the oneused in previous sections, where it was assumed that exactlyr copies per segment wereavailable.) We partitionS into m/3 intervals I1, I2, . . . , Im/3, where eachI j consists ofthree consecutive segments in the given ordering. Letσj denote the central segment inI j , 1≤ j ≤ m/3. Consider a set ofn indices 1≤ j1 < j2 < · · · < jn ≤ m/3 and a setof n query pointsq1,q2, . . . ,qn, whereqk belongs toσjk . It is easy to see that, in orderto answerqk correctly,qk has to be compared with at least one segment inIk. In otherwords, in order to answer correctly all queries,n comparisons between distinct queriesand distinct segments are needed.

LEMMA 14. There exists a set of n distinct intervals among I1, I2, . . . , Im/3 such thatall of the copies of their segments are stored in the local memories of

p′ ≤ max

{12r,2p

(6n

m

)1/6r}

processors.

PROOF. (The proof is similar to the one of Lemma 2 in [13].) We say that an intervalI j is sparseif there is a total of at most 6r copies of its segments stored initially at theprocessors. Note that there arem′ ≥ m/6 sparse intervals, otherwise the nonsparse oneswould account for more thanmr copies of segments, which is impossible. Letp′ be themaximum value for which no set ofp′ processors store all the copies of the segmentsof n sparse intervals. Ifp′ < 12r we are done. Supposep′ ≥ 12r . Consider a matrixwith

( pp′)

rows indexed by all subsets ofp′ processors, andm′ columns indexed by thesparse intervals. Entry(i, j ) of this matrix is 1 if thej th sparse interval has all copiesof its segment stored by the processors of thei th subset, and it is 0 otherwise. Each rowaccounts for at mostn− 1 ones. Each column accounts for at least

( p−6rp′−6r

)ones. Thus,(

p

p′

)(n− 1) ≥ m′

(p− 6r

p′ − 6r

),

which impliesp′ ≤ 2p (6n/m)1/(6r ).

The Complexity of Parallel Multisearch on Coarse-Grained Machines 239

Let Tcomm denote the maximum number of words sent or received by any processorin order to answern queries. Assume that 1≤ r ≤ min{(p/24), (log(m/(6n))/12)}.Consider a set ofn query points belonging to the middle segments ofn distinct intervalschosen according to Lemma 14. As argued before, each query has to be compared withat least one segment in the corresponding interval. By the lemma, the copies of the seg-ments in the chosen intervals are stored by a setV ′ of p′ ≤ max

{12r,2p (6n/m)1/6r

}processors. Note that whenr is in the stated range,p′ ≤ p/2 and, therefore, at least(p− p′)n/p = 2(n) queries are initially assigned to processors outsideV ′. In order toanswer these queries, at least2(n) words have to be exchanged between processors inV ′ and processors outsideV ′, therefore

Tcomm = Ä

(n

p′

)= Ä

(min

{n

r,

n

p

(m

n

)1/6r})

.

This proves the result of Theorem 2 stated in the Introduction.

Acknowledgment. The authors would like to thank the anonymous referee for severalhelpful comments and suggestions which improved the presentation of the paper.

Appendix. Here we show how to solve then-sorting problem within the time boundsclaimed in Lemma 3. We considern integer keys from the range [0, . . . ,d − 1] dis-tributed among the processors such that each processor holdsn/p keys. The task is toredistribute the keys such that processorPi holds then/p keys whose rank in the sortedsequence is betweeni (n/p)+ 1 and(i + 1)(n/p), for 0≤ i < p. The major challengein executing this task is that of exploiting blockwise communication, which requiresthe packing of data exchanged in each superstep into few messages. We first presentalgorithmSmallRangeSort, which is a parallel version of counting sort and achievesefficient performance whend is small. By employing this algorithm as a subroutine andapplying the radix-sort strategy, we then show how to attain the result stated in Lemma 3.

Algorithm SmallRangeSort. Assumed = O((n/p)1/c), for some arbitrary constantc > 1. The algorithm performs the following steps.

1. Determine in parallel therank of each key in the sorted sequence and itstarget pro-cessor, i.e., the processor that will receive the key at the end of the sorting. The ranksare determined so as to make the sorting stable.

2. Each processor locally sorts its keys by rank.3. Define afraction to be a set of keys with the same value, the same target proces-

sor, and residing in the same processor. Call asmall fractiona fraction with at most(n/p)1−1/c keys and alarge fractiona fraction with more than(n/p)1−1/c keys.Comment: Note that since the ranks guarantee a stable sorting, each processor holdsat most two fractions for each key value, for a total of 2d fractions. Moreover, fractionswith the same key value are already sorted by target processor.

240 A. Baumker, W. Dittrich, and A. Pietracaprina

4. Send all large fractions to the corresponding target processors.5. Sort in parallel the small fractions by key value in a stable way so that at the end each

processor holds at most2((n/p)1/c) small fractions.Comment: Since the sorting is stable, at the end of this step all keys relative to thesame target processorP reside in agroup of consecutive processors, denoted byGr(P). Moreover, a processor holdsO(n/p) keys. Note that a processor belongs toat most2((n/p)1/c) groups but a group can span up toO((n/p)1−1/c) processors.

6. Within each groupGr(P) pack the keys with target processorP into the first proces-sor of the group.Comment:Since a processor belongs to at most2((n/p)1/c)groups, at the end of thisstep each processor holds keys for at most2((n/p)1/c) target processors. Moreover,the keys destined to a target processor reside in one processor.

7. Each processor combines the keys it holds for the same target processor into onemessage, and sends the resulting messages to their destinations.

It is easy to see that the algorithm is correct and provides stable sorting. Its runningtime is analyzed in the following claim.

CLAIM 1. If n = Ä(p1+ε), for some constantε > 0, and d = O((n/p)1/c), for aconstant c> 1, Algorithm SmallRangeSort can be implemented on aBSP*(p, g, L , B)in timeW + g (S + BH)+ T L with

W = O(n/p),

S = O(n/p),

H = O((n/p)1/c),

T = O(1).

PROOF. Step 1 can be executed by means of thed-prefix andd-broadcast primitives,in addition to O(d + n/p) = O(n/p) local computation. Sinced = O((n/p)1/c),by applying Lemma 1 withk = min{p − 1, (n/p)1/c, (n/p)1−1/c}, the step requiresW = S = O(n/p), H = O((n/p)1/c), and T = O(1). Steps 2 and 3 requireO(d + n/p) = O(n/p) local computation. In Step 4, each processor sends/receivesat most(n/p)1/c large fractions, for a total ofn/p keys. Hence this step requiresS = O(n/p), H = O((n/p)1/c), andT = O(1). The stable sorting of the smallfractions by key value done in Step 5 can be realized by first computing the rank anddestination processor of each small fraction (throughd-prefix,d-broadcast, andO(n/p)local computation), and then sending each small fraction to its destination. Note that therouting involved is such that each processor sends/receives at most2((n/p)1/c) smallfractions, for a total ofO(n/p) keys. Thus, Step 5 has the same complexity as the pre-vious ones. Step 6 can be implemented as follows. First, anO(n)-scan is executed todetermine the rank of each processor within each group. Thendce = O(1) iterations areexecuted, where in thei th iteration keys for the same target processorP are packed intothe first(n/p)1−i /c processors ofGr(P), for 1≤ i ≤ dce. Thus, after the last iterationall such keys are in the first processor of the group, as required. More precisely, in iter-ationi , within each groupGr(P) only processors of rankj (if any) are active, for every

The Complexity of Parallel Multisearch on Coarse-Grained Machines 241

(n/p)1−i /c ≤ j ≤ (n/p)1−(i−1)/c. A processor of rankj sends its keys with target pro-cessorP (combined into a single message) to the processor with rankd j/(n/p)1/ce in thegroupGr(P). In each iteration a processor sends at most one message or receives at most(n/p)1/c messages. Moreover, over alldce iterations a processor sends/receives at mostO(n/p) keys. Hence, Step 6 requiresS = O(n/p),H = O(c(n/p)1/c) = O((n/p)1/c),andT = O(c) = O(1). Based on the comment that follows Step 6, we have that inStep 7 each processor sends at mostO(n/p) keys to at mostO((n/p)1/c) distinct targetprocessors, and each processor receivesO(n/p) keys from at most one processor. Thus,this step requiresS = O(n/p),H = O((n/p)1/c), andT = O(1). The claim followsby combining the complexities of all steps.

We are now ready to prove the result of Lemma 3. Suppose we want to sortn in-teger keys from the range [0, . . . ,d − 1] distributed among the processors such thateach processor holdsn/p keys. Letd = O((n/p)c2), for some constantc2 > 0. We fixd′ = d1/c′ , for some constantc′ > 0 and regard each key as a number in based′ consist-ing of blogd′ dc+1 digits in the range [0, . . . ,d′ −1]. The digits are numbered from 0 toblogd′ dc, starting from the least significant. We can sort the keys inblogd′ dc+1= O(1)iterations where in thei th iteration Algorithm SmallRangeSort is used to sort the in-put keys by theiri th digit, in a stable way. Lemma 3 follows immediately by choosingc′ = c1c2 and by applying Claim 1 withc = c1.

References

[1] A. Baumker, W. Dittrich, and F. Meyer auf der Heide. Truly efficient parallel algorithms: 1-optimalmultisearch for an extension of the BSP model.Theoretical Computer Science, 203(2):175–203, 1998.

[2] M. J. Atallah, F. Dehne, R. Miller, A. Rau-Chaplin, and J. J. Tsay. Multisearch techniques: parallel datastructures on a mesh-connected computer.Journal of Parallel and Distributed Computing, 20(1):1–13,1994.

[3] L. G. Valiant. A bridging model for parallel computation.Communications of the ACM, 33(8):103–111,August 1990.

[4] D. E. Culler, R. Karp, D. Patterson, A. Sahay, E. Santos, K. E. Schauser, R. Subramonian, andT. V. Eicken. LogP: a practical model of parallel computation.Communications of the ACM, 39(11):78–85, November 1996.

[5] J. H. Reif and S. Sen. Randomized algorithms for binary search and load balancing on fixed connectionnetworks with geometric applications.SIAM Journal on Computing, 23(3):633–651, June 1994.

[6] M. J. Atallah and A. Fabri. On the multisearch problem for hypercubes.Computational Geometry:Theory and Applications, 5:293–302, 1996.

[7] F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel geometric algorithms for coarse grainedmulticomputers. InProc.of the9th ACM Conference on Computational Geometry, pages 298–307, 1993.

[8] M. T. Goodrich. Randomized fully-scalable BSP techniques for multi-searching and convex hull con-struction. InProc. of the8th ACM–SIAM Symposium on Discrete Algorithms, pages 767–776, NewOrleans, Louisiana, January 1997.

[9] W. Paul, U. Vishkin, and H. Wagener. Parallel dictionaries on 2-3 trees. InProc. of the10th InternationalColloquium on Automata, Languages and Programming, pages 597–609, 1983.

[10] A. G. Ranade. Maintaining dynamic ordered sets on processors networks. InProc. of the4th ACMSymposium on Parallel Algorithms and Architectures, pages 127–137, 1992.

[11] F. Dehne and A. Rau-Chaplin. Implementing data structures on a hypercube multiprocessor, and appli-cations in parallel computational geometry.Journal of Parallel and Distributed Computing, 8:367–375,1990.

242 A. Baumker, W. Dittrich, and A. Pietracaprina

[12] K. Herley and G. Bilardi. Deterministic simulations of PRAMs on bounded-degree networks.SIAMJournal on Computing, 23(2):276–292, April 1994.

[13] A. Pietracaprina and G. Pucci. The complexity of deterministic PRAM simulation on distributed memorymachines.Theory of Computing Systems, 30(3):231–247, May/June 1997.

[14] A. Pietracaprina and F.P. Preparata. Practical constructive schemes for deterministic shared-memoryaccess.Theory of Computing Systems, 30(1):3–37, Jan./Feb. 1997.

[15] W. Hoeffding. Probability inequalities for sums of bounded random variables.American StatisticalAssociation Journal, 58(301):13–30, 1963.


Recommended