+ All Categories
Home > Documents > A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s...

A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s...

Date post: 14-May-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
15
A Pictorial Description of Cole’s Parallel Merge Sort Torben Hagerup Institut für Informatik, Universität Augsburg, 86135 Augsburg, Germany [email protected] Abstract. A largely pictorial description is given of a variant of an ingenious parallel sorting algorithm due to Richard Cole. The new de- scription strives to achieve greater simplicity by exploiting symmetries that were not explicit in the original exposition and that can be conveyed nicely with pictures. Not paying attention to constant factors allows an additional slight simplification of the algorithm. 1 Introduction In 1988 Richard Cole published two sorting algorithms for the parallel random- access machine or PRAM, a model of computation that comprises consecutively numbered processors with lock-step access to a shared memory [2]. One algorithm is for the concurrent-read exclusive-write or CREW variant of the PRAM, while the other algorithm works on the more restrictive exclusive-read exclusive-write or EREW PRAM. Neither PRAM variant allows writing to the same memory cell in the same step by several processors. The CREW PRAM allows reading from the same memory cell in the same step by several processors, while the EREW PRAM does not. Both algorithms sort n items using n processors, O(log n) time and O(n) space, which is optimal, up to a constant factor, as concerns the running time, the time-processor product, and the space. The existence of PRAM algorithms with these characteristics was already implied earlier by the so-called AKS network of Ajtai, Komlós and Szemerédi [1] and its descendants, but PRAM algorithms derived in this manner are deemed impractical due to their complexity and their large constant factors. Both of Cole’s algorithms are based on the natural paradigm of merging in a binary tree. If the merging at each level of the tree is completed before the merg- ing at the level above it starts, the total sorting time will be Ω(log n log log n) on the CREW PRAM and Ω((log n) 2 ) on the EREW PRAM. In order to reduce the time to O(log n), Cole developed clever schemes for pipelining the merges. In general, the merging at a level of the tree starts before the merging at the level below it has completed, in a sense using small samples of the full set of items as a “scaffolding” that allows items arriving later to be put in place more speedily. In the case of the algorithm for the CREW PRAM, working out the details of the idea expressed in the previous paragraph leads to a complete algorithm S. Albers, H. Alt, and S. N¨ aher (Eds.): Festschrift Mehlhorn, LNCS 5760, pp. 143–157, 2009. © Springer-Verlag Berlin Heidelberg 2009
Transcript
Page 1: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

A Pictorial Description ofCole’s Parallel Merge Sort

Torben Hagerup

Institut für Informatik, Universität Augsburg, 86135 Augsburg, [email protected]

Abstract. A largely pictorial description is given of a variant of aningenious parallel sorting algorithm due to Richard Cole. The new de-scription strives to achieve greater simplicity by exploiting symmetriesthat were not explicit in the original exposition and that can be conveyednicely with pictures. Not paying attention to constant factors allows anadditional slight simplification of the algorithm.

1 Introduction

In 1988 Richard Cole published two sorting algorithms for the parallel random-access machine or PRAM, a model of computation that comprises consecutivelynumbered processors with lock-step access to a shared memory [2]. One algorithmis for the concurrent-read exclusive-write or CREW variant of the PRAM, whilethe other algorithm works on the more restrictive exclusive-read exclusive-write orEREW PRAM. Neither PRAM variant allows writing to the same memory cell inthe same step by several processors. The CREW PRAM allows reading from thesame memory cell in the same step by several processors, while the EREW PRAMdoes not.

Both algorithms sort n items using n processors, O(log n) time and O(n)space, which is optimal, up to a constant factor, as concerns the running time, thetime-processor product, and the space. The existence of PRAM algorithms withthese characteristics was already implied earlier by the so-called AKS networkof Ajtai, Komlós and Szemerédi [1] and its descendants, but PRAM algorithmsderived in this manner are deemed impractical due to their complexity and theirlarge constant factors.

Both of Cole’s algorithms are based on the natural paradigm of merging in abinary tree. If the merging at each level of the tree is completed before the merg-ing at the level above it starts, the total sorting time will be Ω(log n log log n)on the CREW PRAM and Ω((log n)2) on the EREW PRAM. In order to reducethe time to O(log n), Cole developed clever schemes for pipelining the merges. Ingeneral, the merging at a level of the tree starts before the merging at the levelbelow it has completed, in a sense using small samples of the full set of items asa “scaffolding” that allows items arriving later to be put in place more speedily.

In the case of the algorithm for the CREW PRAM, working out the detailsof the idea expressed in the previous paragraph leads to a complete algorithm

S. Albers, H. Alt, and S. Naher (Eds.): Festschrift Mehlhorn, LNCS 5760, pp. 143–157, 2009.© Springer-Verlag Berlin Heidelberg 2009

Page 2: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

144 T. Hagerup

in a relatively straightforward manner. On the EREW PRAM, however, wheresimultaneous reading is not allowed, things become more involved. For reasonselucidated by Cole, it is necessary to send scaffolding information not only up,but also down the tree. The result is a rather intricate pattern of interacting datastreams that move any which way through the tree. Although Cole’s expositionis admirable, there are many facts to be kept in mind simultaneously and manysomewhat tedious details to verify.

This work aims at a description of Cole’s sorting algorithm for the EREWPRAM that is simpler and easier to verify. One starting point is the realizationthat although the merge tree is clearly a rooted tree, in that the informationascends from the leaves to the root, much is to be gained in simplicity fromignoring this fact to the extent possible and considering the tree as a free (i.e.,unrooted) tree. The nodes in the tree can be made to treat all of their incidentedges in a uniform way. In fact, it is natural to associate computational stepsnot with nodes, but with edges, and to let all edges execute the same procedurein each of a number of identical stages. This lends a pleasing symmetry to thealgorithm that is particularly useful when it is presented pictorially—a centralpart of the algorithm can be viewed as a game about drawing arrows accordingto certain simple rules, and one immediately notices facts whose verification atthe textual level requires a certain effort and is probably less reliable.

2 Preliminaries

Consider the task of sorting elements of a universe U according to a total order< on U . The word item will be used to denote an element of U . Let −∞ and ∞be symbolic quantities such that −∞ < x < ∞ for all items x.

For every integer k ≥ 1, if a set A consists of the items x1, . . . , xm andx1 < · · · < xm, a k-interval of A is a set of the form {x ∈ U | xi ≤ x < xi+k},where 0 ≤ i ≤ m + 1 − k, x0 = −∞, and xm+1 = ∞. When A and B are finitesets of items, we will say that A is a 9-cover of B if no 1-interval of A containsmore than 9 elements of B and, more generally, that A is dense in B if, for everyinteger k ≥ 1, no k-interval of A contains more than 3k + 6 elements of B.

The rank of an item x in a finite set A of items is the number |{y ∈ A : y ≤ x}|of items in A smaller than or equal to x. For every integer c ≥ 1, let the c-sampleof a finite set A of items be the subset of those items in A whose rank in A is amultiple of c. Define a regular sample to be either a 1-sample (i.e., a copy) or a3-sample.

We shall need the following technical result, essentially due to Cole.

Lemma 1. Let A, B, A′ and B′ be finite sets of items such that A and B aswell as A′ and B′ are disjoint, A is dense in A′, and B is dense in B′. Then the3-sample of A ∪ B is dense in the 3-sample of A′ ∪ B′.

Proof. Let S and S′ be the 3-samples of A ∪ B and of A′ ∪ B′, respectively.Let I be a k-interval of S for some integer k ≥ 1 and take kA = |A ∩ I| andkB = |B ∩ I|. Since S is the 3-sample of A ∪ B, kA + kB ≤ 3k. If kA and kB

Page 3: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

A Pictorial Description of Cole’s Parallel Merge Sort 145

are the numbers of intervals of A and of B, respectively, intersected by I, thenkA ≤ kA + 1 and kB ≤ kB + 1. Because A and B are dense in A′ and B′,respectively, |(A′∪B′)∩I| ≤ 3(kA +2)+3(kB +2) ≤ 3(kA +kB +6) ≤ 3(3k+6).But then |S′ ∩ I| ≤ �3(3k + 6)/3 = 3k + 6.

Without this being repeated on every occasion, the following lemmas assumeevery set of items manipulated by an algorithm to be stored compactly in asorted array. Moreover, when it is stated that a task can be carried out inconstant time with a certain number of processors, every processor assigned tothe computation is supposed to know beforehand the rank of its own numberin the set of all numbers of processors assigned to the computation and the sizeand starting address of every array that holds part of the input or is to receivepart of the output. The space needed in addition to that taken up by the inputand output is constant per processor.

When A and B are sets of items, the ranking of A in B is a function thatmaps every item in A to its rank in B. With A represented in a sorted arrayas described above, the ranking of A in B is represented in an array with thesame index set as that of A. We say that A is ranked in B if the ranking of Ain B is available. The cross-ranking of A and B consists of the ranking of A inB and the ranking of B in A, and we will say that A and B are cross-ranked orthat A is cross-ranked with B if the cross-ranking of A and B is available. Asobserved by Cole, a shorthand for denoting rankings is convenient, especially ina pictorial representation. The ranking of A in B and the cross-ranking of A andB will be denoted by A B and A B, respectively. When A is a 9-coverof B, we may express this additional fact by writing the ranking of A in B asA B.

The four lemmas below deal with simple ranking problems. They are illus-trated in Fig. 1 and will be referred to using the short names indicated in paren-theses. In Fig. 1, the meaning of the implication arrow ⇒ is that, given therankings to the left of the arrow, the rankings to its right can be computed inconstant time with as many processors as the total size of the sets on whichrankings are computed. Technically, when a ranking A B is to be producedand either A or B is empty, we assume that no computation is required (so thatzero processors suffice).

Lemma 2 (subset rule). Let A, B and C be sets of items such that A andB are disjoint and assume that the rankings of A in B and of A ∪ B in C areavailable. Then the ranking of A in C can be computed in constant time with |A|processors.

Proof. For each x ∈ A, add the ranks of x in A (trivially available) and in B(available by assumption) to obtain the rank of x in A ∪ B. Then look up therank of x in C and store it in an output array.

Lemma 3 (union rule). Let A, B and C be pairwise disjoint sets of items,every two of which are cross-ranked. Then the cross-ranking of A∪B and C canbe computed in constant time with |A| + |B| + |C| processors.

Page 4: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

146 T. Hagerup

A B

C

A

C

A B

C

A B

C

A

B

A

S

A

B

A

S

S1 S2

A1 A2

S1 S2

A1 A2

⇒ ⇒

(a) The subset rule.

(b) The union rule.

(c) The sample rule. S is a regular sample of B.

(d) The cross rule. Si is a 9-cover of Ai, for i = 1, 2.

Fig. 1. Simple rules for deriving rankings from other rankings

Page 5: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

A Pictorial Description of Cole’s Parallel Merge Sort 147

Proof. For each x ∈ A, obtain the rank q of x in A ∪B as in the previous proofand copy the rank of x in C to position q of an output array; proceed analogouslyfor each x ∈ B. This computes the ranking of A∪B in C. To obtain the rank inA ∪ B of each x ∈ C, add the ranks of x in A and B.

Lemma 4 (sample rule). Let A and B be finite disjoint sets of items and letS be a regular sample of B. If A is ranked in B, then A can be ranked in S inconstant time with |A| processors. If B is ranked in A, then S can be ranked inA in constant time with |S| processors.

Proof. S is a c-sample of B for a c ∈ {1, 3} that can easily be determined—except if B is empty—by testing whether |S| = |B|. If an item in A has rank qin B, its rank in S is �q/c�. If an item in S has rank q in S, its rank in A is thatof the item in B whose rank in B is cq.

Lemma 5 (cross rule). Let S1, S2, A1 and A2 be sets of items such that S1

and S2 as well as A1 and A2 are disjoint, each of S1 and S2 is ranked in each ofA1 and A2, S1 and S2 are cross-ranked, and Si is a 9-cover of Ai, for i = 1, 2.Then, in constant time and with |S1|+ |S2| processors, we can cross-rank A1 andA2, form S1 ∪ S2 and A1 ∪ A2, and rank S1 ∪ S2 in A1 ∪ A2.

Proof. If we associate a processor with each item in S1 and S2, each such pro-cessor can obtain the rank in S1∪S2 of its associated item x by adding the ranksof x in S1 and S2. In constant time, we can therefore form S1 ∪S2 and associatewith each item x ∈ S1 ∪ S2 a processor that knows the ranks of x in S1 andS2 as well as whether x came from S1 or S2. Suppose that x = max(S1 ∪ S2),so that x has a successor x′ in S1 ∪ S2. By pretending to be associated withthe successor, if any, of x in its original set (S1 or S2), the processor associatedwith x can easily discover whether x′ came from the same set as x. From thisit can deduce the ranks of x′ in S1 and S2 and then, for i = 1, 2, look up therank qi of x in Ai and the rank q′i of x′ in Ai. It proceeds to read the elementsin Ai of ranks qi, . . . , q

′i − 1, for i = 1, 2, to merge the corresponding sequences,

each of which contains at most 10 items, and to place the resulting sequence inan output array starting in the (q1 + q2)th position. This computes A1 ∪ A2,except for the at most 18 smallest and the at most 20 largest items, which areeasily handled by the processors associated with the smallest and largest itemsin S1∪S2, and the cross-ranking of A1 and A2 can be obtained as a by-product.Finally, the rank in A1 ∪ A2 of each item in S1 ∪ S2 is found as the sum of itsranks in A1 and A2.

3 High-Level Description

Suppose that the task at hand is to sort n ≥ 2 pairwise distinct items x1, . . . , xn.Let T be an undirected free tree whose internal nodes are all of degree 3 andwith exactly n + 1 leaves r, v1, . . . , vn. Replace each undirected edge {u, v} in Tby the two directed edges (u, v) and (v, u) and let G = (V, E) be the resulting

Page 6: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

148 T. Hagerup

directed graph. With L = {r, v1, . . . , vn}, let E1 = {(u, v) ∈ E | u ∈ L} andE3 = E \ E1 be the set of edges in E out of leaves and out of inner nodes,respectively. For each e = (u, v) ∈ E, we denote by e

the reverse edge (v, u).Moreover, for each edge e = (v, w) ∈ E, an immediate predecessor of e is an edgein E of the form (u, v) with u = w. For each e ∈ E, let L(e) = {xi | e lies on asimple path in G from vi to r}. We will call an edge e ∈ E upward if |L(e)| > 0,and downward otherwise. The height of an upward edge e ∈ E is the length ofa longest simple path in G whose last edge is e. This terminology correspondsto imagining r placed as a root at the top of T and defining the height of anupward edge as one more than the usual height of its lower endpoint. In thisview, for each upward edge e = (u, v), L(e) = {xi | vi is a descendant of u}.

The algorithm to be described works in 2d stages, numbered 1, . . . , 2d, whered is the diameter of T . Before and after every stage, the algorithm stores for eache ∈ E three sets of items, A′(e), S(e) and S′(e), each of which is represented in asorted array. All of these sets are initially empty. At a high level of abstraction,each of the 2d stages processes each edge e ∈ E by executing the followingsteps:

1. If |A′(e)| = |L(e)| > 0, then set c := 1; otherwise set c := 3.2. If e ∈ E1, then set A′(e) := L(e). If e ∈ E3, compute A′(e) := S′(e1)∪S′(e2),

where e1 and e2 are the two immediate predecessors of e.3. Let S′(e) be the c-sample of A′(e).

In each stage, informally, each edge e ∈ E3 fetches samples from its immediatepredecessors, forms their union and provides its own sample of the union. If eis upward and had collected all items in L(e) already in the previous stage, itpasses them all on; otherwise its sample is a 3-sample.

If the execution of A′(e) := S′(e1) ∪ S′(e2) is thought of as moving a copy ofeach item in S′(e1) or S′(e2) across e, then the set of edges across which copiesof a particular item xi are moved span a subgraph of G without length-2 cycles,and therefore an outtree. It follows that whenever the algorithm forms the unionof two sets of items, the two sets are disjoint.

Let us say that an edge e ∈ E is complete in a stage if the relation |A′(e)| =|L(e)| > 0 holds at the beginning of that stage. By induction on h, one canshow that an upward edge of height h is complete in a stage t if and only ift ≥ 2h. For the basis, an upward edge e of height 1 sets A′(e) := L(e) instage 1 and has A′(e) = L(e) forever after. Assume now that h ≥ 2 and thatthe claim holds for all upward edges of height at most h − 1 and consider anupward edge e of height h with immediate predecessors e1 and e2. The relationsS′(e1) = L(e1) and S′(e2) = L(e2) hold at the beginning of stage t if and onlyif t ≥ 2(h − 1) + 1, by induction, and therefore e is complete in stage t if andonly if t ≥ 2h, as desired. Since the edge er entering r is of height at most d,it follows that L(er) = {x1, . . . , xn} can be obtained in sorted form as A′(er)at the end (or, in fact, at the beginning) of the last stage. Thus the algorithmcomputes the desired result.

If an upward edge e ∈ E is complete for the first time in a stage t, the setS′(e) is the 3-sample of L(e) in stage t − 1 and is L(e) itself in stage t and in

Page 7: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

A Pictorial Description of Cole’s Parallel Merge Sort 149

every later stage. Let us therefore call an upward edge of height h active in astage t exactly if t ≤ 2h. A downward edge e is defined to be active exactly whene

is. Let E∗ be the set of active edges. The following lemma is instrumental inbounding the resource requirements of the algorithm.

Lemma 6. At the beginning and end of every stage,∑

e∈E∗ |A′(e)| ≤ 10n.

Proof. Let us call the reciprocal of the number c computed as part of the pro-cessing of an edge e ∈ E in a particular stage the sampling density of e in thatstage. Imagine each item not as a discrete entity, but as a commodity that canbe present in arbitrary amounts. Moreover, imagine that the c-sample computedin step 3 of the algorithm does not contain selected items, but rather includes1/c of the amount of each item present in A′(e). Since a c-sample of a set Anever includes more than |A|/c items, the total amount of items present in a setmanipulated by the algorithm according to this fictitious accounting is an upperbound on the number of items present in the set in the actual execution.

Fix an item xi. A positive amount of xi can be present in A′(e) for an activeedge e only if G contains a simple path p that starts at vi and has e as itslast edge, and then the amount of xi in A′(e) at the end of a stage t is upper-bounded by the product of the sampling densities in stage t of the edges on pother than e. All edges preceding the first active edge e′ on p must be upward,and therefore common to all relevant paths p. Moreover, all edges on p aftere′ have sampling density 1/3, and the same is true of e′ unless e′ is upward.Therefore the amount of xi present in A′(e), summed over all active edges e, isat most 1 + 3

∑∞j=0(2/3)j = 10 (see Fig. 2). The lemma follows by summation

over all n items xi.

4 The Execution of a Stage

The detailed description of a single stage is where pictures will be most useful.Nodes in T are drawn as polygons, two such polygons sharing a corner exactlyif the two corresponding nodes are adjacent. More specifically, nodes in T of de-gree 3 and degree 1 are drawn as triangles and as thirds of triangles, respectively,and Fig. 3(a) shows conventions that will be used throughout for drawing thesets stored by the algorithm between stages. Note, in particular, that the setsassociated with an edge e = (u, v) are shown inside the polygon representing thenode v that e enters.

As mentioned in the introduction, an efficient execution of the algorithmhinges on the availability of suitable “scaffolding”. Before and after every stage,the algorithm stores the following scaffolding information:

A. For each e ∈ E3, the cross-ranking S(e1) S(e2) of S(e1) and S(e2), wheree1 and e2 are the two immediate predecessors of e.

B. For each e ∈ E, the ranking S(e) S′(e) of S(e) in S′(e).C. For each e ∈ E, the cross-ranking S′(e

) A′(e) of S′(e

) and A′(e).

Page 8: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

150 T. Hagerup

vi

1

1

1

1

1

1/31/9

1/9

1/3

1/91/9

1

1/3

1/9

1/9

1/3

1/9

1/9

11/3 1/9

1/91/3

1/91/9

Fig. 2. The tree of paths p from vi to edges e with positive amounts of xi in A′(e).A thick edge has sampling density 1. If it is also black, it is inactive. Each edge e islabeled with the maximum possible amount of xi present in A′(e).

e

S(e)

S′(e)A′(e)

e

S′′(e)

S′(e)

A′′(e)

A′(e)

(a) (b)

Fig. 3. The pictorial representation of sets manipulated by the algorithm for each edgee ∈ E. (a): Between stages. (b): During the processing of e. The sets S(e), S′(e) andS′′(e) are indicated only in figures for which they are of relevance.

Page 9: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

A Pictorial Description of Cole’s Parallel Merge Sort 151

v1

v2

v3

v4

v5

v6

v7 r

Fig. 4. An example tree T with the rankings available at the beginning of each stage

Fig. 4 shows an example tree T with the rankings A–C available at the be-ginning of each stage. The following color coding of the rankings will be usedthroughout: A: red; B: green; C: blue. We will consider the availability of theserankings as invariants with the same names A–C. As anticipated in the short-hand above, invariant B includes the fact that for each e ∈ E, S(e) is a 9-coverof S′(e).

Two additional invariants that hold before and after every stage are formulatedbelow. The first of these is illustrated in Fig. 5, while the other invariant isimplicit already in the drawing conventions of Fig. 3.

D. For each e ∈ E3, A′(e) = S(e1) ∪ S(e2), where e1 and e2 are the immediatepredecessors of e.

E. For each e ∈ E, S′(e) is a regular sample of A′(e).

Before the first stage, invariants A–E are trivially satisfied, since all relevantsets are empty.

At a more detailed level, the processing of each edge e ∈ E in each stage isrefined as follows:

1. If |A′(e)| = |L(e)| > 0, then set c := 1; otherwise set c := 3.2. If e ∈ E1, then set A′′(e) := L(e) and rank A′(e) in A′′(e). Otherwise, with

e1 and e2 taken to be the two immediate predecessors of e, cross-rank S′(e1)and S′(e2), set A′′(e) := S′(e1) ∪ S′(e2) and rank A′(e) in A′′(e).

3. Let S′′(e) be the c-sample of A′′(e).

Page 10: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

152 T. Hagerup

e

e1

e2

Fig. 5. Invariant D: The union of the two yellow sets in the same triangle is the thirdyellow set

4. Rank S′(e) in S′′(e).5. Cross-rank S′′(e

) and A′′(e).6. Set A′(e) := A′′(e), S(e) := S′(e) and S′(e) := S′′(e).

Steps 1–6 above are easily seen to have the same net effect on A′(e) andS′(e) as steps 1–3 of the high-level description. The sets A′′(e) and S′′(e) can bethought of as “the new values” of A′(e) and S′(e), respectively, just as S(e) isthe value of S′(e) from the previous stage. Fig. 3(b) shows the conventions usedfor drawing the sets associated with an edge e during the processing of e. Onemay imagine new sets “sprouting” in the corners of triangles.

Invariants D and E hold at the end of every stage, as an immediate conse-quence of the computation carried out in that stage. Therefore they always holdoutside of step 6. The following lemma proves that the “cover part” of invariant Balso holds outside of step 6.

Lemma 7. At the end of every stage, S(e) is dense in S′(e) for every e ∈ E.

Proof. By induction on the stage number t. The claim is trivial for e ∈ E1 andfor t = 1. For e ∈ E3 and t ≥ 2, consider the situation just before the executionof step 6 in stage t. Invariants D and E show that with e1 and e2 taken to bethe two immediate predecessors of e, S′(e) is a regular sample of S(e1) ∪ S(e2),whereas S′′(e) is a regular sample of S′(e1) ∪ S′(e2). More precisely, if e is notcomplete in stage t, S′(e) and S′′(e) are the 3-samples of S(e1) ∪ S(e2) and ofS′(e1) ∪ S′(e2), respectively. By induction, S(ei) is dense in S′(ei), for i = 1, 2,so Lemma 1 shows that S′(e) is indeed dense in S′′(e). And if e is complete instage t, S′(e) is a regular sample of L(e) = S′′(e) and therefore clearly dense inS′′(e).

Steps 1, 3 and 6 are trivial. The execution of the other steps is described below.An alternative, essentially stand-alone description is provided by Figs. 6–9 andtheir captions.

Page 11: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

A Pictorial Description of Cole’s Parallel Merge Sort 153

e1

e

1

e

e

e2

Fig. 6. The execution of steps 2 and 4 for an edge e ∈ E3. The solid arrows are availableat the start of the phase. Invariant D, applied to the pink sets, the rightmost solid redarrows, the blue arrow and the subset rule allow the drawing of the downward-pointingdashed black arrow. The other dashed black arrow follows by symmetry. The cross rulenow allows the drawing of the dashed red arrows and, by the sample rule and invariantD, applied to the orange sets and to the yellow sets, the dashed green arrow.

2. The necessary computation is trivial if e ∈ E1, so consider the case e ∈ E3.In the following two sentences, various invariants are applied to e

1 ratherthan to e. By invariant C, A′(e

1) is ranked in S′(e1). But by invariant D,A′(e

1) = S(e2) ∪ S(e

), and S(e2) and S(e

) are cross-ranked by invariant A,so the subset rule allows us to rank S(e2) in S′(e1). By symmetry, we canrank S(e1) in S′(e2). Moreover, by invariants A and B, we have the rankingsof S(ei) in S′(ei), for i = 1, 2, as well as the cross-ranking of S(e1) and S(e2).The cross rule now implies that we can cross-rank S′(e1) and S′(e2), mergethe two sets to obtain A′′(e) = S′(e1)∪S′(e2), and rank A′(e) = S(e1)∪S(e2)(invariant D) in A′′(e) (see Fig. 6).

4. By Invariant E, S′(e) is a regular sample of A′(e). Since S′′(e) is clearly aregular sample of A′′(e) and A′(e) was ranked in A′′(e) in step 2, it sufficesto appeal to both parts of the sample rule.

5. Our first goal will be to cross-rank S′(e

) with A′′(e) and, by both parts ofthe sample rule, with S′′(e). Assume first that e ∈ E1. There is nothing todo in stage 1, since S′(e

) is empty. In every later stage, we have A′′(e) =

Page 12: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

154 T. Hagerup

e

e

e1

e2

Fig. 7. The first part of the execution of step 5 for an edge e ∈ E3. The red arrowswere computed in step 2 (Fig. 6). Invariant D, applied to the yellow sets, the red arrowsand the union rule allow the drawing of the leftmost pair of dashed orange arrows. Theother dashed orange arrows follow by symmetry.

A′(e), so the desired ranking is available, according to invariant C. Assumenow that e ∈ E3. As is easy to see by symmetry, step 2 cross-ranked everytwo of S′(e

), S′(e1) and S′(e2), where e1 and e2 are the two immediatepredecessors of e. Therefore, by the union rule, we can cross-rank S′(e

) andA′′(e) = S′(e1) ∪ S′(e2) (see Fig. 7).

By symmetry, we also have the rank of S′(e) in S′′(e

). From step 4, wehave the ranks of S′(e) in S′′(e) and, by symmetry, of S′(e

) in S′′(e

). Byinvariant E, S′(e) is a regular sample of A′(e), so invariant C and both partsof the sample rule show that we can cross-rank S′(e) and S′(e

). Now, bythe cross rule and invariant B, applied at the end of the stage, we can rankS′′(e) in S′′(e

) (see Fig. 8).Since S′′(e) is a regular sample of A′′(e), it is a 9-cover of A′′(e), and we

can trivially rank S′′(e) in A′′(e). At this point, we have ranked each of S′′(e)and S′(e

) in each of S′′(e

) and A′′(e), and we have the cross-ranking of S′′(e)and S′(e

). Therefore, by the cross rule, we can obtain the cross-ranking ofS′′(e

) and A′′(e) (see Fig. 9).

Page 13: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

A Pictorial Description of Cole’s Parallel Merge Sort 155

e

e

Fig. 8. The execution of step 5 continued. The orange arrows derive with the sample rulefrom those drawn in the previous figure. The black arrows follow in the same way from theblue arrows, whose presence is guaranteed by invariant C. The green arrows were drawnin step 4 (Fig. 6). The cross rule now allows the drawing of the magenta arrows.

e

e

Fig. 9. The third and final part of the execution of step 5. The magenta arrow wasdrawn in the previous figure. The orange arrows derive with the sample rule from thosedrawn in Fig. 7. The green arrow was copied from the previous figure, and the blackarrow is trivial. The cross rule now allows the drawing of the blue arrows.

The rankings required by invariants A, B and C for the next stage are com-puted in steps 2, 4 and 5, respectively. Therefore invariants A–E hold at thebeginning and end of every stage.

5 Detailed Implementation

This section spells out the nitty-gritty remaining details of the algorithm. Recallthe following standard method of allocating consecutively numbered resource

Page 14: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

156 T. Hagerup

units such as processors or memory cells to jobs J1, . . . , Jm: If Ji needs ai resourceunits, for i = 1, . . . , m, compute the prefix sums s0, . . . , sm, where si =

∑ij=1 aj

for i = 0, . . . , m, and assign to Ji the resource units numbered b + si−1, . . . , b +si − 1, for i = 1, . . . , m, where b is the number of the first available resourceunit. The prefix sums s0, . . . , sm can be computed in O(log m) time with O(m)processors by means of a balanced binary tree with a1, . . . , am as its leaves: Ina bottom-up sweep over the tree, each node learns the sum of the leaves in themaximal subtree below it, and in a subsequent top-down sweep, it learns thesum of the leaves strictly to the left of that subtree.

For each e ∈ E, let us say that the sets A′(e), S(e), S′(e), A′′(e) and S′′(e)are in the custody of e. For each e ∈ E, the total size of the sets in the custody ofe just before the execution of step 6 in a stage t in which e is active is boundedby a constant times the size of A′(e) at the beginning or end of one of the stagest − 1 and t. Indeed, A′′(e) is just A′(e) at the end of stage t, S′(e) and S′′(e)are subsets of A′(e) and A′′(e), respectively (invariant E), and S(e) is empty orequals S′(e) at the beginning of the previous stage. Therefore, by Lemma 6, thetotal size of the sets in the custody of active edges, as well as of the rankingscomputed for these sets, is O(n) at all times.

Sets in the custody of inactive upward edges never again change, and setsin the custody of inactive downward edges cannot influence the output of thealgorithm. Therefore it is not necessary to associate processors with inactiveedges. It is not necessary to allocate space for sets in the custody of inactiveedges either, except when such sets are read during the processing of an activeedge. This can happen only in steps 2 and 5 of the processing of an active edgee with immediate predecessors e1 and e2, where S′(e1) and S′(e2) are read. Tocope with this exception, when an edge e becomes inactive, the custody of S′(e)is transferred to those active edges of which e is an immediate predecessor, eachof which stores a copy of S′(e) together with any rankings computed for S′(e).The total space requirements remain O(n).

By associating a processor with each edge in E and carrying out a “dry run” ofthe sorting algorithm in which sets of items are replaced by their sizes, mergingof (disjoint) sets is replaced by addition of their sizes, etc., it is possible, in O(d)time, to compute for each e ∈ E and for t = 1, . . . , 2d the total space needed forthe sets in the custody of e in stage t. (To prevent this computation from needingΘ(dn) space, it is preceded by an even more rudimentary computation thatrecords for each edge e only when a set in the custody of e becomes nonemptyfor the first time and when e becomes inactive, so that space proportional tothe number of intervening stages can be allocated to e.) Now, for t = 1, . . . , 2d,the allocation of space to edges in stage t can be planned by computing prefixsums in the manner described in the beginning of the section. Since |E| = O(n),the 2d independent prefix-sums computations can be carried out in a pipelinedfashion in O(d + log n) total time with O(n) processors.

If we allocate one processor per memory cell ever used by the algorithm andintersperse these memory cells with information about the sizes and startingaddresses of relevant arrays, it is clear from Lemmas 2–5 that each stage can

Page 15: A Pictorial Description of Cole’s Parallel Merge Sort · A Pictorial Description of Cole’s Parallel Merge Sort 147 Proof. For each x ∈ A, obtain the rank q of x in A∪B as

A Pictorial Description of Cole’s Parallel Merge Sort 157

be executed in constant time. The available processors can also effectuate anynecessary custody transfers, as discussed above, and copy sets that are to survivefrom one stage to the next between their old and new locations in memory. Thistakes place between stages and needs constant time per stage.

So far, the algorithm uses O(n) processors, O(d+log n) time and O(n) space.By letting each physical processor simulate a constant number of virtual pro-cessors, we can reduce the processor count to exactly n, and a proper choice ofthe tree T ensures that d = O(log n). This reproves Cole’s original result: Thealgorithm sorts n items using n processors, O(log n) time and O(n) space.

6 Comparison with Cole’s Description

Cole’s sets UP(v), SUP(v) and OLDSUP(v) correspond to what is here calledA′(e), S′(e) and S(e), respectively, where e is the edge from v to its parent.Similarly, DOWN(v), SDOWN(v) and OLDSDOWN(v) correspond to A′(e

),S′(e

) and S(e

). Cole’s assumptions (a) and (c) correspond to our invariant A,for e and for e

, (b) and (d) correspond to B, and (f) and (g) correspond to C,while assumption (e) is not used here.

In Cole’s variant of the algorithm, a node uses the sampling densities 1/4, . . . ,1/4, 1/2, 1 over the stages in which it is active. We may express this by sayingthat the algorithm adheres to the sampling regime (4, 2) (with an implicit 1 atthe end). The algorithm in fact works correctly with any sampling regime of theform (z0, . . . , zl), where z0, . . . , zl are integers with z1, . . . , zl > 1 and z0 > 2 (thelatter condition ensures that

∑∞j=0(2/z0)j < ∞; cf. the proof of Lemma 6). Cole

proposes an even more general alternative, namely to use sampling densities 1/2and 1/4 at alternate levels of the tree. Here the sampling regime (3) was chosenas a simplest possibility.

References

1. Ajtai, M., Komlós, J., Szemerédi, E.: An O(n log n) sorting network. In: 15th AnnualACM Symposium on Theory of Computing (STOC 1983), pp. 1–9 (1983)

2. Cole, R.: Parallel merge sort. SIAM J. Comput. 17, 770–785 (1988)


Recommended