Fast Generation of Random Permutations Via Networks Simulation

Algorithmica (1998) 21: 2–20 Algorithmica© 1998 Springer-Verlag New York Inc.

Fast Generation of Random PermutationsVia Networks Simulation1

A. Czumaj,2 P. Kanarek,3 M. KutyÃlowski,2 and K. Lorys4

Abstract. We consider the problem of generating random permutations with uniform distribution. That is,we require that for an arbitrary permutationπ of n elements, with probability 1/n! the machine halts with thei th output cell containingπ(i ), for 1≤ i ≤ n. We study this problem on two models of parallel computations:the CREW PRAM and the EREW PRAM.

The main result of the paper is an algorithm for generating random permutations that runs inO(log logn)time and usesO(n1+o(1)) processors on the CREW PRAM. This is the firsto(logn)-time CREW PRAMalgorithm for this problem.

On the EREW PRAM we present a simple algorithm that generates a random permutation in timeO(logn)usingn processors andO(n) space. This algorithm outperforms each of the previously known algorithms forthe exclusive write PRAMs.

The common and novel feature of both our algorithms is first to design a suitable random switching networkgenerating a permutation and then to simulate this network on the PRAM model in a fast way.

Key Words. Parallel algorithms, Random permutation, Uniform distribution, Switching networks, Matching,PRAM, CREW, EREW.

1. Introduction. Generating permutations is a fundamental problem already studiedin theoretical and applied computer science for a few decades. One approach (historicallyfirst) assumed generation of all permutations ofn elements (see [8] and [17], and thereferences therein). Evidently, the problem is intractable even for very moderate valuesof n. The next approach, also extensively investigated, is generating permutations atrandom. This task can be formally described as follows:

DEFINITION 1. A machineM generates a permutationπ of n elements, if, when halted,the output memory cellsZ1, . . . , Zn store, respectively,π(1), . . . , π(n). We say thatMgenerates(uniformly) random permutationsof n elements, if for every permutationπof n elementsM generatesπ with probability 1/n!.

1 This research was partially supported by KBN Grant 8 S503 002 07, EU ESPRIT Long Term ResearchProject 20244 (ALCOM-IT), DFG-Sonderforschungsbereich 376 “Massive Parallelit¨at,” and DFG LeibnizGrant Me872/6-1. A preliminary version appeared inProceedings of the4th Annual European Symposium onAlgorithms(ESA ’96), volume 1136 of Lecture Notes in Computer Science, pages 246–260, Springer-Verlag,Berlin, 1996.2 Heinz Nixdorf Institute and Department of Mathematics and Computer Science, University of Paderborn,D-33095 Paderborn, Germany.{artur,mirekk}@uni-paderborn.de.3 Institute of Computer Science, University of WrocÃl aw, Przesmyckiego 20, PL-51-151 WrocÃl aw, [email protected] Department of Computer Science, University of Trier, D-54286 Trier, Germany, and Institute of ComputerScience, University of WrocÃl aw, Przesmyckiego 20, PL-51-151 WrocÃl aw, Poland. [email protected].

Received November 1996; revised March 1997. Communicated by J. D´ıaz and M. J. Serna.

Fast Generation of Random Permutations Via Networks Simulation 3

Throughout this paper we consider only the problem of generating permutations withuniform distribution and we omit the word “uniform” while talking about generatingrandom permutations.

The problem of generating a random permutation has recently received much at-tention. The reasons are manifold. One of them is the growing interest in randomizedalgorithms (see, e.g., [12] and [13]). Generating random permutations is a basic elementfor a large number of randomized sequential and parallel algorithms. For many determin-istic algorithms some malicious input data for which the algorithm performs poorly maybe found, although it works efficiently on average. Permuting the input randomly maytransform a difficult input to a good one (at least on the average) and make it tolerable.Another field in which random permutations might be important is cryptography. Ran-dom objects such as permutations are components of a large number of cryptographicalgorithms and protocols. Having really random fast and cheap source of these objectswould be crucial for avoiding security gaps and is often assumed by analysis of cryp-tographic protocols. Unfortunately, creating such sources is a challenging problem andcurrently used techniques generate only pseudorandom objects (see, e.g., [14]). One ofmany reasons why random permutations are difficult to generate is that the running timeinvolved is unacceptable for practical applications. We believe that parallel techniquesmight be important in this context.

When we consider generation of a random permutation, we have to assume thatour algorithm uses some random resources. Thereby the algorithm must be randomized.However, there are many ways of using randomness and a sentencealgorithm A generatesa random permutation in time Tmay have many different meanings. We are interestedin algorithms where all bounds are guaranteed in a strong way:

DEFINITION 2. We say that a machineM generating a random permutation ofnelementsin time T with p processors andm memory cells isstrong randomized, if:

• M always halts afterT steps,• M may use givenp processors andm cells,• each permutation has an equal chance of 1/n! to become the output ofM .

In the literature there are many algorithms for generating random permutations thatdo not satisfy these conditions. In this case we say that the algorithm isweak randomized.This happens if, for instance, an algorithm ensures the claimed time bound to hold only inthe expected case, or to hold with high probability. For example, a randomized algorithmmay generate a permutation uniformly at random and halt withinT steps provided that acertain (randomized) eventE takes place, where Pr(E) > 1− o(1) andE is independentof the permutation found.

There are many reasons why permutation generation by strong randomized algorithmsare superior to weak randomized ones. For instance, it may be difficult to check that thementioned eventE in the case of a weak algorithm has really occurred. In this case wecannot guarantee that the algorithm gives proper answers. When we consider a weakrandomized algorithm running with the failure probabilityf (n), the permutation chosenby such an algorithm is uniform with probability 1− f (n); here f (n) is usually a verysmall function of the formf (n) = n−c or f (n) = 2−εn, for constantsc > 1 and

4 A. Czumaj, P. Kanarek, M. KutyÃlowski, and K. Lorys

0 < ε < 1. Observe that in this situation the probability of a single permutation maydiffer from the ideal 1/n! even by f (n). (Note thatf (n) is usually an extremely big valuecompared with 1/n!.) An approach of this kind may be acceptable as a component inother randomized algorithms, where such a precision is fully acceptable because it addsonly the additive termf (n) to the failure probability. However, there are some criticalapplication (e.g., in cryptography) where such a deviation from the uniform distributionmight be dangerous.

In this paper we study the permutation generation problem on the Parallel RandomAccess Machine (PRAM) model. We focus on PRAMs with exclusive read exclusivewrite (EREW) and concurrent read exclusive write (CREW) access mode to the sharedmemory. In order to make our model sound we assume that a cell of ann-processorPRAM may storeO(logn) bits.

1.1. Related Work. Previous parallel algorithms for generating random permutationshave been based on three basic techniques. The first one, calleddart throwing, consistsof two steps. First, the input elements are mapped at random into an array of sizeO(n).Their order in the array gives an implicit random permutation. Then the elements arecompressed into an array of sizen. This technique is especially efficient when usedon the CRCW PRAM model, where conflicts during writing and reading are allowed.There has been a sequence of papers using this technique [10], [11], [15] culminating inthe papers by Hagerup [7] and Matias and Vishkin [9], who designedweakrandomizedalgorithms that generate random permutations inO(log∗ n) time on theO(n/log∗ n)-processor CRCW PRAM, with high probability.

The second technique is based oninteger sortingand it also leads toweakrandomizedalgorithms. Each element chooses a key uniformly at random from the set{1, . . . ,n}.Then the elements are sorted according to the keys’ values, which define the relativeorder of the elements with different keys. Finally, the elements with the same key arerandomly permuted using a sequential algorithm. This technique was first used by Reif[16], who applied his integer sorting algorithm to generate random permutations in timeO(logn) on theO(n/logn)-processor CRCW PRAM, with high probability. Hagerup[7] used this idea to the EREW PRAM model, and showed that, disregarding a smallfailure probability, generation of random permutations is reducible to integer sorting.In particular, combining with known integer sorting algorithms, this yields a weak ran-domized algorithm that runs inO(logn) time on then-processor EREW PRAM, oran algorithm running in timeO((logn)5/3(log logn)1/3) on the EREW PRAM withO(n/logn) processors. Both these results assume a space ofO(n) size is used.

The third technique is to implement in parallel a basic sequential algorithm, whichwe call SHUFFLE, due to Durstenfeld [6] (see also page 139 of [8] and [17]):

SHUFFLE:

for i := 1 to n do π(i ) := ifor i := 1 to n do

κ := random element from{i, . . .n}exchange the values ofπ(i ) andπ(κ)

It is well known that this algorithm returns permutations according to the uniform


distribution. Anderson [1] used this algorithm in a parallel setting and showed that it canbe run efficiently on parallel machines with a small number of processors communicatingthrough a common bus such as Ethernet. He showed that the main loop of SHUFFLE canbe divided into pieces, each piece executed by a different processor. The ordering inwhich the resulting permutations are composed affects the permutation generated, butnot its probability distribution in a serious way, even if the delivery times of messagessent through the bus are determined on-line by an adversary controlling the bus. Hagerup[7] later implemented SHUFFLE to run in O(logn) time with n processors and2(n2)

space on the EREW PRAM. He was able to reduce the space used, sacrificing howeverthe running time and/or turning from the EREW to the CREW PRAM model. Hepresented two algorithms that useO(n1+ε) space for arbitrary fixedε > 0, one runningin O(log2 n) time on theO(n/logn)-processor EREW PRAM, and another running intime O(logn log log logn) on theO(n/log log logn)-processor CREW PRAM.

Observe that SHUFFLEhas one very important advantage over the first two techniques:it leads to strong randomized algorithms. Thus up to now, the fastest strong randomizedO(logn)-time algorithm usesn-processors andÄ(n2) space on the EREW PRAM. Weare not aware of any better strong randomized algorithm existing in the literature evenon the CRCW PRAM!

1.2. New Results. We present two algorithms for generating random permutations, onerunning on the EREW PRAM and the other running on the CREW PRAM. Our firstresult is an efficient implementation of SHUFFLE.

THEOREM1. There is a strong randomized EREW PRAM algorithm that generatespermutations of n elements uniformly at random in time O(logn)with n processors andO(n) space.

This algorithm is a simple but efficient implementation of SHUFFLE on the EREWPRAM. It uses the minimum number ofO(logn!) random bits required only to definethe output permutation. Though the algorithm is simple, it improves upon all previouslyknown algorithms for generating random permutations for the CREW and the EREWPRAMs. Comparing with former results, we either reduce space used or make the algo-rithm strong randomized and remove concurrent reads, while the other parameters arenot worsened.

Because even with the use of randomization and an unbounded number of processors,any CREW PRAM requiresÄ(logn) time to compute the OR ofn bits [4], [5], anynontrivial problem that can be solved on this model in timeo(logn) is of special interest.There are extremely few such algorithms, perhaps the most significant so far is thealgorithm for merging two ordered sequences ofn elements [2]. Our second andmainresult is the first permutation generation algorithm for the CREW PRAM that runs insublogarithmic time.

THEOREM2. There is a strong randomized CREW PRAM algorithm that generates per-mutations of n elements uniformly at random in time O(log logn) using O(n1+1/c log logn)

processors, for arbitrary positive constant c.


The main message of this result is that the generation of random permutations mayfollow another strategy than the previously known techniques and lead to possible moreefficient algorithms.

Our CREW algorithm uses hypergeometric random number generator. It would bedesirable to achieve the same running time using only the unbiased coins. However, itis easy to see that then the probability of generating an arbitrary permutation would beof the form i /2 j , for somei, j ∈ N. So it cannot be 1/n! and therefore each strongrandomized algorithm has to use something more than a uniform random bit generator.

While designing our algorithms we introduce a novel technique for generating per-mutations that we callnetworks simulation. We study certain suitably defined layerednetworks whose main feature is that each level of the network is designed locally andindependently of the other levels. The final permutation is defined by the paths from thenodes on the first level to the nodes on the terminal level.

1.3. Basic Techniques. It is equivalent to construct a random permutation onnelementsor to construct a random perfect matching between two sets{a1, . . . ,an}and{b1, . . . ,bn}of n elements each. Simply, ifµ is such a perfect matching, then we defineπµ(i ) = jif µ(ai ) = bj . Therefore throughout the rest of the paper we may talk about perfectmatchings instead of permutations.

Our way to obtain a random perfect matching will be through constructing speciallayered networks, later calledmatching networks. Such a network consists of severallevels, each containingn nodes. The directed links of the network form perfect matchingsbetween consecutive levels of the network. Any matching network defines a perfectmatchingµ between the nodes on the first and the last levels: for a nodeC on the firstlevel,µ(C) is the unique nodeZ on the last level so that there is a path betweenC andZ. Of course choosing the matchings between the levels must be carefully done in orderto get finally each permutation with the same probability, and simple enough in order tobe easily constructible.

Once we have constructed the network, determining the perfect matchingµ can berealized by thepointer jumping technique: after stepi every nodeR in level j storesa pointer to the single nodeS on level min{ j + 2i , last level } such that there is apath betweenR and S. At step i + 1 the processor attached toR reads the pointerstored inS and copies it toR. The new pointer ofR points to a nodeS′ on levelmin{ j + 2i+1, last level } such that there is a path betweenR andS′.

Obviously, the pointer jumping technique can be performed on the EREW PRAM intime logarithmic in the number of levels. It can be implemented so that the time–processorproduct equals the number of the nodes in the network.

2. EREW Algorithm. In this section we prove Theorem 1. The algorithm SHUFFLE

can be implemented by constructing a so-calledshuffle networkwith n levels. Thematching between levelsi andi +1 contains a single switch(i, ki ), whereki , i ≤ ki ≤ n,is chosen uniformly at random and corresponds to the numberκ chosen during thei thiteration of the loop of SHUFFLE. The switch(i, ki ) connects the nodei from leveli withnodeki from leveli+1, and nodeki from leveli with nodei from leveli+1. For j 6= i, ki ,the nodej of level i is connected with nodej of level i + 1. For examples see Figure 1.


Fig. 1.A shuffle network. Edges corresponding to the switches are distinguished by the shaded fields.

To find the permutation defined by a shuffle network we may apply pointer jumping,but it would require roughlyn2 processors and space in order to be run inO(logn) time.We show how to use many less processors and less space.

Let the nodes with indexi of all levels be calledrow i of the shuffle network. Letvi denote the path starting at nodei at level 1 (see Figure 2). Our goal is to find thefinal node of each pathvi . Note that each pathvi contains three types of edges. Firsttwo types correspond to edges of switches. An edge of a switch may climb or fall: Wesay thatvi falls at level j , if at level j pathvi goes through a switch( j, kj ) and reachesnode j at level j + 1. (It includes the case whenkj = j and the path goes horizontally.)Note that every path falls exactly once. We say thatvi climbsat level j , if at level jpathvi goes through a switch( j, kj ) and reaches nodekj at level j + 1 with kj > j .The remaining edges of the path form a number ofhorizontal subpaths. More precisely,vi has ahorizontal subpathbetween levelsj + 1 andl , if vi climbs or falls at levelland eitherj = 0 andi = l , or vi climbs at level j with kj = l . Note that the numberof horizontal subpaths isO(n). Indeed, each horizontal subpath starts either at the firstlevel or at the node of a switch. There aren switches and each gives rise to at mosttwo horizontal subpaths. Our construction is based on contracting horizontal subpathsto single edges.

In order to find the final nodes of the paths, we generate a directed graphG consistingof O(n) nodes andO(n) edges (see Figure 3). There aren nodes that correspond to thestarting positions at level 1, and we denote them by〈i, 0, i 〉, for 1≤ i ≤ n. The remainingnodes correspond to the switches: two nodes per switch. The two nodes correspondingto a switch(i, ki ), 1≤ i ≤ n, are denoted by〈i, i, ki 〉 and〈ki , i, ki 〉 (the first coordinate


Fig. 2.Pathv1 in a shuffle network.

Fig. 3.Edges in graphG corresponding to the pathsv1, v2, v6, v7.


is the row number, the next two coordinates correspond to the switch.) There are alson“output” nodes out(i ), for 1≤ i ≤ n.

There are three kinds of edges inG. Some edges correspond to the edges in theshuffle network where the paths climb. Thus for eachi , 1 ≤ i ≤ n, if ki 6= i , thenthere is a directed edge from〈i, i, ki 〉 to 〈ki , i, ki 〉. There areO(n) edges correspondingto the horizontal subpaths. We take a directed edge from〈l , j, l 〉 to 〈l , l , kl 〉, if a patharriving at rowl through switch( j, l ) leaves rowl through switch(l , kl ), wherekl > l(that is, the path climbs further). An important point is that we may find these edgesby lexicographical sorting triplets〈s, i, j 〉. The key observation is that if〈 j, j, k〉 is theimmediate successor of〈 j, s, j 〉 after lexicographical sorting, then there is a horizontalsubpath between levelss+ 1 and j inside row j . Using the sorting algorithm due toCole [3], sorting the triplets can be performed inO(logn) time on ann processor EREWPRAM with O(n) memory cells.

There are alson edges corresponding to the paths starting with falling edges andleading to the output nodes. We find them in the following way: if after sorting thetriplets 〈s, i, s〉 is followed by〈s, j, s〉 for some j ≤ s, thenG contains an edge from〈s, i, s〉 to out( j ).

It follows from the construction that the edges inG determine the paths correspondingto the pathsvi : if the last node of a path starting in〈s, 0, s〉 is out( j ), then j is the finalnode ofvs. What differs the graphG from the shuffle network is thatG has onlyO(n)nodes. So performing pointer jumping onG requires onlyO(n) processors. Also, sinceeach edge ofG corresponds to a subpath of a path in the shuffle network, each path inG has length at mostn and pointer jumping takesO(logn) time.

Observe that our construction reduces permutation generation to stable integer sorting,or, alternatively, to sorting distinct integers drawn from the set{1, . . . ,n3}. Apart frominteger sorting all other operations can be performed inO(logn) time on theO(n/logn)-processor EREW PRAM withO(n) space. Thus our construction extends a similarreduction forweakrandomized permutation generation of Hagerup [7].

3. CREW Algorithm. This section contains a proof of Theorem 2. In our constructionwe use the following probability distribution.

DEFINITION 3. We say that a random variableX ∈ {0, 1, . . . , l } hashypergeometricprobability distributionHp(l ,m) if

Pr(X = i ) =(m

i

)( ml−i

)(2ml

) .

The intuition is that we consider a setA of 2m elements consisting of two partsA1, A2,of m elements each. We choose uniformly at randoml elements ofA. ThenX denotesthe number of elements taken fromA1.

Our CREW PRAM algorithm requires that each processor can randomly choose aninteger from the set{0, 1, . . . ,m}, m ≤ n, with hypergeometric distribution Hp(l ,m),l ≤ m, in a single PRAM step.


Fig. 4.The matching defined by a stable splitter.

3.1. Outline of the Algorithm. In the following we assume thatn is a power of 2, whichsignificantly simplifies the notation and does not influence the generality of the resultsobtained. The algorithm that we present yields a random perfect matching between twosets ofn elements given by constructing a matching network as described in Section 1.3.The main component of the network is asplitter (see Figure 4):

DEFINITION 4.

1. A perfect matching between sequences of nodesA andB is calledstable, if for everyi ≤ |A| the i th node ofA is matched with thei th node ofB.

2. Letl be an even integer. Anl-splitter between sequences of nodesP = P0, . . . , Pl−1

andR = R0, . . . , Rl−1 is a network which for some increasing sequence of integersi0, . . . , i l/2−1, 0≤ i0, i l/2−1 < l ,• defines a stable perfect matching between nodesPi0, . . . , Pil/2−1 and nodesR0, . . . ,

Rl/2−1,• defines a stable perfect matching between the sequences of the remaining nodes ofP andRl/2, . . . , Rl−1.

The nodesPi0, . . . , Pil/2−1 ∈ P are later calledthe chosen nodes of the splitter. Thenodes inP are calledinput nodesand the nodes inR are calledoutput nodes.

In the next subsection we show how to construct a random splitter. “Random” meansin this context that, for a givenl and a sequenceP ′ of l/2 input nodes, the probability

of constructing thel -splitter with the chosen nodesP ′ equals( ll/2

)−1. Provided that we

can build random splitters, the construction of the network defining a random perfectmatching may be described as follows:


Algorithm 1. Recursive construction of random matching network be-tween sequences of nodesP = P0, . . . , Pm−1 andR = R0, . . . , Rm−1 (seeFigure 5)

if m= 1 thenconnectP0 to R0

otherwiseletP ′ = P′0, . . . , P′m−1 be an additional sequence of nodes.

(1) Splitting phase: Choose uniformly at random anm-splitter with theinput nodes inP and the output nodes inP ′.

(2) Recursive call: Construct random matching networks for inputnodesP′0, . . . , P′m/2−1 and output nodesR0, . . . , Rm/2−1 and, in-dependently, for input nodesP′m/2, . . . , P′m−1 and output nodesRm/2, . . . , Rm−1.

(3) Output the composition of networks constructed in (1) and (2).

The above construction may be executed so that the splitting phase and the recursivecall are performed in parallel. This rule may be applied to all recursive calls, so finallythe PRAM generates, independently and in parallel, a number of splitters. Therefore thetime of construction does not exceed the time needed to construct the largest splitter.Moreover, there are lognstages of recursion, so in order to show that we do not require toomany processors it suffices to show that a splitter can be constructed with few processors.

PROPOSITION3.1. Algorithm1 returns each perfect matching with the same probability.

Fig. 5.Structure of the matching network constructed by Algorithm 1.


PROOF. We prove by induction onn that the probability that Algorithm 1 constructs agiven perfect matchingµ of n elements equals 1/n!. Forn = 1 it is obvious. So letn > 1be a power of 2. In order to obtainµ the following three events must take place. First,during the splitting phase we have to choosePi0, . . . , Pin/2−1 as exactly these nodes that areto be matched withR0, . . . , Rn/2−1 and take a splitter that connectsPi0, . . . , Pin/2−1 withP′0, . . . , P′n/2−1. Since each splitter is chosen with the same probability, the probability of

this event equals( n

n/2

)−1. Second, at Phase 2 we have to match the nodesP′0, . . . , P′n/2−1

with R0, . . . , Rn/2−1 in a unique way in order to getµ. By the induction hypothesisthis happens with probability 1/(n/2)!. Third, at Phase 2 we have to match the nodesP′n/2, . . . , P′n−1 with Rn/2, . . . , Rn−1 in a unique way, and it happens with probability1/(n/2)!. Hence we obtain matchingµ with probability

1( nn/2

) · 1

(n/2)!· 1

(n/2)!= 1

n!.

3.2. Construction of a Random Splitter. For eachi , 0 ≤ i ≤ logn, let the setsPi, j ,0≤ j < 2i , partition the set of input nodesP = {P0, P1, . . . , Pn−1} into 2i consecutiveintervals of lengthn/2i . That is, we putPi, j = {Pj ·n/2i , . . . , P( j+1)·n/2i−1}. According tothis definition each setPlogn, j consists of a single inputPj .

The idea of the construction of a splitter is that in order to choosen/2 input nodes wedetermine instead how many elements chosen are inside each setPi, j . For this purposewe use a treeT with the set of vertices{Ti, j | 0≤ i ≤ logn and 0≤ j < 2i }. We adoptthe convention thatTi, j is a parent ofTi+1,2 j andTi+1,2 j+1, soT is a binary tree of depthlogn. Each vertexTi, j ∈ T corresponds to the setPi, j of input nodes. Thus the root ofthe treeT0,0 corresponds to the set of all input nodes. Further, forτ ∈ T the children ofτ correspond to “halves” of the set of their parent. Finally, each leaf ofT correspondsto a single input node.

We use a labeling cnt of vertices ofT such that cnt(Ti, j )equals the number of elementschosen inPi, j . Therefore function cnt has to satisfy the following conditions:

• cnt(T0,0) = n/2,• cnt(Ti, j ) = cnt(Ti+1,2 j )+ cnt(Ti+1,2 j+1),• 0≤ cnt(Ti, j ) ≤ |Pi, j |,for 0 ≤ i < logn and 0≤ j ≤ 2i . If these properties hold, then〈T , cnt〉 is called adistribution tree. We say also that a nodePj is chosen, if and only if cnt(Tlogn, j ) = 1.For an example, see Figure 6.

How to construct uniformly at random a distribution tree without loosing efficiencyis discussed in the next subsections. At the moment we assume we are given somedistribution tree〈T , cnt〉 and show how we construct the corresponding splitter.

The set of chosen elements (and hence the distribution tree) defines a uniquesplittingperfect matchingµ. Namely,µ(Pi ) = Rl where

l ={|{ j | j < i andPj is chosen}| if Pi is chosen,

n/2+ |{ j | j < i andPj is not chosen}| if Pi is not chosen.

Obviously,µ has all properties demanded from the matching defined by a splitter.


Fig. 6.A distribution tree.

Hence it remains only to indicate how we generate a splitter realizingµ. This networkcan be described recursively as follows:

Algorithm 2. Recursive description of a splitter from distribution tree〈T , cnt〉if n = 1 then

connect the output node with the input nodeotherwise

let {P′0, . . . , P′n−1} be an additional set of nodes divided intoequal halvesS′ = {P′0, . . . , P′n/2−1} andS′′ = {P′n/2, . . . , P′n−1}.(1) Recursive call(see Figure 7):

Independently do in parallel:• apply recursively the algorithm toS′, the firstn/2 input nodes and

the subtree of the distribution tree with the root inT1,0,• apply recursively the algorithm toS′′, the lastn/2 input nodes and

the subtree of the distribution tree with the root inT1,1.(2) Distribution phase(see Figure 8):• connect the first cnt(T1,0) nodes ofS′ with the first cnt(T1,0) output

nodes,• connect the first cnt(T1,1) nodes ofS′′ with the next cnt(T1,1) output

nodes,• connect the remaining nodes ofS′ with the nextn/2− cnt(T1,0)

output nodes,• connect the remaining nodes inS′′with the remaining output nodes.

(3) Output the composition of networks constructed in (1) and (2).

We prove by induction onn that the network constructed above defines matchingµ.


Fig. 7.The recursive structure of a splitter.

Forn = 1 it is obvious, so we assume thatn > 1. By the construction, fori ≤ cnt(T1,0)

the i th chosen input node is connected with thei th node ofS′ and therefore with thei th output node. Fori > cnt(T1,0), the i th chosen node is thei − cnt(T1,0) chosennode in the second half of the set of input nodes. Hence it is connected with the nodei − cnt(T1,0) of S′′. So this input node is finally connected with the output node with theindexi −cnt(T1,0)+cnt(T1,0) = i . We conclude that the chosen nodes are matched with

Fig. 8.Matching constructed by the distribution phase.


the output nodes indicated byµ. Similarly we may check that the nonchosen nodes areconnected to the output nodes according toµ, too.

In order to generate the splitter described we perform the distribution phase and therecursive call in parallel. Because the connections generated by both parts of the algorithmdepend on the distribution tree〈T , cnt〉 that is given in advance, the construction can bedone by a CREW PRAM in constant time.

3.3. Random Choice of a Distribution Tree. As we have seen, in order to choose asplitter uniformly at random it suffices to choose a distribution tree uniformly at random.A straightforward way would be to construct the tree top-down as follows.

Algorithm 3. Naive method for constructing random distribution tree

cnt(T0,0)← n/2for i = 0 to logn− 1 do

for 0≤ j < 2i do in parallelchoose cnt(Ti+1,2 j ) as a random number according to probability

distribution Hp(cnt(Ti, j ), n/2i )

cnt(Ti+1,2 j+1)← cnt(Ti, j )− cnt(Ti+1,2 j )

endfor

Let us have a closer look at the above algorithm. Given a value cnt(Ti, j )we choose thevalues cnt(Ti+1,2 j ) and cnt(Ti+1,2 j+1) so that cnt(Ti+1,2 j ) + cnt(Ti+1,2 j+1) = cnt(Ti, j ),as demanded for function cnt. Given that we have already set cnt(Ti, j ) = l , then we knowthat we choosel nodes inPi, j . Thus we should choose cnt(Ti+1,2 j ) = k, cnt(Ti+1,2 j+1) =l − k (that is,k chosen elements from the first half andl − k from the second half) withprobability (n/2i+1

k

) · (n/2i+1

l−k

)(n/2i

l

) .

That is, we have to choose according to the hypergeometric probability distributionHp(l , n/2i+1).

PROPOSITION3.2. When Algorithm3 starts withcnt(T0,0) = l , each subset of l nodesis chosen by the distribution tree equiprobably.

PROOF. The proof is by induction onn. For n = 1 this is obviously true. We assumethat the claim holds forn/2; we check it forn. Consider a setX of l out ofn nodes suchthat |X ∩ P1,0| = k. In order to chooseX it is necessary to decide upon cnt(T1,0) = k,

cnt(T1,1) = l − k. This happens with probability(n/2

k

) · (n/2l−k

) · (nl )−1. Function cnt is

defined for the ancestors ofT1,0 so that, by the induction hypothesis, each subset of

P1,0 of k elements is chosen with the same probability(n/2

k

)−1. Similarly, each subset of

P1,1 of l − k elements is chosen with probability(n/2l−k

)−1. The processes of constructing

subtrees with the rootsT1,0 andT1,1 are independent, so the probability that we obtain


X equals (n/2k

) · (n/2l−k

)(nl

) · (n/2k

) · (n/2l−k

) = 1(nl

) ,as required.

Notice that Algorithm 3 runs inO(logn) time and therefore cannot be used as asubroutine of our algorithm mentioned in Theorem 2. The reason for this running timeis that in order to generate cnt(Ti+1,2 j ) and cnt(Ti+1,2 j+1) we have to wait until cnt(Ti, j )

is fixed. In the following subsection we show how to elude this difficulty.

3.4. Fast Parallel Generation of a Distribution Tree. In order to speed up the naivealgorithm we apply the following trick. We substitute eachTi, j by a set{Ti, j,a | 0 ≤a ≤ n/2i }. By Ti, j,a we understand a copy ofTi, j which assumes that cnt(Ti, j ) = a.In other words,Ti, j,a presumes thata nodes are to be chosen fromPi, j . Let T ′ ={T0,0,n/2} ∪ {Ti, j,a | 1≤ i ≤ logn, 0≤ j < 2i , 0≤ a ≤ n/2i }. The following algorithmlets eachTi, j,a choose at random its childrenTi+1,2 j,a′ andTi+1,2 j+1,a′′ so thata = a′+a′′.

Algorithm 4. Generating Step

for each vertexτ = Ti, j,a ∈ T ′ wherei < logn do in parallelchoose a numberr at random according to probability distribution

Hp(a, n/2i )

Childleft(τ )← Ti+1,2 j,r

Childright(τ )← Ti+1,2 j+1,a−r

Since|T ′| = 1+ 2 · n/2+ 4 · n/4+ · · · = n logn + 1, Generating Step can beperformed in a constant time withn logn processors.

Generating Step constructs a graphG with the set of verticesT ′ and the edges(τ,Childleft(τ )) and (τ,Childright(τ )). It is easy to see that the successors ofT0,0,n/2

in G form a complete binary tree of depth logn (see Figure 9). We call this treeT ′ anddefinecnt for each vertexTi, j,a of this tree to bea. Then〈T ′, cnt〉 is a well defined dis-tribution tree generated according to the probability distribution defined in the previoussubsection.

In order to use treeT ′ in our construction of a splitter in Algorithm 2 we still lackone important thing. The distribution tree is given as a pointer structure and we do notknow how to find the vertices ofT ′ without tracingT ′ from the root. On the other hand,for Algorithm 2 we need to find the labels of the nodes!

The next algorithm gathers information about all vertices ofT ′ in the root ofT ′ (andfor technical reasons, at some vertices ofT ′ as long as it is needed.) For this purpose weuse standard doubling technique. For each vertexτ of G we collect information aboutthe subtreeInfoTree(τ ) rooted atτ and containing the successors ofτ in G. Initially,eachInfoTree(τ ) stores the names of the children ofτ . Using the doubling technique itis possible to replace leaves ofInfoTree(τ ) by subtrees found up to this moment by thevertices corresponding to the leaves ofInfoTree(τ ):


Algorithm 5. Gathering Step

for each vertexτ = Ti, j,a ∈ T ′ do in parallelif i = logn then

setInfoTree(τ ) to be the tree with only one vertexτ which is both theroot and the leaf

if 0≤ i < logn thenset InfoTree(τ ) to be the tree with rootτ and the leavesChildleft(τ ),Childright(τ )

repeat log logn timesfor each vertexτ ∈ T ′ do in parallel

if l1, l2, . . . , lr are the leaves ofInfoTree(τ ),then for 1≤ j ≤ r replace the leafl j in InfoTree(τ ) by InfoTree(l j )

PROPOSITION3.3. Gathering Step collects the whole treeT ′ in InfoTree(T0,0,n/2) inO(log logn) time using n2 processors.

PROOF. Therepeat loop of the algorithm is executed in a constant time—simply foreach leaf and each value copied there is a separate processor. Therefore the runningtime is O(log logn). The number of the processors used for thekth iteration does notexceed the total size of allInfoTrees immediately after iterationk. Notice that for eachTi, j,a ∈ T ′ theInfoTreeconsists finally of 2· 2logn−i − 1 vertices. Thus the total number

Fig. 9. The edges chosen by the nodes ofT ′. The bold edges correspond to the distribution tree generatedby T ′.


of processors used at any iteration does not exceed

∑Ti, j,a∈T ′

(2 · 2logn−i − 1) =logn∑i=0

2i ·( n

2i+ 1

)· (2 · 2logn−i − 1)

= (2n− 1)2+ n · (logn+ 1).

By applying Generating Step and then Gathering Step we would get an algorithmwhich generates a distribution tree in timeO(log logn) using n2 processors. In thefollowing we show how to combine Gathering Step with Algorithm 3 in order to reducethe number of processors without losing the execution speed.

The main idea behind our construction is to generate theInfoTrees till some depthh is reached. If we stop generatingInfoTrees at the moment when they have depthh¿ logn, then because the size of each tree is small, we use many less processors thanduring Gathering Step. Once this is done we determine the distribution tree〈T ′, cnt〉sequentially top-down: We start with the root, then collect information about the nodesof the firsth levels (using the information contained inInfoTreeof the root), then aboutthe nodes of the nexth levels (usingInfoTrees of the nodes alreadyinformedat levelh+ 1), and so on.

Algorithm 6. Fast parallel method for generating and identification of adistribution tree; parameter h(0< h < logn) is used to tune the executiontime and the number of processors used:

(1) execute Generating Step(2) for each τ ∈ T ′ do in parallel

perform logh times the loop of Gathering Step, so thatInfoTree(τ ) willcontain a tree of depthh pointing to all successors ofτ in distance atmosth

(3) inform T0,0,n/2 that it belongs toT ′(4) for i = 1, h+ 1, 2h+ 1, . . . , logn do

for each τ , a vertex ofT ′ of depthi do in parallel

if τ knows that it belongs toT ′ theninform all successors ofτ pointed by the vertices ofInfoTree(τ )that they belong toT ′

(5) for eachi, j,aif vertexTi, j,a is marked as chosen (and thereby is inT ′),then set cnt(Ti, j ) = a

Now we analyze resources necessary to execute Algorithm 6.

PROPOSITION3.4. For every integer h, 1 ≤ h ≤ logn, Algorithm 6 runs in timeO(logh+ logn/h) and uses O(2hn logn) processors.

PROOF. We consider each phase of Algorithm 6 separately:

Phase1: Generating Step takes constant time withn logn processors.


Phase2: Each execution of the loop of Gathering Step is performed in a constant time.So together logh executions of the loop takeO(logh) time. As during this stepeachτ ∈ T ′ collects only information about its successors at distance≤ h,InfoTrees have depths at mosth and thereby at most 2· 2h − 1 nodes each. Sothe total number of processors necessary to perform this step is bounded by

∑Ti, j,a∈T ′

(2 · 2h − 1) =logn∑i=0

2i ·( n

2i+ 1

)· (2 · 2h − 1)

≤ 3 · 2h · n logn.

Phase4: The loop is executed logn/h times and its body is performed in one parallelstep. The number of processors used does not exceed the number of verticesin T ′, that isO(n logn).

So finally, the total running time of Algorithm 6 is bounded by

O

(logh+ logn

h

).

The number of processors used is bounded by

max{n logn, 3 · 2h · n logn} = 3 · 2h · n logn.

Concluding we get the following result:

COROLLARY 3.5. For arbitrary positive constant c, the distribution tree can be gener-ated in time O(c log logn) using n1+1/c log logn processors.

PROOF. Plug h = logn/((c + 1) log logn) into the bound from Lemma 3.4. ThenAlgorithm 6 runs in timeO(log logn) with n · 2logn/((c+1) log logn) · logn processors.Since for sufficiently largen, n · 2logn/((c+1) log logn) · logn = n1+1/((c+1) log logn) · logn ≤n1+1/c log logn, the bound follows.

3.5. Properties of the Constructions. It is easy to see that the splitter constructed usingAlgorithm 2 has depth logn. Hence applying it in Algorithm 1 we get a random matchingnetworkN with O(log2 n) levels andO(n log2 n) nodes. By Corollary 3.5, generatingNtakesO(c log logn) time and usesn1+1/c log logn processors for arbitrary positive constantc. Performing pointer jumping onN takesO(log log2 n) = O(log logn) steps andusesn log2 n processors. Thereby the algorithm designed fulfills the properties stated inTheorem 2.

4. Conclusions. As we already mentioned, our CREW PRAM algorithm uses ahypergeometric random number generator. It would be interesting to find ano(logn)-time strong randomized algorithm based on a simple number generator or to provide anelegant parallel generator for hypergeometric distribution.


Our EREW algorithm can be significantly accelerated, if we remove the assumptionthat each permutation has to be generated with the same probability. Simply, in oneparallel step one can set randomly the switches of the Beneˇs network and then determinethe permutation defined by the network through pointer jumping inO(log logn) steps.However, the resulting probability distribution is far from being uniform. A challengingproblem is to establish a lower bound for the running time of uniform permutationgeneration on the strong randomized EREW PRAM.

References

[1] R. Anderson, Parallel algorithms for generating random permutations on a shared memory machine, inProc. 2nd Annual ACM Symposium on Parallel Algorithms and Architectures(ACM Press, New York,1990), pp. 95–102.

[2] A. Borodin and J. E. Hopcroft, Routing, merging, and sorting on parallel models of computation,J. Comput. System Sci. 30 (1985), 130–145.

[3] R. Cole, Parallel merge sort,SIAM J. Comput. 17 (1988), 770–785.[4] S. Cook, C. Dwork, and R. Reischuk, Upper and lower bounds for parallel random access machines

without simultaneous writes,SIAM J. Comput. 15 (1986), 87–97.[5] M. Dietzfelbinger, M. KutyÃl owski, and R. Reischuk, Exact lower time bounds for computing boolean

functions on CREW PRAMs,J. Comput. System Sci. 48 (1994), 231–254.[6] R. Durstenfeld, Random permutation (Algorithm 235),Comm. ACM 7 (1964), 420.[7] T. Hagerup, Fast parallel generation of random permutations, inProc. 18th Annual International Col-

loquium on Automata, Languages and Programming, ICALP ’91 (Springer-Verlag, LNCS 510, Heidel-berg, 1991), pp. 405–416.

[8] D. E. Knuth,The Art of Computer Programming: Seminumerical Algorithms, volume 2, 2nd edition,Addison-Wesley, Reading, Massachusetts, 1981.

[9] Y. Matias and U. Vishkin, Converting high probability into nearly-constant time—with applications toparallel hashing, inProc. 23rd Annual ACM Symposium on Theory of Computing(ACM Press, NewYork, 1991), pp. 307–316.

[10] G. L. Miller and J. H. Reif, Parallel tree contraction, inProc. 26th the Symposium on Foundations ofComputer Science(IEEE, Los Alamitos, 1985), pp. 478–489.

[11] G. L. Miller and J. H. Reif, Parallel tree contraction, Part 1: Fundamentals,Adv. in Comput. Res. 5(1989), 47–72.

[12] R. Motwani and P. Raghavan,Randomized Algorithms, Cambridge University Press, Cambridge, 1995.[13] K. Mulmuley, Computational Geometry. An Introduction Through Randomized Algorithms, Prentice-

Hall, Englewood Cliffs, New Jersey, 1994.[14] J. Pieprzyk and B. Sadeghiyan,Design of Hashing Algorithms, Springer-Verlag, Berlin, 1987.[15] S. Rajasekaran and J. Reif, Optimal and sublogarithmic time randomized parallel sorting algorithms,

SIAM J. Comput. 19 (1989), 594–607.[16] J. Reif, An optimal parallel algorithm for integer sorting, inProc. 26th Symposium on Foundations of

Computer Science(IEEE, Los Alamitos, 1985), pp. 490–503.[17] R. Sedgewick, Permutation generation methods,ACM Comput. Surv. 9 (1977), 138–164.

Date post:	19-Aug-2016
Category:	Documents
Upload:	k
View:	213 times
Download:	0 times

Fast Generation of Random Permutations Via Networks Simulation

Documents