+ All Categories
Home > Documents > 1 In - pdfs.semanticscholar.org · unicati on, whic h means ... unicatio n step, and then, extend...

1 In - pdfs.semanticscholar.org · unicati on, whic h means ... unicatio n step, and then, extend...

Date post: 02-Sep-2018
Category:
Upload: lycong
View: 212 times
Download: 0 times
Share this document with a friend
25
Transcript

OPTIMAL ALL-TO-ALL BROADCAST SCHEMES INDISTRIBUTED COMPUTING SYSTEMSMing-Syan Chen, Philip S. Yu and Kun-Lung WuIBM Thomas J. Watson Research CenterP.O. Box 704Yorktown Heights, NY 10598AbstractBroadcast, referring to a process of information dissemination in a distributed system wherebya message originating from a certain node is sent to all other nodes in the system, is a veryimportant issue in distributed computing. All-to-all broadcast means the process by which everynode broadcasts its certain piece of information to all other nodes. In this paper, we �rst developthe optimal all-to-all broadcast scheme for the case of one-port communication, which meansthat each node can only send out one message in one communication step, and then, extend ourresults to the case of multi-port communication, i.e., k-port communication, meaning that eachnode can send out k messages in one communication step. We prove that the proposed schemesare optimal for the model considered in the sense that they not only require the minimal numberof communication steps, but also incur the minimal number of messages.Index Terms: Distributed computing systems, all-to-all broadcast, NODUP, partitioning trees,minimal complete sets, multi-port communication.

1 IntroductionThe availability of inexpensive, high-performance microprocessors has made it attractive to linktogether many powerful and autonomous computers to build a distributed computing system forbetter availability and cost performance [7] [17] [19] [21]. In such a system, instead of usinga shared memory and a global clock, all the synchronization and communication between theprocessing nodes is done via message passing [1] [22]. Since data are distributed, not shared, specialschemes are generally required to perform various distributed computations. One such scheme indistributed computations is broadcast [6] [8] [9] [11] [16] [25], which refers to a process of informationdissemination in a distributed system whereby a message originating from a certain node is sentto all other nodes in the system. An example of a one-to-all broadcast scheme can be found in1, where f1g denotes the message broadcast by the originator N1 and the broadcast is completedin 3 steps while incurring 4 messages. In addition to one-to-all broadcast, all-to-all broadcast,where every node, instead of a certain node as in one-to-all broadcast, has a piece of information tobe shared with others, is also very important in numerous applications in distributed computing.Applications of all-to-all broadcast include decentralized consensus protocols [3], extrema �nding,coordination of distributed checkpoints [14], acquisition of a new global state [5], and the broadcastof personalized information [12]. Similar to one-to-all broadcast, all-to-all broadcast schemes arecharacterized by the number of message steps required and the total number of messages incurredto complete the broadcast [11] [12] [23] [26].Several studies have been conducted to minimize the number of message steps (or time) ofone-to-all and all-to-all broadcast schemes for various communication networks/distributed envi-ronments. A recent survey can be found in [11]1. Also, to reduce the number of messages, somebroadcast schemes and consensus protocols were proposed in [14] [26] and shown to be e�cientin terms of message complexity. To reduce the overhead of the scheme without compromising itse�ciency, we naturally would like to complete the broadcast in the minimal number of steps whileincurring as few messages as possible. However, to date, despite its importance, the problem ofdetermining the minimal number of messages required for all-to-all broadcast in the minimal num-ber of steps has not been solved. Consequently, we derive in this study optimal all-to-all broadcastschemes for a distributed system that complete the broadcast with not only the minimal numberof steps but also the minimal number of messages.To facilitate our presentation, we start with considering the case of one-port communication,which means that each node can only send out one message in one communication step. We thenshow our results for the case of multi-port communication [12], i.e., k-port communication, meaningthat each node can simultaneously send out k messages in one communication step. Speci�cally,1All-to-all broadcast is the same as the gossiping problem in [11] except that a two-way transmission, such as aphone conversation, is assumed for the latter. 1

N1

N2

N4 N5

N3

N1

2

N4 N5

N3

N1

N2

N4 N5

N3

(a) Step 1 (b) Step 2 (c) Step 3

{1}

N

{1} {1}

{1}

{1}

{1}

{1}

{1} {1}{1}

{1}

Figure 1: A one-to-all broadcast scheme in a system of 5 nodes (3 steps and 4 messages).we consider completely connected systems with synchronous communication. Such a model wasemployed in other related work [11]. (A detailed description of the model can be found in Section2.) For ease of exposition, we use the identi�cation (id) of each node to denote the informationthat this node wants to broadcast to every other node2. Also, as in [11] [20] [24], to reduce the costof transmission, the schemes investigated here are those without duplicate information, i.e., everymessage conveys only new information to its receiver. This sort of scheme is termed \NODUP"(for no duplication) in [11]. All-to-all broadcast is completed in the end after each node receivesall the id's from all other nodes.The problem studied in this paper can be best understood by considering the case of all-to-allbroadcast among 5 nodes with one-port communication in Figure 2. The information collectedthus far after each step is shown in the bracket next to each node. An arrow pointing from nodeNi to node Nj represents that Ni is sending what it knows thus far to Nj. For example, in step 1in Figure 2, nodes N1 and N2 simultaneously send their own id's to node N4. Thus, after step 1,node N4 will have the information f1,2,4g. Note that one message might consist of more than oneid3, and each message takes exactly one communication step. We use black nodes to denote theones that have gathered all the information from all other nodes, and white nodes to denote those2Depending on the application, the real content of id can be very general, such as a personalized informa-tion/database, the numbers to be sorted, a vector describing the local state, and a \yes" or \no" vote of the commitprotocol in a distributed transaction, to name a few.3A message fa1; a2; : : : ; ajg should be viewed as f(a1; a2; : : : ; aj), where f is an application-dependent function.2

N1

N2

N4 N5

N3

N1

2

N4 N5

N3

N1

N2

N4 N5

N3

{1}->{1,4}

{2}

{4}->{1,2,4} {5}->{3,5}

{3}->{3,5}

(a) Step 1 (b) Step 2 (c) Step 3

{1,3,4,5}

{2}

{1,2,3,4,5} {1,2,3,4,5}

{1,2,3,4,5}

{1,2,3,4,5}

N

{1,2,3,4,5}

Figure 2: An all-to-all broadcast scheme in a system of 5 nodes (3 steps and 12 messages).that still have incomplete information. As shown in Figure 2, all-to-all broadcast is completed after3 steps while incurring 12 messages. For an illustrative purpose, another all-to-all broadcast fora system of 5 nodes is shown in Figure 3, where 4 steps and 8 messages are required, showing atrade-o� between the number of messages and that of communication steps.To develop the optimal all-to-all broadcast scheme for one-port communication, we shall �rstintroduce the concept of a balanced binary partitioning tree of a positive number to exploit the na-ture of NODUP schemes. In light of the balanced binary partitioning tree, we devise an addressingscheme, based on the topology of a hypercube, for the nodes in the distributed system. Using theaddressing scheme and the concept of minimal complete sets to be introduced later, the optimalall-to-all broadcast scheme can be systematically executed according to the balanced binary parti-tioning tree, and completed by a system of p nodes in the minimal number of steps, i.e., dlog2 pesteps, while incurring np+ p� 2n messages, which is proved to be the minimal number of messagesrequired for all-to-all broadcast in n steps where n = dlog2 pe. Moreover, in light of the topologyof generalized hypercubes [2], we extend our results to the case of multi-port communication. Itis proved in Theorem 3 that the proposed scheme not only requires the minimal number of steps,i.e., dlogk+1 pe steps, but also incurs the minimal number of messages that is required to completek-port all-to-all broadcast in dlogk+1 pe steps. To the best of our knowledge, there is no prior workon determining the minimal number of messages required for all-to-all broadcast in the minimalnumber of steps for a distributed system of an arbitrary number of nodes, for either one-port or3

N1

N2

N4 N5

N3{2}

(a) Step 1

{3}

{5}{4}

{1}->{1,2,3,4,5} N1

N2

N4 N5

N3{2}->{1,2,3,4,5}

(b) Step 2

{3}

{5}{4}

N1

N2

N4 N5

N3{3}->{1,2,3,4,5}

(c) Step 3

{5}

N1

N2

N4 N5

N3

(d) Step 4

{5}->{1,2,3,4,5}{4}->{1,2,3,4,5}Figure 3: Another all-to-all broadcast in a system of 5 nodes (4 steps and 8 messages).k-port communication. This feature distinguishes our work from others.This paper is organized as follows. Preliminaries are given in Section 2. We develop optimalall-to-all broadcast schemes for 1-port and k-port communication in Sections 3 and 4, respectively.This paper concludes with Section 5.2 PreliminariesWe shall use the hypercube topology to facilitate the presentation of our broadcast schemes. How-ever, as will become clear later, this does not mean that the number of nodes in the system has tobe equal to that of a hypercube. An n-dimensional hypercube, denoted by Qn, can be de�ned asfollows.De�nition 1: A Qn is de�ned recursively as follows [10].(i). Q0 is a trivial graph with one node, and(ii). Qn = K2 � Qn�1, where K2 is the complete graph with two nodes.A Qn contains 2n nodes. Let P be the ternary symbol set f0, 1, *g, where * is a don't caresymbol. Every subcube in a Qn can then be uniquely represented by a string of symbols in P.Such a string of ternary symbols is called the address of the corresponding subcube. The rightmostcoordinate of the address of a subcube will be referred to as dimension 1, and the second to the4

rightmost coordinate as dimension 2, and so on. For a distributed system of p nodes, we shalladdress the p nodes with n-bit strings of ternary symbols, where n = dlog2 pe. Also, we use �bi todenote the invert of a bit bi so that �1=0, �0=1 and �� = �. A node bn : : : bi : : : b1 is called the i-thneighbor of node bn : : : �bi : : : b1, and vice versa.We use the identi�cation (id) of each node to denote the information that this node wants tobroadcast to every other node. Also, the information at each node means the set of id's that nodecollects thus far, and the content of the message of a transmission is referred to as the informationof the sender at the time of transmission. One message might contain many id's. The termcommunication step and the term message step are used interchangeably. For example, in step 2 ofFigure 2, the message from N3 to N1 is f3,5g, and the information at N1 after step 2 is f1,3,4,5g.An all-to-all broadcast is said to be completed if all nodes in the system receive all id's in thesystem. The system model we consider can be summarized as follows.Model M1. The system is completely connected with synchronous communication.2. Every message sent in the system takes one communication step.3. Only NODUP (no duplication) schemes, where each message conveys only new informationto its receiver, are considered.4. k-port communication means that each node is capable of sending k messages out to any kreceivers in one step.(There is no restriction on the number of messages each node can receive in one step.)De�nition 2: An all-to-all broadcast scheme is called optimal, if under the above model, M ,the following two conditions are satis�ed.1. It completes the broadcast in the minimal number of steps.2. It incurs the minimal number of messages required to complete the broadcast in the minimalnumber of steps.Note that the assumption for each message to take one communication step can be justi�edby the technique of virtual cut-through for communication [13], which can be incorporated intomulticomputer systems such as iPSC/2 [4]. In addition, as it will be proved in Section 4 later,under a NODUP scheme for a system of p nodes, the total number of id's sent in all messagesduring the entire scheme is p(p� 1), which is the minimal number of id's needed to be sent for anall-to-all broadcast, explaining the reason that we shall study the NODUP schemes. Consequently,the objective of this paper is to develop optimal all-to-all broadcast schemes under model M .5

3 Optimal All-To-All Broadcast for 1-Port CommunicationIn this section, we develop the optimal all-to-all broadcast scheme for one-port communication.As will be proved by Theorem 2 later, for a system of p nodes with one-port communication,the minimal number of messages required for all-to-all broadcast in n steps is np + p� 2n, wheren = dlog2 pe. To facilitate our presentation, we shall propose an addressing scheme for nodesin the system. It can be seen that in light of the addressing scheme, the proposed broadcastalgorithm can be systematically executed and shown to incur the minimal number of messages inthe minimal numbers of steps. The scheme developed in this section can be extended to the caseof k-port communication in Section 4. To describe the addressing scheme, it is necessary to de�nea partitioning tree and a balanced binary partitioning tree of a positive number as follows.De�nition 3: A partitioning tree of a positive number p is a tree where the root node islabeled with p, all leaf nodes are labeled with ones, and the number labeled in each non-leaf node(or internal node) is the sum of those labeled in its child nodes.De�nition 4: A balanced binary partitioning tree of a positive number p is a binary partitioningtree constructed as follows.1. Label the root node with p.2. For each node with a label k � 2, generate the left and right children of this node and labelthem with dk2e and bk2c, respectively.For example, the balanced binary partitioning tree with p = 6 is given in Figure 4. Clearly,there are n + 1 levels in the balanced binary partitioning tree of a number p where n = dlog2 pe.For convenience, the level of the root is called level 0. Using the balanced binary partitioning tree,the nodes in the system can be addressed as follows.Addressing scheme A1Step 1: For a system of p nodes, obtain the balanced binary partitioning tree of p.Step 2: For every internal node, code the edge to its left child with a bit \0" and that to its rightchild with a bit \1".Step 3: Determine the address of each leaf node by the coded bits in the edges on the path fromthe root to that node.Step 4: Append a bit \*" to each leaf node in level n� 1.Step 5: Assign arbitrarily the p nodes in the system with the addresses of the p leaf nodes in thebalanced binary partitioning tree. 6

1 1

12

1 1

3

6

12

3

0

0

0

0

0

1

1

1

1

1

000 001 100 101

01* 11*

......level 0

......level 2

......level 3

......level 1

Figure 4: The balanced binary partitioning tree and its addressing scheme when p = 6.An example of the above addressing scheme can be found in Figure 4. Note that while thosenodes in level n = dlog2 pe of the balanced binary partitioning tree of a number p are addressedas hypercube nodes, i.e., Q0's, those nodes in level n� 1 are addressed as 1-dimensional subcubes,i.e., Q1's. A Q3 whose subcubes are used to address the 6 nodes in Figure 4 is given in Figure 5. Inlight of the addressing scheme, optimal all-to-all broadcast can be described in algorithm G below,where the primitive send(M, bnbn�1 : : : b1) means sending the message M to the node bnbn�1 : : : b1,receive(RM) means receiving the messages RM, and M [RM denotes the union of the messagesM and RM.Algorithm G:/* Let p be the number of nodes in the system and n = dlog2pe */1. Address each node according to scheme A1./* Node bnbn�1 : : : b1 does the following. */2. M:= fbnbn�1 : : : b1g;3. for j=n to 1 step=�1 do4. begin5. if bj 6=* then send(M, bn : : : �bj : : : b̂1);/* where b̂1 = 0 if b1 = �; otherwise b̂1=b1. */6. receive(RM);7. M:= M [ RM;8. end 7

000

010

001

011

100

110111

101

01*

11*Figure 5: Illustrating the addressing of 6 nodes by a Q3.To show the operations of algorithm G, an example for a system of 8=23 nodes is given inFigure 6 where the broadcast scheme can be described in light of the topology of a Q3. It can beseen that nodes exchange messages via dimension 3 �rst, then dimension 2 and dimension 1. Foran illustrative purpose, the information collected by node 001 thus far is shown in the bracket nextto 001. All nodes receive all id's (marked black) after 3 steps.To show the operations of G for a system whose number of nodes is not equal to a power of two,consider a system of 6 nodes. Under the addressing scheme shown in Figure 4 and the operationsof algorithm G, the 3 steps of the message passing are shown in Figure 7. It can be veri�ed that thebroadcast is completed in 3 steps and the total number of messages sent is 6+6+4=16. Note thatfor a node with an address of the form bnbn�1 : : : b2�, such as node 01* in Figure 7, it determinesits message receiver by setting * to 0 and inverting the appropriate bit as described in algorithm Gso that each node sends out one message at a time4. De�ne a minimal complete set of nodes as aminimal set of nodes that consists of all information. For example, after step 1 in Figure 7, the nodeset f000,001,01*g is a minimal complete set since the nodes in the set have enough information tocomplete the broadcast, whereas f000,01*g is not, nor is f000,001,01*,100g since the latter is nota minimal set. As it will become clear later, the number labeled in each internal node in level i ofthe balanced binary partitioning tree denotes not only the number of nodes in the corresponding4Note that the purpose of setting * to 0 is mainly to provide a systematic procedure. It can be veri�ed thatalgorithm G is also valid if such node as bnbn�1 : : : b2� determines its message receiver by setting * to 1 and invertingthe appropriate bit accordingly. 8

000

010

001

011

100

110 111

101Step 1

Step 2

Step 3

000

010

001

011

100

110 111

101

000

010

001

011

100

110 111

101

{001,101}

{001,101,011,111}

Figure 6: Optimal all-to-all broadcast for a system of 8 nodes.9

000 001 100 101

01* 11*

000 001 100 101

01* 11*

000 001 100 101

01* 11*

Step 1:

Step 2:

Step 3:

6 messages

6 messages

4 messages

{000,100}

{001,101}

{01*,11*}

{000,100} {001,101}

{01*,11*}

{000,100,01*,11*} {000,100,11*,01*}

Figure 7: Optimal all-to-all broadcast for a system of 6 nodes.10

minimal complete set, but also the number of messages sent by the nodes in that minimal completeset in step i of algorithm G. Using the concept of minimal complete sets, we obtain the followinglemma for algorithm G.Lemma 1: After step i of algorithm G, there are 2i minimal complete sets, where 2i is thenumber of nodes in level i of the balanced binary partitioning tree for 0 � i � n � 1. Speci�cally,they are formed by nodes corresponding to the subcubes with the addresses bnbn�1 : : : bn�i+1 � : : :�,where bj 2 f0; 1g for n� i+ 1 � j � n.Proof: It can be seen that when i=0, i.e., before the operations in algorithm G begin, theminimal complete set is the set that contains all nodes in the system. In the execution of step 1,each node, say bnbn�1 : : : b1, sends a message to its n-th neighbor, �bnbn�1 : : : b1, in the Qn. Thus,nodes in the two Qn�1's, 1�� : : :� and 0�� : : :�, form two minimal complete sets, respectively, sinceafter step 1, node bnbn�1 : : : b1 has already contained the information that both bnbn�1 : : : b1 and�bnbn�1 : : : b1 had in step 0. It can be veri�ed from the addressing scheme of the balanced binarypartitioning tree that the two subcubes, 1 � � : : :� and 0 � � : : :�, are associated with the two childnodes of the root. Then, from the fact that nodes exchange messages via dimension n + 1 � i instep i, it follows that a minimal complete set associated with a Qn�i+1 in level i� 1 is partitionedinto two minimal complete sets associated with two Qn�i's in level i, thus proving this lemma byinduction. Q.E.D.The fact that each node in level i of the balanced binary partitioning tree is associated with aminimal complete set after step i of algorithm G can be seen in Figure 4 and Figure 7. Then, wehave the following theorem.Theorem 1: For a system of p nodes with one-port communication, algorithm G completes anall-to-all broadcast in n steps by incurring np+ p� 2n messages, where n = dlog2 pe.Proof: From Lemma 1, we know that after dlog2 pe steps every node will form a minimalcomplete set itself, meaning that all-to-all broadcast is complete. To determine the total number ofmessages incurred, consider the balanced binary partitioning tree of the number p. First, from theproof of Lemma 1, it can be seen that the number labeled in that node is the number of processingnodes5 in the corresponding minimal complete set. Next, we claim that for a minimal completeset of k nodes to be partitioned in one step into two minimal complete sets of the cardinalities k1and k2, where k = k1+k2, each of the k nodes has to send out in that step a message containingthe information it has collected thus far. Note that in a minimal complete set, the information ofeach node is not contained in that of any other node, meaning that if a node, say Ni, does not sendout a message in that step, Ni should be included into both minimal complete sets, leading to a5Processing nodes mean the computing nodes in the distributed system, and should not be confused with thenodes in a partitioning tree. 11

contradiction to k1+k2 = k, thus proving this claim. It follows that the total number of messagessent in algorithm G is the sum of the numbers labeled in the internal nodes in the correspondingpartitioning tree.Since a number k is partitioned into dk2e and bk2c in the partitioning tree, it can be seen thatall the leaf nodes are in either level n � 1 or level n. Then, we know that the numbers labeled inthe nodes in level n� 1 must be either 2's (labeled in non-leaf nodes) or 1's (labeled in leaf nodes),and the sum of those numbers is p, meaning that the number of 2's in level n� 1 is p� 2n�1, sincethe number of nodes in level n� 1 is 2n�1. From this fact and that the sum of numbers labeled innodes in each level i, for 0 � i � n � 2, is p, it follows that the sum of the numbers labeled in theinternal nodes in the balanced binary partitioning tree of p is p(n� 1)+ 2(p� 2n�1)= np+ p� 2n.Q.E.D.It can be veri�ed that the number of messages for the example in Figure 6 is 24=3*8+8�8, andthat for the example in Figure 7 is 16=3*6+6�8, agreeing with Theorem 1. Note that it at leasttakes dlog2 pe steps for one-to-all broadcast in a system of p nodes. This fact leads to the followingproposition.Proposition 1: In a system of p nodes with one-port communication, the minimal number ofsteps required for all-to-all broadcast is dlog2 pe.Hence, from the above proposition and Theorem 1, we have the following corollary.Corollary 1.1: In a system of p nodes with one-port communication, algorithm G requires theminimal number of steps, dlog2 pe, to complete an all-to-all broadcast.It can be veri�ed that in a system of p nodes, for one minimal complete set to be partitionedinto two in one step, the total number of id's sent in all messages incurred in that step is p. Then,as stated in Corollary 1.2 below, the total number of id's sent in all messages incurred by algorithmG is p(p�1), since there are p�1 internal nodes in a balanced binary partitioning tree. This agreeswith the fact that algorithm G is a NODUP scheme.Corollary 1.2: In a system of p nodes with one-port communication, the total number of id'ssent in all messages incurred by algorithm G is p(p� 1).Next, we have the following theorem which states that algorithm G is optimal in terms of thenumber of messages required for all-to-all broadcast in the minimal number of steps.Theorem 2: For a system of p nodes with one-port communication, the minimal number ofmessages required for all-to-all NODUP broadcast in n steps is np + p� 2n, where n = dlog2 pe.12

Proof: From the facts that the schemes are without duplicate information and that every nodesends all the id's it has thus far to its receiver, it follows that optimal all-to-all broadcast schemescan be described by the generation of minimal complete sets resulting from the process of broadcast.Such a generation of minimal complete sets can be denoted by a binary partitioning tree. As pointedout in the proof of Theorem 1, the number labeled in each internal node of the partitioning treeis the number of nodes in the corresponding minimal complete set. Also, the number of nodes ina minimal complete set, say h, is the number of messages to be sent in the next step so that thebroadcast can be completed by the nodes within the set in dlog2 he steps. It in turn follows thatthe sum of the numbers labeled in the internal nodes of the partitioning tree is the total numberof messages required for the all-to-all broadcast. Then, the problem of determining the minimalnumber of messages required in an all-to-all broadcast scheme can be transformed to the one ofdetermining the corresponding binary partitioning tree, of which the sum of the numbers labeledin the internal nodes is minimal. Note that such a binary tree can be constructed by the Hu�manalgorithm [18], which starts with p ones, and then, repeatedly adds the two smallest numberstogether and uses their sum to replace the two numbers. The resulting binary tree by the Hu�manalgorithm is called the Hu�man tree. An example of the Hu�man tree for 6 ones is given in Figure8. It has been proved that the sum of the numbers labeled in the internal nodes of the Hu�mantree is the minimal among all the binary trees with the same set of leaf nodes. Also, all the leafnodes in a Hu�man tree, labeled by ones, must be in either level dlog2 pe�1 or level dlog2 pe [18],implying that the sum of the numbers labeled in internal nodes is np + p� 2n, where n=dlog2 pe.This theorem follows. Q.E.D.Note that the formula in Theorem 2 agrees with the lower bound of message complexity,O(p log2 p), derived in [26] where, however, the minimal number of messages required was notdetermined. Theorem 1 and Theorem 2 lead to the following corollary.Corollary 2.1: For a system of p nodes with one-port communication, algorithm G requiresthe minimal number of messages, np+ p� 2n, to complete an all-to-all broadcast in n steps, wheren = dlog2 pe.4 Optimal All-To-All Broadcast for k-Port CommunicationAs presented in algorithm G, the balanced binary partitioning tree can be used to describe optimalall-to-all broadcast for one-port communication. In fact, our scheme, based on the partitioning treeand the generation of minimal complete sets, can be extended to the case of k-port communication,meaning that each node is capable of sending k messages at a time. As can be seen below, theextension to the k-port communication can be described in light of the generalized n-dimensionalm-ary hypercube [2], where m is chosen to be k+ 1. Using the product operation in De�nition 1, ageneralized n-dimensional m-ary hypercube, denoted by Hmn , can be de�ned as follows.13

1 1

12

1 1

4

6

12

2

.....level 0

.....level 1

.....level 2

.....level 3Figure 8: The Hu�man tree constructed by 6 1's.De�nition 5: An n-dimension m-ary hypercube Hmn is de�ned recursively as follows.(i). Hm0 is a trivial graph with one node, and(ii). Hmn = Km � Hmn�1, where Km is the complete graph with m nodes.An example of H32 can be found in Figure 9 where the edges in dimension 2 and dimension 1 aredrawn in Figure 9a and Figure 9b, respectively. It can be veri�ed that De�nition 2 is a special caseof De�nition 5 when m = 2, and in fact Qn=H2n. In light of the generalized hypercubes, the schemein Section 3 can be extended to the case of multi-port communication by modifying the partitioningtree and the addressing scheme accordingly. For example, for the case of 2-port communication ina system of 9 nodes, which corresponds to the case of H32 , we have the 3-ary partitioning tree asshown in Figure 10 where Step 2 of the addressing scheme A1 is modi�ed as below.Step 20: For every internal node, code the edge to its left child with a bit \0", that to its centerchild with a bit \1", and that to its right child with a bit \2".From the same reasoning as in the proof of Lemma 1, it follows that all-to-all broadcast for2-port communication can be developed from the above partitioning tree in such a way that eachinternal node in level i of the tree is taken as a minimal complete set generated after step i and theminimal complete set associated with an internal node is partitioned into those with its child nodesin one step. For example, for a system of 9 nodes with 2-port communication, the operations of anall-to-all broadcast is shown in Figure 9. It can be veri�ed by Figure 9 and Figure 10 that afterStep 1, all 9 nodes in the system (addressed by ** in Figure 10) are partitioned into three minimal14

(b) step 2(a) step 1

00

01

02

10

11 12

20

21 22

00

01 02

10

11 12

20

21 22Figure 9: The all-to-all broadcast scheme for 2-port communication when p=9.1

3

9

1

33

11 11 1 11

1

1 11

0

0 0 0

2

2 2 2

0* 1* 2*

** ...... level 0...... level 0

...... level 1

...... level 2Figure 10: The partitioning tree and addressing scheme for 2-port communication when p=9.15

complete sets, formed by nodes in 0*, 1* and 2*, respectively. Clearly, the above scheme based onthe generation of minimal complete sets in the corresponding partitioning tree can be generalizedto the case of k-port communication. An algorithm to build the corresponding optimal partitioningtree will be presented later.Let N(p; k) be the number of message steps required by our scheme for all-to-all broadcast ina system of p nodes with k-port communication. It can be observed that the recursion N(p; k) =1+N(d pk+1 e; k) holds, whereN(a; b) = 1 if a � b, leading toN(p; k) = dlogk+1 pe. Note that it takesat least dlogk+1 pe steps for one-to-all broadcast in a system of p nodes with k-port communication.This fact and the existence of our scheme lead to the following proposition, which was also provedby a di�erent approach in [15] where no attempt was made to minimize the number of messages.Proposition 2: In a system of p nodes with k-port communication, the minimal number ofsteps required for all-to-all broadcast is dlogk+1 pe.It can be seen that the height of the corresponding partitioning tree for an all-to-all broadcastdetermines the number of message steps required. Thus, to complete the broadcast in the minimalnumber of steps, the partitioning tree must have the minimal height. Note that for a system of pnodes with k-port communication, there can be di�erent partitioning trees with the same minimalheight, i.e., dlogk+1 pe, whereas the corresponding numbers of messages incurred may di�er fromone to another. Recall that the total number of messages sent in algorithm G is the sum of thenumbers labeled in the internal nodes of the balanced binary partitioning tree. Call the numberof child nodes of an internal node z the degree of z, denoted by ds(z), and also denote the numberlabeled in z by w(z). Then, we have the lemma below which follows from the fact that for a minimalcomplete set to be partitioned into r minimal complete sets in one step, each node in the originalminimal complete set has to send out r � 1 messages in that step.Lemma 2: The number of messages incurred in an all-to-all broadcast scheme is Pz2VIw(z)(ds(z)� 1) where VI is the set of internal nodes in the corresponding partitioning tree.For example, the number of messages required for the broadcast corresponding to Figure 4is 6*1+3*1+3*1+2*1+2*1=16 which agrees with Figure 7, and the message number required forthe broadcast corresponding to Figure 10 is 9*2+3*2*3=36, agreeing with Figure 9. Figure 11shows two partitioning trees for p = 34 and k = 3, which have the same height but will incurdi�erent numbers of messages. From Lemma 2, it can be veri�ed that the number of messagesassociated with the tree in Figure 11a is 241 while that with the tree in Figure 11b is 232. It canalso be seen that in Figure 11a while the two subtrees under the two 9's incur di�erent numbersof messages (9*2+3*2*3 6= 9*3+2*3+3*2), the two subtrees under the two 8's, in spite of theirdi�erent structures, do incur the same number of messages (8*3+2*4 = 8+4*3*2). Then, theproblem of determining the minimal number of messages required in an all-to-all broadcast scheme16

can be transformed to the one of determining the corresponding partitioning tree, of which the sumdetermined by Lemma 2 is minimal. Consequently, the following theorem is derived to solve thisproblem.Theorem 3: For a system of p nodes with k-port communication, the minimal number ofmessages required for an all-to-all NODUP broadcast in n=dlogk+1 pe steps is,M(p; k) = (d� 2)n1p+ (d� 1)[n2p+ p� (d� 1)n1dn2 ];where n1+n2 = n = dlogk+1pe, and d is the smallest positive integer such that p � (d�1)n1dn2and p > (d� 1)n1+1dn2�1.Proof: Same as in the one-port communication, the optimal scheme can be described by apartitioning tree since there is no duplicate information in each transmission. Also, we learn fromProposition 2 that such an optimal tree is of height dlogk+1 pe. It is easy to see from Lemma 2that in an optimal partitioning tree, each subtree, say under node z, is the optimal partitioningtree for a system of w(z) nodes. We shall prove that an optimal tree, denoted by T(p,k), possessesthe following two properties: (A). All leaf nodes in T(p,k) are in the same lowest level, i.e., leveldlogk+1 pe, except the case that T(p,k) is a binary tree, and (B). The set of distinct degrees ofall internal nodes in T(p,k), denoted by DS , contains either a single number or two consecutivenumbers. Property (B) means that for any two internal nodes in T(p,k), their degrees must eitherbe the same or di�er by one. From the above two properties, M(p,k) in Theorem 3 can then bederived by applying some algebraic operations.We shall prove Property (A) by showing that if there is a leaf node y not in the lowest level ofan optimal tree, then the tree must be binary and all the internal nodes in the same level as y arelabeled with 2's, agreeing with Theorem 2. The fact that except for the binary trees, all leaf nodesof an optimal tree must in the same lowest level thus follows. Call two nodes siblings of each otherif they have the same parent node. Suppose there is a leaf node y which is not in the lowest level.Then, we know that for T(p,k) to be optimal, the number labeled in any sibling of y cannot begreater than 2, since if y has a non-leaf sibling, say yA, with w(yA) > 2, then the message numbercan be reduced by moving one unit from yA to y, i.e., w(yA) is reduced by one and y becomes aninternal node with w(y) = 2. For illustrative purposes, example subtrees are given in Figure 12,where the message number associated in the subtree in (b) is less than that in (a). Denote theparent node of y as x. It can be observed that the degree of x must be two, since if ds(x) is greaterthan two, then from the fact that the degrees of all siblings of y are at most two, the messagenumber can be reduced by rearranging the subtree under x in such a way that ds(x) becomes 2.From ds(x)=2, it in turn follows that all the internal nodes which are in the same level and underthe same grandparent as y must be labeled with 2's, since if, among them, there is an internal nodeyB with w(yB) > 2, then the message number can be reduced by moving one unit from yB to y. By17

(b) The optimal partitioning tree

1 111 1

34

1 1 1 1 1 1 1 1

3

8 9

2 4 4

8 9

1 1

2

1 1

2

1 1

2

1 11

3

1 11

3

1 1

2

1 1

2

1 1

2

1 11

3

1 11 1 1 1 1 1 1 1 1

34

1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

3

11 12

4 34 4 4 4 4 4

11

(a) One partitioning tree

Figure 11: The partitioning trees when p=34 and k=3.18

1

1 1 1 1 1 1 1 1

x x

y yyAAy m-1m 2

(a) (b)Figure 12: Example subtrees to illustrate Property (A).the same reasoning, it can be seen that the degrees of all the internal nodes under the grandparentof y must be 2, showing by induction that the tree is binary. Property (A) thus follows.To prove Property (B), we shall investigate the degrees of internal nodes in an optimal treeT(p,k) in a manner of bottom up. Recall that the root of the tree is in level 0. Without loss ofgenerality, we start with investigating a set of nodes in level dlogk+1 pe�1 which are under the sameparent node, say x. If there are any two nodes z1 and z2 under x and w(z1) � w(z2)+2, it canbe veri�ed that such a tree is not optimal since the message number can be reduced by movingone unit from z1 to z2. Note that (w(z1)�1)*(w(z1)�2)+ (w(z2)+1)*w(z2) < w(z1)*(w(z1)�1)+w(z2)*(w(z2)�1) for w(z1) � w(z2)+2, where w(z1)=ds(z1) and w(z2)=ds(z2) since z1 and z2 areparent nodes of leaf nodes. Then, let d and d� 1 be the two possible degrees for any internal nodeunder x. We claim that ds(x) 2 fd; d� 1g. First, if ds(x) � d+ 1, then we can select a child nodeof x, say z, with ds(z)=d, and detach z from x by distributing the d child nodes of z, one-to-one,into the other d siblings of z so that each of the d selected siblings of z adopts one child from z.Clearly, since ds(x) � d + 1, z must have at least d siblings. The numbers labeled in the nodesa�ected are modi�ed accordingly. It can be veri�ed by Lemma 2 that such a movement will reducethe message number, leading to a contradiction for T(p,k) to be optimal, implying that ds(x) � d.Next, if ds(x) � d� 2, then we can take one child node out of each of the ds(x) existing child nodesof x, attach them under a new node, and arrange that node as a new child node of x. ds(x) is thusincreased by one. It can be veri�ed that such a movement will also reduce the message number19

determined by Lemma 2, and the claim that ds(x) 2 fd�1; dg thus follows. Let Ds(x) be the set ofdistinct degrees for all internal nodes under x. We thus proved Ds(x) � fd� 1; dg. Note that thetechnique used above to form and decompose a subtree can also be applied for the internal nodeswhich are under the same grandparent. Similarly, we can obtain that for each sibling of x, say h,Ds(h) � fd� 1; dg, and then Ds(v) � fd� 1; dg where v is the parent node of x, proving that DS� fd� 1; dg by induction. Property (B) thus follows.From Properties (A) and (B), it can be seen that to generate p leaf nodes in level dlogk+1 pefrom the number p and minimize the message number, d has to be the smallest positive integer suchthat p � (d� 1)n1dn2 and p > (d� 1)n1+1dn2�1, where n1 + n2= n = dlogk+1 pe. From Lemma 2,it follows that an optimal partitioning tree can be constructed by �rst having n1 levels of internalnodes with degree d�1 (i.e., from level 0 to level n1�1), and then n2�1 levels of internal nodes withdegree d (i.e., from level n1 to level n�2), followed by, in level n�1, (d�1)n1dn2�p internal nodeswith degree d � 1 and p � (d� 1)n1+1dn2�1 internal nodes with degree d. Then, we get M(p,k)=n1p(d� 2) + (n2 � 1)p(d� 1)+ [(d� 1)n1dn2 � p](d� 1)(d� 2)+ [p � (d� 1)n1+1dn2�1]d(d� 1)=(d� 2)n1p+ (d� 1)[n2p+ p� (d� 1)n1dn2 ], thus proving Theorem 3. Q.E.D.Therefore, to determine the minimal number of messages required for an all-to-all broadcast ina system of p nodes with k-port communication, the corresponding optimal partitioning tree canbe obtained as follows, where n1 + n2 = n = dlogk+1pe, and d is the smallest positive integer suchthat p � (d� 1)n1dn2 and p > (d� 1)n1+1dn2�1.Algorithm to build the optimal partitioning treeStep 1: Build a tree from level 0 to level n1 � 1 in such a way that each node is an internal nodeand has a degree d� 1.Step 2: In the next n2 � 1 levels (i.e., from level n1 to level n� 2), let each internal node have adegree d.Step 3: In the last level of internal nodes (i.e., level n�1), let (d�1)n1dn2 �p internal nodes havedegree d� 1, and p� (d� 1)n1+1dn2�1 internal nodes have degree d.Step 4: In level n, attach leaf nodes to those internal nodes in level n� 1, according to the degreeof each internal node in that level.Step 5: Label each leaf node with one, and determine the number labeled in each internal nodein the tree bottom up such that the number labeled in each node is the sum of those labeledin its child nodes.For example, consider all-to-all broadcast in a system of 34 nodes with 3-port communication.Then, we have p = 34, k = 3 and n1 + n2 = n = 3, leading to d = 4, n1 = 2 and n2 = 1. We20

can obtain the optimal partitioning tree in Figure 11b by the algorithm above. It follows fromTheorem 3 that M(34,3)=232, meaning that the number of messages required by the partitioningtree in Figure 11b is in fact the minimal one in order to complete the broadcast in 3 steps. Notethat from an optimal partitioning tree obtained by Theorem 3, we can determine the address ofeach leaf node in the tree in light of Hdn. As in Section 3, the addresses of leaf nodes are thenone-on-one mapped into the nodes in the system. Using this addressing scheme, each node in thesystem can determine its message receiver in each communication step in such a way that all-to-allbroadcast follows the generation of minimal complete sets in the partitioning tree.It is interesting to see that to achieve the minimal number of messages in the minimal number ofsteps, it is not always necessary to use the maximal number of communication ports allowed. Thisis the very reason that d determined in Theorem 3 is not necessary equal to k+1. For the exampleof p = 20 and k = 3, we get that the minimal number of steps is 3 = dlog4 20e. However, fromTheorem 3 we have d = 3, meaning that to minimize the message number in 3 communication steps,each node at most uses 2 communication ports in every step during the execution of the optimalscheme. Also, it can be veri�ed that Theorem 2 is in fact a special case of Theorem 3. For the caseof one-port communication, we have d = 2 and then M(p; 1)= n2p+ p� 2n2 where n2 = dlog2 pe,agreeing with Theorem 2. To the best of our knowledge, the minimal number of messages derivedin Theorem 3, together with its special case in Theorem 2, was previously unknown, and is �rstdetermined in this study.It can be seen that the all-to-all broadcast scheme with k-port communication introduced inthis section is a NODUP scheme. Speci�cally, we have the following corollary.Corollary 3.1: For a system of p nodes with k-port communication, the total number of id'scarried by all messages incurred in our all-to-all broadcast scheme is p(p� 1), which is the minimalrequired for all-to-all broadcast schemes.Proof: Note that for a partitioning tree of p, Pz2VI ds(z)= jVI j + p � 1 where VI is the set ofinternal nodes and ds(z) is the degree of node z in the tree. Then, we havePz2VI(ds(z)�1) = p�1.Also, for a minimal complete set associated with node z to be partitioned into ds(z) minimalcomplete sets in one step, p(ds(z) � 1) id's have to be sent in all messages incurred in that step.The fact that the total number of id's sent in all messages incurred in our all-to-all broadcastscheme is p(p� 1) thus follows. It can be seen that in any all-to-all broadcast scheme, every nodeneeds to receive p� 1 id's in a system of p nodes, proving this corollary. Q.E.D.5 RemarksIt is worth mentioning that similarly to other optimization problems whose solutions closely dependon the models assumed, the optimal schemes derived in this paper are results from the model M21

described in Section 2. Speci�cally, the numbers of messages in the proposed schemes are provedminimal among all NODUP schemes. Clearly, without being restricted to the NODUP schemes, atthe cost of having more id's transmitted in the broadcast, one may further minimize the number ofmessages required. It is noted that we do not exclude either the possibility of two-way transmissionbetween two nodes, or the capability of each node to participate in both sending and receivingmessages in one communication step, thus distinguishing our work in this paper from the one in[23]. Also, we do not assume that nodes in the system will be faulty or maliciously send wrongmessages to others. To analyze and improve fault-tolerance of these schemes is an important, butnot fully explored issue. In addition, the schemes proposed in this paper are developed under theassumption that the system is completely connected and every message takes one communicationstep. While their variations could provide fair performance for hypercube multicomputers, theseschemes are not designed for all system interconnections. Certainly, assumptions that the systemhas a predetermined topology for its interconnection and that every message may take more thanone communication step are reasonable assumptions for some computing environments, and willlead to very di�erent solutions. Last but not the least, there is no restriction on the number ofmessages each node can receive in one step in our model. Imposing a constraint on the messagenumber one node can receive in one step is an interesting direction and will be a matter of ourfuture study.6 ConclusionIn this paper, we developed optimal all-to-all broadcast schemes for a distributed processing systemof an arbitrary number of nodes. The emphasis was on how to complete the broadcast in the systemwith not only the minimal number of message steps but also the minimal number of messages. Theoptimal all-to-all broadcast scheme for the case of one-port communication was �rst developed.The concept of the partitioning tree of a positive number was introduced to address the nodesin the system. Under this addressing scheme, optimal all-to-all broadcast can be systematicallyexecuted based on the generation of minimal complete sets, and completed in dlog2 pe steps for adistributed system of p nodes. It was proved that the number of messages incurred by the proposedscheme, np+ p� 2n, is the minimal number of messages required for all-to-all NODUP broadcastwith one-port communication in n steps where n = dlog2 pe. Moreover, we extended our results tothe case of k-port communication. The minimal number of messages required to complete all-to-allNODUP broadcast in the minimal number of steps, i.e., dlogk+1 pe steps, was derived in Theorem3. An algorithm to build the optimal partitioning tree was also presented. Note that we not onlyderived the theoretically minimal bounds for the numbers of steps and messages required, but alsodevised e�ective schemes to achieve them.ACKNOWLEDGEMENT22

The authors would like to thank J. Chen at IBM for her comments and assistance on improvingthe presentation of this paper.References[1] W. C. Athas and C. L. Seitz. Multicomputers: Message-Passing Concurrent Computers. IEEEComputer Mag., 21:9{24, August 1988.[2] L. Bhuyan and D. P. Agrawal. Generalized Hypercube and Hyperbus Structures for a ComputerNetwork. IEEE Transactions on Computers, C-33(4):323{333, April 1984.[3] M.-S. Chen, K.-L. Wu, and P. S. Yu. E�cient Decentralized Consensus Protocols in a Dis-tributed Computing System. Proceedings of the 12th International Conference on DistributedComputing Systems, pages 426{433, June 1992.[4] Intel Corporation. iPSC/2 User's Guide. Intel Corporation, March 1988.[5] S. B. Davidson, H. Garcia-Molina, and D. Skeen. Consistency in Partitioned Networks. ACMComputing Surveys, 17(3):341{370, September 1985.[6] R. Dechter and L. Kleinrock. Broadcast Communications and Distributed Algorithms. IEEETransactions on Computers, C-35(3):210{219, March 1986.[7] P. J. Denning. Parallel Computing and its Evolution. Comm. of ACM, 29:1163{1167, December1986.[8] A. M. Farley. Minimal Broadcast Networks. NETWORKS, 9:313{332, 1979.[9] J. Halpern and Y. Moses. Knowledge and Common Knowledge in a Distributed Environment.Journal of ACM, 37(3):549{587, July 1990.[10] F. Harary. Graph Theory. Addison-Wesley, MA, 1969.[11] S. M. Hedetniemi, S. T. Hedetniemi, and A. Liestman. A Survey of Broadcasting and Gossipingin Communication Networks. NETWORKS, 18:319{351, 1988.[12] S. L. Johnsson and C. T. Ho. Optimum Broadcasting and Personalized Communication inHypercubes. IEEE Transactions on Computers, C-38(9):1249{1268, September 1989.[13] P. Kermani and L. Kleinrock. Virtual Cut-Through: A New Computer Communication Switch-ing Technique. Computer Networks, 3:267{286, 1979.[14] T. V. Lakshman and A. K. Agrawala. E�cient Decentralized Consensus Protocols. IEEETransactions on Software Engineering, SE-12(5):600{607, May 1986.[15] H. G. Landau. The Distribution of Completion Times for Random Communication in a TaskOriented Group. Bull. Math. Biophys., pages 187{201, 1954.[16] S. Levitan. Algorithms for Broadcast Protocol Multiprocessor. Distributed Computing Systems,pages 666{671, 1982. 23

[17] D. A. Reed and D. C. Grunwald. The Performance of Multicomputer Interconnection Networks.IEEE Computer Mag., 20:63{73, June 1987.[18] K. A. Ross and C. R. B. Wright. Discrete Mathematics. Prentice-Hall, NJ, 1985.[19] C. L. Seitz. The Cosmic Cube. Comm. of ACM, 28:22{33, January 1985.[20] A. Seress. Quick Gossiping without Duplicate Transmissions. Graphs and Combinatorics,2:363{383, 1986.[21] K. G. Shin. HARTS: A Distributed Real-Time Architecture. IEEE Computer, pages 25{35,May 1991.[22] L. G. Valiant. A Scheme for Fast Parallel Communication. SIAM, Journal on Computing,11(2):350{361, May 1982.[23] K. N. Venkataraman, G. Cybenko, and D. W. Krumme. Simultaneous Broadcasting in Multi-processor Networks. Proceedings of the International Conference on Parallel Processing, pages555{558, 1986.[24] D. B. West. Gossiping without Duplicate Transmission. SIAM, Journal on Alg. Disc. Meth.,(3):418{419, 1982.[25] C.-B. Yang, R. C. T. Lee, and W.-T. Chen. Parallel Graph Algorithms Based upon Broadcast-ing Communications. IEEE Transactions on Computers, 39(12):1468{1472, December 1990.[26] S.-M. Yuan and A. K. Agrawala. A Class of Optimal Decentralized Commit Protocols. Proc.of 8th Int. Conference on Distributed Computing Systems, pages 234{241, 1988.

24


Recommended