+ All Categories
Home > Documents > 1991-Design and Analysis of Even-Sized Binary SEN

1991-Design and Analysis of Even-Sized Binary SEN

Date post: 14-May-2017
Category:
Upload: rajkumarpani
View: 214 times
Download: 1 times
Share this document with a friend
13
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, OCTOBER 1991 385 Design and Analysis of Even-Sized Binary Shuffle-Exchange Networks for Multiprocessors Krishnan Padmanabhan, Senior Member, IEEE Abstract-The architecture and performance of binary shuffle- exchange networks of any even size are investigated. It is estab- lished that a network with rlog,N1 shuffle-exchange stages or a single recirculating stage can provide the connectivity between N inputs and N outputs using a distributed tag-based control algorithm. Control tags depend on both source and destination when N is not a power of two, and can be computed in a simple manner. Several structural and dynamic properties of the network are established, contrasting the behavior of the power-of- two and composite sized systems. The performance of the network in a stochastic environment is also investigated analytically. It is seen that the shuffle-exchange networks behave in much the same way with respect to traffic and buffer capacity, whether or not the system size is a power of two. Index Terms-Interconnection networks, multiprocessors, mul- tistage networks, Omega network, perfect shuffle, shuffle- exchange. I. INTRODUCTION HE shuffle-exchange network has been extensively in- T vestigated in the last two decades as an architecture to interconnect processing elements (and memory modules) together in multiprocessor systems. Even prior to this, it has been used as a building block in telephone switching networks, and more recently single stage shuffle-exchange networks have been considered for local area interconnections. However, one problem that does not appear to have received serious attention in the published literature is the possibility of constructing binary shuffle-exchange networks to interconnect a composite number (i.e., # 2") of nodes. This problem is addressed in this paper. In many instances, primarily for cost and modularity reasons, it is desirable to build the network out of binary switching elements. Designers in these situations, particularly in the early phases of system evolution, are forced to think in terms of the nearest power-of-two sizes. It is the goal of this paper to investigate the feasibility and practical utility of the binary shuffle-exchange architecture for systems of any even size. In the published literature, Pease [17] appears to be the first to have used the perfect shuffle connection in a parallel computation, viz., the fast Fourier transform of 2" data points. The FFT algorithm itself (or the "prime factor algorithm") is applicable even when N is composite and not a power of two. A generalization of Pease's algorithm to this case results in nonbinary switching (or computational) elements and Manuscript received April 2, 1990; revised January 25, 1991. The author is with AT&T Bell Laboratories, Murray Hill, NJ 07974. IEEE Log Number 9102438. nonbinary shuffles. Stone [ 181 went on to show that the perfect shuffle can be used for a variety of other parallel computations (again of 2" data elements), including polynomial evaluation, sorting, and matrix transposition. Lawrie [ 121 generalized this work further by characterizing the connection capability of the shuffle-exchange structure (i.e., ignoring the computations that are performed in each stage) and showing that many of the data alignment functions required in parallel processing could be performed by such a network, that he termed the "Omega network." These networks were defined for composite sizes also, although a later paper [13] discusses networks of size 2" only. Since then substantial research has been done on the design and analysis of shuffle-exchange type networks, particularly on their application to MIMD type systems [l], [8], [ll], [16], [20]. In such applications, synchronized alignment between inputs and outputs is replaced by fast stochastic access to the outputs as the primary requirement. However, in all of this work, so far as we are aware, the networks considered are either 1) for power-of-two sized systems, and use a single stage or multiple stages of the binary shuffle-exchange structure, or 2) for composite sized systems, and use switching elements corresponding to the prime (or some other) factorization of N to build the network. For composite system sizes, the architect is left with the second option above or else a binary shuffle-exchange network of size equal to the closest power of two. (As has been observed in [13], in the latter case, some unnecessary switching elements and interconnections can be deleted, but this could still leave the network with substantial asymmetry and redundancy.) It may be desirable to build the network out of exchange elements (2 x 2 switches) instead of larger switching elements for several reasons, and cost is just one of them. (Ease of control, the ability to construct any network out of a single component are others. In addition, in certain emerging technologies like guided wave photonics, a binary switching element is the only one currently available.) Depending on how the switches are fabricated, an Omega network using the prime factorization technique could be more expensive to build than the shuffle-exchange network outlined in this paper. This is likely to be so if the system size has large prime factors (> 5), since this would require the use of bigger crossbar switches. It is difficult to quantify this cost measure in a way that is both analytically illuminative and technologically realistic, and we will not pursue it here. In this paper we propose the binary shuffle-exchange ar- chitecture for interconnecting any even number of inputs and 0162-8828/91$01.00 0 1991 IEEE
Transcript

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, OCTOBER 1991 385

Design and Analysis of Even-Sized Binary Shuffle-Exchange Networks for Multiprocessors

Krishnan Padmanabhan, Senior Member, IEEE

Abstract-The architecture and performance of binary shuffle- exchange networks of any even size are investigated. It is estab- lished that a network with rlog,N1 shuffle-exchange stages or a single recirculating stage can provide the connectivity between N inputs and N outputs using a distributed tag-based control algorithm. Control tags depend on both source and destination when N is not a power of two, and can be computed in a simple manner. Several structural and dynamic properties of the network are established, contrasting the behavior of the power-of- two and composite sized systems. The performance of the network in a stochastic environment is also investigated analytically. It is seen that the shuffle-exchange networks behave in much the same way with respect to traffic and buffer capacity, whether or not the system size is a power of two.

Index Terms-Interconnection networks, multiprocessors, mul- tistage networks, Omega network, perfect shuffle, shuffle- exchange.

I. INTRODUCTION HE shuffle-exchange network has been extensively in- T vestigated in the last two decades as an architecture

to interconnect processing elements (and memory modules) together in multiprocessor systems. Even prior to this, it has been used as a building block in telephone switching networks, and more recently single stage shuffle-exchange networks have been considered for local area interconnections. However, one problem that does not appear to have received serious attention in the published literature is the possibility of constructing binary shuffle-exchange networks to interconnect a composite number (i.e., # 2") of nodes. This problem is addressed in this paper. In many instances, primarily for cost and modularity reasons, it is desirable to build the network out of binary switching elements. Designers in these situations, particularly in the early phases of system evolution, are forced to think in terms of the nearest power-of-two sizes. It is the goal of this paper to investigate the feasibility and practical utility of the binary shuffle-exchange architecture for systems of any even size.

In the published literature, Pease [17] appears to be the first to have used the perfect shuffle connection in a parallel computation, viz., the fast Fourier transform of 2" data points. The FFT algorithm itself (or the "prime factor algorithm") is applicable even when N is composite and not a power of two. A generalization of Pease's algorithm to this case results in nonbinary switching (or computational) elements and

Manuscript received April 2, 1990; revised January 25, 1991. The author is with AT&T Bell Laboratories, Murray Hill, NJ 07974. IEEE Log Number 9102438.

nonbinary shuffles. Stone [ 181 went on to show that the perfect shuffle can be used for a variety of other parallel computations (again of 2" data elements), including polynomial evaluation, sorting, and matrix transposition. Lawrie [ 121 generalized this work further by characterizing the connection capability of the shuffle-exchange structure (i.e., ignoring the computations that are performed in each stage) and showing that many of the data alignment functions required in parallel processing could be performed by such a network, that he termed the "Omega network." These networks were defined for composite sizes also, although a later paper [13] discusses networks of size 2" only.

Since then substantial research has been done on the design and analysis of shuffle-exchange type networks, particularly on their application to MIMD type systems [l], [8], [ l l] , [16], [20]. In such applications, synchronized alignment between inputs and outputs is replaced by fast stochastic access to the outputs as the primary requirement. However, in all of this work, so far as we are aware, the networks considered are either 1) for power-of-two sized systems, and use a single stage or multiple stages of the binary shuffle-exchange structure, or 2) for composite sized systems, and use switching elements corresponding to the prime (or some other) factorization of N to build the network. For composite system sizes, the architect is left with the second option above or else a binary shuffle-exchange network of size equal to the closest power of two. (As has been observed in [13], in the latter case, some unnecessary switching elements and interconnections can be deleted, but this could still leave the network with substantial asymmetry and redundancy.)

It may be desirable to build the network out of exchange elements (2 x 2 switches) instead of larger switching elements for several reasons, and cost is just one of them. (Ease of control, the ability to construct any network out of a single component are others. In addition, in certain emerging technologies like guided wave photonics, a binary switching element is the only one currently available.) Depending on how the switches are fabricated, an Omega network using the prime factorization technique could be more expensive to build than the shuffle-exchange network outlined in this paper. This is likely to be so if the system size has large prime factors (> 5 ) , since this would require the use of bigger crossbar switches. It is difficult to quantify this cost measure in a way that is both analytically illuminative and technologically realistic, and we will not pursue it here.

In this paper we propose the binary shuffle-exchange ar- chitecture for interconnecting any even number of inputs and

0162-8828/91$01.00 0 1991 IEEE

386 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, OCTOBER 1991

outputs and demonstrate that the network retains most of its structural and control properties even when N is not a power of two. The first and major step toward establishing this is a constructive proof of a distributed tag-based control algorithm for any input in the network to connect to any output. The network in this case consists of [log, NI shuffle-exchange stages and each stage is controlled by a single bit in the tag. (Single stage recirculation networks are also discussed.) We then investigate some of the structural properties of this network and point out the difference between the networks for composite and power-of-two sizes. Several architectural issues are also investigated, the results of which show that the network is as feasible and nearly as easy to use as when N is a power of two. The last part of the paper analyzes the stochastic performance of these networks. This analysis shows that the shuffle-exchange network behaves (with respect to system parameters like traffic and buffer capacity) in a similar manner for both composite and power-of-two system sizes.

11. THEORETICAL BASIS

In this section we provide the theoretical framework in which the shuffle-exchange network is defined for any even size, N’, and derive a distributed control algorithm for it. Some of the more important structural properties of the network are also presented here. The results in this section have a dual purpose. First, they characterize the structure of the network and contrast it with that of the power-of-two sized network. Second, they will be used in the next two sections to determine how useful the network is in practice and how it performs in a stochastic environment.

A. Network Definition The perfect shuffle operation on N’ terminals (N’ even)

separates the bottom N ’ / 2 terminals from the top N’/2 and precisely interleaves them, with the bottom terminal remaining at the bottom [5]. If the terminals are numbered 0,1, . . , NI - 1, then the operation of the shuffle is equivalent to the permutation T , defined as

22 ~ ( i ) = (22 + LFJ) mod N’ , 0 5 i 5 N’ - 1. (1)

In particular note that NI does not have to be a power of two. When it is, however, this operation is identical to the circular left shift of the binary representation of i:

. . T(2”-12,-2 * . . i120) = i ,-22,-3. . * 202,-1.

Several properties of the perfect shuffle permutation are de- rived in [5]. In particular, [log, N’1 applications of the perfect shuffle do not result in the original ordering, unless N‘ is a power of two. The number of such applications required in the general case (called the order of the permutation) is given by f , where

2 f 1 mod (N’ - 1).

A multistage shuffle-exchange network with N’ inputs and outputs (N’ even) consists of [log, N’1 shuffle-exchange stages, i.e., each stage consists of the perfect shuffle of

N‘ terminals followed by N ‘ / 2 exchange elements. Each exchange element is a two-input crossbar switch that can connect either input to either of the two outputs, so long as there is no conflict. Note that for N’ = 2”, the structure is identical to the Omega network defined in [13]. The only reason we do not call it a “general version” of the Omega network is that the latter corresponds to an “C2-base” derived from a factorization of NI. When NI = 2”, all its prime factors are equal to two, so that the structure is identical to the shuffle-exchange network. In our definition, no part is played by the factorization of N’. Fig. 1 shows an 18 x 18 shuffle- exchange network with five stages of nine exchange elements each. Each exchange element can be controlled by a single bit accompanying a message or a request: if the bit is 0, a connection is made to the top output, and if it is a 1, then the bottom output is chosen.

Several key topological features can be observed from this figure. First, the two input links to an exchange element need not be connected to the same type of outputs (top or bottom) of exchange elements in the previous stage. Thus, the network does not possess a key property of Delta networks, a very general class of multistage networks [9], [16]. This does not mean that the network cannot be “digit-controlled,” as we shall see later. The network is not a Banyan [4] either (which are a class of networks more general than the Delta networks), since certain inputs have two paths to certain outputs (0 to 0, for instance). However, the network is not a multipath Omega network [ 151 because certain input-output pairs have unique paths between them (2 to 8, for instance).

The structure does not have pairs of nodes in adjacent stages that constitute ‘‘butterfly” type cycles ( K z , ~ graphs, to be precise). From the definition of the perfect shuffle in (l), it can be seen that unless N’/2 is even, the interconnection will not possess this property. We also note that a “tree-structure” exists from each input until the penultimate stage, but not to the network outputs. Destination addresses (even if augmented by a “don’t-care’’ bit as in multipath networks) do not in general serve as routing tags to get to an output. These and other properties are examined in detail in the next two sections.

B. Network Control In this section we take up the question of how a path can be

established from an input to an output in the shuffle-exchange network. In the process, we will also provide a constructive proof for its connectivity. Let us call as a “tag-based” control algorithm one in which a path can be set up by an input to an output using a control tag T. Each bit in the binary representation of the tag is used to control one exchange element in the path as previously explained. (Let us assume that bit t , is used to control the first exchange element, bit t,-l the next, and so on.) The following theorem provides a tag- based control algorithm for the multistage shuffle-exchange network. For purposes of notation, let us define

N ’ = N + M , N = 2 ” , O < M < N .

The key point to observe in the theorem is that when N’ is not a power of two, the control tag is a function of both the input and the output.

PADMANAB”: EVEN-SIZED BINARY SHUFFLE-EXCHANGE NETWORKS FOR MULTIPROCESSORS 387

Fig. 1. An 18 x 18 multistage shuffle-exchange network. Paths to output 15 from inputs 0 and 1 are shown.

Theorem I:’ (Control Tag) Any input i, 0 5 i < N’, in the multistage shuffle-exchange network can set up a path to an output j , 0 5 j < N‘, by using the control tag

TI = ( j + 2Mi) mod N’.

In addition if TI + N’ < 2N, then a second control tag exists and is given by

T2 = Ti + N’ .

Proof: Let T = 2,t, + 2”-lt,-l + . . + 2tl + t o . The perfect shuffle at each stage maps the output terminal

in the previous stage to T( . ) as defined in (1); the exchange element replaces the least significant bit in this terminal by the tag bit used in that stage. Consider stage 0 first.

LSB of ~ ( i ) = ~ ( i ) - 21-1. 4 2 )

Terminal (0) = ~ ( i ) - ( ~ ( i ) - 2[ , ] ) 4 i ) + t,

2 The address of the terminal occupied by the path or message at the output of stage 0 is then given by

Note that this value is < N’ since ~ ( i ) , the input terminal to the exchange element, is < N’, and only the lsb of ~ ( i ) is replaced by t,. Using the properties of congruences [19] we have

1 i+-[,Jli+,. 1 2i 2 N

‘F. Hwang has brought to our attention a similar theorem derived in [2] for the generalized de Bruijn digraph. Since the shuffle-exchange and de Bruijn digraphs bear a close relationship [6], their theorem is also readily applicable to multistage networks, as pointed out in [2].

It is possible to show that if N I = N2 mod m, then [NI] = LNz] mod m, so that

N‘ N’ 2 2

1 2i 2 N

[?r(i)] = [(i + -[,J)J mod - = i mod -. 2

Substituting this into (2) above, we have

Terminal(0) = 2(i mod -) + t, = 22 mod N‘ + t , < N‘.

By following this derivation for the perfect shuffle and ex- change element of stage 1, we have

N‘ 2

Terminal(1) = 2 Terminal(0) mod N’ + t,-l = 22i mod N’ + 2t, + t,-l < N‘ .

Using mathematical induction on the stage number, we can show that at the last stage, the network output connected to input i is given by

Terminal(n) = 2,+’i mod N’ + 2,t, + 2,-’t,-l +... + 2tl + t o

= 2Ni mod N ’ + T .

When TI = ( j + 2Mi) mod N’ is used as the control tag,

Terminal(n) = (2Ni + 2Mi + j) mod N’ = j .

When T2 = TI + N’ < 2N, clearly it can also be used as an 0

Note that one control tag from input 0 to output j is always equal to j . If j < 2N - N’, then input 0 will have a second control tag to get to j . Also note the special case of N’ = 2N: tag values to an output j are independent of the input and are given by the output address j. Finally, the control tags from inputs i and (i + %) mod N’ are identical to any output j .

As an example, consider the 18 x 18 network in Fig. 1. Input 1 can get to output 15 using two different control tags 00001 (1) and 10011 (19). (N’ = 18, N = 16, M = 2, in this case.) On the other hand, input 0 can get to the same output using

(n + 1)-bit tag, since it results in Terminal(n) = j .

388 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, OCTOBER 1991

only one control tag, 01111 (15). It can be seen from Fig. 1 that the two control tags in the first case lead through two disjoint paths to the same output. Corollary 1 in the following section generalizes this observation.

C. Structural Properties We now generalize and formally state the properties that

were observed in regard to the network in Fig. 1 earlier in Section 11-A. All of these either follow from or make use of

Thus, the first value of j that satisfies this is given by

j o = ( N - M - 2Mi) mod N’.

In addition, for 1 5 k < 2M, ( j o + k) mod N’ also satisfies inequality (3) since

( ( j o + k) mod N’ + 2 M i ) mod N’ = ( N - M - 2Mi + k + 2Mi). mod N’ = ( N - M + k ) m o d N’ .

Theorem 1 and its proof. Corollary I : (Disjoint paths) If an input has two valid

(n + 1)-bit control tags to get to an output, then the paths used by these two tags are edge disjoint within the network.

Proof: Since N‘ > N , a second control tag exists only if the first is < N . Thus, if an input has two control tags TI and T2 to get to an output, then the most significant bit t, = 0 for TI and t , = 1 for Tz. From this, and the expressions for Terminal(j), 0 5 j < n, that the paths generated by the tags occupy at stage j (in the proof of Theorem l), we see that Terminal(j) for path 1 # Terminal(j) for path 2,O 5 j < n. 0

Theorem 1 establishes the connectivity of the network-any input can be connected to any output using a control tag. From Corollary 1 we see that certain inputs may be connected to certain outputs using two different paths. The number of such inputs for a given output, and vice versa, are characterized by the following two corollaries.

Corollary 2: (Unique path inputs) Given an output j , there are 2M inputs that have a unique path to j. They are specified by

( N - M - j ) mod N ’ + k N ’ 2M

ik = r 1, O < k < 2 M .

Proof: From Theorem 1, such inputs i are characterized by

( j + 2Mi) mod N’ 2 2N - N’ = N - M, Or, 2Mi 2 ( N - M - j ) mod N ’ .

(3)

Thus, the first such input is given by

( N - M - j ) mod N’ 2M io = r 1.

Additional inputs are obtained from the equivalence class, modulo N’, of ( N - M - . j ) , or

For k < 2M, N - M + k < N‘, so that

( N - M + k ) mod N‘ = N - M + k 2 N - M .

However, for Ic 2 2M,

( N - M + k)modN‘ = N - M + k - N‘ < N - M , since k < N‘. 0

Summarizing these two corollaries, we see that in a network of size N’ = N + M , each input has dual paths to N - M outputs and unique paths to 2M outputs. Similarly, each output has dual paths from N - M inputs and unique paths from 2M inputs.

Fig. 2 shows a 12 x 12 shuffle-exchange network (N’ = 12, N = 8, M = 4) and the paths to output 0 from all the inputs. The switches occupied by these paths are shown by the hatched boxes in the last three stages. In the first stage a “1” indicates that inputs connected to that switch have a unique path to output 0. Switches marked by a “2” connect to two hatched boxes in the second stage and inputs connected to these have two paths to output 0. Thus, eight inputs have unique paths and four inputs have dual paths to this (or any other) output. In the 18 x 18 network shown in Fig. 1, four inputs (4, 8, 13, 17) have unique paths to output 0; all the others have dual paths.

In Section 11-A we observed that in Fig. 1 a tree structure exists from an input only up to the penultimate stage (or from an output till the second stage). This can be seen clearly in Fig. 2 from output 0. The following corollary formalizes this property.

Corollary 4: (Tree Structure) Consider an exchange element at stage k ,O 5 k 5 n, in an N’ x N’ network. Let IO be the set of all network inputs with paths to the top input of the exchange element and let 11 be the set of all network inputs with paths to the bottom input. Similarly define 00 and 01 in

- ,

terms of network and exchange element outputs. Then,

for 0 5 k 5 n- 1

for 0 5 IC 5 n for 15 k 5 n

( N - M - j ) mod N ’ + k N ’ 2M 0

i k = 1. a) 1’0 = { N - M for k = Note, however, that 0 5 k < 2M, since i 2 ~ io mod N‘. 0

are 2M outputs that have a unique path from i . These are

k Corollary 3: (Unique path outputs) Given an input i, there 1101 = 1111 = 2 0

specified by b, 1°0nO1l = { N - M fo rk = 0

j k = ( N - M - 2Mi + k ) mod N ‘ , 0 5 k < 2M. 1001 = 1011 = 2n-k for 0 5 k 5 n.

Proof: From inequality (3), such outputs are characterized Proofi We will prove the property for network inputs. The validity of the property for network outputs can be shown by replacing “inputs” and “input sets” in the following by “outputs” and “output sets.” 110 n Il I = 0 in stages 0 5 IC < n

by

( j + 2 M i ) mod N‘ 2 2N - N‘ = N - M .

I -- - -

~

389 PADMANAB”: EVEN-SIZED BINARY SHUFFLE-EXCHANGE NETWORKS FOR MULTIPROCESSORS

Fig. 2. A 12 x 12 multistage shuffle-exchange network. Paths from all inputs to outputs 0/1 are shown.

follows from the fact that if this were not the case, each input in the intersection would have two paths to get to the outputs of this switching element. Corollary 1 forbids this. At stage n, the number of inputs in 110 n 111 is exactly the number of network inputs with two paths to a given output, which is N - M from Corollary 2.

To see that exactly 2k network inputs can reach either of the two inputs of the exchange element under consideration, we work backward from stage k . Note that at any stage j , 0 5 j < k , the input terminals at this stage that connect to one input of the stage k exchange element have to satisfy the property that / I p n I, I = 0 for any two such terminals p and q. If this were not the case, the network inputs in the intersection would have two paths to the input terminal at stage k , leading to a contradiction with the first part of the corollary. Given this, and the fact that each exchange element has an in-degree of two, we have the desired result. 0

Thus, we see that through each output of a first stage exchange element, 2” = N destinations can be reached. The paths to these constitute a binary tree rooted at each output. The two trees do not overlap until the last stage. A similar statement can be made for each switch in the last stage and network inputs connecting to it. This corollary will turn out to be quite important in routing traffic through the network in a balanced manner.

The final derivative of Theorem 1 characterizes the number of different control tags for each output.

Corollary 5: (Number of control tags) Let pN’ be the smallest multiple of N’ which is divisible by 2M. Then the number of different control tags for a given output is pN’I2M. The number of inputs that use the same control tag to get to a given output = 2M/p.

Proof: From Theorem 1 , the control tags to an output are given by ( j + 2Mi) mod NI, 0 5 i < N’. The number of distinct values of this expression is pN’ /2M, where p is defined above.

With NI inputs and only pN‘/2M distinct tags per output, the number of inputs using the same control tag for an output = NI x 2M/ (pN’ ) = 2M/p. In fact, these inputs belong to

0 Note that for N’ = 2N,p = 1 , so that pN‘/2M = 1, and

the same equivalence class modulo pN’ /2M.

2Mlp = 2N.

Going back to our example of the 12 x 12 network in Fig. 2, NI = 12 and M = 4, so that p = 2 in this case. Thus, the number of control tags for each output is three, given by j , ( j + 4) mod 12, and ( j + 8) mod 12 for each output j. Four inputs use the same control tag to get to each output. In the case of output 0, these are (0, 3, 6, 9) using control tags 0/12, ( 1 , 4, 7, 10) using control tag 8, and {2, 5 , 8, 1 1 ) using control tag 4.

D. Permutation Capability It is extremely difficult to derive exact expressions for the

number of permutations that can be realized by an Omega type network when redundant paths are present [3]. The situation is more complex in our case because the redundancy is not static-not all inputs have two paths to a particular output, and this depends on the output chosen. Thus, we will not attempt to analyze this in detail here. However, it is possible to derive a lower bound on this number by observing that for a fixed setting of the first stage switches, each different setting (or state) of the remaining switches in the network yields a different permutation. Thus, we have

Number of realizable permutations 2 2 Llog “J .

Note that this value becomes tighter as N’ -+ 2N, since this reduces the redundancy in the network. For N‘ = 2N, the number is exact. Conversely, the estimate is poorest for N‘ = N + 2.

While this lower bound on the number of permutations can be achieved by the distributed tag control algorithm (if two paths are available, one fixed choice would always be made), realizing all the permutations that the network is capable of will require global processing, in a manner similar to the looping algorithm for permutation networks [14]. The algorithm will be much simpler in our case because a choice (of alternate paths) is available only at the two outer stages, so that the algorithm need not be recursively applied to the internal stages.

E. Higher Order ShuffEes and Nonbinary Switching Elements The shuffle-exchange network can be defined for any system

of size IC“ x I C ” , k 2 2; when k > 2 , the binary exchange elements are replaced by k x IC crossbar switches and the perfect shuffle connections by k-ary shuffle connections [12]. We can define general versions of these latter structures for systems of size N’, where NI is a multiple, but not a power, of k . Formally, the k-ary shuffle permutation is defined as

k i ~ ( i ) = (ki + LF]) mod N’, 0 5 i 5 N’ - 1 .

Let N’ = N + M as before, with N = k“ and k 5 M 5 ( k - l ) N . The multistage shuffle-exchange network in this case consists of [lo& N’] = R + 1 stages, with each stage composed of the k-ary shuffle of N’ elements followed by N’lk crossbar switches of size k x k. Each such switch can be controlled by a single k-ary digit. Following Theorem 1 and

390 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, OCIDBER 1991

using k-ary arithmetic, it is possible to show that the network control tag from input i to output j is given by

TI = ( j + kMi) mod NI.

In addition, other control tags (and paths) may be available, specified by

Tp = TI + ( p - 1)N' if Tp < kN, 1 < p 5 k.

The maximum number of disjoint paths between any input and any output is given by [kn+'/N'l. Note that as in the case of k = 2, not all input-output pairs might be connected by this maximum number of paths. An example of an 18 x 18 network constructed using three stages of ternary shuffles and 3 x 3 switching elements is shown in Fig. 3. The maximum number of paths between an input and output in this case is two.

111. ARCHITECTURAL ISSUES Having defined the shuffle-exchange network for N' # 2"

and derived some of its structural properties, we need to determine how easily it can be used in practice. Two of the chief reasons for the popularity of the power-of-two sized network are its simple bit-controlled routing algorithm, and its ability to provide reasonable stochastic performance at a low cost. (As mentioned earlier, the ability of the network to realize many commonly used permutations is relevant mainly to array processors.) In this section we try to establish that building and using a shuffle-exchange network of any even size is no more cumbersome than a power-of-two sized network.

A. Efficient Computation of Control Tags One of the main features in which the composite sized net-

work differs from its power-of-two counterpart is the structure of the control tag used by an input ( i ) to set up a connection or send a message to an output U). While in the power-of- two network this tag is just the output address, in the general case it is given by TI = ( j + 2Mi) mod N'. In addition, if TI + N' < 2N, then a second control tag exists, given by Tz = TI + NI. We will show in this section that despite the modulo N' arithmetic in the control tag definitions, generating the latter involves an added delay of only an (n + 1)-bit addition and logic of roughly three (n + 1)-bit adders.

Note that the network interface unit attached to each pro- cessing element in a multiprocessor typically includes some address processing capability, for a variety of reasons. Even for a single scalar address generated by the processor, deter- mination has to be made if the address is local or global, and whether it is physical or virtual. Once a physical global address is determined, the memory module number is extracted (which in the simplest case is a sequence of most significant bits in the address, and in general depends on the memory interleaving scheme) and used to control the network. For a vector access, additional computation will have to be done to determine successive memory locations (module numbers) based on a base address and a vector access stride. While this discussion applies to a shared memory multiprocessor, computations of a similar complexity need to be done in a distributed memory

Fig. 3. An 18 x 18 shuffle-exchange network using ternary shuffles and switching elements. Paths from input 0 to output 3 and input 14 to output 9 are shown.

architecture also. We now describe how the control tags TI and T2 (if existent) can be generated in a simple and efficient manner, given the memory module number, j , to be accessed.

At each network input i, 2Mi is a constant and 0 5 2Mi mod N' < N'. Let C = 2Mi mod NI. Then T l ( j ) = j + C - N' if j + C - N' 2 0 and T l ( j ) = j + C if j + C - N' < 0. This can be accomplished by the simple unit shown in Fig. 4 in which both j + C and j + C' = j + C - N' are generated in parallel and based on the sign of j + C', one of the values is chosen as the tag Tl(j). (Note that the addition of j and C' is done in signed representation.) If T z ( j ) exists, it is given by T l ( j ) + NI. Since T l ( j ) and N' both use R. + 1 bits, their sum is < 2N if there is no carry out in the addition process. The carry signal can thus be used to validate Tz(j) (Fig. 4).

When access is to be done to V elements of a vector that are stored in the memory modules (outputs) with stride S, then the addition (modulo NI) of S to the previous tag can either be done at the input side (i.e., j + kS) or at the output side (Tl(j) + ICs). In the former case we would pump V module numbers through the tag generator to obtain V control tags, while in the latter case, the first tag value generated would be used as a base value to which the stride can be added to get successive tags.

B. Balanced Routing of Traffic Under equiprobable addressing of outputs by inputs, a

connection request by an input is equally likely to be for any of the outputs. A network that does not have any in- herent asymmetries will route such traffic uniformly. When considering networks made up of 2 x 2 switching elements, a key manifestation of this symmetry is that at any switching element an incoming message can reach the same number of destinations through either output. This is true when N' = 2", i.e., in the Omega network, and is also true of the multipath Omega network. When combined with the random choice of a path when multiple paths to a destination exist, this typically

I - -

PADMANABHAN: EVEN-SIZED BINARY SHUFFLE-EXCHANGE NETWORKS FOR MULTIPROCESSORS 391

C

i -4

Fig. 4. Control tag generation (TI and Tz) for an output j . C, C‘, and N‘ are constants.

leads to similar traffic through the two ports of any switch. Because the secondary topological properties of the structure are not obvious when N’ # 2”, we need to establish this in the general case. The following lemma accomplishes this for the first stage, the most problematic one. The last paragraph in the section discusses the issue for the remaining stages.

Lemma 1: At the first stage of the shuffle-exchange network (of size N’ = N + M ) each input can access the same number of destinations N through either switch output. Of these, a) N - M destinations can be accessed through both the outputs, b) M destinations can be accessed through the top output only, and c) M destinations can be accessed through the bottom output only.

Proof: Each input has dual paths to N - M destinations and the two paths are disjoint in all cases (Corollary 3). This proves part a). Parts b) and c) together say that the remaining 2M unique path destinations are distributed equally across the two outputs. To see this we refer again to Corollary 3 which specifies what these destinations are, and use that in Theorem 1 to get the control tags to these destinations. These tags are given by N - M + k , 0 5 k < 2M. With N = 2,, it is easy to see that the first M of these have control tags < N (i.e., tag bit t , = 0) and these will be accessed through the top output. The rest have control tags 2 N (t, = 1) and these

cl Note the special case of N’ = 2N. In this case each input

has a unique path to each output and N‘/2 destinations are accessed through the top output only and N‘/2 through the bottom output.

Consider the example of the 18 x 18 network in Fig. 1 and input 1 (or 10). It can reach 16 destinations through each output of the first stage exchange element. Out of these, 14 are dual path destinations (0-9, 14-17), which can be reached through both outputs. Destinations 10 and 11 can be reached only through the top output, while 12 and 13 can be reached only through the bottom output.

As a consequence of this lemma, if each input addresses all outputs equiprobably and in addition chooses the path for a dual path output randomly, then any exchange element in the first stage will carry the same amount of traffic through both

are accessed through the bottom output.

its outputs. In fact the random choice of an alternate path is not the only scheme that will accomplish equal distribution of traffic through the two outputs. For instance if inputs i < N’/2 always choose the path through the top output (when a choice is available) and inputs i 2 N‘/2 always choose the bottom path, then also the two outputs at any first stage switch will carry identical traffic. Second-order dynamic properties of the system (like waiting time at buffers) will depend on which strategy is chosen. This fact will be made use of in our performance studies of the network in Section IV. Also, one or the other scheme might be easier to implement in practice.

At stages 1 through n, Corollary 4 guarantees a tree structure from each output link to the set of destinations reachable from it. Furthermore, the trees associated with the two outputs of any switch do not over lap. This results in an equal distribution of incoming traffic through the two switch outputs under equiprobable addressing. Thus, we have established that the shuffle-exchange network routes traffic in a balanced manner, whether or not N‘ = 2”.

C. Broadcasts

The broadcast capability of the shuffle-exchange network when N’ # 2, is almost identical to that of the power-of- two sized network. Let us begin by characterizing the kind of broadcasts the latter network can accomplish [13]. We are concerned here not with the raw connection capability, but with connections that can be realized using a reasonable control scheme. For broadcasts, this typically means sending a broadcast control tag ( B ) along with the destination control tag (T). If B(i) = 1, 0 5 i < n, then at stage i the exchange element that receives the message would send a copy of it to both outputs. If B(i ) = 0, control would be done based on T(i) . Note that this is equivalent to using two control bits for each stage, since such a switch can now be “set” in one of three states, for an incoming request.

The broadcasts that are possible under such a control scheme can be characterized as follows. A broadcast can be made to any set of 2k outputs, 1 5 k 5 n, whose binary addresses differ in one or more of k specified bit positions j l , j z , ’ . . , j k ,

0 5 ji < n. This is accomplished by using a broadcast control tag of B = {b,lb, = 1 if m E {j1,j2,...,jk},bm = 0 otherwise}, and a destination control tag of T = {t,lt, = don’t care if m E {jl , j 2 , . 9 . , j k } , and t , = d, otherwise}. If D represents the destination in which all the “don’t care” bits are set to 0, then the broadcast set is given by D , D + The simplest (and probably most useful) version of this is broadcasting to a sequence of 2‘“ locations, separated by a distance (“stride”) of 2i from each other. In this case, B has the form 0 . . . 01 . . . l o . . . 0, where the 1’s are in positions i+k-1 to 2. T has the form dn-ldn-2 . . . di+k * * . . * di-1 . . . do.

Let us now consider the same control scheme in a network of size N’ # 2“, and determine what happens when B is an arbitrary bit string. As in the previous case, the 2k control tags are given by T , T+23’1, T+2j2, T+23’1 +2j2, . . . , T+2j1 +2j2

+ . . e + 2 j k , where T is the tag value in which all “don’t care” bits are set to 0. Let D be the destination addressed by T

2 j 1 , D + 2 j 2 , D + 2 j 1 + 2 j 2 , . . . , D + 2j1 + 2 j2 + . . . + 2 j k .

I ,

392 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, OCTOBER 1991

so that T = (D + 2Mi) mod N’. Now TI = T + 2 j1 = ( D + 2 j 1 + 2Mi) mod N’ = ( D l + 2Mi) mod N’, where D1 = (D + 2 j1 ) mod N‘. As long as 23, < N’,D1 # D. Similarly TZ = ( D + 2 j z + 2Mi) mod N‘ leads to destination DZ = (D + 2 j z ) mod N’, and as long as 232 < N’ , DZ # D1, and 0 2 # D. Proceeding similarly until we reach the last of the control tag values, we can see that as long as B = 2 j1 + 2 j 2 + .. . + 2jk is < N’, all 2k destinations in the broadcast set will be distinct.

What happens when B 2 N’? In this case it is possible that some of the control tags are duplicates. This happens if and only if B . N’ = N’, i.e., if B has 1’s in all the bit positions in which N’ has 1 bits. To see this, let N’ = 2 P 1 + 2 p 2 +. .+2Pm. Let us say that B has 1’s in k bit positions which include p1,p2,...,prn.ConsiderT=(D+2Mi) mod N’andT ,= ( D + 2 P 1 + 2P2 + . . . + 2 P m + 2Mi) mod N’, which are valid control tags in the broadcast set. D , = D mod N‘, and so T, and T both lead to destination D. This is also true of other control tags, if IC > m. However, note that B could be > N’ and still generate a broadcast set of size 2 k , if B . N‘ # N‘. In this case, if D, D , mod N‘, we would have T, = ( D Y + N ’ + 2 M i ) mod N’ or TzI = ( D Z + N ’ + 2 M i ) mod N’, which implies B has 1’s in positions p l , pa, . . . , p,, leading to a contradiction. Characterizing the broadcast set further when B 2 N‘ depends on both the number and location of the broadcast stages, making it impossible to arrive at an elegant formulation. We will thus be content with the following lemma.

Lemma 2: In a [log N’1 stage shuffle-exchange network us- ing the broadcast control tag scheme, any input can broadcast to 2k outputs (1 5 k 5 n) whose addresses differ in one or more of k specified bit positions less than n. If a broadcast is done in bit position n (stage 0) also, then 2k distinct outputs will be reached if and only if the broadcast control tag does

0 In Fig. 5, we see two examples of broadcasts, one from

input 0 with a broadcast control tag of 11000 and a destination tag of **001, and another from input 17, with a broadcast tag of 10010 and a destination tag of *OO*O. In the former case, because B . N’ # N’ (N’ = lOOlO), four destinations are reached. In the latter case, because B N’ = N’, one destination (4) gets two copies of the broadcast message and only two other destinations are reached.

Note that this guarantees single-copy broadcasts to a se- quence of 2k locations ( k < n), as long as the stride is 5 2n-k . Let us consider another common broadcast pattern, that of broadcast to all outputs. From Lemma 2 and the control algorithm it can be established that this can only be done with some outputs ( N - M of them) receiving duplicate copies of the message. In one sense this can be considered similar to a limitation of the power-of-two sued network, which cannot broadcast to K outputs, K # 2k, using the broadcast control tag algorithm.

not have 1’s in all the bit positions that N’ has.

D. Network Decomposition It is desirable to be able to construct a larger network using

smaller networks as building blocks, for system growability

and packaging considerations. When N is a power of two, this can be done using subnetworks corresponding to any factorization of N [13]. In this section we consider the partitionability question when N is not a power of two.

Let N’ = N‘1 x N2, where N‘1 = N I + M I , N I = 2n1 , and NZ = 2n2. In the N‘ = N + M notation, N = N I N Z and M = MlN2. An N’ x N‘ network can in this case be constructed using a column of NZ subnetworks of size N’1 x N’1 followed by a column of N’1 subnetworks of size N2 x Nz . Each subnetwork is a shuffle-exchange network as defined in Section 11. The pth output of the qth network in the first column is connected to the qth input of the pth network in the second column, 0 5 p < N‘1,O 5 q < Nz. Thus, each network input can get to every network output; redundant paths are available only in the subnetworks of the first column. We will see presently that the control tag for path setup in this partitioned architecture is identical to that in an N’ x N‘ shuffle-exchange network as defined in Theorem 1.

Consider a path from input i to output j. Each subnetwork in the second column is a unique path network and output j is connected to terminal j - NZ h / N z J of subnetwork h / N z J in this column. This subnetwork will therefore use control tag Tz = J’ - N2[_1’/NzJ to get to output j. In addition, the path from i to j will have to occupy output terminal h / N z ] of the subnetwork LiIN’1J in the first column, because of the connection between the two columns. Thus, the control tag in the first subnetwork is given by (Theorem 1):

TI = ( LINZ] + 2M1(i - N’1 LiIN’lJ)) mod N’1.

The (n + 1)-bit composition of TI and Tz is given by

T = Ti . Tz = Ti Nz + Tz < N‘ = (Nz t j / N z ] + 2M1Nzi - 2MlN’lNz LiIN’lJ)

mod N’1 Nz + ( j - Nz k / N z ] ) = ( j + 2Mi) mod N’.

Thus, the control tag to be used in the partitioned archi- tecture is identical to that in the shuffle-exchange network of size N‘ x N‘ .

Note that if TI + N’1 < 2N1, the first subnetwork can also be controlled by the tag TI + N’1. The complete tag for the network in this case is given by (TI + N’1) .Tz = TI .Tz +NI, which is indeed the second control tag in the shuffle-exchange network of size N’. In fact the condition TI + N’1 < 2N1 is equivalent to T + N’ < 2N. This follows from the fact that T + N ‘ = (Tl+N‘1)Nz+TZ. Since Nz = 2n2 and Tz < 2n2, T + N‘ < 2NlN2 iff TI + N’1 < 2N1.

An example of a 24 x 24 network constructed from subnet- works of sizes 6 x 6 and 4 x 4 is shown in Fig. 6.

When neither of the chosen factors is a power of two, a valid N’ x N’ interconnection network will still result from this construction and the control tag can be derived as shown above. However, it will not equal ( j + 2 M i ) mod N’, and will be more difficult to compute. (The indexes of the subnetworks that i and j are connected to need to be computed first. Then the control tags in each of these subnetworks can be derived from Theorem 1 and concatenated.)

PADMANABHAN: EVEN-SIZED BINARY SHUFFLE-EXCHANGE NETWORKS FOR MULTIPROCESSORS

1 1 0 0 0

393

1 0 0 I 0

Fig. 5 . Two broadcast connections in an 18 x 18 network. Broadcast control tags are shown at the top and bottom.

Fig. 6. A 24 x 24 network constructed using 6 x 6 and 4 x 4 shuffle-exchange networks as building blocks.

E. Recirculating Networks

N I 1 stages can be built using recirculation, just as in the case of N' = 2n

network in which N' input registers feed into N' /2 exchange elements through a Perfect shuffle connection. The outputs of the exchange elements feed back into the registers. The network sources and destinations have access to these registers for insertion of new messages and removal of terminating mes-

sages. Each message recirculates [log NI1 times, using one bit of the control tag in each cycle. It is easy to see that the control tag of Theorem 1 applies to the recirculating networks also.

There, however, are a couple of points that are specific to

handled in one of two ways in such networks: either by using buffers at each output (or input) of an exchange element [lo], [l] or by misrouting one of the packets to the wrong output [ll]. The operation is fairly straightforward in the first case and let us consider the second scheme. Each message carries

Shuffle-exchange networks with fewer than

[lo], [lll. The most common version Of this is the sing1e stage the single stage architecture. Conflicts between packets are

394 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, O(JT0BER 1991

with the message it a “counter” has made, that and specifies the control the number tag bit to of use recirculations in the next

cycle. When a message is misrouted, the counter is reset and m g l z L 1: 1;;; jgL mi mi it effectively “starts over.” This is equivalent to the port that mg

received the message generating it anew. This is possible when N‘ = 2” because the destination tag is independent of the source. In the general case of N’ # 2n, the port has to generate a new tag based on the destination address (which has to be a part of the message now). While this involves some time and space overhead, both are minimal. In terms of message lengths, this represents an addition of about 11% for n = 10 and a mes- sage body of 64 bits. As for computation of the new tag, it can be overlapped to a large extent with incrementing the message hop-counter, both of which are (n + 1)-bit operations (Fig. 4).

A technique that is useful in speeding up message delivery in single stage networks is “fast finish” [20]. Here messages are removed from the system as soon as they reach their destination register, instead of waiting to complete n recir- culations. This is done by comparing the destination address with the current register address. (Note that a counter is still needed to specify the tag bit to be used in each cycle.) It can be verified that logN cycles are not necessary for all source-destination pairs. Implementation of this scheme for N’ # 2” will also require the destination address to be a part of the message, since checking for termination cannot be done using the routing tag.

IV. PERFORMANCE RESULTS One of the main questions we wish to answer through our

analysis is: do the shuffle-exchange networks behave signifi- cantly differently when N’ # 2” than when N’ = 2”? The latter networks have been thoroughly analyzed in the literature [l], [8], [ l l ] , [20] and to answer this question we will consider the same operating environment and modeling assumptions as these earlier networks. The mode of communication within the network will be packet or message switched, with the entire message contained in a single packet. We will use the equiprobable independent reference model under which 1) all network outputs (destinations) are equally likely to be the target of a message generated by any network input (source), and 2) requests generated by different sources are independent of each other, both in terms of destination addresses and in terms of the generation process. Conflict resolution at any switch output is egalitarian.

As mentioned in Section 11, each switching element a message transits through can be controlled by a single bit in the control tag of the message. The structure of a switching el- ement is as shown in Fig. 7. (Ignore the notations on the inputs and outputs for the present.) The key point to notice is the pres- ence of buffers at each switch output to handle message con- flicts. The switching elements, sources, and destinations oper- ate in cycles, where operations in a cycle are confined to gen- eration of messages, transit of messages from one stage to the next, and absorption of messages. Note that the cycles are de- fined in terms of individual components and not in terms of the entire network. In particular, messages generated in different cycles could conflict within the network during a later cycle.

( a ) ( b )

Fig. 7. Message rates through a switching element in (a) stage 0, (b) stage a, 15 i 5 n.

The generation of messages at the sources in each cycle will be modeled as independent Bernoulli processes with probability rn, (the “message generation rate”). As mentioned in the discussion following Lemma 1, we will use the routing scheme in which inputs i < N ’ / 2 will always choose the path through the top output of the first stage switch, if the destination has two paths to it. Inputs i 2 N’/2 choose the path through the bottom output to a dual path destination. There are two reasons for this. The first is that collisions are reduced in the first stage by using this scheme rather than the random choice. Under this scheme each switch output will receive messages at the rate of rn,N/N‘ from one input and m,M/N’ from the other. Under random choice, the rates will be m , / 2 and m, /2 . For N’ # 2N, the first system will have fewer collisions and consequently better throughput and shorter waiting time.

The second reason has to do with maintaining the property of independent arrivals at the two inputs of any switching element in the network. It is clear that the arrivals at the two inputs of a first stage switch are independent. From Corollary 4 we know that the sets of sources that can reach the two inputs of a switch are disjoint in stages 0 through (n - l), so that this independence assumption holds at these stages also. At the last stage, N - M sources can access both inputs of a switch. Under the random choice scheme, independence of input arrivals would therefore not hold at this stage. With the deterministic choice of a path, we have guaranteed that none of these N - M sources would ever access both inputs to a last stage switch. Thus, the arrivals at the two inputs of a switch at the last stage are also from two disjoint sets of sources, and hence are independent.

All sources are identical (though functionally independent) and when this is combined with Lemma 1 and Corollary 4, we see that all switches at any particular stage in the network behave identically. Thus, it is sufficient to analyze just one switching element in each stage of the network.

A. Lossless System (Infinite Buffers) In this case we model the system by assuming buffers of

infinite capacity at each output of a switching element, so that no messages are ever dropped due to a full buffer. This permits

PADMANABHAN: EVEN-SIZED BINARY SHUFFLE-EXCHANGE NETWORKS FOR MULTIPROCESSORS 395

us to derive explicit formulations for message delays and we will see later that buffers of small capacities capture much of the performance under the lossless model.

In order to determine the expected delays of messages, we need to compute the waiting time of a message at stage i (wi) and also the message rate out of a buffer at stage i (mi), 0 5 i 5 n. These two values are identical for all buffers in a stage at steady state.

Consider a switching element at stage 0 [Fig. 7(a)]. As discussed in the previous section, each output of this switch receives messages from the two inputs at the rates m,N/N' and m,M/N'. The queues can be modeled as G P / 1 systems (general arrival process, deterministic service process). A G/D/l system with IC inputs of rates T I , 1-2, . . . , I-k can be analyzed and it can be shown that the average system time ( 1 cycle service time t waiting time in the queue) for an input message is given by [7], [8]

w = - + 1 V 2 2 E ( l - E )

where E is the expectation of the arrival process = ~ i , and V is the variance of the arrival process = ri ( l - ri). For our case with I-1 = m,N/N' and 1-2 = m,M/N', we can show that the waiting time at the first stage is given by

1 - m , ( l - N M I N ' 2 ~ . m, N M 3, ' W O = 1 - m, ~1~ ' . 1 - m ,

For N # M # N'/2,

1 - 3mg/4 - 4 - 3m, - 1-m, 4(1-m,)

which would be the waiting time under random choice of alternate paths (I-1 = 1-2 = m,/2, in that case).

Also, because no messages are lost from the system, mo = 7%.

The situation in stages 1 5 i 5 n is shown in Fig. 7(b). An incoming message can reach exactly 2"+ destinations through each output, and the two sets of destinations are disjoint. Thus, the situation in these stages is identical to the N' = 2" case [8]. With I-1 = 1-2 = m,/2, we have

4 -- 3m, 4(1 - mg)'

wi = mi = m,, 1 5 i 5 n.

Since such a message has to traverse all (n+l) stages, we have

1 - m,(l - N M / N I 2 ) 1 - m,

Average message delay =

4 - 3m, 4( 1 - m,) ' +n

This delay expression is identical to that in the literature for the special case of N' = 2"+'. (WO = wi, i > 0, for N = M = "12.) For other values of N', it differs by the delay term for the first stage, which has a different form than the rest of the stages.

The average message delay is shown in Fig. 8 as a function of log, NI and as a function of the message generation rate mg .

25 301 / m 8 = 0 ' 9

2oY 15 -I

mg = O S

5 6 7 8 9 10 IogN'

(a)

35 -

30 -

25 -

D 20-

15 -

N'=700 N'=400

N'= 100

5 I I I I I I I I 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

m8

(b)

Fig. 8. Message delays in shuffle-exchange networks as a function of (a) network size N', and (b) message generation rate mg. The lossless system model is used.

Consider Fig. 8(a) first. As mentioned earlier, if we connected just the data points for N' = 2", we would get the straight line described by the expression win. (m, decides the value of wi and hence the slope.) Delays for the intermediate values of N' depend on the behavior of WO.

From the expression above, we see that WO has two com- ponents to it: a one cycle service time, which is independent of traffic, and a waiting time component, which does depend on both m, and NIN' (or MINI). The first component is what leads to (one cycle) "steps" or jumps in going from NI = 2" to N' = 2" + 2. Note that the second component is negligible for N' w >2". However, as M increases, the waiting time at stage 0 also increases until when M = N , it becomes identical to to the waiting time at the remaining stages and we get to the N' = 2"+' point in the curve. Again, at high traffic (m, = 0.9), this second component matters much more than at low traffic (m, = 0.5). Thus, at low traffic, waiting time at all the stages is quite small and message delay comes mainly from just moving through the n + 1 stages.

The second figure [Fig. 8(b)] shows the behavior of message delays as a function of traffic (message generation rate m,), for systems of size N' # 2". This behavior is identical to that for power-of-two sized systems, since the functions WO and w;,i > 0, have similar forms with respect to m,. Thus, message delays stay reasonable and the system does not saturate until roughly 80% loading of the network.

396 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, OCTOBER 1991

B. Lossy System (Finite Buffers) Under this model we will consider finite buffer capacities at

switch outputs and assume that all blocked messages will be lost from the system. The probability of a message success- fully transiting the network and reaching the destination (the "acceptance probability") is the primary performance metric in this model. The product of the message generation rate and the acceptance probability gives the normalized bandwidth of the network.

Multistage networks with finite buffer capacities have been analyzed in the literature (for the case of N' = 2") and we can follow the same outline in our case [l]. The idea is to proceed from stage 0 to stage n, computing the message rates into and out of each stage. The difference from the lossless model comes in the analysis of individual buffers, in which the output rate is no longer the sum of the input rates; the difference between the two rates is the rate at which messages are lost at the buffer. And the ratio of the output to input rates is the acceptance probability of messages at the buffer.

The state transition diagram for a two-input buffer is just that of a birth-death system with nearest neighbor transitions. The message rate out of the buffer is given by (1-boar,), where bo is the state in which no messages are in the buffer and a0 is the probability of no arrivals. In this case the message rates out of each stage are different, and the first stage differs from the rest both quantitatively and qualitatively -it will experience fewer conflicts when N' # 2". A closed form expression for the message rate out of the last stage (which is the normalized bandwidth of the system) has eluded us. We have solved the model numerically.

The results are shown in Fig. 9, as functions of the parame- ters NI, m,, and y (the buffer size). Again consider Fig. 9(a) first, where we see the more interesting behavior of acceptance probability for different network sizes. y = 0 represents the unbuffered system and we see in the graph for y = 0 the same kind of functional behavior as in Fig. 8(a), but with much less of a deviation from the curve that would be obtained by joining the N' = 2" data points. The reason for this is the difference in behavior of acceptance probability and delay functions at stage 0, which is explained below.

The jumps in Fig. 8(a) were the result of the one cycle penalty for traversing an additional stage, whereas as pointed out in the last section, the waiting time itself at this stage is negligible for N' M >2n. Acceptance probability (or message loss probability) behaves much like the waiting time at the stage since both are a direct consequence of the probability of conflicts in each cycle. Thus, even though there is a penalty from conflicts in the additional stage, this is small for small M , and as M increases to 2", we get to the situation for N' = 2"+'. Devoid of a constant added (or multiplicative) penalty for each additional stage, the deviations from N' = 2" are much less marked. This is even more so, to the point of not being visible on the graph, as the buffer size increases.

It must be pointed out that as in the case of message delays, choosing the particular routing algorithm that we have (when dual paths are available) contributes to the negligible deviation of the behavior of Pact from the power-of-two case. If a

O 2 i ma =0.6

0.0 I, 5 6 7 8 9 1 0

lo@

W=8W 0.2 -

0.0 I I I I I 0.0 0.2 0.4 0.6 0.8 1.0

m8

(b)

Fig. 9. Acceptance probability of a message under the lossy system model, as a function of (a) network size N' , and (b) message generation rate m g . q is the size of the buffer at each switch output.

random choice of an alternate path were made, conflicts would be greater at the first stage, even for N' "N >2", leading to a more severe deviation.

Fig. 9(b) shows the behavior of the acceptance probability as a function of message generation rate, in a system of size N' = 800. Again, this is identical in form to systems of size N' = 2"; the unbuffered system provides unsatisfactory bandwidth (or throughput), but small buffer sizes (two or three) provide more than 90% of the maximum bandwidth. Given this, it is reasonable to use the lossless model to approximate message delays in a real system.

C. Summary of Performance Summarizing the performance of the multistage shuffle-

exchange networks for general N', we see that their behavior is fundamentally identical to that of the special case structures for N' = 2". The only noticeable deviation is the presence of unit cycle jumps in message delays in going from 2" to 2n + 2, as opposed to a smooth transition. The effect of these jumps is more prominent at low traffic, since at higher traffic, the increase in message delays for the next higher power-of-two system is sufficiently greater than one cycle. At low traffic, a system of size N' # 2" would have roughly the same delay as the next higher power-of-two system. However, in that case message delays are almost entirely the result of transiting the stages, with negligible waiting time at each stage.

One issue that we have not considered is that of performance in the presence of hot spots or nonuniform traffic in general. It is necessary to simulate the network to understand its behavior under this kind of traffic, because satisfactory an-

- - ~-

I - -

PADMANABHAN: EVEN-SIZED BINARY SHUFFLE-EXCHANGE NETWORKS FOR MULTIPROCESSORS 397

alytical models are unavailable presently. The main difference between the nonpower-of-two and the power-of-two structures in this respect is the availability of two disjoint paths between a subset of inputs and each output, which can be used to alleviate, to some extent, the effect of bottlenecks. The path structure depends upon the network size, and rerouting around congestion points, even when possible, will require some additional hardware at the individual switches.

V. CONCLUDING REMARKS

Shuffle-exchange networks constitute a versatile intercon- nection scheme not only in multiprocessors but also in telecom- munications switching. We have shown how they can be constructed and used for any even-sized system, thereby eliminating a serious limitation of these networks. We have addressed three issues that we believe are most important for use of these networks in real systems. The first is controlling the network, and we have presented a distributed tag-based control algorithm similar to that for power-of-two sized net- works. Control tags in general depend on both source and destination addresses and they can be computed in a simple manner. Second, we established several structural properties of the network for composite system sizes. In particular we have shown that the network can route traffic in a balanced manner, despite the fact that some inputs have dual paths to some outputs. Finally, we considered the stochastic performance of these networks (message delays, network bandwidth). The analysis shows that the behavior of such networks with traffic and buffer capacity is similar, whether or not the system size is a power of two. In a plot of performance measures versus network size, the composite sized networks effectively “fill- in the gaps” in between power-of-two sized networks, with minimal discontinuity.

REFERENCES

[l] P.-Y. Chen, P.-C. Yew, and D.H. Lawrie, “Performance of packet switching in buffered single-stage shuffle-exchange networks,” in Proc. 3rd Inr. Con& Distributed Comput., May 1982, pp. 622427.

[2] D. Z. Du and F. K. Hwang, “Generalized de Bruijn digraphs,” Networks,

[3] I. Gazit and M. Malek, “On the number of permutations performable by extra-stage multistage interconnection networks,” in Proc. 1987 In?. Conf Parallel Processing, Aug. 1987, pp. 461470.

[4] L.R. Goke and G.J. Lipovski, “Banyan networks for partitioning multiprocessor systems,” in Proc. 1st Annu. Symp. Comput. Architecture, Dec. 1973, pp. 21-28.

vol. 18, pp. 27-38, 1988.

[5] S. W. Golomb, “Permutations by cutting and shuffling,” SIAMRev., vol. 3, no. 4, pp. 293-297, Oct. 1961.

[6] M. Imase and M. Itoh, “Design to minimize diameter on building-block network,” IEEE Trans. Comput., vol. C-30, pp. 439442, June 1981.

[7] L. Kleinrock, Queueing Systems, Vol. 1. New York Wiley, 1975, ch. 5 . [8] C.P. Kruskal and M. Snir, “The performance of multistage intercon-

nection networks for multiprocessors,” IEEE Trans. Compur., vol. C-32, pp. 1091-1098, Dec. 1983.

[9] __, “A unified theory of interconnection network structure,” Theoret. Compur. Sci., vol. 48, pp. 75-94, 1986.

[lo] T. Lang, “Interconnections between processors and memory modules using the shuffle-exchange network,” IEEE Trans. Compur., vol. C-25, pp. 496-503, May 1976.

[ l l ] D.H. Lawrie and D.A. Padua, “Analysis of message switching with shuffle-exchange in multiprocessors,” in Proc. Workshop in Intercon- nection Networks for Parallel and Distributed Processing, Apr. 1980, pp. 116-123.

[12] D. H. Lawrie, “Memory-processor connection networks,” Ph.D. disser- tation, Univ. of Illinois at Urbana-Champaign, UMI #73-15783, Feb. 1973.

[13] -, “Access and alignment of data in an array processor,” IEEE Trans. Compur., vol. C-24, pp. 1145-1155, Dec. 1975.

[14] D.C. Opferman and N.T. Tsao-Wu, “On a class of rearrangeable switching networks, Part I: Control algorithm,” Bell Syst. Tech. J . , vol. 50, no. 5, pp. 1579-1600, May-June 1971.

[15] K. Padmanabhan and D. H. Lawrie, “A class of redundant path multi- stage interconnection networks,” IEEE Trans. Comput., vol. C-32, pp. 1099-1108, Dec. 1983.

[ 161 J. H. Patel, “Performance of processor-memory interconnections for multiprocessors,” IEEE Trans. Comput., vol. C-30, pp. 771-780, Oct. 1981.

[17] M.C. Pease, “An adaptation of the fast Fourier transform for parallel processing,”J. ACM, vol. 15, no. 2, pp. 252-264, Apr. 1968.

[18] H. S. Stone, “Parallel processing with the perfect shuffle,” IEEE Trans. Comput., vol. C-20, pp. 153-161, Feb. 1971.

[19] I.M. Vinogradov, Elements of Number Theory. New York: Dover, 1954, ch. 111.

[20] P-C. Yew, D.A. Padua, and D.H. Lawrie, “Stochastic properties of a multiple-layer single-stage shuffle-exchange network in a message switching environment,” J. Digital Syst., vol. VI, no. 4, pp. 387410, 1982.

Krishnan Padmanabhan (S’8O-M’84-SM’90) re- ceived the Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign in 1984.

He is a Member of Technical Staff in the Com- puting Systems Research Laboratory at AT&T Bell Laboratories, Murray Hill, NJ, a position he has held since 1984. From 1985 to 1990 he was also on the adjunct faculty of the Department of Com- puter Science at the University of Illinois at Ur- banazhampaign. His research interests lie in the

theory and architecture of high-performance computer and switching systems. He was a member of the Cedar Supercomputer Project at Illinois, and is the principal architect of the Multi-Array Processor, a wafer-based multiprocessor project at Bell Labs. He is also active in the photonic switching effort at Bell Labs.


Recommended