Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | bao-tram-nguyen |
View: | 215 times |
Download: | 1 times |
Chapter 20
Parallel FFTs with
Inter-Processor Permutations
This chapter treats the class of parallel FFTs which employ inter-processor data per-mutations. That is, part of a processors initial complement of data may migrate toanother processor to accomplish one or all of the following goals:
To balance arithmetic workload among the processors. To reduce communication cost. To have the output data elements arranged in a desired ordering.This chapter contains a description of a collection of algorithms similar to those
in the previous chapter which evenly distribute all buttery computations among theprocessors, and also reduce the message lengths from NP elements to
12NP in each of
the log2 P + 1 concurrent message exchanges. The key to achieving this involves dataexchanges among processors to eect permutations as well as simply to convey datafor purposes of computing butteries.
20.1 Improved Parallel DIFNR and DITNR Algorithms
It was shown in Chapter 19 that when no inter-processor permutation was allowed inthe parallel DIFNR FFT, the computation of each buttery was unevenly split betweentwo processors. To avoid this diculty, an alternative which allows each processorto replace one half of its own data with the incoming data is described below. Thediscussion that follows will focus on the DIFNR FFT; as will be apparent by the endof the section, the substance of the discussion is the same for the DITNR FFT.
20.1.1 The idea and a modied shorthand notation
The idea can be explained using a familiar example: suppose that N = 32, and aconsecutive data map denoted by |i4i3 |i2i1i0 is used to distribute data among thefour processors. Figures 20.1 and 20.2 show how the data are permuted within eachpair of processors in advance of the rst stage of buttery computation, and how each
2000 by CRC Press LLC
processor can then compute exactly the same number of whole butterieswhich,of course, implies equal division of arithmetic work.
A shorthand notation must now reect both the permutation and the computationaccomplished in Figures 20.1 and 20.2. A notation which serves these purposes isobtained by modifying the notation for the corresponding parallel FFT (without datapermutation between processors) from Chapter 19.
The modied notation begins with the initial map and the rst stage of butterycomputation represented by
|i4i3|i2i1i0 |i2i3|
i4i1i0
(Initial Map)
The symbol denotes one concurrent message exchange of 12 NP data elementswithin all pairs of processors, which occurs in the buttery stages involving bits whichform the processor ID number.
Observe that after data are distributed to individual processors according to theinitial mapping | i4i3 | i2i1i0, the element xi4i3i2i1i0 in a[i4i3i2i1i0] can be found inA[i2i1i0] in processor Pi4i3 . For example, a[19] = x19 is shown to be initially in A[3] inP2 in Figure 20.1, a[14] = x14 is shown to be initially in A[6] in P1 in Figure 20.2, andso on.
When bit i4 in the PID and bit i2 in the local M switch their positions in theshorthand notation, the mapping is changed to |i2i3 |i4i1i0, which means that thedata in a[i4i3i2i1i0] can now be found in A[i4i1i0] in Pi2i3 . For example, a[19] = x19is relocated to A[7] in P0 after the inter-processor permutation shown in Figure 20.1,a[14] = x14 is relocated to A[2] in P3 after the inter-processor permutation shown inFigure 20.2, and so on.
To identify the one half of the data each processor must send out, the symbol is used to label two dierent bits: the bit ik
, which has just been permuted from PID
into Local M , and the bit i
, which has just been permuted from Local M to the PID.
In the example above, i4
and i2
have switched their respective positions in the PID
and the Local M .Because ik was in PID before the switch, ik = 1 in one processor, and ik = 0 in the
other processor. On the other hand, because i was in Local M before the switch, i = 0for half of the data, and i = 1 for another half of the data. Consequently, the valueof ik, the PID bit, is equal to i, the local M bit, for half of the data elements in eachprocessor, and the notation which represents the switch of these two bits identies boththe PID of the other processor as well as the data to be sent out or received. To depictexactly what happens, the data exchange between two processors and the butterycomputation represented by
|i2i3|
i4i2i1i0
is shown in its entirety in Figures 20.1 and 20.2.
2000 by CRC Press LLC
Figure 20.1 DIFNR buttery computation (1st stage) with data migration betweenP0 and P2.
From E. Chu and A. George [28], Linear Algebra and its Applications, 284:95124, 1998. With
permission.
2000 by CRC Press LLC
Figure 20.2 DIFNR buttery computation (1st stage) with data migration betweenP1 and P3.
2000 by CRC Press LLC
20.1.2 The complete algorithm and output interpretation
Using the shorthand notation developed in the previous section, the complete parallelalgorithm corresponding to DIFNR FFT is represented below for the N = 32 example.
|i4i3|i2i1i0 |i2i3|
i4i1i0 |i24
|i3i1i0 |3
4|
i2i1i0 |34|2
i1i0 |34|21
i0
(Initial Map)
To provide complete information for this example, the second stage of butterycomputations with inter-processor permutation is depicted in Figures 20.3 and 20.4; thethird stage of buttery computations with inter-processor permutation, together withthe remaining two stages of local buttery computations, are depicted in Figures 20.5and 20.6.
To determine the data mapping for the output elements, observe the following.
The in-place buttery computation in the DIFNR FFT algorithm ensures that
a[i4i3i2i1i0] = x(5)i4i3i2i1i0
= Xi0i1i2i3i4 .
The nal mapping |34 |210 indicates that the nal content in a[i4i3i2i1i0]is now located in A[i2i1i0] in processor Pi3i4 (instead of the initially assignedprocessor Pi4i3).
Accordingly, the output data elementXi0i1i2i3i4 , which overwrites the data in a[i4i3i2i1i0],is nally contained in A[i2i1i0] in Pi3i4 .
2000 by CRC Press LLC
Figure 20.3 DIFNR buttery computation (2nd stage) with data migration betweenP0 and P1.
2000 by CRC Press LLC
Figure 20.4 DIFNR buttery computation (2nd stage) with data migration betweenP2 and P3.
2000 by CRC Press LLC
Figure 20.5 DIFNR buttery computation (3rd stage) with data migration betweenP0 and P2.
2000 by CRC Press LLC
Figure 20.6 DIFNR buttery computation (3rd stage) with data migration betweenP1 and P3.
2000 by CRC Press LLC
20.1.3 The use of other initial mappings
The parallel algorithms using other initial mappings may be completely specied usingthe same notations. Given below are the three parallel DIF FFT algorithms corre-sponding to the three other cyclic block mappings.
|i3i2|i4i1i0 |i3i2|i4i1i0 |4
i2|
i3i1i0 |43
|i2i1i0 |43|2
i1i0 |43|21
i0 |2
3|410
Initial |PID|Local M (Optional)
|i2i1|i4i3i0 |i2i1|i4i3i0 |i2i1|4
i3i0 |4
i1|
i23i0 |42
|i13i0 |42|13
i0 |1
2|430
Initial |PID|Local M (Optional)
|i1i0|i4i3i2 |i1i0|i4i3i2 |i1i0|4
i3i2 |i1i0|43
i2 |4
i0|
i132 |41
|i032 |0
1|432
Initial |PID|Local M (Optional)
Observe that the last permutation is optional because the actual mapping of theoutput elements can be determined given any mapping, although one mapping may bemore convenient than the other if the output data elements are used to continue witha subsequent phase of computation.
It is worth noting that in the examples above, if the optional permutation stepis performed, then the array elements which were mapped to one processor will staytogether (at the same local address) in a dierent processor.
Further discussion regarding the optional permutation step and the nal mappingis deferred to Section 20.3.
As noted at the beginning of this section, the sequential DITNR FFT diers fromthe DIFNR FFT only in the application of twiddle factors. Therefore, the specicationsof the various DITNR versions of the parallel algorithm remain the same as those ofthe DIFNR versions given above.
20.2 Improved Parallel DIFRN and DITRN Algorithms
With the explicit understanding that a[i0i1i2i3i4] = xi4i3i2i1i0 when input data ele-ments are in bit-reversed order, the parallel FFT (with inter-processor permutation)corresponding to the DIFRN FFT or DITRN FFT can be specied for the four possibleCBM mappings as given below. Note that the nal mapping is always specied fora[i0i1i2i3i4], whose initial content of xi4i3i2i1i0 is overwritten by the output elementXi0i1i2i3i4 after the ve stages of in-place buttery computation.
|i0i1|i2i3i4 |i0i1|i2i3i4 |i0i1|i2
i34 |i0i1|
i234 |i02
|i134 |1
2|
i034 |10
|234
Initial |PID|Local M (Optional)
2000 by CRC Press LLC
|i1i2|i0i3i4 |i1i2|i0i3i4 |i1i2|i0
i34 |i13
|i0
i24 |2
3|i0
i14 |23|
i014 |21
|034
Initial |PID|Local M (Optional)
|i2i3|i0i1i4 |i2i3|i0i1i4 |i24
|i0i1
i3|34|i0i1
i2|34|i0
i12 |34|
i012 |32
|014
Initial
|PID|Local M (Optional)
|i3i4|i0i1i2 |i3i0|i4i1i2 |4
i0|
i3i1i2 |4i0|3i1
i2 |4i0|3
i12 |43
|i012
Initial |PID|Local M
As noted earlier, in addition to evenly distributing all buttery computations amongthe processors, the message length is reduced from NP elements to
12NP in each of the
log2 P + 1 concurrent message exchanges.
20.3 Further Technical Details and a Generalization
Note that in most examples given in this chapter, the PID bit in question is alwaysexchanged with the leftmost bit in the Local M , which is often referred to as thepivot [95, 104]. In these cases, as shown in Figures 20.1, 20.2, 20.3, 20.4, 20.5, and20.6, the data migrated between processors are consecutively stored in either the tophalf or the bottom half of each processors local array A.
However, in one example in Section 20.2, the PID bit in question is always exchangedwith the second bit from the right in the Local M ; in another example in Section 20.2,the PID bit is always exchanged with the rightmost bit in the Local M . Thus, thepivot could be arbitrarily chosen, if one so desires, from the bits of the Local M .
Since the ID number is formed by consecutive bits when a cyclic block mapping isused, whenever a PID bit is permuted into the local pivot position, it will be exchangedwith the next PID bit and occupy the latters position back in the PID eld. After dexchanges, one has the following scenario: the rightmost PID bit is in the Local M , andthe pivot i or from Local M is still in the leftmost position in PID. The example inSection 20.1.2 demonstrates the case involving the i bit from Local M , and the threeexamples in Section 20.1.3 demonstrate the case involving the bit from Local M .
Therefore, one more permutation involving these two bits will get i or back intoits original position in the Local M , and the rightmost PID bit would be cyclic-shiftedinto the leftmost position in the PID as shown below.
|kd+1
k kd+2|
n1 k+2k+1
ikd i+1ii1 i1i0
or
|kd+1
k kd+2|
n1 +11 k+2k+1
ikd i1i0
2000 by CRC Press LLC
Observe that as long as the PID is not formed by the leftmost d bits, there wouldbe at least one bit available when the buttery computation reaches any PID bit.One thus has the option of using a bit (instead of i bit) as the pivot. In this case,the bit may stay as the leftmost bit in the PID if the local data is not required toremain together in one processor, and one concurrent message exchange can be saved(by not performing the so-called optional permutation for examples in Section 20.1.3),with the nal mapping determined by
|k kd+2| n1 +1kd+11 k+2k+1
ikd i1i0 .
The PID in such a nal mapping is no longer formed by consecutive bits. For N = 25,such a nal |PID|Local M is a permutation of the n = 5 bits, which still uniquelydetermines the data mapping of a[i4i3i2i1i0] = Xi0i1i2i3i4 .
In certain contexts, it is important to have the output elements X mapped to theprocessors in a specic way to facilitate subsequent computations. For example, thenal distribution of X (with respect to its bit-reversed subscript) is required to beidentical to the initial one for x (with respect to its naturally-ordered subscript) in thesolution. This has motivated the development of a number of algorithms which arereviewed in the next chapter.
2000 by CRC Press LLC
INSIDE the FFT BLACK BOX: Serial and Parallel Fast Fourier Transform AlgorithmsTable of contentsPart III: Parallel FFT AlgorithmsChapter 20: Parallel FFTs with Inter-Processor Permutations20.1 Improved Parallel DIF NR and DIT NR Algorithms20.1.1 The idea and a modified shorthand notation20.1.2 The complete algorithm and output interpretation20.1.3 The use of other initial mappings
20.2 Improved Parallel DIF RN and DIT RN Algorithms20.3 Further Technical Details and a Generalization