Post on 26-Mar-2018
transcript
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Scalability of a pseudospectralDNS turbulence code with 2D
domain decomposition onPower4+/Federation and Blue
Gene systemsD. Pekurovsky1, P.K.Yeung2,D.Donzis2,
S.Kumar3,W. Pfeiffer1, G. Chukkapalli1
1San Diego Supercomputer Center2Georgia Institute of Technology
3IBMSP SciComp, Boulder CO, July 20, 2006
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Turbulence: examples
The small scales are important.
Computational Science of Turbulence – p.5/28
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
DNS code
• Code written in Fortran 90 with MPI• Time evolution: Runge Kutta 2nd order• Spatial derivative calculation: pseudospectral method• Typically, FFTs are done in all 3 dimensions.• Consider 3D FFT as compute-intensive kernel
representative of performance characteristics of the fullcode
• Input is real, output is complex; or vice versa
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
3D FFTUse ESSL library calls for 1D FFT on IBM, or FFTW on other systems
(FFTW is about 3 times slower than ESSL on IBM)
Forward 3D FFT in serial. Start with real array (Nx,Ny,Nz):• 1D FFT in x for all y and z
• Input is real (Nx,Ny,Nz).• Call SRCFT routine (real-to-complex), size Nx, stride is 1• Output is complex (Nx/2+1,Ny,Nz) – conjugate symmetry: F(k)=F*(N-k)• Pack data as (Nx/2,Ny,Nz) – since F(1) and F(Nx/2+1) are real numbers
• 1D FFT in y for all x and z• Input is complex (Nx/2,Ny,Nz)• Call SCFT (complex-to-complex), size Ny, stride Nx/2• Output is complex (Nx/2,Ny.Nz)
• 1D FFT in z for all x and y• Input and output are complex (Nx/2,Ny,Nz)• Call SCFT (complex-to-complex), size Nz, stride (NxNy)/2
Inverse 3D FFT: do the same in reverse order. Call SCFT, SCFT and SCRFT(complex-to-real).
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
3D FFT cont’d
Note: Alternatively, could transpose array in memoryNote: Alternatively, could transpose array in memorybefore calling before calling FFTsFFTs, so that strides are always 1. In, so that strides are always 1. Inpractice, with ESSL this doesnpractice, with ESSL this doesn’’t give an advantaget give an advantage(ESSL efficient even with strides > 1)(ESSL efficient even with strides > 1)
Stride 1: 28% peak Flops on DatastarStride 32: 25% peakStride 2048: 10% peak
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Parallel version
• Parallel 3D FFT: so-called transpose strategy, asopposed to direct strategy. That is, make sure all data indirection of 1D transform resides in one processor’smemory. Parallelize over orthogonal dimension(s).
• Data decomposition: N3 grid points over P processors• Originally 1D (slab) decomposition: divide one side of the cube over P,
assign N/P planes to each processor. Limitation: P <= N• Currently 2D (pencil) decomposition: divide side of the cube (N2) over
P, assign N2/P pencils (columns) to each processor.
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Memory and compute power
• 2048^3 on 2048 processors – 230 MB/proc. This problem fits onDatastar and Blue Gene. Extensive simulations under way.
• 4096^3 on 2048 processors – 1840 MB/proc. This problem doesn’t fiton BG (256 MB/proc), and fits very tightly on Datastar.• Anyway, computational power of 2048 processors is not enough to solve
problems in reasonable time. Scaling to higher counts is necessary, certainlymore than 4096.
Therefore, using 2D decomposition is a necessity(P > N)
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
1D Decomposition
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
2D decomposition
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
2D Decomposition cont’d
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
CommunicationGlobal communication: traditionally, a serious
challenge for scaling applications to largenode counts.
• 1D decomposition: 1 all-to-all exchange involving Pprocessors
• 2D decomposition: 2 all-to-all exchanges within p1groups of p2 processors each (p1 x p2 = P)
• Which is better? Most of the time 1D wins. But again:it can’t be scaled beyond P=N.
Crucial parameter is bisection bandwidth
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Alternative approaches attempted
• Overlap communication and computation• No advantage.
• Hybrid MPI/OpenMP• No advantage.
• Transpose in memory, call ESSL routines withstride 1• No advantage or worse
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Platforms involved
• Datastar: IBM Power4+ 1.5 GHz• at SDSC, up to 2048 CPUs• 8 processors/node• Fat tree interconnect
• Blue Gene: IBM PowerPC 700 MHz• at SDSC, up to 2048 CPUs• at IBM’s T.J.Watson Lab in New York state, up to 32768
CPUs (2nd in Top500 list)• 2 processors/node• 3D torus interconnect
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Performance on IBM Blue Gene and Datastar
64128
256
512
256
512
10242048
4096 8192
16384
32768
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
10 100 1000 10000 100000
Nproc
N^
3L
OG
2(N
)/(t
Np
)
CO 512^3
VN 512^3
CO 1024^3
VN 1024^3
CO 2048^3
VN 2048^3
CO 4096^3
VN 4096^3
Datastar 1024^3
Datastar 2048^3
VN: Two processors per nodeCO: One processor per node
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
A closer look at performance on BGDNS 2048^3
20484096 8192
16384
32768
0
20000
40000
60000
80000
100000
120000
140000
0 5000 10000 15000 20000 25000 30000 35000
N proc
T x
P
VN total
CO total
VN comm
CO comm
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Communication model for BG• 3D Torus network: 3D mesh with wrap-around
links. Each compute node has 6 links.• Modeling communication is a challenge for this
network and problem (mapping 2D processorgeometry to 3D network topology). Try to do areasonable estimate.
• Assume message sizes are large enough,consider only bandwidth, ignore overhead.
• Model CO (VN is similar)
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Communication model for BG, cont’dTwo subcommunicators: S1 and S2
P = P1P2 = Px Py PzStep 1: All-to-all exchange among P1 processors within each group S1Step 2: All-to-all exchange among P2 processors within each group S2
By default tasks are assigned in a block fashion (although custommapping schemes also available to the user and are interestingoption)
• S1’s are rectangles in X-Y plane: P1= Px x (Py / k)• S2’s are k Z-columns: P2= k PzB=175MB/s, 1-link bidirectional bandwidth
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Upper boundUpper bound
• Assume dimensions are independent• Find bottlenecks in each dimension, sum up the maximum time in
each dimension• Some links are idle some of the time
Take first step – communicator group is a plane Px X (Py/k)Assume torus (wraparound links) in x dimension, but not in y for k > 1Bisection bandwidth across y lines is b Px,Proceed in Px, stagesThe number of messages exchanged is Px, (Py/2k)2 for each stageTotal time for y-dimension bottleneck is ty = (Nb/B) Px, (Py/2k)2
Now independently consider X direction, and derive tx = (Nb/B) (Py/k) (Px /2 )2 (1/2)
Summing up, and using Nb=(4N3)/(P P1), we getT1 = (N3/P B) ((1/2)Px+(1/k)Py)
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Upper bound cont’dNow step 2: k Z-lines in each communicator, all lie in Y-Z planesAgain, assume staged implementation. First communicate along y.Dimension size is k, bisection bandwidth is bPx but it’s shared among P1
groups. Do this Pz times.Time is Nb/b (k/2)2 P1Pz/Px (1/2)= (N3/Pb) Py/2Finally, exchange within Z-lines (k times)Tz = Nb/b (Pz/2)2 /2 (2 comes from torus links)So T2 = (N3/Pb) (Py+kPz/2)Summing up,
Tup = T1+T2 = (N3/P B) (Px/2 +(1/2+1/k)Py+Pz/2)For the lower bound, assume all links busy, and only the maximum time
counts. Obtain
Tlower = (N3/P B) [Max(Px/2,Py/k)+1/2 Max(Py,Pz)]
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Communication model for BG
2048^3
1024
2048 4096
8192
16384
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
0 5000 10000 15000 20000
P
Tco
mm
x P
CO 2D measured
CO 2D predic upper bound
CO 2D predic lower bound
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Summary• 2D decomposition enables significantly
increased scalability of the DNS turbulencecode
• Achieve good scaling on both IBM SP4 andBG/L (up to 32k processors)
• Ready for next generation of machines• DNS turbulence is one of the 3 Model Problems
in the recent NSF Petascale RFP