Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and...

transcript

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Scalability of a pseudospectralDNS turbulence code with 2D

domain decomposition onPower4+/Federation and Blue

Gene systemsD. Pekurovsky1, P.K.Yeung2,D.Donzis2,

S.Kumar3,W. Pfeiffer1, G. Chukkapalli1

1San Diego Supercomputer Center2Georgia Institute of Technology

3IBMSP SciComp, Boulder CO, July 20, 2006

Turbulence: examples

The small scales are important.

Computational Science of Turbulence – p.5/28

DNS code

• Code written in Fortran 90 with MPI• Time evolution: Runge Kutta 2nd order• Spatial derivative calculation: pseudospectral method• Typically, FFTs are done in all 3 dimensions.• Consider 3D FFT as compute-intensive kernel

representative of performance characteristics of the fullcode

• Input is real, output is complex; or vice versa

3D FFTUse ESSL library calls for 1D FFT on IBM, or FFTW on other systems

(FFTW is about 3 times slower than ESSL on IBM)

Forward 3D FFT in serial. Start with real array (Nx,Ny,Nz):• 1D FFT in x for all y and z

• Input is real (Nx,Ny,Nz).• Call SRCFT routine (real-to-complex), size Nx, stride is 1• Output is complex (Nx/2+1,Ny,Nz) – conjugate symmetry: F(k)=F*(N-k)• Pack data as (Nx/2,Ny,Nz) – since F(1) and F(Nx/2+1) are real numbers

• 1D FFT in y for all x and z• Input is complex (Nx/2,Ny,Nz)• Call SCFT (complex-to-complex), size Ny, stride Nx/2• Output is complex (Nx/2,Ny.Nz)

• 1D FFT in z for all x and y• Input and output are complex (Nx/2,Ny,Nz)• Call SCFT (complex-to-complex), size Nz, stride (NxNy)/2

Inverse 3D FFT: do the same in reverse order. Call SCFT, SCFT and SCRFT(complex-to-real).

3D FFT cont’d

Note: Alternatively, could transpose array in memoryNote: Alternatively, could transpose array in memorybefore calling before calling FFTsFFTs, so that strides are always 1. In, so that strides are always 1. Inpractice, with ESSL this doesnpractice, with ESSL this doesn’’t give an advantaget give an advantage(ESSL efficient even with strides > 1)(ESSL efficient even with strides > 1)

Stride 1: 28% peak Flops on DatastarStride 32: 25% peakStride 2048: 10% peak

Parallel version

• Parallel 3D FFT: so-called transpose strategy, asopposed to direct strategy. That is, make sure all data indirection of 1D transform resides in one processor’smemory. Parallelize over orthogonal dimension(s).

• Data decomposition: N3 grid points over P processors• Originally 1D (slab) decomposition: divide one side of the cube over P,

assign N/P planes to each processor. Limitation: P <= N• Currently 2D (pencil) decomposition: divide side of the cube (N2) over

P, assign N2/P pencils (columns) to each processor.

Memory and compute power

• 2048^3 on 2048 processors – 230 MB/proc. This problem fits onDatastar and Blue Gene. Extensive simulations under way.

• 4096^3 on 2048 processors – 1840 MB/proc. This problem doesn’t fiton BG (256 MB/proc), and fits very tightly on Datastar.• Anyway, computational power of 2048 processors is not enough to solve

problems in reasonable time. Scaling to higher counts is necessary, certainlymore than 4096.

Therefore, using 2D decomposition is a necessity(P > N)

1D Decomposition

2D decomposition

2D Decomposition cont’d

CommunicationGlobal communication: traditionally, a serious

challenge for scaling applications to largenode counts.

• 1D decomposition: 1 all-to-all exchange involving Pprocessors

• 2D decomposition: 2 all-to-all exchanges within p1groups of p2 processors each (p1 x p2 = P)

• Which is better? Most of the time 1D wins. But again:it can’t be scaled beyond P=N.

Crucial parameter is bisection bandwidth

Alternative approaches attempted

• Overlap communication and computation• No advantage.

• Hybrid MPI/OpenMP• No advantage.

• Transpose in memory, call ESSL routines withstride 1• No advantage or worse

Platforms involved

• Datastar: IBM Power4+ 1.5 GHz• at SDSC, up to 2048 CPUs• 8 processors/node• Fat tree interconnect

• Blue Gene: IBM PowerPC 700 MHz• at SDSC, up to 2048 CPUs• at IBM’s T.J.Watson Lab in New York state, up to 32768

CPUs (2nd in Top500 list)• 2 processors/node• 3D torus interconnect

Performance on IBM Blue Gene and Datastar

10242048

4096 8192

500000

1000000

1500000

2000000

2500000

3000000

3500000

10 100 1000 10000 100000

CO 512^3

VN 512^3

CO 1024^3

VN 1024^3

CO 2048^3

VN 2048^3

CO 4096^3

VN 4096^3

Datastar 1024^3

Datastar 2048^3

VN: Two processors per nodeCO: One processor per node

A closer look at performance on BGDNS 2048^3

20484096 8192

100000

120000

140000

0 5000 10000 15000 20000 25000 30000 35000

N proc

VN total

CO total

VN comm

CO comm

Communication model for BG• 3D Torus network: 3D mesh with wrap-around

links. Each compute node has 6 links.• Modeling communication is a challenge for this

network and problem (mapping 2D processorgeometry to 3D network topology). Try to do areasonable estimate.

• Assume message sizes are large enough,consider only bandwidth, ignore overhead.

• Model CO (VN is similar)

Communication model for BG, cont’dTwo subcommunicators: S1 and S2

P = P1P2 = Px Py PzStep 1: All-to-all exchange among P1 processors within each group S1Step 2: All-to-all exchange among P2 processors within each group S2

By default tasks are assigned in a block fashion (although custommapping schemes also available to the user and are interestingoption)

• S1’s are rectangles in X-Y plane: P1= Px x (Py / k)• S2’s are k Z-columns: P2= k PzB=175MB/s, 1-link bidirectional bandwidth

Upper boundUpper bound

• Assume dimensions are independent• Find bottlenecks in each dimension, sum up the maximum time in

each dimension• Some links are idle some of the time

Take first step – communicator group is a plane Px X (Py/k)Assume torus (wraparound links) in x dimension, but not in y for k > 1Bisection bandwidth across y lines is b Px,Proceed in Px, stagesThe number of messages exchanged is Px, (Py/2k)2 for each stageTotal time for y-dimension bottleneck is ty = (Nb/B) Px, (Py/2k)2

Now independently consider X direction, and derive tx = (Nb/B) (Py/k) (Px /2 )2 (1/2)

Summing up, and using Nb=(4N3)/(P P1), we getT1 = (N3/P B) ((1/2)Px+(1/k)Py)

Upper bound cont’dNow step 2: k Z-lines in each communicator, all lie in Y-Z planesAgain, assume staged implementation. First communicate along y.Dimension size is k, bisection bandwidth is bPx but it’s shared among P1

groups. Do this Pz times.Time is Nb/b (k/2)2 P1Pz/Px (1/2)= (N3/Pb) Py/2Finally, exchange within Z-lines (k times)Tz = Nb/b (Pz/2)2 /2 (2 comes from torus links)So T2 = (N3/Pb) (Py+kPz/2)Summing up,

Tup = T1+T2 = (N3/P B) (Px/2 +(1/2+1/k)Py+Pz/2)For the lower bound, assume all links busy, and only the maximum time

counts. Obtain

Tlower = (N3/P B) [Max(Px/2,Py/k)+1/2 Max(Py,Pz)]

Communication model for BG

2048^3

2048 4096

0 5000 10000 15000 20000

CO 2D measured

CO 2D predic upper bound

CO 2D predic lower bound

Summary• 2D decomposition enables significantly

increased scalability of the DNS turbulencecode

• Achieve good scaling on both IBM SP4 andBG/L (up to 32k processors)

• Ready for next generation of machines• DNS turbulence is one of the 3 Model Problems

in the recent NSF Petascale RFP

Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and...

Documents