16.4-Tﬂops Direct Numerical Simulation of Turbulence · PDF file16.4-Tﬂops Direct...

16.4-Tflops Direct Numerical Simulation of Turbulence by a Fourier SpectralMethod on the Earth Simulator

Mitsuo Yokokawa1†, Ken’ichi Itakura2, Atsuya Uno2, Takashi Ishihara3 and Yukio Kaneda3

1 Earth Simulator Research and Development CenterJapan Atomic Energy Research Institute6-9-3, Higashi-Ueno, Taito-ku, Tokyo 110-0015, Japan

2 Earth Simulator Center, Japan Marine Science and Technology Center3173-25 Showa-machi, Kanazawa-ku, Yokohama 236-0001, Japan{itakura,uno}@es.jamstec.go.jp

3 Graduate School of Engineering, Nagoya University,Chikusa-ku, Nagoya 464-8603, Japan{ishihara,kaneda}@cse.nagoya-u.ac.jp

Abstract

The high-resolution direct numerical simulations (DNSs) of incompressible turbulence with numbers ofgrid points up to 40963 have been executed on the Earth Simulator (ES). The DNSs are based on the Fourierspectral method, so that the equation for mass conservation is accurately solved. In DNS based on thespectral method, most of the computation time is consumed in calculating the three-dimensional (3D) FastFourier Transform (FFT), which requires huge-scale global data transfer and has been the major stumblingblock that has prevented truly high-performance computing. By implementing new methods to efficientlyperform the 3D-FFT on the ES, we have achieved DNS at 16.4 Tflops on 20483 grid points. The DNS yieldsan energy spectrum exhibiting a wide inertial subrange, in contrast to previous DNSs with lower resolutions,and therefore provides valuable data for the study of the universal features of turbulence at large Reynoldsnumber. ∗

1 Introduction

Direct numerical simulation (DNS) of turbulence provides us with detailed data on turbulence that is free ofexperimental uncertainty. DNS is therefore not only a powerful means for finding directly applicable solutionsto problems in practical application areas that involve turbulent phenomena, but also for advancing our under-standing of turbulence itself – the last outstanding unsolved problem of classical physics, and a phenomenonthat is seen in many areas which have societal impacts.

Sufficiently high levels of computational performance are essential to the DNS of turbulence. If we don’thave this, we are only able to simulate turbulence with insufficient resolution or for low or moderate values ofthe Reynolds number Re, which represents the degree of non-linearity of flow in a turbulent system. However,

†Currently Grid Technology Research Center, National Institute of Advanced Industrial Science and Technologye-mail: [email protected]∗0-7695-1524-X/02 $17.00 (c) 2002 IEEE

1

one will then be missing the essence of the turbulence. For example, our recent experience has shown that DNSwith resolution of only around 5123 grid points or so results in a significant overestimate of the Kolmogorovconstant, which is one of the most important constants in the theory of turbulence. To obtain asymptoticallycorrect higher-order statistics on small-scale eddies for large Re, which today forms the core of much of theeffort in turbulence research, the required degree of resolution for the DNS of incompressible turbulent flow isestimated as at least 20483 or 40963 grid points.

The computer that runs a DNS of this type must have (M) enough memory to accommodate the hugenumber of degrees of freedom, (S) high enough speeds to run the DNS within a tolerable time, and (A) highlevels of accuracy such that it is possible to resolve the motion of small eddies that have velocity amplitudesmuch smaller than those of the energy-containing eddies.

The Earth Simulator (ES) provides a unique opportunity in these respects. On the ES, we have recentlyachieved DNS of incompressible turbulence under periodic boundary conditions (BC) by a spectral method on20483 grid points with double-precision arithmetic, and DNS on 40963 grid points with the application of timeintegration with single-precision arithmetic; double-precision arithmetic was used to obtain the convolutionalsums for evaluating the nonlinear terms. Being based on a spectral method, our DNS accurately satisfies thelaw of mass conservation as is explained below (§ 3); this is not achieved by a conventional DNS based on afinite-difference scheme. Such accuracy is of crucial importance in the study of turbulence, and in particularfor the resolution of small eddies.

On the other hand, the execution of DNS code that implements the Fourier spectral method is definitelysuitable as a way of evaluating the performance of such newly developed distributed memory systems as the ESfrom the viewpoints of computational performance, bandwidth from node to node, and I/O capabilities. DNScode is a very good benchmark program of the ES. It has also been employed in the final adjustment of thehardware system.

The maximum number of degrees of freedom used in the evaluation is O(1011). The computational speedfor DNSs was measured by using up to 512 processor nodes of the ES to simulate runs with different numbersof grid points. The best sustained performance of 16.4 Tflops was achieved in a DNS on 20483 grid points.

An overview of the ES and of the numerical methods applied in the DNS, along with results for system-performance evaluation and DNS are presented in this paper.

2 Overview of the Earth Simulator

2.1 Structure

The ES is a parallel computer system of the distributed-memory type, and consists of 640 processor nodes(PNs) connected by 640 × 640 single-stage crossbar switches. Each PN is a system with a shared memory,consisting of 8 vector-type arithmetic processors (APs), a 16-GB main memory unit (MMU), a remote accesscontrol unit (RCU), and an I/O processor. The peak performance of each AP is 8Gflops. The ES as a wholethus consists of 5120 APs with 10 TB of main memory and the peak performance of 40Tflops[1].

Each AP consists of a 4-way super-scalar unit (SU), a vector unit (VU), and main memory access controlunit on a single LSI chip. The AP operates at a clock frequency of 500MHz with some circuits operating at1GHz. Each SU is a super-scalar processor with 64KB instruction caches, 64KB data caches, and 128 general-purpose scalar registers. Branch prediction, data prefetching and out-of-order instruction execution are allemployed. Each VU has 72 vector registers, each of which can has 256 vector elements, along with 8 sets ofsix different types of vector pipelines: addition/shifting, multiplication, division, logical operations, masking,

2

Arit

hmet

ic P

roce

ssor

#0

Arit

hmet

ic P

roce

ssor

#1

Arit

hmet

ic P

roce

ssor

#7

Shared Memory16GB

Arit

hmet

ic P

roce

ssor

#0

Arit

hmet

ic P

roce

ssor

#1

Arit

hmet

ic P

roce

ssor

#7

Shared Memory16GB

Arit

hmet

ic P

roce

ssor

#0

Arit

hmet

ic P

roce

ssor

#1

Arit

hmet

ic P

roce

ssor

#7

Processor Node #0 Processor Node #1 Processor Node #639

Interconnection Network (Single-stage full crossbar switch : 12.3GB/s x 2)

Shared Memory16GB

Figure 1: System configuration of ES

and load/store. The same type of vector pipelines works together by a single vector instruction and pipelines ofdifferent types can operate concurrently. The VU and SU support the IEEE 754 floating-point data format.

The RCU is directly connected to the crossbar switches and controls inter-node data communications at12.3GB/s bidirectional transfer rate for both sending and receiving data. Thus the total bandwidth of inter-nodenetwork is about 8TB/s. Several data-transfer modes, including access to three-dimensional(3D) sub-arrays andindirect access modes, are realized in hardware. In an operation that involves access to the data of a sub-array,the data is moved from one PN to another in a single hardware operation, and relatively little time is consumedthis processing.

The overall MMU is divided into 2048 banks and the sequence of bank numbers corresponds to increasingaddresses of locations in memory. Therefore, the peak throughput is obtained by accessing contiguous datawhich are assigned to locations in increasing order of memory address.

The fabrication and installation of the ES at the Earth Simulator Center of the Japan Marine Science andTechnology Center was completed by the end of February 2002 (Fig. 2)[2].

2.2 Parallel programming on the ES

If we consider vector processing as a sort of parallel processing, then we need to consider three-level parallelprogramming to attain high levels of sustained performance for the ES.

The first level of parallel processing is vector processing in an individual AP; this is the most fundamentallevel of processing by the ES. Automatic vectorization is applied by the compilers to programs written inconventional Fortran 90 and C.

The second level is that of shared-memory parallel processing within an individual PN. Shared-memoryparallel programming is supported by microtasking and OpenMP. The microtasking capability is similar instyle to that one which has been provided for a Cray supercomputer, and the same function is realized for theES. Microtasking is applied in two ways; one (AMT) is automatic parallelization by the compilers and the other(MMT) is the manual insertion of microtasking directives before target do loops.

The third level is distributed-memory parallel processing that is shared among the PNs. A distributed-memory parallel programming model is supported by the Message Passing Interface (MPI). The performanceof this system for the MPI put function of the MPI-2 specification was measured[3]. The maximum throughput

3

Cartridge TapeLibrary System

InterconnectionNetwork Cabinets

Processor NodeCabinets

Disks

Power Supply System

Air Conditioning System

Double Floorfor PN-IN Cables

Figure 2: A model of the ES system in the gym-like building. The building is 50m × 65m × 17m and has twostories; it includes a seismic isolation system.

and latency for MPI put are 11.63GB/s and 6.63 µsec, respectively. Only 3.3 µsec is required for barrier syn-chronization; this is because the system includes a dedicated hardware system for global barrier synchronizationamong the PNs.

3 Numerical Methods

3.1 Basic Equations and Spectral Method

The problem of incompressible turbulence under periodic boundary conditions (BC) is one of the most canon-ical problems in the study of turbulence, and has in fact been extensively studied. It keeps the essence ofturbulence–nonlinear convection, pressure and dissipative mechanisms due to the viscosity–while being free ofsuch extra complexities as those due to the fluid’s compressibility, which often make the reliability of DNS lesstransparent.

We here consider the flow of an incompressible fluid as described by the Navier-Stokes (NS) equations,

∂u∂t+ (u · ∇)u = −∇p + ν∇2u + f , (1)

under a periodic boundary condition with period 2π, where u = (u1, u2, u3) is the velocity field, p is the pressure,and f is the external force that satisfies ∇ · f = 0, the fluid density is assumed to be unity. The pressure term pcan be eliminated by the incompressibility condition:

∇ · u = 0. (2)

Let us rewrite (1) in the form∂u∂t= u × ω − ∇Π + ν∇2u + f , (3)

4

where ω = rotu = (ω1, ω2, ω3) is the vorticity and Π = p + 12u2. Then, taking the divergence of (3) and using

(2), we obtain∇2Π = ∇ · [u × ω]. (4)

In DNS of turbulence, the accurate solution of equations of this type, i.e., Poisson’s equation, is important,because a violation of the equation(s) implies a violation of mass conservation, one of the most fundamentallaws of physics. However, it is not, in general easy to accurately solve Poisson’s equation by using a finite-difference (FD) scheme. In fact, most of the cpu time consumed in solving a DNS by FD is known to beconsumed in solving Poisson’s equation, and the cpu time increases rapidly with the required accuracy level.This difficulty can be overcome by using the Fourier spectral method, where (3) is written as(

ddt+ νk2

)u(k) = s(k) − k

k · s(k)

k2+ f (k), (5)

s(k) = − (u × ω)(k),

where we have used (4). In (5), k is the wave vector, k = |k|, and the hatdenotes the Fourier coefficient. Forexample, u(k) is defined by

u(x) =∑

k<KC

u(k) exp ik · x,

where KC is a cut-off wavenumber in the DNS (see below). In the spectral method, the inverse Poisson operatoris expressed simply by −1/k2, so its evaluation can be accurate to within the limit imposed by the numericalround-off error.

In DNS by a spectral method, most of the cpu time is consumed for the evaluation of the nonlinear termsin NS equations, which are expressed as convolutional sums in the wave vector space. As is well-known,the sum can be efficiently evaluated by using an FFT. In order to achieve high levels of efficiency in termsof computational speed and numbers of modes retained in the calculation, we use the so-called phase-shiftmethod, in which a convolutional sum is obtained for a shifted grid system as well as for the original gridsystem. This method allows us to keep all of the wave vector modes which satisfy k < KC =

√2N/3, where N

is the number of discretized grid points in each direction of the Cartesian coordinates, thus greatly increasingthe number of the retained modes. The nonlinear term in (3) or (5) can be evaluated without aliasing error by18 real 3D-FFTs[4].

Our code ( hereafter called Code-ω) based on (5) and the standard 4-th order Runge-Kutta (R-K) methodfor advancing time, was written in Fortran 90, and 25N3 main dependent variables are used in the program. Thetotal number of lines of code, excluding comment lines, is about 3,000. The required amount of memory forN = 4096 and double-precision data was thus estimated to be about 12.5 TB, which is beyond the capacity ofthe ES. However, executing DNS with N = 4096 is of great interest from the view-point of studying turbulence.A closer inspection motivated by this consideration showed that the number of variables to be accommodatedin the DNS may be reduced to 22N3 by rewriting the NS equations (1) in divergence form, i.e., as

∂ui

∂t+∂

∂x j(uiu j) = − ∂

∂xip + ν∇2ui + fi, (6)

and that the memory requirement can be further reduced by (a) limiting the time integration by the R-K methodto relevant wave-vector modes or (b) reducing the precision of the arithmetic from double to single. DNS forN = 4096 by 512 PNs of the ES is then possible. Unlike method (a), method (b) is easy to implement andmakes the computation faster, so we were tempted to try method (b).

5

Before applying method (b) to allow us to execute DNS with N = 4096, it is desirable that we have someidea on the potential effects of the arithmetic precision on the results. We therefore performed preliminary DNSswith N = 1024 in (i) double-precision arithmetic, (ii) single-precision arithmetic, and (iii) single-precisionarithmetic for time integration in the spectral space and double-precision arithmetic for the convolutional sumsused in evaluating the nonlinear term. Method (iii) is interesting because, while the nonlinear coupling in theNS equations is treated more accurately by (iii) than by (ii), the memory capacity of the ES allows DNS forN = 4096 with method (iii). In comparing the results of the DNS test runs, we confirmed that the differencebetween the results is not significant, at least in terms of such low-order statistics as the energy spectrumpresented below (the details of the comparison will be reported elsewhere).

There are slightly fewer operations per time step in Code-ω than in the code (hereafter called Code-div)based on the divergence form (6); this difference, however is O(N3), which is much less than the numberO(N3 log2 N) of operations for one 3D-FFT, [see Eqs. (7) and (8)], since the respective programs have the samenumbers of 3D-FFTs per time step. Accordingly, the speed of computation is slightly faster with Code-ω. Forexample, the respective cpu times per 100 time steps in execution by 512 nodes of simulations with N = 2048and double-precision arithmetic are 321 sec for Code-w and 330 sec for Code-div. We confirmed that the resultsof simulation by the two bodies of code are the same, to within the limit of machine accuracy. With regard tothe simulation runs with N ≤ 2048, we below present only the results for Code-ω. Regarding the run withN = 4096, we only used Code-div, in which we apply the Runge-Kutta-Gill method for advancing time to saveon memory usage.

3.2 Parallelization

Since the 3D-FFT accounts for more than 90% of the computational cost of executing the code for a DNS ofturbulence by the Fourier spectral method, the most crucial factor in maintaining high levels of performance inthe DNS of turbulence is the efficient execution of this calculation. In particular, in a parallel implementation,the 3D-FFT has a global data dependence because of its requirement for global summation over PNs, so data isfrequently moved about among the PNs during this computation.

Vector processing is capable of efficiently handling the 3D-FFT as decomposed along any of the threeaxes. Applying the domain decomposition method to assign the calculations for respective sets of several 2Dplanes to individual PNs, then having each PN execute the 2D-FFT for its assigned slab by vector processingand microtasking is an effective approach. The FFT in the remaining direction should then be performed aftertransposition of the 3D array data. Domain decomposition in the k3 direction in wave-vector space and in the ydirection in physical space was implemented in the code.

We achieved a high-performance FFT by implementing the following ideas/methods on the ES.

(1) Data AllocationLet nd be the number of PNs, and let’s consider the 3D-FFT for N3 real-valued data, which we will call u. Inwave-vector space, the Fourier transform u of the real u is divided into a real part uR and an imaginary partuI , each of which is of size N3/2. These data are divided into nd data sets, each of which is allocated to theglobal memory region (GMR) of a PN, where the size of each data set is (N + 1)×N × (N/2/nd). Similarly,the real data of u are divided into nd data sets of size (N + 1) × (N/nd) × N, each of which is allocated tothe GMR of the corresponding PN. Here the symbol ni in n1 × n2 × n3 denotes the data length along thei-th axis, and we set the length along the first axis to (N + 1), so as to speed up the memory throughput byavoiding memory-bank conflict.

6

(2) Parallelization by MicrotaskingFor efficiently performing the N× (N/2/nd) 1D-FFTs of length N along the first axis, the data along the sec-ond axis is divided up equally among the 8 APs of the given PN. This division can be efficiently achieved byusing the microtasking function which is provided by the ES. However, we decided to achieve this in prac-tice by the method referred to as MMT in subsection §2.2, i.e., the manual insertion of the parallelizationdirective “*cdir parallel do” before target do-loops, which directs the compiler to apply microtaskingto parallelize the do-loop. We did this because we had found that the use of automatic parallelization (AMT)by the compiler did not draw out the benefits of parallel execution.

(3) Radix-4 FFTThough the peak performance of an AP of the ES is 8 Gflops, the bandwidth between an AP and the memorysystem is 32 GB/s. This means that only one double-precision floating-point datum can be supplied forevery two possible double-precision floating-point operations. The ratio of the number of times memoryis accessed to the number of floating-point data operations for the radix-2 FFT is 1; the memory system isthus incapable of supplying sufficient data to the processors.This bottleneck of memory access in the kernel loop of a radix-2 FFT function degrades the sustained levelsof performance on the overall task. Thus, to obtain efficiently calculation of the 1D-FFT within the ES, theradix-4 FFT must replace the radix-2 FFT to the extent that this is possible. This is because of the lowerratio of the number of memory accesses to the number of floating-point data operations in the kernel loopof the radix-4 FFT, so the radix-4 FFT better fits the ES. The 1D-FFT along the 2nd axis can be performedby vectorization and the function of microtasking is applied to the 1st axis by dividing the do-loops.

(4) Data Transposition by Remote Memory AccessBefore performing the 1D-FFT along the 3rd axis, we need to transpose the data from the domain decom-position along the 2nd axis to the one along the 3rd axis. The remote memory access (RMA) function iscapable of handling the transfer of data which is required for this transposition. RMA is a means for thedirect transfer of data from the GMR of a PN of the ES to the GMR of pre-assigned PN. One then does notneed to make copies of the data, i.e., data is copied to neither the MPI-library nor the communications bufferregion of the OS. In a single cycle of RMA transfer, N × (N/nd) × (N/2/nd) data are transferred from eachof the nd PNs to the other PNs. The data transposition can be completed with (nd − 1) such RMA-transferoperations, after which N × (N/nd) × (N/2) data will have been stored at each target PN. The 1D-FFT forthe 3rd axis is then executed by dividing the do-loop along the 1st axis so as to apply microtasking.

4 Performance and DNS Results

4.1 Performance of Parallel Computation on Multi-nodes

We have measured the sustained performance for both the double-precision and single-precision versions ofcode-ω by changing the number N3 of grid points, setting N values of 128, 256, 512, 1024, and 2048. Thecorresponding numbers of PNs taken up in the ES are 64, 128, 256, and 512; see Table 1 for the correspondencesbetween number of PNs and number of grid points which we tested.

The calculation time for 100 time steps of the Runge-Kutta integration was measured by using the MPIfunction MPI wtime. The time taken in initialization and I/O processing are excluded from the measurementbecause these values are negligible in comparison with the cost of Runge-Kutta integration, which increaseswith the number of time steps.

7

Table 1: Performance in Tflops of the computations with double [single] precision arithmetic as counted bythe hardware monitor on the ES. The numbers in ( - ) denote the values for computational efficiency, CE . Thenumber np of APs in each PN is a fixed 8.

N3 \ nd 512 256 128 64

20483 13.7(0.43)[15.3(0.48)] 6.9(0.43)[7.8(0.49)] − −10243 11.3(0.35)[11.2(0.35)] 6.2(0.39)[7.2(0.45)] 3.3(0.41)[3.7(0.47)] 1.7(0.43)[1.9(0.48)]5123 − 4.1(0.26)[4.0(0.25)] 2.7(0.34)[3.0(0.38)] 1.5(0.38)[1.7(0.43)]2563 − − 1.3(0.16)[1.2(0.15)] 1.0(0.26)[1.1(0.28)]1283 − − − 0.3(0.07)[0.3(0.07)]

Table 2: Performance in Tflops as calculated for the same cases in Table 1 by using the analytical expressionsfor numbers of operations

N3 \ nd 512 256 128 64

20483 14.6[16.4] 7.4[8.4] − −10243 12.2[12.1] 6.7[7.7] 3.5[4.0] 1.8[2.1]5123 − 4.4[4.3] 3.0[3.3] 1.7[1.9]2563 − − 1.4[1.3] 1.1[1.2]1283 − − − 0.3[0.3]

The number of floating-point operations in the measurement range, which is needed in calculation of thesustained performance, may be monitored by either (a) a hardware counter in the ES or (b) some analyticalmethod for obtaining the number of operations. The hardware counter obtains total amounts of floating-pointdata operations that have been processed in vector operations. Generally, numbers of operations are greaterfor vector than for scalar processors, because IF statements in do loops in the former case may require extranumber of floating-point operations due to masking operations. Since there are few IF statements in most ofthe subroutines for calculating 3D-FFTs, the number measured by the hardware counter is appropriate for thecalculation of sustained performance.

Method (b) is based on the fact that the number of operations in a 1D-FFT with both radix-2 and 4 againstthe number of grid points N is N(5p + 8.5q), where N is represented by 2p4q and p = 0 or 1 according tothe value of N[5]. Considering that the DNS code has 72 real 3D-FFTs per time step and summing up thenumber of operations over the measurement range by hand, we obtain the following analytical expressions ofthe number of operations:

459N3 log2 N + (288 + 16π)N3, if p = 0, (7)

459N3 log2 N + (369 + 16π)N3, if p = 1. (8)

Results of these expressions can be used as reference values on the number of operations in comparing thesustained performance of code.

Tables 1 and 2 show results obtained by methods (a) and (b), respectively. N3 and nd in the tables indicatethe numbers of grid points and PNs used in each run, respectively. The number np of APs in each PN is fixed to

8

8 in the execution that produced Tables 1 and 2 by the use of MMT. The total number of APs used in each runis thus 8 × nd. In the measurement runs, the data allocated to the GMR of each PN are transferred by the RMAfunction MPI put, and the tasks in each PN are divided up among the APs by the use of manual microtasking(MMT). The maximum sustained performance of 16.4 Tflops as calculated by method (b) was for the problemsize of 20483 with 512 PNs; 15.3 Tflops (again the maximum) was obtained for the same case in evaluation bymethod (a). Table 1 shows that, for a fixed np = 8, the sustained performance is better for single precision thanfor double precision code for larger values of N and smaller values of nd. Though the numbers of floating-pointoperations in the code are the same regardless of the precision, only half as much data is transferred in thetransposition for 3D-FFT in the former case. This is presumably the reason for the better performance of thesingle precision code for larger values of N.

Note here that when one uses AMT, the compiler automatically determines the division of the do-loop formicrotasking and the number np of available APs within 1 ≤ np ≤ 8. In our experience with AMT, pre-assigningnp to np = N/nd/2 has not greatly improved the performance over that for the case with no pre-assignment.This implies that one needs to use MMT to achieve high levels of performance for fixed values of N and nd.In fact, the highest levels of performance are achieved by the manual insertion of microtasking directives thatobtain equal utilization of all 8 APs. If AMT were used instead of MMT, only 2 APs in each PN would beutilized to execute each nested do-loop.

In Table 1 we see that the performance in Tflops increases almost linearly with nd or the total numberof APs (8 × nd), and the computational efficiency CE , defined as (sustained performance)/(theoretical peakperformance) increases with N given a fixed 8 × nd. This implies that the RMA transfer is working veryefficiently despite the huge amount of data being transferred among the PNs, so that communications do notbecome a bottleneck. A close inspection of the CE value reveals that it is over 0.6 for some examples of theradix-4 FFT, but is generally only about 0.3 for the radix-2 FFT. Our achievement of CE values in the 0.4 − 0.5range for larger values of N is presumably due to our strategy of using the radix-4 FFT as much as possible, aswas explained in §3.

The CPU times for Code-ω to go through one time-step of the 4th-order R-K method, when N = 1024and 2048, np = 8, and nd = 512, are 0.435 and 3.21 seconds, respectively. It may be interesting to comparethis with the performance of other code; for example, on the supercomputer Fujitsu VPP5000/56 installed atthe Computation Center of Nagoya University, code developed by the authors (which has been used in variousstudies of turbulence; see the cited example [6]) achieved the CPU time of 160 seconds for the DNS withN = 1024.

The CPU times to calculate a single eddy-turn-over time T defined as T = L/U (where U and L are thecharacteristic velocity and length scale of energy containing eddies, respectively), were 23 and 268 minutes inthe runs with N = 1024 and 2048, respectively, when we were using np = 8 and nd = 512. The time of 23minutes for N = 1024 is much smaller than that of 60 hours achieved with the 32 PEs ( ∼ 307 Gflops) of theVPP500/56 [8]. This comparison demonstrates the strong performance of the code we have developed for theES, in terms of wall-clock measures of computation time.

For reference, it may be also interest to note that the CPU time of the ES for one time step of the Runge-Kutta-Gill method in single precision for the run with N = 4096, np = 8, and nd = 512 was 30.7 seconds andthe CPU time for a single eddy-turn-over time was estimated as 43 hours. The memory capacity of 7.2 TByteswas required for this execution. Simulations of this size are only truly practicable on the ES.

9

4.2 DNS Results

All of the runs, except the one with N = 4096, were performed with double-precision arithmetic and continueduntil the time t ≈ 5T . Single-precision arithmetic was used in the run with N = 4096, except for the 3D-FFT(method (iii) in subsection 3.1), and this run was continued until t ≈ 0.7T .

Table 3 shows some characteristic parameter values for the runs, where Rλ is the so-called Taylor-scaleReynolds number. In experimental and numerical data on fully developed turbulence, it is traditional to use Rλinstead of the Reynolds number Re = UL/ν, since the former is easier to measure [7]. The relation Rλ ∝ Re1/2

has been shown to hold for large-Reynolds-number turbulence. The Rλ values 732 and 1217 for N = 2048 and4096, are much higher than those in previous DNSs [8, 9]; for example, Rλ = 460 for N = 1024 in the DNS of[8].

One of the most important features of real turbulence at large Reynolds number is the existence of a widegap, the so-called inertial subrange, between the scale on which the energy is and the scale on which it has to bedissipated. These scales are characterized by L (the so-called integral length scale) and the Kolmogorov lengthscale η, respectively. In order to study the possible universal features of turbulence by DNS, simulating a wideenough inertial subrange is of crucial importance. In this respect, Table 3 shows that the scale ratio L/η is morethan 103 for N = 2048 and 4096,which is much greater than the ratios in previous DNSs.

These simulation runs on the ES may thus have provided us with data which is indispensable to turbulenceresearch. Among these are data on the energy spectrum E(k). According to Kolmogorov’s theory (K41)[10],which remains a major source of inspiration for turbulence research, the energy spectrum in the inertial subrangeat sufficiently large Reynolds number must obey a power law of the form

E(k) = CKε2/3k−5/3, (9)

where CK is a universal constant and ε is the mean rate of energy dissipation per unit mass. Figure 3 showsthe compensated energy spectra of the form E(k) = k5/3E(k)/ε2/3 for four different values of N. Equation (9)indicates that the compensated spectra in Fig. 3 must be flat, i.e., constant independent of k, in the wavenumberrange that corresponds to the inertial subrange.

One can estimate CK from this constant level. To obtain a reliable estimate of CK in this way, the inertialrange has to be wide enough, i.e., N must be large enough. Otherwise, the estimate will be of questionablevalidity. As a matter of fact, values for CK of 2.0 ∼ 2.2 on the basis of data from DNSs have been reportedin literature, but we must remember that these estimates have been based on DNSs with N ≤ 512. In contrast,Figure 3 shows that with increases in N (or Re), CK converges well to a constant CK = 1.6 ∼ 1.7, which is ingood agreement with existing large-scale experimental data.[11]

Table 3: Parameters in the DNS; Rλ: Taylor micro scale Reynolds number, ν: kinematic viscosity, L: integrallength scale, η: Kolmogorov length scale.

N Rλ ν(×10−4) ∆t(×10−4) L η

512 257 2.8 10.0 1.02 0.003951024 471 1.1 6.25 1.28 0.002102048 732 0.44 4.0 1.23 0.001054096 1217 0.173 2.5 1.21 0.00053

10

(a)

0

0.5

1

1.5

2

2.5

3

3.5

4

0.001 0.01 0.1 1

k5/3 E

(k)/ε

2/3

kη

10243

5123 (b)

0

0.5

1

1.5

2

2.5

3

3.5

4

0.001 0.01 0.1 1

k5/3 E

(k)/ε

2/3

kη

40963

20483

10243

5123

Figure 3: Compensated energy spectra as obtained in DNSs with 5123, 10243, 20483 and 40963 grid points. (a)Spectra from DNSs with 5123 and 10243 grid points, and (b) spectra from the four DNSs.

Moreover, a close inspection of the spectra for N = 2048 and 4096 in Fig. 3 (b) suggests a feature that isnot seen or not clearly seen in DNSs with N ≤ 1024; the compensated spectra for N = 2048 and 4096 suggesta slight slope in the inertial range, i.e., roughly in the range 0.004 < kη < 0.03 (in Fig.3, we show (a) and(b) separately to make the difference between the spectra with N ≤ 1024 and N ≥ 2048 clearly visible). Thisimplies that the energy spectrum has a power law form E(k) ∝ k−α with α being slightly different from the K41value, α = 5/3. The detection of such a deviation, if it does in fact exist, is possible only with a DNS that hashigh enough resolution.

It has been known that K41 agrees quite well with experiments for low-order statistics, but the agreement ispoor for high-order statistics. The source of this disagreement is believed to be the intermittence of turbulence,and modern work on turbulence has been focused to a large extent on the problem of intermittency (cf. Frisch[7]). One of the most fundamental measures of this intermittence is the departure, the so-called intermittencefactor, which is defined as the degree to which the exponent ζp differs from the K41 value, p/3. Here, ζp is thescaling exponent of the structure function S p , which is defined as the n-th order moment of the difference δur

between the velocities at two points x and x + r;

S p =⟨|δur|p⟩ ∝ rζp , (10)

in the inertial subrange satisfying η � r � L, where we have ignored the componental dependence of thevelocity field. Reliable data on the exponent ζp are indispensable to the study of the intermittence problem.Simulating a wide enough inertial subrange where S p scales like (10) is, however, known to be difficult, es-pecially when the value of p is large and the DNS does not have a large enough N. The ranges achieved inDNS to date have been too narrow to obtain reliable estimates of ζp. This difficulty is apparent even when the pvalue is as low as p = 2. Let’s consider the compensated spectra of Fig. 3. The DNS with N = 2048 and 4096suggests a slope, i.e., the difference from the K41 scaling (note that ζ2 = 2/3 as obtained by K41 is equivalentto E(k) ∝ k−5/3); however, it is difficult to detect this effect in results for the DNS with N ≤ 1024.

The results of a DNS with as high a resolution as is possible are thus of fundamental interest. In this regard,the present DNS is expected to provide valuable data for the study of the universal features of turbulence forlarge values of the Reynolds number, and particularly the scaling properties in the inertial subrange, which hasbeen the objective in numerous studies of turbulence. Analysis on this point of the DNS data behind this reportis now under way.

11

[Visualization]In general, a DNS generates such a huge amount of data that efficiently dealing with the data is very im-

portant for the understanding of the physics of the phenomena it represents. For this purpose, the importanceof visualization techniques is increasing with the amounts of data. Figures 4 to 7 show examples of the visual-ization of the simulated turbulence field, i.e., snapshots of the vorticity field in the DNS with N = 2048. Theimpression these images give is quite different from those for, e.g., N ≤ 512. The literature seems to indicatea widespread belief that the structure of individual small eddies is important in determining the inertial rangestructure. Although this point of view may be consistent with figures produced by DNSs with N ≤ 512, Figs.4–7, where N = 2048, suggest that this is unlikely to be the case. The difference between the individual smalleddies and the inertial subrange structure in Figs. 4–7 may remind us of the difference between leaves and aforest; the structure at the former scale may be totally different from that at the latter. Much remains to bestudied and explored in such visualizations.

5 Summary

For the DNS of turbulence, the Fourier spectral method has the advantage of accuracy, particularly in termsof solving the Poisson equation, which represents mass conservation and is to be solved accurately for a goodresolution of small eddies. However, the method requires frequent execution of the 3D-FFT, the computationof which requires global data transfer. In order to achieve high performance in the DNS of turbulence on thebasis of the spectral method, efficient execution of the 3D-FFT is therefore of crucial importance.

By implementing new methods for the 3D-FFT, on the ES as was explained in §3, we have accomplishedthe high-performance DNS of turbulence on the basis of the Fourier spectral method. This is presumably theworld’s first DNS of incompressible turbulence with 20483 or 40963 grid points. The DNS provides valuabledata for the study of the universal features of turbulence at large Reynolds number. The numbers of degreesof freedom 4 × N3 (4 is the number of degrees of freedom (u1, u2, u3, p) in terms of grid points, and N3 isthe number of grid points) are about 3.4 × 1010 in the DNS with N = 2048, and 2.7 × 1011 in the DNS withN = 4096. The sustained speed is 16.4 Tflops. To the authors’ knowledge, these values are the highest in anysimulation so far carried out on the basis of spectral methods in any field of science and technology.

Acknowledgments

The authors would like to express their deepest condolences in connection with the late Mr. Hajime Miyoshi,who initiated and directed the ES project. They would also like to thank Dr. Tetsuya Sato, the director of theEarth Simulator Center, for his warm encouragement of this study.

The authors would also like to thank Mr. Minoru Saito of NEC Informatic Systems, Ltd. for his contributionin developing the code and to thank all the members of ESRDC and ESC who were engaged in the developmentof the ES for valuable discussions and comments.

References

[1] M.Yokokawa, Present Status of Development of the Earth Simulator, Innovative Architecture for FutureGeneration High-Performance Processors and Systems, IEEE PR01309, pp.93-99 (2000).

12

[2] T.Sato, S.Kitawaki, M.Yokokawa, Earth Simulator Running, Proceedings of ISC 2002, Heidelberg, June20-22 (2002).

[3] H.Uehara, M.Tamura, and M.Yokokawa, An MPI Benchmark Program Library and Its Application to theEarth Simulator, Proceedings of ISHPC 2002, LNCS 2327, pp.219-230 (2002).

[4] C.Canuto, M.Y.Hussaini, A.Quarteroni, and T.A.Zang, Spectral Methods in Fluid Dynamics, Springer-Verlag (1988).

[5] C.V.Loan, Computational Frameworks for the Fast Fourier Transform, SIAM (1992).

[6] T.Ishihara, K.Yoshida, and Y.Kaneda, Anisotropic Velocity Correlation Spectrum at Small Scales in aHomogeneous Turbulent Shear Flow, Phys. Rev. Lett. 88, 154501 (2002).

[7] U.Frisch, Turbulence, Cambridge University Press (1995).

[8] T. Gotoh, D. Fukayama, and T. Nakano, Velocity field statistics in homogeneous steady turbulence ob-tained using a high-resolution direct numerical simulation, Phys. Fluids. 14, pp.1065-1081, (2002).

[9] T. Ishihara and Y. Kaneda, High resolution DNS of incompressible homogeneous forced turbulence –Time dependence of the statistics–, Proceedings of the International Workshop on “Statistical Theoriesand Computational Approaches to Turbulence,” Kaneda and Gotoh (eds.) Springer, pp.179, (2002).

[10] A. N. Kolmogorov, The local structure of turbulence in incompressible viscous fluid for very largeReynolds number, Dokl. Akad. Nauk SSSR 30, pp.9-13, (1941), Dissipation of energy in locally isotropicturbulence, Dokl. Akad. Nauk SSSR, 32, pp.16-18, (1941).

[11] K. R. Sreenivasan, On the universality of the Kolmogorov constant, Phys. Fluids 7, pp.2778-2784, (1995).

13

Figure 4: Intense-vorticity isosurfaces showing the region where |ω| > ω + 4σ; ω is the vorticity, and ω andσ are the mean and standard deviation of |ω|, respectively. The size of the display domain is (59842 × 1496)η,periodic in the vertical and horizontal directions. η is the Kolmogorov length scale and Rλ = 732 (see Table 3).

14

Figure 5: A closer view of the inner square region of Fig. 4; the size of the display domain is (29922 × 1496)η.

15

Figure 6: The same isosurfaces as in Fig. 4; a closer view of the inner-square region of Fig. 5. The size of thedisplay domain is (14962 × 1496)η.

16

Figure 7: The same isosurfaces as in Fig. 4; a closer view of the inner-square region of Fig. 6. The size of thedisplay domain is (7482 × 1496)η.

17

Date post:	09-Mar-2018
Category:	Documents
Upload:	ngohanh
View:	222 times
Download:	2 times

16.4-Tﬂops Direct Numerical Simulation of Turbulence · PDF file16.4-Tﬂops Direct...

Documents