Efficient utilization of computational resources in hybrid clusters Massimiliano Fatica
Overview
! Motivations ! Challenges
! Data movement ! Accuracy
! Results ! Library for DGEMM ! TeraTF code
! Conclusions
Motivations • GPUs are very attractive in High Performance
Computing: Massive multithreaded many-core chips High flops count ( both SP and DP) High memory bandwidth, ECC Programming languages: CUDA C, CUDA Fortran,
OpenACC, …. Tools: debuggers, profilers, libraries ( BLAS, FFT,
LAPACK,…)
Motivations • GPU accelerated clusters are now a popular
configuration: Top500 in June 2012: 58 systems with accelerators
(53 NVIDIA, 2 AMD, 2 Cell, 1 Intel MIC) Extremely popular for oil and gas, molecular
dynamics, astrophysics
• For specific workloads/configurations it is desirable to use both CPUs and GPUs
Data Movement • CPU and GPU have different memory spaces
• CPU memory system is optimized for latency • GPU memory system is optimized for throughput
• CPU and GPU are connected with PCI-e bus • 6 GB/s (gen2), 10 GB/s (gen3)
• Data movement needs to be minimized/hidden • Pinned memory to fully utilize PCI-e bus • Overlap computations and data transfer
• Tesla Fermi GPUs have two DMA engines
Accuracy
There may be several reasons for different results: • Different algorithms: serial to parallel (reductions) • Use of FMA instructions • Math libraries
“A man with one watch knows what time it is; a man with two watches is never quite sure” ~Lee Segall
Computing π
Compute pi in single precision (seed 1234567) Samples= 10000 Pi=3.16720009 Error= 0.2561E-01 Samples= 100000 Pi=3.13919997 Error= 0.2393E-02 Samples= 1000000 Pi=3.14109206 Error= 0.5007E-03 Samples= 10000000 Pi=3.14106607 Error= 0.5267E-03 Mismatch between CPU/GPU 78534862 78534859 Samples= 100000000 Pi=3.14139414 Error= 0.1986E-03
Compute pi in single precision (seed 1234) Samples= 10000 Pi=3.11120009 Error= 0.3039E-01 Samples= 100000 Pi=3.13632011 Error= 0.5273E-02 Samples= 1000000 Pi=3.14056396 Error= 0.1029E-02 Samples= 10000000 Pi=3.14092445 Error= 0.6683E-03 Samples= 100000000 Pi=3.14158082 Error= 0.1192E-04
Where is the difference coming from? if( ( hostData(i)**2+ hostData(i+Nhalf)**2) <= 1._fp_kind) inside_cpu=inside_cpu+1 (CPU) if( (deviceData(i)**2+deviceData(i+Nhalf)**2) <= 1._fp_kind ) inside=inside+1 (GPU) – Sum of the point inside the circle is done with integers ( no issues due to floating point arithmetic) – Computation of the distance from the origin (x*x+y*y), no special functions just + and *
Accuracy: effect of FMA instructions ! FERMI GPUs are IEEE-754 compliant, both for SP and DP ! Support for Fused Multiply-Add instruction ( IEEE 754-2008) ! Results with FMA could be different* from results without FMA ! It is possible to toggle FMA on/off with a compiler switch: ! Extremely useful to compare results to “golden” CPU output ! FMA will be present in future CPUs
Compute pi in single precision (seed=1234567 FMA disabled) Samples= 10000 Pi=3.16720009 Error= 0.2561E-01 Samples= 100000 Pi=3.13919997 Error= 0.2393E-02 Samples= 1000000 Pi=3.14109206 Error= 0.5007E-03 Samples= 10000000 Pi=3.14106607 Error= 0.5267E-03 Samples= 100000000 Pi=3.14139462 Error= 0.1981E-03
*Single precision GPU results with FMA are identical to double precision CPU results
Accuracy: effect of math functions
x^6 pow(x,6) pow(x,6.0) gcc 58139640.39860632
2705745697021484 58139640.398606315255165100097656
58139640.398606315255165100097656
icc 58139640.398606322705745697021484
58139640.398606322705745697021484
58139640.398606322705745697021484
pgcc 58139640.398606322705745697021484
58139640.398606322705745697021484
58139640.398606322705745697021484
CUDA 58139640.398606322705745697021484
58139640.398606322705745697021484
58139640.398606322705745697021484
Different libraries could give different results: x=-9.841215180935854789368022*2.0
Library for DGEMM Both CPU cores and GPUs are used in synergy with minor or no modifications to the original source code:
- Host library intercepts the calls to DGEMM and executes them simultaneously on the GPUs and CPU cores - Use of pinned memory for fast PCI-e transfers (up to 6 GB/s on x16 Gen2 , 11 GB/s on x16 Gen3) and overlap computations and communications
- Library used to accelerate Linpack, Paratec, similar approach (phiGemm library) used in Quantum Espresso
PCI-e transfer speed
CUDA offers a fast PCI-e transfer when host memory is allocated with cudaMallocHost instead of regular malloc.
SUN Ultra 24
PCI-e x16 gen2
Supermicro 6016GT
PCI-e x16 gen2
Sandybridge
PCI-e x16 gen3
Pageable
Memory
Pinned
Memory
Pageable
Memory Pinned
Memory
Pageable
Memory
Pinned
Memory
H2D 2132 MB/s 5212 MB/s 4665 MB/s 5745 MB/s 3168 MB/s 11163 MB/s
D2H 1882 MB/s 5471 MB/s 4064 MB/s 6059 MB/s 2961 MB/s 10624 MB/s
DGEMM: C = alpha A B + beta C
DGEMM(A,B,C) = DGEMM(A,B1,C1) U DGEMM(A,B2,C2)
The idea can be extended to multi-GPU configuration and to handle huge matrices Find the optimal split, knowing the relative DGEMM performances of the GPU and CPU cores
(GPU) (CPU)
Optimal split If A(M,K), B(K,N) and C(M,N), a DGEMM call performs 2*M*K*N operations
TCPU(M,K,N2) = TGPU(M,k,N1) N=N1+N2
If GCPU denotes the DGEMM performance of the CPU in Gflops and GGPU the one of the GPU,
The optimal split is
η= GGPU / (GCPU+GGPU)
Overlap DGEMM on CPU and GPU // Copy A cublasSetMatrix (m, k , sizeof(A[0]), A, lda, devA, m_gpu); // Copy B1 cublasSetMatrix (k ,n_gpu, sizeof(B[0]), B, ldb, devB, k_gpu); // Copy C1 cublasSetMatrix (m, n_gpu, sizeof(C[0]), C, ldc, devC, m_gpu); // DGEMM on GPU // Control returns immediately to CPU cublasDgemm('n', 'n', m, n_gpu, k, alpha, devA, m,devB, k, beta, devC, m); // DGEMM on CPU dgemm('n','n',m,n_cpu,k, alpha, A, lda,B+ldb*n_gpu, ldb, beta,C+ldc*n_gpu, ldc); // Copy C1 status = cublasGetMatrix (m, n, sizeof(C[0]), devC, m, C, *ldc);
Using CUDA, it is very easy to express the workflow in the diagram
DGEMM compute/copy analysis ! Assume N,M>>K
! Copy Time = 8∗(𝑚𝑘+𝑛𝑘+2𝑚𝑛)/𝑃𝐶𝐼𝑒 ~ 16𝑚𝑛/𝑃𝐶𝐼𝑒
! Compute Time = 2𝑚𝑛𝑘/𝐺𝐹𝐿𝑂𝑃𝑆
! % Compute = 1/1+ 8∗𝐺𝐹𝐿𝑂𝑃𝑆/𝐾∗𝑃𝐶𝐼𝑒 = 1/1+𝑥 , x= 8∗GFLOPS/K∗PCIe
Fermi DGEMM Strategy Slice Matrix into several pieces
Use Stream API
— Overlap COPY + Compute
Copy A H2D
Loop over pieces:
Copy Bi,Ci H2D
DGEMM A,Bi,Ci
Copy Ci D2H
CPU DGEMM A,B_last,C_last
DGEMM Compute/Copy Overlap
Additional Overlap strategy
Copy A matrix can be significant for smaller matrix sizes
Split A for additional Overlap
DGEMM Performance
0
20
40
60
80
100
120
128
320
512
704
896
1088
1280
1472
1664
1856
2048
2240
2432
2624
2816
3008
3200
3392
3584
3776
3968
4160
4352
4544
4736
4928
5120
5312
5504
5696
5888
6080
GFLOPs
Size
Xeon Quad-‐core 2.8 GHz, MKL 10.3Tesla C1060 GPU (1.296 GHz)CPU + GPU
0
50
100
150
200
250
300
350
400
450
500
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
GFLOPS
Size: N=M (K=1024)
Dual Quad-‐Core Xeon X5550 2.66 GHz 8 cores MKL 10.2.4.032 Tesla M2050 "Fermi" 1.15 GHz
Fermi DGEMM Performance
435 CPU+GPU
350 GPU
85 CPU
Optimizations – Auto Split ! Keep track of CPU and GPU performance and adjust split
! Wallclock() CPU time ! Cuda event record GPU time ! Compute optimal split for next iteration
cudaEventRecord(GPU_start,0);
Loop: launch GPU copys + Kernels
cudaEventRecord(GPU_stop,0);
CPU_start = wallclock();
Call CPU_DGEMM
CPU_stop = wallclock();
cudaEventSynchronize(GPU_stop);
GPU_GFLOPS = GPU_FLOPS/GPU_TIME
CPU_GFLOPS = CPU_FLOPS/CPU_TIME
SPLIT = GPU_GFLOPS/(GPU_GFLOPS+CPU_GFLOPS)
Results on single node Dual Intel Xeon X5560 2.8 GHz 96GB memory, 2 Tesla M2050
- Peak DP: 89 + 515*(2) = 604 (1119) GFLOPS - DGEMM ( 2/3 of peak on Fermi) : 89 + 350*(2) = 439 (789) GFLOPS
================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L2L2 108032 768 1 1 2011.42 4.179e+02 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0039415 ...... PASSED ================================================================================
705.1 GFLOPS = 63% of “PEAK” or 89% of “DGEMM”
================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L2L2 108032 768 1 2 1192.13 7.051e+02 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0040532 ...... PASSED ================================================================================
417.9 GFLOPS = 69% of “PEAK” or 95% of “DGEMM”
Results on clusters
T/V N NB P Q Time Gflops ------------------------------------------------------------------------------------------------- WR13C2L4 2359296 768 32 145 6886.10 1.271e+06 -------------------------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033806 ...... PASSED
53 systems with Tesla GPUs on latest Top500 (Jun 2012)
Several machines with thousands of Tesla GPUs
Results: TeraTF
! TeraTF is a 3D Euler Hydrodynamics solver ! 2nd order Godunov-type scheme, 3rd order remapping ! Various Riemann solvers:
! Exact ( used in SPEC benchmark ) ! Dukowicz ( used in benchmark from CEA ) ! Acoustic
! Fortran 90 with MPI and OpenMP parallelization ! Porting to GPU done with CUDA Fortran
! Part of SpecMPI 2007 Benchmark
CUDA Fortran ! PGI / NVIDIA collaboration ! Same CUDA Programming model as CUDA-C ! Program GPU in Fortran Syntax ! Strongly Typed – variables with device-type reside in
GPU memory ! Use standard allocate, deallocate ! Copy between CPU and GPU with assignment statement
( GPU_array = CPU_array ) ! Copy subset of arrays with interval notation
Data Layout ! Hydro_vars( vars, i, j, k) = Array of Structs
! Hydro_vars( i, j, k, vars) = Struct of Arrays ... ... ... ...
... CPU sequential access = efficient
GPU parallel “Warp” access = inefficient (uncoalesced)
GPU parallel “Warp” access = efficient (coalesced)
Data Layout
! Pad leading dimensions ! Multiple of memory
transaction granularity ! Coalescing / alignment
! Equalize Leading Dimensions ! Reduce register
pressure / simplify addressing
k
i
plane_stride
comm buffers
Hydro_vars
Cell Work
Node Work
TeraTF results ! 240^3 Grid, SOD problem, Dukowicz and Exact Riemann solver ! 8 MPI processes ( 2x2x2 ), 1 GPU per MPI process ! 8 M2050 GPUs, Dual Quad-Core Xeon X5560 (2.66 GHz) ! 4 GPUs share x16 PCIe Link (not optimal)
Accuracy of GPU and CPU results
! Dukowicz : GPU and CPU results are identical ! Exact : Difference in last digit ( POW ? )
START: CPU : GPU : Masse : 0.1082530240533624 : 0.1082530240533624 Energie totale : 0.2646185032453942 : 0.2646185032453942 END (Dukowicz): Masse : 0.1082530240536995 : 0.1082530240536995 Energie totale : 0.2646185032454266 : 0.2646185032454266 END (Exact): Masse : 0.1082530240536390 : 0.1082530240536390 Energie totale : 0.2646185032458129 : 0.2646185032458130
TeraTF Performance
15x
19x
TeraTF Performance
MPI tasks GPUs Time(s) Speed-up 1x1x1 1 691 1 2x2x2 8 112 6.1 3x3x3 27 42 16.4 4x4x4 64 24 28.7 5x5x5 125 16 42.3
Medium case of the SPEC MPI configuration (2403, Exact Riemann Solver) on Cray XK6
Hybrid approach ! Partition local domain
across GPU and CPU (openMP)
k
i
GPU
CPU
i
GPU CPU
X and Y Phase
Z Phase
TeraTF hybrid performances
1
3.96
6.4
21
27
1 CPU Core 8 OMP 8 MPI GPU GPU + 8 OMP
TeraTF Hybrid Performance
Conclusions • It is possible to fully utilize the computational resources available on hybrid clusters • With the right software design, PCI-e transfer time can be hidden
• Improved Power Efficiency