MULTI GPU PROGRAMMING (WITH MPI)
Massimo Bernaschi National Research Council of Italy
SOME OF THE SLIDES COME FROM A TUTORIAL PRESENTED DURING THE 2014 GTC
CONFERENCE BY Jiri Kraus and Peter Messmer (NVIDIA)
Download the sample codes from: http://twin.iac.rm.cnr.it/cccsample.tgz To unpack the codes: tar zxvf cccsample.tgz Get http://twin.iac.rm.cnr.it/CUDA_C_QuickRef.pdf
WHAT YOU WILL LEARN § How to use more than one GPU for the same task § How to use MPI for inter GPU communication with CUDA § What CUDA-aware MPI is § What Multi Process Service is and how to use it § How to use NVIDIA Tools in an MPI environment § How to hide MPI communication times
3
MULTI-GPU MEMORY
ü GPUs do not share global memory ü But starting on CUDA 4.0 one GPU can copy data from/to
another GPU memory directly if the GPUs are connected to the same PCIe switch
ü Inter-GPU communication ü Application code is responsible for copying/moving data
between GPU ü Data travel across the PCIe bus
ü Even when GPUs are connected to the same PCIe switch!
SHARING DATA BETWEEN GPUS
§ Options — Explicit copies via host
— Zero-copy shared host array
— Per-device arrays with peer-to-peer exchange transfers — Peer-to-peer memory access
MULTI-GPU ENVIRONMENT
ü GPUs have consecutive integer IDs, starting with 0 ü Starting on CUDA 4.0, a host thread can maintain more than one GPU context at a time ü CudaSetDevice allows to change the “active” GPU
(and so the context) ü Device 0 is chosen when cudaSetDevice is not called
ü GPU are ordered by decreasing performance ü Remember that multiple host threads can establish contexts
with the same GPU ü Driver handles time-sharing and resource partitioning unless the GPU is
in exclusive mode
MULTI-GPU ENVIRONMENT: A SIMPLE APPROACH // Run independent kernel on each CUDA device
int numDevs = 0; cudaGetNumDevices(&numDevs); ... for (int d = 0; d < numDevs; d++) { cudaSetDevice(d); kernel<<<blocks, threads>>>(args);
} Exercise: Compare the dot_simple_multiblock.cu and dot_multigpu.cu programs. Compile and run. Attention! Add the –arch=sm_20 option: nvcc –o dot_multigpu dot_multigpu.cu –arch=sm_20
CUDA FEATURES USEFUL FOR MULTI-GPU
• Control multiple GPUs with a single CPU thread – Simpler coding: no need for CPU multithreading
• Streams: – Enable executing kernels and memcopies concurrently – Up to 2 concurrent memcopies: to/from GPU
• Peer-to-Peer (P2P) GPU memory copies – Transfer data between GPUs using PCIe P2P support – Done by GPU DMA hardware – host CPU is not involved
• Data traverses PCIe links, without touching CPU memory
– Disjoint GPU-pairs can communicate simultaneously
PEER-TO-PEER GPU COMMUNICATION • Up to CUDA 4.0, GPU could not
exchange data directly.
Example: read p2pcopy.cu, compile nvcc –o p2pcopy p2pcopy.cu –arch=sm_30
• With CUDA >= 4.0 two GPU connected to the same PCI/express switch can exchange data directly.
cudaStreamCreate(&stream_on_gpu_0); cudaSetDevice(0); cudaDeviceEnablePeerAccess( 1, 0 ); cudaMalloc(d_0, num_bytes); /* create array on GPU 0 */ cudaSetDevice(1); cudaMalloc(d_1, num_bytes); /* create array on GPU 1 */ cudaMemcpyPeerAsync( d_0, 0, d_1, 1, num_bytes, stream_on_gpu_0 ); /* copy d_1 from GPU 1 to d_0 on GPU 0: pull copy */
INTER-GPU COMMUNICATION WITH MPI • Example: simpleMPI.c
– Generate some random numbers on one node. – Dispatch them to all nodes. – Compute their square root on each node's GPU. – Compute the average of the results by using MPI.
To compile: nvcc -c simpleCUDAMPI.cu
$mpic++ -o simpleMPI simpleMPI.c simpleCUDAMPI.o -L/usr/local/cuda/lib64/ -lcudart
To run: $mpirun –np 2 simpleMPI
On a single line!
MPI+CUDA
PCI-‐e
GPU
GDDR5 Memory
System Memory
CPU
Network Card
Node 0
PCI-‐e
GPU
GDDR5 Memory
System Memory
CPU
Network Card
Node n-1
PCI-‐e
GPU
GDDR5 Memory
System Memory
CPU
Network Card
Node 1
…
MPI+CUDA
PCI-‐e
GPU
GDDR5 Memory
System Memory
CPU
Network Card
Node 0
PCI-‐e
GPU
GDDR5 Memory
System Memory
CPU
Network Card
Node n-1
PCI-‐e
GPU
GDDR5 Memory
System Memory
CPU
Network Card
Node 1
…
MPI+CUDA
//MPI rank 0 MPI_Send(s_buf_d,size,MPI_CHAR,n-1,tag,MPI_COMM_WORLD); //MPI rank n-1 MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
MESSAGE PASSING INTERFACE - MPI § Standard to exchange data between processes via messages
— Defines API to exchanges messages
§ Pt. 2 Pt.: e.g. MPI_Send, MPI_Recv
§ Collectives, e.g. MPI_Reduce
§ Multiple implementations (open source and commercial) — Binding for C/C++, Fortran, Python, …
— E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, …
14
MPI – A MINIMAL PROGRAM #include <mpi.h>
int main(int argc, char *argv[]) {
int rank,size;
/* Initialize the MPI library */
MPI_Init(&argc,&argv);
/* Determine the calling process rank and total number of ranks */
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
/* Call MPI routines like MPI_Send, MPI_Recv, ... */
...
/* Shutdown MPI library */
MPI_Finalize();
return 0;
} 15
MPI – COMPILING AND LAUNCHING $ mpicc –o myapp myapp.c $ mpirun –np 4 ./myapp <args>
16
myapp myapp myapp myapp
Solving the Laplace equation with the Jacobi method
T(i,j) T(i+1,j) T(i-1,j)
T(i,j-1)
T(i,j+1)
The Laplace equation in 2D (elliptic PDE)
The Jacobi method: each point is updated as the average of its nearest neighbours
EXAMPLE: JACOBI SOLVER – SINGLE GPU While not converged § Do Jacobi step: for (int i=1; i < n-1; i++)
for (int j=1; j < m-1; j++)
Tnew[i][j] = 0.0f - 0.25f*(T[i-1][j] + T[i+1][j]
+ T[i][j-1] + T[i][j+1])
§ Copy Tnew to T or Swap Tnew and T § Next iteration
Exercise LAPLACE CUDA (serial)
• wget http://twin.iac.rm.cnr.it/eserlaplace.tgz
• tar zxvf eserlaplace.tgz
• cd LAPLACE_CNAF/SERIAL
• C version in the laplace_serial.c, laplace_serial_cuda.c and cuda_tools.c files
• CUDA version activated by using –DCUDA at compile time
(otherwise code within #ifdef CUDA is not compiled!)
• Use the Makefile: make c_serial
make c_cuda
Exercise: complete the CUDA parts indicated by EXERCISE in laplace_serial_cuda.c
EXAMPLE: JACOBI SOLVER – MULTI GPU While not converged § Do Jacobi step: for (int i=1; i < n-1; i++)
for (int j=1; j < m-1; j++)
Tnew[i][j] = 0.0f - 0.25f*(T[i-1][j] + T[i+1][j]
+ T[i][j-1] + T[i][j+1])
§ Exchange halo with neighbors § Copy Tnew to T or Swap Tnew and T § Next iteration
EXAMPLE: JACOBI SOLVER § Solves the 2D-Laplace equation on a rectangle
∆𝒖(𝒙,𝒚)=𝟎 ∀ (𝒙,𝒚)∈Ω\𝜹Ω — Dirichlet boundary conditions (constant values on boundaries)
𝒖(𝒙,𝒚)=𝒇(𝒙,𝒚)∈𝜹Ω
§ 2D domain decomposition with n x k domains
Rank (0,0)
Rank (0,1)
Rank (0,n-1) …
Rank (k-1,0)
Rank (k-1,1)
Rank (k-1,n-1)
…
EXAMPLE: JACOBI TOP/BOTTOM HALO UPDATE
MPI_Sendrecv(Tnew+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,
Tnew+offset_bottom_boundary, m-2, MPI_DOUBLE, b_nb, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Sendrecv(Tnew+offset_last_row, m-2, MPI_DOUBLE, b_nb, 1,
Tnew+offset_top_boundary, m-2, MPI_DOUBLE, t_nb, 1,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
1
1 2
2
Pla
in C
on
CP
U
EXAMPLE: JACOBI TOP/BOTTOM HALO UPDATE
MPI_Sendrecv(Tnew_d+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,
Tnew_d+offset_bottom_bondary, m-2, MPI_DOUBLE, b_nb, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Sendrecv(Tnew_d+offset_last_row, m-2, MPI_DOUBLE, b_nb, 1,
Tnew_d+offset_top_bondary, m-2, MPI_DOUBLE, t_nb, 1,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
CU
DA
1
1 2
2
Depending on the CUDA (and MPI) version, the MPI_Sendrecv may work or fail. We may need to copy data from the GPU to the CPU to exchange them with other MPI tasks.
Device pointer
EXAMPLE: JACOBI – TOP/BOTTOM HALO UPDATE – WITHOUT CUDA-AWARE MPI
25
//send to bottom and receive from top – top bottom omitted
cudaMemcpy(Tnew+1, Tnew_d+1, (m-2)*sizeof(double), cudaMemcpyDeviceToHost);
MPI_Sendrecv(Tnew+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,
Tnew+offset_bottom_bondary, m-2, MPI_DOUBLE, b_nb, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
cudaMemcpy(Tnew_d, Tnew, (m-2)*sizeof(double), cudaMemcpyDeviceToHost);
CU
DA
EXAMPLE: JACOBI – LEFT/RIGHT HALO UPDATE
//right neighbor omitted
cuda_copy_to_buffer<<<gs,bs,0,s>>>(to_left_d, u_new_d, n, m);
cudaStreamSynchronize(s);
MPI_Sendrecv( to_left_d, n-2, MPI_DOUBLE, l_nb, 0,
from_left_d, n-2, MPI_DOUBLE, l_nb, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE );
cuda_copy_from_buffer<<<gs,bs,0,s>>>(u_new_d, from_left_d, n, m);
27
CU
DA
Jacobi method with MPI Domain Decomposition
28
T=T0 until (deltaT_max < tol)
1. Update the halo regions (MPI_Sendrecv) 2. Tnew(i,j)=T(i-1,j)+T(i+1,j)+T(i,j-1)+T(i,j+1) 3. deltaT(i,j) = abs(Tnew(i,j)-T(i,j)) 4. deltaT_max_loc = max(deltaT) 5. compute deltaT_max (MPI_Allreduce) 6. T=Tnew
• The exercise/sample makes use of the MPI Cartesian decomposition • Each MPI process is in charge of a subdomain and before the execution
of the update needs to receive data on the boundaries of its subdomain • For the evaluation of the global error it is necessary to carry out also a
distributed reduction operation (MPI_Allreduce).
Esercizio LAPLACE CUDA MPI
• wget http://twin.iac.rm.cnr.it/eserlaplace.tgz
• tar zxvf eserlaplace.tgz
• cd LAPLACE/MPI
• C version in the laplace_cuda_mpi.c and cuda_tools.c files
• CUDA version activated by using –DCUDA at compile time
(otherwise code within #ifdef CUDA is not compiled!)
• Use the Makefile: make c_mpi make c_mpi_cuda
Exercise: complete the CUDA parts indicated by EXERCISE The copies of MPI buffers correspond to kernels that must be implemented.
UNIFIED VIRTUAL ADDRESSING No UVA: Multiple Memory Spaces UVA : Single Address Space
System Memory
CPU GPU
GPU Memory
PCI-e
0x0000
0xFFFF
0x0000
0xFFFF
System Memory
CPU GPU
GPU Memory
PCI-e
0x0000
0xFFFF
No UVA : Separate Address Spaces
UNIFIED VIRTUAL ADDRESSING
§ One address space for all CPU and GPU memory — Determine physical memory location from a pointer value
— Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)
§ Supported on devices with compute capability 2.0 for — 64-bit applications on Linux and on Windows also TCC mode
Pinned fabric Buffer
Host Buffer memcpy
GPU Buffer cudaMemcpy
Pinned fabric Buffer Pinned fabric Buffer
MPI+CUDA
//MPI rank 0 MPI_Send(s_buf_d,size,…); //MPI rank n-1 MPI_Recv(r_buf_d,size,…);
With UVA and CUDA-aware MPI //MPI rank 0 cudaMemcpy(s_buf_h,s_buf_d,size,…); MPI_Send(s_buf_h,size,…); //MPI rank n-1 MPI_Recv(r_buf_h,size,…); cudaMemcpy(r_buf_d,r_buf_h,size,…);
No UVA and regular MPI
GPU1
GPU1 Memory
PCI-e
CPU
Chip set
System Memory
GPU2
GPU2 Memory
IB
NVIDIA GPUDIRECTTM
ACCELERATED COMMUNICATION WITH NETWORK & STORAGE DEVICES
GPU1
GPU1 Memory
PCI-e
CPU
Chip set
System Memory
GPU2
GPU2 Memory
IB
NVIDIA GPUDIRECTTM
ACCELERATED COMMUNICATION WITH NETWORK & STORAGE DEVICES
NVIDIA GPUDIRECTTM
PEER TO PEER TRANSFERS
GPU1
GPU1 Memory
PCI-e
CPU
Chip set
System Memory
GPU2
GPU2 Memory
IB
GPU1
GPU1 Memory
PCI-e
CPU
Chip set
System Memory
GPU2
GPU2 Memory
IB
NVIDIA GPUDIRECTTM
PEER TO PEER TRANSFERS
NVIDIA GPUDIRECTTM SUPPORT FOR RDMA
GPU1
GPU1 Memory
PCI-e
CPU
Chip set
System Memory
GPU2
GPU2 Memory
IB
GPU1
GPU1 Memory
PCI-e
CPU
Chip set
System Memory
GPU2
GPU2 Memory
IB
NVIDIA GPUDIRECTTM SUPPORT FOR RDMA
CUDA-AWARE MPI Example: MPI Rank 0 MPI_Send from GPU Buffer MPI Rank 1 MPI_Recv to GPU Buffer § Show how CUDA+MPI works in principle
— Depending on the MPI implementation, message size, system setup, … situation might be different
§ Two GPUs in two nodes
42
MPI Rank 0 MPI Rank 1
GPU
Host
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
MPI GPU TO REMOTE GPU GPUDIRECT SUPPORT FOR RDMA
MPI Rank 0 MPI Rank 1
GPU
Host
cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);
cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);
cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);
cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);
cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);
REGULAR MPI GPU TO REMOTE GPU
MPI Rank 0 MPI Rank 1
GPU
Host
MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
MPI GPU TO REMOTE GPU WITHOUT GPUDIRECT
48
MPI_Sendrecv
Time
MPI GPU TO REMOTE GPU WITHOUT GPUDIRECT
49
More Details on pipelines: S4158 - CUDA Streams: Best Practices and Common Pitfalls (Tuesday 03/27 210A)
PERFORMANCE RESULTS TWO NODES
0 1000 2000 3000 4000 5000 6000 7000
1 4 16
64
256
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
BW (
MB/
s)
Message Size (byte)
OpenMPI 1.7.4 MLNX FDR IB (4X) Tesla K40
CUDA-aware MPI with GPUDirect RDMA
CUDA-aware MPI
regular MPI
Latency (1 byte) 19.04 us 16.91 us 5.52 us
PERFORMANCE RESULTS TWO NODES
1
10
100
1000
10000
1 4 16
64
256
1024
40
96
1638
4 65
536
2621
44
1048
576
4194
304
Late
ncy
(us)
log
scal
e
Message Size (byte)
OpenMPI 1.7.4 MLNX FDR IB (4X) Tesla K40
CUDA-aware MPI with GPUDirect RDMA
CUDA-aware MPI
regular MPI
GPU ACCELERATION OF LEGACY MPI APPLICATION
§ Typical legacy application — MPI parallel
— Single or few threads per MPI rank (e.g. OpenMP)
§ Running with multiple MPI ranks per node
§ GPU acceleration in phases — Proof of concept prototype, ..
— Great speedup at kernel level
§ Application performance misses expectations
GPU parallelizable part CPU parallel part Serial part
N=4 N=2 N=1 N=8
Multicore CPU only GPU accelerated CPU
N=1
GPU parallelizable part CPU parallel part Serial part
N=4 N=2 N=1 N=8
Multicore CPU only GPU accelerated CPU
With Hyper-Q/MPS Available in K20, K40
N=4 N=2 N=1 N=8
PROCESSES SHARING GPU WITHOUT MPS: NO OVERLAP
Process A Process B
Context A Context B
Process A Process B
GPU
PROCESSES SHARING GPU WITH MPS: MAXIMUM OVERLAP
Process A Process B
Context A Context B
GPU Kernels from Process A
Kernels from Process B
MPS Process
CASE STUDY: HYPER-Q/MPS FOR UMT Enables overlap
between copy and compute of different
processes
Sharing the GPU between multi MPI
ranks increases GPU utilization
USING MPS - No application modifications necessary - Not limited to MPI applications - MPS control daemon
- Spawn MPS server upon CUDA application startup
- Typical setup export CUDA_VISIBLE_DEVICES=0
nvidia-smi –i 0 –c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control –d
- On Cray XK/XC systems export CRAY_CUDA_MPS=1
USING MPS ON MULTI-GPU SYSTEMS - MPS server only supports a single GPU
- Use one MPS server per GPU
- Target specific GPU by setting CUDA_VISIBLE_DEVICES - Adjust pipe/log directory
export DEVICE=0
export CUDA_VISIBLE_DEVICES=${DEVICE}
export CUDA_MPS_PIPE_DIRECTORY=${HOME}/mps${DEVICE}/pipe
export CUDA_MPS_LOG_DIRECTORY=${HOME}/mps${DEVICE}/log
cuda_mps_server_control –d
export DEVICE=1 …
- More at http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html
MPS SUMMARY § Easy path to get GPU acceleration for legacy applications § Enables overlapping of memory copies and compute between
different MPI ranks
68
TOOLS FOR MPI+CUDA APPLICATIONS § Memory Checking cuda-memcheck § Debugging cuda-gdb § Profiling nvprof and NVIDIA Visual Profiler
MEMORY CHECKING WITH CUDA-MEMCHECK § Cuda-memcheck is a functional correctness checking suite
similar to the valgrind memcheck tool § Can be used in a MPI environment
mpirun -np 2 cuda-memcheck ./myapp <args>
§ Problem: output of different processes is interleaved — Use save, log-file command line options and launcher script
#!/bin/bash
LOG=$1.$OMPI_COMM_WORLD_RANK
#LOG=$1.$MV2_COMM_WORLD_RANK
cuda-memcheck --save $LOG.memcheck $*
mpirun -np 2 cuda-memcheck-script.sh ./myapp <args>
DEBUGGING MPI+CUDA APPLICATIONS USING CUDA-GDB WITH MPI APPLICATIONS
§ You can use cuda-gdb just like gdb with the same tricks § For smaller applications, just launch xterms and cuda-gdb > mpirun –np 4 xterm -e cuda-gdb --cuda-use-lockfile=0 ./myapp
DEBUGGING MPI+CUDA APPLICATIONS CUDA-GDB ATTACH
§ CUDA 5.0 and forward have the ability to attach to a running process
if ( rank == 0 ) {
int i=0;
printf("rank %d: pid %d on %s ready for attach\n.", rank, getpid(),name);
while (0 == i) {
sleep(5);
}
}
> mpiexec -np 2 ./jacobi_mpi+cuda
Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and one Tesla M2070 for each process (2049 rows per process).
rank 0: pid 30034 on judge107 ready for attach
> ssh judge107
jkraus@judge107:~> cuda-gdb --pid 30034
PROFILING MPI+CUDA APPLICATIONS USING NVPROF+NVVP
3 Usage modes: 1. Embed pid in output filename mpirun –np 2 nvprof --output-profile profile.out.%p
2. Only save the textual output mpirun –np 2 nvprof --log-file profile.out.%p
3. Collect profile data on all processes that run on a node nvprof --profile-all-processes –o profile.out.%p
EXERCISE: profile your laplace_mpi_cuda code by using the second mode.
PROFILING MPI+CUDA APPLICATIONS THIRD PARTY TOOLS
§ Multiple parallel profiling tools are CUDA aware — Score-P
— Vampir
— Tau
§ These tools are good for discovering MPI issues as well as basic CUDA performance inhibitors
BEST PRACTICE: USE NONE-BLOCKING MPI
MPI_Request t_b_req[4];
MPI_Irecv(Tnew+offset_top_bondary,m-2,MPI_DOUBLE,t_nb,0,MPI_COMM_WORLD,t_b_req);
MPI_Irecv(Tnew+offset_bottom_bondary,m-2,MPI_DOUBLE,b_nb,1,MPI_COMM_WORLD,t_b_req+1);
MPI_Isend(Tnew+offset_last_row,m-2,MPI_DOUBLE,b_nb,0,MPI_COMM_WORLD,t_b_req+2);
MPI_Isend(Tnew+offset_first_row,m-2,MPI_DOUBLE,t_nb,1,MPI_COMM_WORLD,t_b_req+3);
MPI_Waitall(4, t_b_req, MPI_STATUSES_IGNORE);
85
MPI_Sendrecv(Tnew+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,
Tnew+offset_bottom_bondary, m-2, MPI_DOUBLE, b_nb, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Sendrecv(Tnew+offset_last_row, m-2, MPI_DOUBLE, b_nb, 1,
Tnew+offset_top_bondary, m-2, MPI_DOUBLE, t_nb, 1,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
NO
NE
-BLO
CK
ING
B
LOC
KIN
G
Gives MPI more
opportunities to build
efficient piplines
OPTIMIZED CODE WITH “OLD STYLE” MPI
• Pipelining at user level with non-blocking MPI and CUDA functions
• Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_buf+j*block_sz,s_device+j*block_sz,…); for (j = 0; j < pipeline_len; j++) { while (result != cudaSuccess) { results = cudaStreamQuery(..); if (j>0) MPI_Test(…); } MPI_Isend(s_buf+j*block_sz,block_sz,MPI_CHAR,1,1,…) } MPI_Waitall();
GENERAL MULTI-GPU PROGRAMMING PATTERN
• The goal is to hide communication cost – Overlap with computation
• So, every time-step, each GPU should: – Compute parts (halos) to be sent to neighbors – Compute the internal region (bulk) – Exchange halos with neighbors
• Linear scaling as long as internal-computation takes longer than halo exchange
– Actually, separate halo computation adds some overhead
overlapped
OVERLAPPING COMMUNICATION AND COMPUTATION
89
Process whole domain MPI
No
Ove
rlapp
Process inner domain
MPI Process boundary domain Dependency
Boundary and inner domain processing
can overlap
Ove
rlapp
Possible Speedup
90!
PAGE-LOCKED DATA TRANSFERS ü cudaMallocHost() allows allocation of page-locked (“pinned”) host memory
ü Same syntax of cudaMalloc() but works with CPU pointers and memory
ü Enables highest cudaMemcpy performance ü 3.2 GB/s on PCI-e x16 Gen1 ü 6.5 GB/s on PCI-e x16 Gen2
ü Use with caution!!! ü Allocating too much page-locked memory can reduce overall
system performance
See and test the bandwidth.cu sample
91!
OVERLAPPING DATA TRANSFERS AND COMPUTATION
ü Async Copy and Stream API allow overlap of H2D or D2H data transfers with computation ü CPU computation can overlap data transfers on all CUDA capable devices ü Kernel computation can overlap data transfers on devices with “Concurrent
copy and execution” (compute capability >= 1.1)
ü Stream = sequence of operations that execute in order on GPU ü Operations from different streams can be interleaved ü Stream ID used as argument to async calls and kernel launches
ü Asynchronous host-device memory copy returns control immediately to CPU ü cudaMemcpyAsync(dst, src, size, dir, stream); ü requires pinned host memory
(allocated with cudaMallocHost)
ü Overlap CPU computation with data transfer ü 0 = default stream
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);
kernel<<<grid, block>>>(a_d);
cpuFunction();
92!
ASYNCHRONOUS DATA TRANSFERS
overlapped
OVERLAPPING KERNEL AND DATA TRANSFER
ü Requires: ü Kernel and transfer use different, non-zero streams
ü For the creation of a stream: cudaStreamCreate ü Remember that a CUDA call to stream 0 blocks until all previous calls complete
and cannot be overlapped
ü Example: cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);
cudaMemcpyAsync(dst, src, size, dir,stream1); kernel<<<grid, block, 0, stream2>>>(…);
overlapped Exercise: read, compile and run simpleStreams.cu
OVERLAPPING COMMUNICATION AND COMPUTATION
process_boundary_and_pack<<<gs_b,bs_b,0,s1>>>(u_new_d,u_d,to_left_d,to_right_d,n,m);
process_inner_domain<<<gs_id,bs_id,0,s2>>>(u_new_d, u_d,to_left_d,to_right_d,n,m);
cudaStreamSynchronize(s1); //wait for boundary
MPI_Request req[8];
//Exchange halo with left, right, top and bottom neighbor
// Using non-blocking MPI primitives
MPI_Waitall(8, req, MPI_STATUSES_IGNORE);
unpack<<<gs_s,bs_s>>>(u_new_d, from_left_d, from_right_d, n, m);
cudaDeviceSynchronize(); //wait for iteration to finish
95
CU
DA
HANDLING MULTI GPU NODES
§ Multi GPU nodes and GPU-affinity: — Use local rank:
int local_rank = //determine local rank
int num_devices = 0;
cudaGetDeviceCount(&num_devices);
cudaSetDevice(local_rank % num_devices);
— Look at function assignDeviceToProcess in file assign_process_gpu.c (directory LAPLACE_CNAF/MPI)
— Use exclusive process mode + cudaSetDevice(0)
98
HANDLING MULTI GPU NODES
§ How to determine local rank: — Rely on process placement (with one rank per GPU)
int rank = 0;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
int num_devices = 0;
cudaGetDeviceCount(&num_devices); // num_devices == ranks per node
int local_rank = rank % num_devices;
— Use environment variables provided by MPI launcher § e.g. for OpenMPI
int local_rank = atoi(getenv("OMPI_COMM_WORLD_LOCAL_RANK"));
§ e.g. For MVPAICH2
int local_rank = atoi(getenv("MV2_COMM_WORLD_LOCAL_RANK"));
99
HOMEWORK
§ Implement the Jacobi method for the solution of the 2D Laplace equation by using CUDA streams and MPI non-blocking point-to-point primitives to overlap computation and communication.
§ Develop a 3D solution by using a decomposition along one axis (e.g., the Z direction)
100
CONCLUSIONS
§ Using MPI as abstraction layer for Multi GPU programming allows multi GPU programs to scale beyond a single node
— CUDA-aware MPI delivers ease of use, reduced network latency and increased bandwidth
§ All NVIDIA tools are usable and third party tools are available § Multipe CUDA-aware MPI implementations available
— OpenMPI, MVAPICH2, Cray, IBM Platform MPI
§ With CUDA streams it is possible to use the CPU as a GPU network-coprocessor.
103
OVERLAPPING COMMUNICATION AND COMPUTATION – TIPS AND TRICKS § CUDA-aware MPI might use the default stream
— Allocate stream with the non-blocking flag (cudaStreamNonBlocking)
§ In case of multiple kernels for boundary handling the kernel processing the inner domain might sneak in
— Use single stream or events for inter stream dependencies via cudaStreamWaitEvent – disables overlapping of boundary and inner domain kernels
— Use high priority streams for boundary handling kernels – allows overlapping of boundary and inner domain kernels
§ As of CUDA 6.0 GPUDirect P2P in multi process can overlap disable it for older releases
104
HIGH PRIORITY STREAMS
§ Improve scalability with high priority streams (cudaStreamCreateWithPriority)
Comp. Local Forces
Comp. Non-Local Forces
Ex. Non-local Atom pos.
Ex. Non-local forces
Stream 1
Stream 2
Comp. Local Forces
Comp. Non-Local Forces
Ex. Non-local Atom pos.
Ex. Non-local forces
Stream 1 (LP)
Stream 2 (HP)
Possible Speedup