Sreeram Potluri, Anshuman Goswami – NVIDIA
Manjunath Gorentla Venkata, Neena Imam - ORNL
EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM
2
SCOPE OF THE WORK
Reliance on CPU for communication limits scaling on GPU clusters
NVSHMEM provides GPU kernel-side OpenSHMEM API
NVLink is a high-bandwidth memory-like fabric between GPUs
Effectiveness of NVSHMEM over NVLink with Breadth First Search
CPU-initiated MPI vs GPU-initiated SHMEM in a particular implementation of BFS
3
AGENDA
GPU Programming Models: MPI and SHMEM
NVSHMEM Overview
DGX and NVLink
MPI+CUDA implementation of BFS
SHMEM implementation of BFS
Performance Evaluation
Conclusion and Future Work
4
GPU CLUSTER PROGRAMMING Offload model
Compute on GPU
Communication from CPU
Synchronization at boundaries
Overheads on GPU Clusters
Offload latencies
Synchronization overheads
Limits strong scaling
More CPU means more power
void 2dstencil (u, v, …)
{
for (timestep = 0; …) {
interior_compute_kernel <<<…>>> (…)
pack_kernel <<<…>>> (…)
cudaStreamSynchronize(…)
MPI_Irecv(…)
MPI_Isend(…)
MPI_Waitall(…)
unpack_kernel <<<…>>> (…)
boundary_compute_kernel <<<…>>> (…)
…
}
}
5
GPU-CENTRIC COMMUNICATION
GPU capabilities Warp scheduling to hide latencies to global memory Implicit coalescing of loads/stores to achieve efficiency
CUDA helps to program to these Should also benefit when accessing data over the network Direct accesses to remote memory simplifies programming Achieving efficiency while making it easier to program Continuous fine-grained accesses smooths traffic over the network
6
GPU GLOBAL ADDRESS SPACE
Sym
metr
ic H
eap
7
COMMUNICATION FROM CUDA KERNELS
Long running CUDA kernels
Communication within parallelism
__global__ void 2dstencil (u, v, sync, …) { for(timestep = 0; …) { if (i+1 > nx) { v[i+1] = shmem_float_g (v[1], rightpe); } if (i-1 < 1) { v[i-1] = shmem_float_g (v[nx], leftpe); } u[i] = (u[i] + (v[i+1] + v[i-1] . . . if (i < 2) { shmem_int_p (sync + i, 1, peers[i]); shmem_quiet(); shmem_wait_until (sync + i, EQ, 1); } //intra-kernel sync … } }
void 2dstencil (u, v, …)
{
stencil_kernel <<<…>>> (…)
}
8
Implementation of OpenSHMEM Specification (1.3)
Symmetric heap on GPU memory
Implements multi-threading support as being discussed in PR #43
Extensions for GPUs
CUDA stream-based SHMEM communication
Thread cooperative SHMEM communication
Expected availability: Q4 2017 (single OS Multi-GPU nodes: NVLink, PCIe, QPI)
NVSHMEM
HOST/GPU HOST ONLY
Library setup, exit and query
Memory management
shmem_ptr
Remote memory access
Atomic memory operations
Collectives
Point-to-point synchronization
Memory ordering
Locking routines
9
DGX-1 ARCHITECTURE
P100
P100
P100
P100
P100
P100
P100
P100
PLX switch
PLX switch
PLX switch
PLX switch
CPU CPU
IB 100 GB/sec
IB 100 GB/sec
IB 100 GB/sec
IB 100 GB/sec
10
Inter-GPU Copies – Single Process
GPU 0
CPU
GPU 1
cudaSetDevice(0); cudaMalloc(&buf0, size) cudaSetDevice(1); cudaMalloc(&buf1, size) ---------- ---------- cudaMemcpy (buf0, buf1, size, cudaMemcpyDefault )
Unified Virtual Addressing available since CUDA 4.0 (2011)
11
GPUDirect P2P: by-passing CPU memory
cudaSetDevice(0); cudaMalloc(&buf0, size); cudaCanAccessPeer (&access, 0, 1); assert(access == 1); cudaEnablePeerAccess (1, 0); cudaSetDevice(1); cudaMalloc(&buf1, size); ---------- ---------- cudaSetDevice (0); cudaMemcpy (buf0, buf1, size, cudaMemcpyDefault);
cudaDeviceCanAccessPeer(int *access, int device, int peerdevice); cudaEnablePeerAccess(int peerDevice, unsigned int flags)
GPU 0
CPU
GPU 1
GPU 0
CPU
GPU 1
Can also access (LD/ST) the peer GPU memory from inside a CUDA kernel
PCIe P2P (copy/LD/ST)
NVLink (copy/LD/ST
/atomics)
12
cudaIpcGetMemHandle(), cudaIpcOpenMemHandle() & cudaDeviceEnablePeerAccess() Using GPU memory from another process
Physical GPU Memory Existing allocation
Get handle and send to other process
Opening handle creates mapping in receiving
process
H
GPU-0 VAs for Process 1
GPU-1 VAs for Process 2
GPU-0
GPU-1
Available for access by another process AND from a second GPU use peering
13
NVLINK1 W/ PASCAL
0
20
40
60
80
100
120
PCIe 1 NVLink1 4x NVLink1
GB
/se
c
Copy using DMA
PCIe 1 NVLink1 4x NVLink1
0
20
40
60
80
100
120
PCIe 1 NVLink1 4x NVLink1
GB
/se
c
LD/ST - Linear
PCIe 1 NVLink1 4x NVLink1
0
2
4
6
8
10
PCIe 1 NVLink1 4x NVLink1
GB
/se
c
LD/ST - Random Access
PCIe 1 NVLink1 4x NVLink1
1.48x
5.89x
1.35x
5.43x
2.64x
10.44x
• GPU memory model is relaxed and scoped
• Same memory model works within/across GPUs
14
BREADTH FIRST SEARCH (BFS)
Baseline: an MPI-based multi-GPU implementation of BFS by M.Bisson et. al.
Has been shown to scale to thousands of GPUs on the Tsubame supercomputer
Loosely follows the Graph 500 specification
R-MAT generator, mean performance over 64 BFS operations
Uses 32-bit vertices instead of required 48-bit representation
Uses a parallel level-synchronous algorithm
Frontier: list of vertices from which traversal starts in a give level Expand: traverse graph to find un-visited neighbors of frontier vertices
Exchange: exchange list of newly found vertices
Append: remove redundancy and build the frontier for next level
15
DATA PARTITIONING
RC
P0,0
PR-1,0
P1,0
P0,1
PR-1,1
P1,1
P0,C-1
PR-1,C-1
P1,C-1
P0,0
PR-1,0
P1,0
P0,1
PR-1,1
P1,1
P0,C-1
PR-1,C-1
P1,C-1
P0,0
PR-1,0
P1,0
P0,1
PR-1,1
P1,1
P0,C-1
PR-1,C-1
P1,C-1
N/C
0
R-1
2
R
2R-1
R+1
R(C-1)
RC-1
R(C-1)+1
0 1 C-1
0R + i
1R + i
jR + i
(C-1)R + i
Partition at Process Pi,j
Process grid: R rows x C columns
Adjacency list of a vertex is distributed along process column
Destination vertex of each edge is in the same process row
16
BFS USING MPI
frontier list at one processor column is populated with the seed vertex
while (frontier not empty)
{
Expand: local expansion
Horizontal Exchange: Alltoall exchange newly discovered vertices (happens only along grid rows)
Append: build new frontier vertex list
Vertical exchange: Allgather frontier list along grid columns (happens only along columns)
Convergence check: Allreduce frontier list size among all processors
}
17
EXCHANGE USING MPI
Vertex lists are exchanged in two forms:
Using a vertex list (one integer per vertex)
Using a bitmap (one bit per vertex)
Format is statically picked based on the level of traversal
bitmap when number of discovered vertices is expected to be large
vertex list when number of discovered vertices is expected to be large
Communication implemented using non-blocking MPI sends and receives
Enhanced the base version to using CUDA-aware MPI
internally uses CUDA IPC/P2P to take advantage of direct PCIe or NVLink connections
18
0
5
10
15
20
25
20 21 22 23 24 25
GTE
PS
Graph Size (scale)
baseline cuMPI
PERFORMANCE WITH CUDA-AWARE MPI
4 K40 GPUs + PCIe 4 P100 GPUs + NVLink
0
2
4
6
8
20 21 22 23 24 25
GTE
PS
Graph Size (scale)
baseline cuMPI
15% 4%
Supermicro server w/ IvyBridge CPU and Broadcom PCIe switch, CUDA 8.0, MVAPICH2 2.2
DGX-1 w/ Broadwell CPU, CUDA 8.0, MVAPICH2 2.2
19
BFS USING NVSHMEM
Fuse communication into CUDA kernels expand <- exchange of newly discovered vertices append <- exchange frontier vertices while (frontier not empty)
{
Expand on GPU: local expansion
Append on GPU: build new frontier vertex list
Convergence check on CPU: Allreduce frontier list size among all processors
}
atomicOr(sbuf + r[k]/BITS(msk), m[k]);
int peer = r[k] / drow_bl;
uint32_t off = disp + r[k]%drow_bl;
q[k] = shmem_int_or ((int *)(msk + off/BITS(msk)), m[k], peer);
gets rid of the MPI exchange code on the host
20
WORKAROUND FOR PCIE
GPU-GPU atomics are not supported over PCIE We take advantage of fact that 32-bit writes are atomic Vertex list is represented as integers, marked with shmem_int_p Higher processing overhead as the vertex map is 32x larger Provides a common version to compare behavior over PCIe and NVLink
21
PERFORMANCE WITH PUTS
0
2
4
6
8
20 21 22 23 24
GTE
PS
Graph Size (Scale)
cuMPI SHMEM-Put
4 K40 GPUs + PCIe
0
5
10
15
20
25
20 21 22 23 24
GTE
PS
Graph Size (Scale)
cuMPI SHMEM-Put
4 P100 GPUs + NVLink
22
0
10
20
30
40
50
21 22 23 24 25 26
GTE
PS
Graph Size (scale)
cuMPI SHMEM-Atomics
PERFORMANCE WITH ATOMICS
8 P100 GPUs + NVLink (2x4) 8 P100 GPUs + NVLink (4x2)
22%
75%
Overall: 12%
52%
0
10
20
30
40
50
21 22 23 24 25 26 27
GTE
PS
Graph Size (scale)
cuMPI SHMEM-Atomics
23
FUTURE WORK
Fairer comparison by removing tracking of predecessor information Hybrid vertex list/bitmap design like in the case of MPI Can offload multiple levels of BFS using stream-based collectives in NVSHMEM Possibility of kernel fusion, using kernel-side collectives in NVSHMEM Evaluate the use of IB to extend NVSHEM across multiple nodes
24
CONCLUSION
NVLink provides a high-bandwidth memory-like fabric between GPUS NVSHMEM provides GPU-side communication API based on OpenSHMEM We presented a study of using GPU-initiated communication in BFS Benefits from reduced overheads in strong scaling cases, compared to using MPI Tradeoff with efficient network utilization for weak scaling
25
THANKS
We thank authors of the baseline BFS code
Mauro Bisson, Massimo Bernaschi, Enrico Mastrostefano: Parallel Distributed Breadth First Search on the Kepler Architecture
This has been supported in part by Oak Ridge National Lab under subcontract #4000145249.