+ All Categories
Home > Documents > ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND...

ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND...

Date post: 13-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
88
Lars Nyland & Stephen Jones, GTC 2019 ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX - 2
Transcript
Page 1: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

Lars Nyland & Stephen Jones, GTC 2019

ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2

Page 2: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

2

DGX-2: FASTEST COMPUTE NODE EVER BUILTWe’re here to tell you about it

Page 3: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

3

NVIDIA DGX-2 SERVER AND NVSWITCH

16 Tesla™ V100 32 GB GPUsFP64: 125 TFLOPS

FP32: 250 TFLOPS

Tensor: 2000 TFLOPS

512 GB of GPU HBM2

Single-Server Chassis10U/19-Inch Rack Mount 10 kW Peak TDPDual 24-core Xeon CPUs1.5 TB DDR4 DRAM30 TB NVMe Storage

New NVSwitch Chip18 2nd Generation NVLink™ Ports 25 GBps per Port900 GBps Total Bidirectional Bandwidth450 GBps Total Throughput

12 NVSwitch NetworkFull-Bandwidth Fat-Tree Topology2.4 TBps Bisection BandwidthGlobal Shared MemoryRepeater-less

® ™™

Page 4: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

4

USING MULTI-GPU MACHINESWhat’s the difference between a mining rig and DGX-2?

Page 5: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

5

SINGLE GPU

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/O NVLink2

L2 L2 L2

CPU

PCIe Bus

Page 6: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

6

SINGLE GPU

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/O NVLink2

L2 L2 L2

CPU

PCIe Bus

Page 7: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

7

SINGLE GPU

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/O NVLink2

L2 L2 L2

CPU

PCIe Bus

Page 8: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

8

TWO GPUSCan read and write each other’s memory over PCIe

CPU

PCIe Bus

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/O NVLink2

L2 L2 L2

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/ONVLink2

L2L2L2

Page 9: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

9

TWO GPUSCan read and write each other’s memory over PCIe

CPU

PCIe Bus

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/O NVLink2

L2 L2 L2

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/ONVLink2

L2L2L2

Page 10: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

10

TWO GPUS USING NVLINK6 Bidirectional Channels Directly Connecting 2 GPUs

PCIe Bus

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/ONVLink2

L2L2L2

CPU

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/O NVLink2

L2 L2 L2

Page 11: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

11

TWO GPUS USING NVLINK6 Bidirectional Channels Directly Connecting 2 GPUs

PCIe Bus

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/ONVLink2

L2L2L2

CPU

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/O NVLink2

L2 L2 L2

Page 12: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

12

TWO GPUS USING NVLINK6 Bidirectional Channels Directly Connecting 2 GPUs

PCIe Bus

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/ONVLink2

L2L2L2

CPU

HBM

PCIe I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe I/O NVLink2

L2 L2 L2

Page 13: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

13

MULTIPLE GPUS

- Requires dedicated connections between GPUs

- Decreases bandwidth between GPUs as more are added

- Not scalable

Directly connected using NVLink2

HBM

PCIe

I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe

I/ONVLink2

L2 L2 L2

HBM

PCIe

I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe

I/ONVLink2

L2L2L2

HBM

PCIe

I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe

I/ONVLink2

L2 L2 L2

Page 14: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

14

ADDING A SWITCH

HBM

PCIe

I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe

I/ONVLink2

L2 L2 L2

HBM

PCIe

I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe

I/ONVLink2

L2L2L2

HBM

PCIe

I/O

S

M

L2

HBM

XBAR

H

U

B

PCIe

I/ONVLink2

L2 L2 L2

NVSwitch

Page 15: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

15

DGX-2 INTERCONNECT INTRO8 GPUs with 6 NVSwitch Chips

NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch

V100 V100 V100 V100 V100 V100 V100 V100

Page 16: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

16

DGX-2 INTERCONNECT INTRO8 GPUs with 6 NVSwitch Chips

NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch

V100 V100 V100 V100 V100 V100 V100 V100

Page 17: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

17

DGX-2 INTERCONNECT INTRO8 GPUs with 6 NVSwitch Chips

NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch

V100 V100 V100 V100 V100 V100 V100 V100

Page 18: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

18

FULL DGX-2 INTERCONNECTBaseboard to baseboard

NV

Sw

i

tch

NV

Sw

i

tch

NV

Sw

i

tch

NV

Sw

i

tch

NV

Sw

i

tch

NV

Sw

i

tch

V100

V100

V100

V100

V100

V100

V100

V100

NV

Sw

i

tch

NV

Sw

i

tch

NV

Sw

i

tch

NV

Sw

i

tch

NV

Sw

i

tch

NV

Sw

i

tch

V100

V100

V100

V100

V100

V100

V100

V100

Page 19: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

19

MOVING DATA ACROSS THE NVLINK FABRIC

Bulk transfers

Use copy-engine (DMA) between GPUs to move data

Available with cudaMemcpy()

Word by word

Programs running on SMs can access all memory by address

For LOAD, STORE, and ATOM operations

1 to 16 bytes per thread

Similar performance guidelines for addressing coalescing apply

Page 20: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

20

HOST MEMORY VIA PCIE

2 Intel Xeon 8168 CPUs

1.5 Terabytes DRAM

4 PCIe buses

2/CPU

4 GPUs/bus

GPUs can read/write host memory at 50 GB/s

1.5 TB of host memory accessible via four PCIe channels

NVS

WIT

CH

NVS

WIT

CH

NVS

WIT

CH

NVS

WIT

CH

NVS

WIT

CH

V100

V100

V100

V100NVS

WIT

CHV100

V100

V100

V100

NVS

WIT

CH

NVS

WIT

CH

NVS

WIT

CH

V100

V100

V100

V100

V100

V100

V100

V100

PCIE

SWx86x86

PCIE

SW

PCIE

SW

PCIE

SW

PCIE

SW

PCIE

SW

x6x6

PCIE

SW

PCIE

SW

PCIE

SW

PCIE

SW

PCIE

SW

PCIE

SW

PCIE

SW

PCIE

SW

100G

NIC

100G

NIC

100G

NIC

100G

NIC

100G

NIC

100G

NIC

100G

NIC

100G

NIC

NV

SW

ITC

H

NV

SW

ITC

H

QPIQPI

Page 21: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

21

HOST MEMORY BANDWIDTHUser data moving at 49+ GB/s

1 kernel/GPU reading host memory

On GPUs 0, 1, …, 15

10 second delay between each launch

GPUs 0-3 share one PCIe bus

Same for 4-7, 8-11, 12-15

Page 22: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

22

MULTI-GPU PROGRAMMING IN CUDA

Programs control each device independently

▪ Streams are per-device work queues

▪ Launch & synchronize on a stream implies device

Inter-stream synchronization uses events

▪ Events can mark kernel completion

▪ Kernels queued in a stream on one device can wait for an event from another

GPU1

CPU

Program

GPU0

GPUN...

Page 23: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

23

// Create as many streams as I have devicescudaStream_t stream[16];for(int gpu=0; gpu<numGPUs; gpu++) {

cudaSetDevice(gpu);cudaStreamCreate(&stream[gpu]);

}

// Launch a copy of the first kernel onto each GPU’s streamfor(int gpu=0; gpu<numGPUs; gpu++) {

cudaSetDevice(gpu);firstKernel<<< griddim, blockdim, 0, stream[gpu] >>>( ... );

}

// Wait for the kernel to finish on each streamfor(int gpu=0; gpu<numGPUs; gpu++)

cudaStreamSynchronize(stream[gpu]);

// Now launch a copy of another kernel onto each GPU’s streamfor(int gpu=0; gpu<numGPUs; gpu++) {

cudaSetDevice(gpu);secondKernel<<< griddim2, blockdim2, 0, stream[gpu] >>>( ... );

}

VERY BASIC MULTI-GPU LAUNCH

Create stream

on each GPU

Launch kernel

on each GPU

Synchronize stream

on each GPU

Page 24: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

24

BETTER ASYNCHRONOUS MULTI-GPU LAUNCH

// Launch the first kernel and an event to mark its completionfor(int gpu=0; gpu<numGPUs; gpu++) {

cudaSetDevice(gpu);firstKernel<<< griddim, blockdim, 0, stream[gpu] >>>( ... );cudaEventRecord(event[gpu], stream[gpu]);

}

// Make GPU 0 sync with other GPUs to know when all are donefor(int gpu=1; gpu<numGPUs; gpu++)

cudaStreamWaitEvent(stream[0], event[gpu], 0);

// Then make other GPUs sync with GPU 0 for a full handshakecudaEventRecord(event[0], stream[0]);for(int gpu=1; gpu<numGPUs; gpu++)

cudaStreamWaitEvent(stream[gpu], event[0], 0);

// Now launch the next kernel with an event... and so onfor(int gpu=0; gpu<numGPUs; gpu++) {

cudaSetDevice(gpu);secondKernel<<< griddim, blockdim, 0, stream[gpu] >>>( ... );cudaEventRecord(event[gpu], stream[gpu]);

}

Launch kernel

on each GPU

GPU 0 waits for all

kernels on all GPUs

Other GPUs wait

for GPU 0

Synchronize all

GPUs at end

Page 25: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

25

MUCH SIMPLER: COOPERATIVE LAUNCH

// Cooperative launch provides easy inter-grid sync.// Kernels and per-GPU streams are in “launchParams”cudaLaunchCooperativeKernelMultiDevice(launchParams, numGPUs);

// Now just synchronize to wait for the work to finish.cudaStreamSynchronize(stream[0]);

Launch

cooperative kernel

Program runs

across all GPUs

Synchronize within

GPU code

Exit when done

// Inside the kernel, instead of kernels make function calls__global__ void masterKernel( ... ) {

firstKernelAsFunction( ... ); // All threads on all GPUs runthis_multi_grid().sync(); // Sync all threads on all GPUs

secondKernelAsFunction( ... )this_multi_grid().sync();

...}

Page 26: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

26

MULTI-GPU MEMORY MANAGEMENT

Unified MemoryProgram spans all GPUs + CPUs

GPU1

GPU0

CPU

Program

GPU0

GPU1

P2P

CPU

Program

PCIe

Individual memory

Independent instances readfrom neighbours explicitly

Page 27: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

27

16 GPUs WITH 32GB MEMORY EACH

NVSWITCH PROVIDES

All-to-all high-bandwidth peer mapping between GPUs

Full inter-GPU memory interconnect (incl. Atomics)

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

16x 32GB Independent Memory Regions

GPU8

GPU9

GPU10

GPU11

GPU12

GPU13

GPU14

GPU15

Page 28: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

28

UNIFIED MEMORY PROVIDES

Single memory viewshared by all GPUs

User control of data locality

Automatic migration of data between GPUs

UNIFIED MEMORY + DGX-2

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

GPU8

GPU9

GPU10

GPU11

GPU12

GPU13

GPU14

GPU15

512 GB Unified Memory

Page 29: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

29

WE WANT LINEAR MEMORY ACCESSBut cudaMalloc creates a partitioned global address space

4 GPUs require 4 allocationsgiving 4 regions of memory

Problem: Program must nowbe aware of data & compute

layout across GPUs

GPU 0

GPU 1

GPU 2

GPU 3

Page 30: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

30

UNIFIED MEMORYCUDA’s Unified Memory Allows One Allocation To Span Multiple GPUs

Normal pointer arithmeticjust works

GPU 0

GPU 1

GPU 2

GPU 3

Page 31: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

31

SETTING UP UNIFIED MEMORY

// Allocate data for cube of side “N”float *data;size_t size = N*N*N * sizeof(float);cudaMallocManaged(&data, size);

// Make whole allocation visible to all GPUsfor(int gpu=0; gpu<numGPUs; gpu++) {

cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, gpu);}

// Now place chunks on each GPU in a striped layoutfloat *start = ptr;for(int gpu=0; gpu<numGPUs; gpu++) {

cudaMemAdvise(start, size / numGpus, cudaMemAdviseSetPreferredLocation, gpu);cudaMemPrefetchAsync(start, size / numGpus, gpu);start += size / numGpus;

}

GPU 0

GPU 1

GPU 2

GPU 3

Page 32: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

32

WEAK SCALINGProblem Grows As Processor Count Grows – Constant Work Per GPU

1x 1x 1x 1x 1x 1x 1x

Page 33: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

33

IDEAL WEAK SCALING

Page 34: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

34

STRONG SCALINGProblem Stays Same As Processor Count Grows – Less Work Per GPU

1x ½x ¼x½x ¼x ¼x ¼x

Page 35: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

35

IDEAL STRONG SCALING

Page 36: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

36

EXAMPLE: MERGESORT

Break list into 2 equal parts, recursively

Until just 1 element per list

Merge pairs of lists

Keeping them sorted

Until just 1 list remains

S

p

l

i

t

M

e

r

g

e

Page 37: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

37

PARALLELIZING MERGESORT

Parallelize merging of lists

Compute where elements are stored in result independently

For all items in list A

Find lowest j such that A[i] < B[j]

Store A[i] at D[i+j]

Binary search step impacts O(parallelism)

Extra log(n) work to perform binary search

A B

D

i j

i+j

Page 38: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

38

OUTLINE OF PARALLEL MERGESORT

1. Read by thread-id

2. Compute location in write-buffer

3. Write

4. Sync

5. Flip read, write buffers

6. Repeat, doubling list lengths

The “merge” step

T T T T T T T T T T T T T T T T

Tim

e

Page 39: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

39

OUTLINE OF PARALLEL MERGESORT

1. Read by thread-id

2. Compute location in write-buffer

3. Write

4. Sync

5. Flip read, write buffers

6. Repeat, doubling list lengths

The “merge” step

T T T T T T T T T T T T T T T T

Wr T

ime

Page 40: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

40

OUTLINE OF PARALLEL MERGESORT

1. Read by thread-id

2. Compute location in write-buffer

3. Write

4. Sync

5. Flip read, write buffers

6. Repeat, doubling list lengths

The “merge” step

T T T T T T T T T T T T T T T T

T T T T T T T T T T T T T T T T

Wr T

ime

Wr

Page 41: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

41

OUTLINE OF PARALLEL MERGESORT

1. Read by thread-id

2. Compute location in write-buffer

3. Write

4. Sync

5. Flip read, write buffers

6. Repeat, doubling list lengths

The “merge” step

T T T T T T T T T T T T T T T T

T T T T T T T T T T T T T T T T

Wr

Wr

T T T T T T T T T T T T T T T T

Wr

Tim

e

Page 42: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

42

ALIGNMENT OF THREADS & DATA

Memory pages are 2 MB

Pinned to GPUs in round-robin fashion

Run 80*1024 = 81920 threads on each GPU

One 8-byte read covers 655,360 bytes

16 GPUs cover 10,485,760 bytes

No optimized alignment between threads and memory

Possible to do better (and worse)

memory

threads

Page 43: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

43

ACHIEVED BANDWIDTH

16 GPUs read & write data at 6 TB/s

Adding more GPUs

Adds more accessible bandwidth

Adds memory capacity

Is 6 TB/s fast enough?

Speed of light is 2 TB/s for DGX-2

Caching gives a performance boost

Aggregate bandwidth for all loads and stores in mergesort

Page 44: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

44

STRONG SCALINGSorting 8 billion values on 4-16 GPUs

Page 45: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

45

COMMUNICATING ALGORITHMS

3x3 Convolution

Page 46: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

46

COMMUNICATING ALGORITHMS

3x3 Convolution

Page 47: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

47

COMMUNICATION IS EVERYTHING

Page 48: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

48

COMMUNICATION IS EVERYTHING

Page 49: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

49

COMMUNICATION IS EVERYTHINGHalo Cells Keep Computation Local and Communication Asynchronous

Copy remote node boundary-celldata into local halo cells

Page 50: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

50

STRONG SCALING: DIMINISHING RETURNS

Strong

scaling

Page 51: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

51

LEAVING COMMUNICATION TO NVLINK

1. Eliminate halo cells

Pretending That Memory Over NVLink Is Local Memory

Page 52: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

52

LEAVING COMMUNICATION TO NVLINK

1. Eliminate halo cells

2. Read directly from neighbor GPUs as if all memory were local

Pretending That Memory Over NVLink Is Local Memory

Page 53: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

53

LEAVING COMMUNICATION TO NVLINK

1. Eliminate halo cells

2. Read directly from neighbor GPUs as if all memory were local

3. NVLink takes care of fast communication

How far can we push this?

Pretending That Memory Over NVLink Is Local Memory

Page 54: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

54

LEAVING COMMUNICATION TO NVLINK

As GPU count increases

▪ Halo-to-core ratio increases

▪ Off-chip accesses increase

▪ On-chip accesses decrease

▪ Proportions depend on algorithm

We reach NVLink bandwidth limit

Pretending That Memory Over NVLink Is Local Memory

Page 55: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

55

LEAVING COMMUNICATION TO NVLINK

As GPU count increases

▪ Halo-to-core ratio increases

▪ Off-chip accesses increase

▪ On-chip accesses decrease

▪ Proportions depend on algorithm

We reach NVLink bandwidth limit

BUT: Communication is many-to-manyso full aggregate bandwidth is available

Pretending That Memory Over NVLink Is Local Memory

Page 56: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

56

LOOKING FOR BANDWIDTH LIMITS

Stencil codes are memory bandwidth limited

Limited by HBM2 bandwidthfor on-chip reads

Limited by NVLink bandwidthfor off-chip reads

Hypothesis: Local-to-Remote Ratio Determines Performance

NVLink = 120 GB/sec

HBM2 = 880 GB/sec

Ratio = 18.4%

Page 57: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

57

NAÏVE 3D STENCIL PROGRAMHow Well Does The Simplest Possible Stencil Perform?

Slope = 25%

Page 58: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

58

BANDWIDTH & OVERHEAD LIMITS

Example: LULESH stencil CFD code

Expect to lose performance when off-chip bandwidth exceeds NVLink bandwidth:

NVLink = 120 GB/sec

HBM2 = 880 GB/sec

Ratio = 18.4%18.4%

Page 59: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

59

BONUS FOR NINJAS: ONE-SIDED COMMUNICATION

NVLink achieves higher bandwidth forwrites than for reads:

Read requests consume inbound bandwidth at remote node

Difference is ~9%

Page 60: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

60

WORK STEALINGWeak Scaling Mechanism: GPUs Compete To Process Data

Producers feed workinto FIFO queue

Any consumer can pop head of queuewhenever it needs more work

Page 61: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

61

DESIGN OF A FAST FIFO QUEUEBasic Building Block For Many Dynamic Scheduling Applications

TailHead

Push Operation: Adds data to head of queue

Page 62: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

62

DESIGN OF A FAST FIFO QUEUEBasic Building Block For Many Dynamic Scheduling Applications

Head

Advance head pointer to claim more space1

Push Operation: Adds data to head of queue

Tail

Page 63: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

63

DESIGN OF A FAST FIFO QUEUEBasic Building Block For Many Dynamic Scheduling Applications

Advance head pointer to claim more space1

Push Operation: Adds data to head of queue

Write new data into space2

Head Tail

Page 64: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

64

DESIGN OF A FAST FIFO QUEUEBasic Building Block For Many Dynamic Scheduling Applications

Pop Operation: Extracts data from tail of queue

Head Tail

Page 65: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

65

DESIGN OF A FAST FIFO QUEUEBasic Building Block For Many Dynamic Scheduling Applications

Tail

Advance tail pointer to next item1

Pop Operation: Extracts data from tail of queue

Head

Page 66: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

66

DESIGN OF A FAST FIFO QUEUEBasic Building Block For Many Dynamic Scheduling Applications

Advance tail pointer to next item1

Pop Operation: Extracts data from tail of queue

Read data from new location2

Head Tail

Page 67: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

67

DESIGN OF A FAST FIFO QUEUEBasic Building Block For Many Dynamic Scheduling Applications

Head

Advance head pointer to claim more space1

Push Operation: Adds data to head of queue

Write new data into space2

Tail

Advance tail pointer to next item1

Pop Operation: Extracts data from tail of queue

Read data from new location2

Page 68: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

68

DESIGN OF A FAST FIFO QUEUEProblem: Concurrent Access Of Empty Queue

Tail

Head

Empty Queue: When Head == Tail

Page 69: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

69

DESIGN OF A FAST FIFO QUEUEProblem: Concurrent Access Of Empty Queue

TailHead

Advance head pointer to claim more space1

Push Operation: Adds data to head of queue

Page 70: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

70

DESIGN OF A FAST FIFO QUEUEProblem: Concurrent Access Of Empty Queue

Tail

Head

Advance head pointer to claim more space1

Push Operation: Adds data to head of queue

Advance tail pointer to next item1

Pop Operation: Extracts data from tail of queue

Page 71: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

71

DESIGN OF A FAST FIFO QUEUEProblem: Concurrent Access Of Empty Queue

Tail

Head

Advance head pointer to claim more space1

Push Operation: Adds data to head of queue

Advance tail pointer to next item1

Pop Operation: Extracts data from tail of queue

Read data from new location2

Page 72: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

72

DESIGN OF A FAST FIFO QUEUESolution: Two Head (& Tail) Pointers For Thread-Safety

Tail

InnerHead

OuterHead

Page 73: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

73

DESIGN OF A FAST FIFO QUEUESolution: Two Head (& Tail) Pointers For Thread-Safety

Tail

InnerHead

Advance outer head pointer1

Push Operation: Adds data to head of queue

OuterHead

Page 74: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

74

DESIGN OF A FAST FIFO QUEUESolution: Two Head (& Tail) Pointers For Thread-Safety

Tail

InnerHead

Advance outer head pointer1

Push Operation: Adds data to head of queue

OuterHead

While “tail” == “inner head”, do nothing1

Pop Operation: Extracts data from tail of queue

1

Page 75: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

75

DESIGN OF A FAST FIFO QUEUESolution: Two Head (& Tail) Pointers For Thread-Safety

Tail

InnerHead

Advance outer head pointer1

Push Operation: Adds data to head of queue

OuterHead

While “tail” == “inner head”, do nothing1

Pop Operation: Extracts data from tail of queue

Write new data into space21

Page 76: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

76

DESIGN OF A FAST FIFO QUEUESolution: Two Head (& Tail) Pointers For Thread-Safety

Tail

InnerHead

Advance outer head pointer1

Push Operation: Adds data to head of queue

OuterHead

While “tail” == “inner head”, do nothing1

Pop Operation: Extracts data from tail of queue

Write new data into space2

Advance inner head pointer3

1

Page 77: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

77

DESIGN OF A FAST FIFO QUEUESolution: Two Head (& Tail) Pointers For Thread-Safety

Tail

InnerHead

Advance outer head pointer1

Push Operation: Adds data to head of queue

OuterHead

While “tail” == “inner head”, do nothing1

Pop Operation: Extracts data from tail of queue

Write new data into space2

Advance inner head pointer3

Advance tail pointer to next item2

Page 78: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

78

DESIGN OF A FAST FIFO QUEUESolution: Two Head (& Tail) Pointers For Thread-Safety

InnerHead

Advance outer head pointer1

Push Operation: Adds data to head of queue

OuterHead

While “tail” == “inner head”, do nothing1

Pop Operation: Extracts data from tail of queue

Write new data into space2

Advance inner head pointer3

Advance tail pointer to next item2

Read data from new location3Tail

Page 79: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

79

DESIGN OF A FAST FIFO QUEUEThread-Safe Multi-Producer / Multi-Consumer Operation

OuterTail

OuterHead

InnerHead

InnerTail

Basic Rules

1. Use inner/outer head to avoid underflow

2. Use inner/outer tail to avoid overflow

3. Access all pointers atomically to allow multiple push/pop operations at once

4. NVLink carries atomics – this would be much, much harder over PCIe

Page 80: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

80

SCALE EASILY BY ADDING MORE CONSUMERSLimit Is Memory Bandwidth Between Queue & Consumers

Page 81: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

81

SCALING OUT TO LOTS OF GPUSIt Works! Here’s an Incredibly Boring Graph To Prove It

Page 82: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

82

QUEUE CONTENTION LIMITATION

Why are more consumers worse?

▪ Large number of consumers accessing a single queue

▪ Saturates memory system at queue head

▪ Loss of throughput even though bandwidth is available

When Consumers Are Consuming Too Quickly

Page 83: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

83

MEMORY CONTENTION LIMITATION

Mitigations

▪ Contention arises when many consumers are available for work

▪ Apply backoff delay between queue requests

▪ BUT: Unnecessary backoff increases latency

▪ Full-queue management still adds overhead

Throttling requests restores throughput, but costs latency

Page 84: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

84

GUIDANCE AND GOTCHAS

You can mostly ignore NVLink – it’s just like memory thanks to NVSwitch

BUT

2TB/sec is combined bandwidth, for many-to-many access patterns

If everyone accesses a single GPU, they share 137GB/sec

NVSwitch Is The Secret Sauce

Page 85: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

85

GUIDANCE AND GOTCHAS

You can mostly ignore NVLink – it’s just like memory thanks to NVSwitch

BUT

2TB/sec is combined bandwidth, for many-to-many access patterns

If everyone accesses a single GPU, they share 137GB/sec

ALSO

Sometimes NVLink bandwidth is not the limiting factor

High contention at a single memory location hurt you

You have 2.6 million threads – contention can get very high

NVSwitch Is The Secret Sauce

Page 86: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

86

KEEPING PERFORMANCE HIGH

Spread your data across all GPUs

To avoid colliding memory requests at the “storage” GPU

Spread threads and data across all GPUs to use the most hardware

Spread your threads across all GPUs

To avoid all traffic congesting at the “computing” GPU

“All the wires, all the time”

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6

GPU8 GPU9 GPU10 GPU11 GPU12 GPU13 GPU14

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6

GPU8 GPU9 GPU10 GPU11 GPU12 GPU13 GPU14

Page 87: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

87

GUIDANCE: RELY ON DGX-2 HARDWARE

Unified virtual memory gives you a simple memory model for spanning multiple GPUs

At maximum performance

Remote memory traffic uses L1 cache

Volta has a 128 KB cache for each SM

Not coherent, use fences to ensure consistency after writes

Explicit management of local & remote memory accesses may improve performance

How much tuning is needed?

Page 88: ALL YOU NEED TO KNOW ABOUT PROGRAMMING NVIDIA’S DGX-2 · 2019-03-29 · 3 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor:

88

CONCLUSIONS

The fabric makes DGX-2 more than just 16 GPUs – it’s not just a mining rig

DGX-2 is a superb strong scaling machine in a time-to-solution sense

Overhead of Multi-GPU programming is low

Naïve code behaves well – great effort/reward ratio

Linearization of addresses through UVM provides a familiar model for free


Recommended