+ All Categories
Home > Documents > Mark Harris, November 1, 2017 - GPU Technology...

Mark Harris, November 1, 2017 - GPU Technology...

Date post: 29-May-2018
Category:
Upload: vanduong
View: 219 times
Download: 0 times
Share this document with a friend
38
May 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017 CUDA 9 AND BEYOND
Transcript
Page 1: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

May 8-11, 2017 | Silicon Valley

Mark Harris, November 1, 2017

CUDA 9 AND BEYOND

Page 2: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

22

INTRODUCING CUDA 9

Tesla V100New GPU ArchitectureTensor CoresNVLinkIndependent Thread Scheduling

BUILT FOR VOLTA

COOPERATIVE THREAD GROUPS

Flexible Thread GroupsEfficient Parallel AlgorithmsSynchronize Across Thread Blocks in a Single GPU or Multi-GPUs

cuBLAS for Deep LearningNPP for Image ProcessingcuFFT for Signal Processing

FASTER LIBRARIES

DEVELOPER TOOLS & PLATFORM UPDATES

Faster Compile TimesUnified Memory ProfilingNVLink VisualizationNew OS and Compiler Support

partition

sync sync

Page 3: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

33

INTRODUCING TESLA V100

The Fastest and Most Productive GPU for Deep Learning and HPC

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink & HBM2

Efficient Bandwidth

Page 4: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

44

ROAD TO EXASCALEVolta to Fuel Most Powerful

US Supercomputers

1.64

1.501.39 1.41 1.37

1.7

1.41.5

V100

Per

form

ance

Rela

tive

to

P100

1.5x HPC Performance in 1 Year

System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 2X Tesla P100 or V100.

Summit Supercomputer200+ PetaFlops~3,400 Nodes10 Megawatts

Page 5: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

55

FASTER LIBRARIES

Page 6: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

66

CUDA 9: WHAT’S NEW IN LIBRARIES

VOLTA PLATFORM SUPPORT PERFORMANCE

IMPROVED USER EXPERIENCENEW ALGORITHMS

Utilize Volta Tensor Cores

Volta optimized GEMMs (cuBLAS)

Out-of-box performance on Volta (all libraries)

GEMM optimizations for RNNs (cuBLAS)

Faster image processing (NPP)

FFT optimizations across various sizes (cuFFT)

Multi-GPU dense & sparse solvers, dense eigenvalue & SVD (cuSOLVER)

Breadth first search, clustering, triangle counting, extraction & contraction (nvGRAPH)

New install package for CUDA Libraries (library-only meta package)

Modular NPP with small footprint, support for image batching

DEEP LEARNING

Scientific Computing

Page 7: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

77

0

1

2

3

4

5

6

7

8

9

10

512 1024 2048 4096

RelativePerfo

rmance

MatrixSize(M=N=K)

cuBLAS MixedPrecision(FP16Input,FP32compute)

P100(CUDA8) V100TensorCores(CUDA9)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

512 1024 2048 4096

RelativePerfo

rmance

MatrixSize(M=N=K)

cuBLAS SinglePrecision(FP32)

P100(CUDA8) V100(CUDA9)

cuBLAS GEMMS FOR DEEP LEARNINGV100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply

9.3x1.8x

Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.

Page 8: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

88

COOPERATIVE GROUPS

Page 9: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

99

COOPERATIVE GROUPSFlexible and Scalable Thread Synchronization and Communication

Define, synchronize, and partition groups of cooperating threads

Clean composition across software boundaries

Optimize for hardware fast path

Scalable from a few threads to all running threads

Deploy Everywhere: Kepler and Newer GPUs

Supported by CUDA developer tools

Thread Block Group

Partitioned Thread Groups

Page 10: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

1010

SYNCHRONIZE AT ANY SCALEThree Key Capabilities

FLEXIBLE GROUPS

Define and Synchronize Arbitrary

Groups of Threads

partition

sync sync

WHOLE-GRID SYNCHRONIZATION

Synchronize Multiple Thread Blocks

sync

MULTI-GPU SYNCHRONIZATION

sync

* Note: Multi-Block and Mult-Device Cooperative Groups are only supported on Pascal and above GPUs

Page 11: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

1111

COOPERATIVE GROUPS BASICSFlexible, Explicit Synchronization

Thread groups are explicit objects in your program

You can synchronize threads in a group

Create new groups by partitioning existing groups

Partitioned groups can also synchronize

thread_group block = this_thread_block();

block.sync();

thread_group tile32 = tiled_partition(block, 32);thread_group tile4 = tiled_partition(tile32, 4);

tile4.sync();Note: calls in green are part of the cooperative_groups:: namespace

Thread Block Group

Partitioned Thread Groups

Page 12: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

1212

EXAMPLE: PARALLEL REDUCTIONComposable, Robust and Efficient

__device__ int reduce(thread_group g, int *x, int val) { int lane = g.thread_rank();for (int i = g.size()/2; i > 0; i /= 2) {x[lane] = val; g.sync();val += x[lane + i]; g.sync();

}return val;

}

g = tiled_partition<32>(this_thread_block());reduce(g, ptr, myVal);

g = this_thread_block();reduce(g, ptr, myVal);

Per-Block Per-Warp

Page 13: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

1313

LAUNCHING COOPERATIVE KERNELSThree Synchronization Scales

Block or Sub-Block Sync

Launch with <<<>>> orcudaLaunchKernel()

Multi-Device Sync Launch with cudaLaunchCooperativeKernelMultiDevice()

Multi-Block Sync Launch with cudaLaunchCooperativeKernel()

Page 14: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

1414

EXAMPLE: PARTICLE SIMULATIONWithout Cooperative Groups

0 1 2 3

4 5 67

// threads update particles in parallelintegrate<<<blocks, threads, 0, stream>>>(particles);

Page 15: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

1515

EXAMPLE: PARTICLE SIMULATIONWithout Cooperative Groups

// threads update particles in parallelintegrate<<<blocks, threads, 0, s>>>(particles);

// Collide each particle with others in neighborhoodcollide<<<blocks, threads, 0, s>>>(particles);

0 1 2 3

5 6 7

4

Note change in how threads map to particles in acceleration data structure

Page 16: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

1616

EXAMPLE: PARTICLE SIMULATIONWithout Cooperative Groups

// threads update particles in parallelintegrate<<<blocks, threads, 0, s>>>(particles);

// Note: implicit sync between kernel launches

// Collide each particle with others in neighborhoodcollide<<<blocks, threads, 0, s>>>(particles);

Note change in how threads map to particles in acceleration data structure

0 1 2 3

4 5 6 7

0 1 2 3

5 6 7

4

Page 17: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

1717

WHOLE-GRID COOPERATIONParticle Simulation Update in a Single Kernel

__global__ void particleSim(Particle *p, int N) {

grid_group g = this_grid();

for (i = g.thread_rank(); i < N; i += g.size())integrate(p[i]);

g.sync() // Sync whole grid!

for (i = g.thread_rank(); i < N; i += g.size())collide(p[i], p, N);

}

Launch using cudaLaunchCooperativeKernel(…)

0 1 2 3

4 5 6 7

0 1 2 3

5 6 7

4

Page 18: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

1818

MULTI-GPU COOPERATIONLarge-scale Multi-GPU Simulation in a Single Kernel

Launch using cudaLaunchCooperativeKernelMultiDevice(…)

__global__ void particleSim(Particle *p, int N) {

multi_grid_group g = this_multi_grid();

for (i = g.thread_rank(); i < N; i += g.size())integrate(p[i]);

g.sync() // Sync all GPUs!

for (i = g.thread_rank(); i < N; i += g.size())collide(p[i], p, N);

}

0 1 2 3

4 5 6 7

0 1 2 3

4 5 6 7

0 1 2 3

4 5 6 7

0 1 2 3

4 5 6 7

0 1 2 3

5 6 7

4 0 1 2 3

5 6 7

4 0 1 2 3

5 6 7

4 0 1 2 3

5 6 7

4

Page 19: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

1919

ROBUST AND EXPLICIT WARP PROGRAMMING

Volta Independent Thread Scheduling:

Program familiar algorithms and data structures in a natural way

Flexible thread grouping and synchronization

Use explicit synchronization, don’t rely on implicit convergence

CUDA 9 provides a fully explicit synchronization model

Adapt Legacy Code for New Execution Model

Page 20: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

2020

ROBUST AND EXPLICIT WARP PROGRAMMING

Eliminate implicit warp synchronous programming on all architectures

Use explicit synchronization

Focus synchronization granularity with Cooperative Groups

Transition to new *_sync() primitives

__shfl_sync(), __ballot_sync(), __any_sync(), __all_sync(), __activemask()

CUDA 9 deprecates non-synchronizing __shfl(), __ballot(), __any(), __all()

Adapt Legacy Code for New Execution Model

Page 21: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

2121

Learn More

“Cooperative Groups: Flexible CUDA Thread Programming”https://devblogs.nvidia.com/parallelforall/cooperative-groups/

GTC San Jose 2017: “Coooperative Groups”Kyrylo Perelygin and Yuan Lin

http://on-demand-gtc.gputechconf.com/gtc-quicklink/pTT9h

Page 22: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

2222

DEVELOPER TOOLS

Page 23: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

2323

UNIFIED MEMORY PROFILINGCorrelate CPU Page Faults with Source

Page Fault Correlation

Page 24: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

2424

NEW UNIFIED MEMORY EVENTS

Page ThrottlingMemory Thrashing Remote Map

Visualize Virtual Memory Activity

Page 25: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

2525

THE BEYOND SECTION

Page 26: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

2626

FUTURE: UNIFIED SYSTEM ALLOCATORAllocate unified memory using standard malloc

Removes CUDA-specific allocator restrictions

Data movement is transparently handled

Requires operating system support:

HMM Linux Kernel Module

void sortfile(FILE *fp, int N) {char *data;

// Allocate memory using any standard allocatordata = (char *) malloc(N * sizeof(char));

fread(data, 1, N, fp);

sort<<<...>>>(data,N,1,compare);

use_data(data);

// Free the allocated memoryfree(data);

}

CUDA 9 Code with System Allocator

Progress Update:HMM patchset will be integrated

into Linux Kernel 4.14NVIDIA Driver support coming

Page 27: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

2727

USING TENSOR CORES

Volta Optimized Frameworks and Libraries

__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)

{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;

wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);

}

CUDA C++Warp-Level Matrix Operations

NVIDIA cuDNN, cuBLAS, TensorRT

Page 28: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

2828

TENSOR COREMixed Precision Matrix Math4x4 matrices

D = AB + C

D =

FP16 or FP32 FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

Page 29: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

2929

TENSOR CORE COORDINATION

Warp-synchronizing operation for cooperative matrix math

Full Warp 16x16 Matrix Math

Aggregate Matrix Multiply and Accumulate for 16x16 matrices

Result distributed across warp

warp

warp

Page 30: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

3030

CUDA TENSOR CORE PROGRAMMING16x16x16 Warp Matrix Multiply and Accumulate (WMMA)

D = AB + C

Page 31: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

3131

CUDA TENSOR CORE PROGRAMMINGNew WMMA datatypes

wmma::fragment<matrix_a, …> Amat;

Per-Thread fragments to hold components of matrices for use with Tensor Cores

Page 32: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

3232

CUDA TENSOR CORE PROGRAMMINGNew WMMA load and store operations

wmma::load_matrix_sync(Amat, a, stride);

Warp-level operation to fetch components of matrices into fragments

warp

Page 33: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

3333

CUDA TENSOR CORE PROGRAMMINGNew WMMA Matrix Multiply and Accumulate Operation

wmma::mma_sync(Dmat, Amat, Bmat, Cmat);

Warp-level operation to perform matrix multiply and accumulate

D =

Page 34: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

3434

CUDA TENSOR CORE PROGRAMMINGNew WMMA load and store operations

wmma::store_matrix_sync(d, Dmat, stride);

Warp-level operation to fetch components of matrices into fragments

warp

Result

Page 35: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

3535

Learn More

Programming Tensor Cores in CUDA 9https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/

Page 36: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

3636

Partition using an arbitrary label:

Use with care: random groups can lead to SIMT execution inefficiency

FUTURE COOPERATIVE GROUPSVolta Enables Greater Flexibility

// Four groups of threads with same computed valueint label = foo() % 4; thread_group block = partition(this_thread_block(), label);

Page 37: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

3737

FUTURE COOPERATIVE GROUPS

Reductions, sorting, prefix sum (scan), etc.

Library of Collective Algorithms

// collective key-value sort using all threads in the blockcooperative_groups::sort(this_thread_block(), myValues, myKeys);

// collective scan-based allocate across blockint sz = myAllocationSize(); // amount each thread wants int offset = cooperative_groups::exclusive_scan(this_thread_block(), sz);

Note: preliminary API sketch

Page 38: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017

May 8-11, 2017 | Silicon Valley

#GTC17

CUDA 9 AND BEYOND

[email protected]@harrism

http://parallelforall.com

http://developer.nvidia.com/cuda-toolkit


Recommended