CUDA programming - Aalto University Wiki

transcript

NVIDIA Research

!   What is the world going to look like, what should our hardware look like in 5–10 years? !   And how do we get there?

!   Engage and participate in the academic community

!   ~30 researchers around the globe (4 in Helsinki)

!   See http://research.nvidia.com

!   Brief history to GPU programming

!   CUDA programming model !   Writing CUDA programs

!   Designing parallel algorithms

Motivation for GPU Programming

!   This thing packs a lot of oomph

!   How to tap into that?

Early Days: GPGPU (bad!)

!   General-Purpose GPU programming !   The craze around 2004 – 2006

!   Trick the GPU into general-purpose computing by casting problem as graphics !   Turn data into images (textures) !   Turn algorithms into image synthesis (rendering passes)

!   Many attempts to handle these automatically !   Brook, Sh, PeakStream, MS Accelerator, … !   Take a “program”, somehow convert to shaders

Problems with GPGPU

!   Highly constrained memory access model !   No scatter, no read/write access

!   Split computation into highly constrained passes !   Limited by what shaders can do

!   Tough learning curve

!   To understand limitations, must understand graphics HW !   Need crazy stunts to circumvent rigidity of hardware

!   Overhead of graphics API

GPGPU: An Illustrated Guide

Using graphics API to express programs

Designing GPGPU algorithms

The Road to CUDA

!   Okay, this GPGPU thing has potential !   The only problem is that it sucks

!   Let’s design the right tool for the job

!   Need new hardware capabilities? Build it. !   We are a hardware company, after all

!   Need a better API for poking the GPU? Ok.

!   Don’t invent a new language

CUDA Design Goals

!   Heterogeneous CPU/GPU computing platform

!   Easy to program !   Also, easy to integrate GPU code into existing programs

!   Close enough to hardware to get best performance !   For those who know what they’re doing

Some Ingredients of CUDA

!   SIMT execution model !   Single Instruction, Multiple Thread !   Allows to write scalar code instead of explicit SIMD

!   Ways to exploit locality !   Warp and block execution model !   Shared memory

!   Direct memory access

!   C/C++ with minimal extensions

To Whet Your Appetite

Interactive visualization of

volumetric white matter connectivity

Ionic placement for molecular dynamics simulation on GPU

Transcoding HD video stream to H.264

Fluid mechanics in Matlab using .mex file

CUDA function

Astrophysics N-body simulation

Financial simulation of LIBOR model with

swaptions

GLAME@lab: an M-script API for GPU

linear algebra

Ultrasound medical imaging for cancer

diagnostics

Highly optimized object oriented

molecular dynamics

Cmatch exact string matching to find

similar proteins and gene sequences

CUDA Programming Model

Programmer’s View of Hardware

GPU Memory (DRAM)

CPU SM SM SM …

L1 L1 L1

CPU Memory (DRAM)

Threads, Warps, Blocks

!   A thread in CUDA executes scalar code !   Very much like a usual CPU program

!   Hardware packs threads into warps !   Crucial for efficient execution !   Programmer can ignore warps (but you shouldn’t)

!   Threads are logically grouped into blocks !   Threads in the same block can communicate and

synchronize efficiently

Programmer’s View of SM

core core core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

Programmer’s View of SM

Warp 0 Warp 1

Warp n

a CUDA thread state

Each thread has otherwise independent state, but it shares PC with other threads of warp

Programmer’s View of SM: Execution

Warp 0 Warp 1

Warp n

Warp 0 Warp 1

Warp n

… r1 = r2 * r3

Warp 0 Warp 1

Warp n

… r1 = r2 * r3

read r2 and r3

Warp 0 Warp 1

Warp n

… r1 = r2 * r3

Warp 0 Warp 1

Warp n

write to r1

r1 = r2 * r3

Warp 0 Warp 1

Warp n

Warp 0 Warp 1

Warp n

Programmer’s View of SM: Blocks

core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr PC

Warp n

a CUDA thread block Shared memory

Note: Blocks are formed on the fly from the available warps (don’t need to consecutive)

Programmer’s View of SM: Blocks

core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr PC

Warp n

another CUDA thread block Shared memory

Note: Blocks are formed on the fly from the available warps (don’t need to consecutive)

Implications

!   All threads in a warp always execute concurrently !   Same PC, same instruction !   You can exploit this if you’re careful!

!   But warps in a block are scheduled irregularly !   Hence, threads of a block are not implicitly synchronized !   But they are always in the same SM, and can synchronize

efficiently and communicate through shared memory

!   Blocks are instantiated in SMs that have space !   No way of knowing which blocks end up in which SMs !   Key to good load balancing

Occupancy

!   How many warps can fit in one SM depends on resource usage !   Number of registers / thread !   Amount of shared memory / block

!   Block size matters too !   Work is always launched in full blocks !   Number of blocks / SM also limited

!   Occupancy = percentage of thread slots used !   Handy occupancy calculator spreadsheet available !   Directly affects latency hiding capability

Synchronization

!   Threads can specify a synchronization point !   __syncthreads() intrinsic !   This prevents warp from being scheduled until all warps in

the same block have arrived at sync point !   Very lightweight mechanism

!   Atomic operations can be used for avoiding race conditions globally !   E.g., append to an array with atomicAdd()

!   Implicit synchronization between launches !   Unless asynchronous operation is explicitly allowed

SIMT Execution Model

!   How can threads of a warp diverge if they all have the same PC?

!   Partial solution: Per-instruction execution predication

!   Full solution: Hardware-supported execution mask, execution stack, and related instructions

Example: Instruction Predication

if (a < 10) small++;

else big++;

ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1;

else big++;

Set predicate register P0 if a > 9

else big++;

If P0 is cleared, R5 = R5 + 1

else big++;

ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1; If P0 is set, R4 = R4 + 1

What About Complex Cases?

!   Nested if-else blocks, loops, recursion …

!   Need hardware execution mask and execution stack

Non-Predicated Example

if (a < 10) foo();

else bar();

/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block

else branch

if branch

if (a < 10) foo();

else bar();

else branch

if branch

Case 1: All threads take the if branch

// no thread wants to jump

if (a < 10) foo();

else bar();

else branch

if branch

Case 2: All threads take the else branch

// all threads want to jump

if (a < 10) foo();

else bar();

else branch

if branch

Case 3: Some threads take the if branch, some take the else branch

// some threads want to jump: push

// pop

// restore active thread mask

Benefits of SIMT

!   Supports all structured C++ constructs !   If/else, switch/case, loops, function calls, exceptions ! goto is a different beast – supported, but best to avoid

!   Multi-level constructs handled efficiently !   Break/continue from inside multiple levels of conditionals !   Function return from inside loops and conditionals !   Retreating to exception handler from anywhere

!   You only need to care about SIMT when tuning for performance

Some Consequences of SIMT

!   An if statement takes the same number of cycles for any number of threads > 0 !   If nobody participates it’s cheap !   Also, masked-out threads don’t do memory accesses

!   A loop is iterated until all active threads in the warp are done

!   A warp stays alive until every thread in it has terminated !   Terminated threads cause “empty slots” in warps !   Thread utilization = percentage of active threads

Coherent Execution Is Great

!   An if statement is perfectly efficient if either everyone takes it or nobody does !   All threads stay active

!   A loop is perfectly efficient if everyone does the same number of iterations

!   Note: These are required for traditional SIMD

Incoherent Execution Is Okay

!   Conditionals are efficient as long as threads usually agree

!   Loops are efficient if threads usually take roughly the same number of iterations

!   Much easier to program than explicit SIMD !   SIMT: Incoherence is supported, performance degrades

gracefully if control diverges !   SIMD: performance is fixed, incoherence not supported

Striving for Execution Coherence

!   Learn to spot low-hanging fruit for improving execution coherence

!   Process input in coherent order !   E.g., process nearby pixels of an image together

!   Fold branches together as much as possible !   Only put the differing part in a conditional

!   Simple low-level fixes !   Favor [f]min / [f]max over conditionals !   Bitwise operators sometimes help

Memory in CUDA, part 1

!   Global memory !   Accessible from everywhere, including CPU (memcpy) !   Requests go through L1, L2, DRAM

!   Shared memory !   Either 16 or 48 KB per SM in Fermi !   Pieces allocated to thread blocks when launched !   Accessible from threads in the same block !   Requests served directly, very fast

!   Local memory !   Actually a thread-local portion of global memory !   Used for register spilling and indexed arrays

Memory in CUDA, part 2

!   Textures !   Data can also be fetched from DRAM through texture units !   Separate texture caches !   High latency, extreme pipelining capability !   Read-only

!   Surfaces !   Read / write access with pixel format conversions !   Useful for integrating with graphics

!   Constants !   Coherent and frequent access of same data

Simplified

!   Global memory !   Almost all data access goes here, you will need this

!   Shared memory !   Use to share data between threads

!   Textures !   Use to accelerate data fetching

!   Local memory, constants, surfaces !   Let’s ignore for now, details can be found in manuals

Memory Access Coherence

!   GPU memory buses are wide !   Both external and internal

!   When warp executes a memory instruction, the addresses matter a lot !   Those that land on the same cache line are served

together !   Different cache lines are served sequentially

!   This can have a huge impact on performance !   Easy to accidentally burden the memory system !   Incoherent access also easily overflows caches

Improving Memory Coherence

!   Try to access nearby addresses from nearby threads

!   If each thread processes just one element, choose wisely which one

!   If each thread processes multiple elements, preferably use striding

Striding Example

!   We want each thread to process 10 elements of an array !   64 threads per block

No striding Thread 0: 0 1 2 3 4 5 6 7 8 9 Thread 1: 10 11 12 13 14 15 16 17 18 19 .. Thread 63: 630 631 632 633 634 635 636 637 638 639

With stride of 64 Thread 0: 0 64 128 192 256 320 384 448 512 576 Thread 1: 1 65 129 193 257 321 385 449 513 577 .. Thread 63: 63 127 191 255 319 383 447 511 575 639

Bad access pattern

Optimal access pattern

Launching Work in CUDA

!   Kernel = function running on GPU !   Written in CUDA C

!   A kernel is launched for a grid of blocks !   The blocks and the grid can be 1D, 2D or 3D !   The extra dimensions are really just syntactic sugar but

convenient if the data lives in a 2D or 3D domain

!   Every thread gets to know its !   Thread location within the block (threadIdx) !   Block location within the grid (blockIdx) !   Block and grid dimensions (blockDim, gridDim)

Example

!   Each block has 8×8 threads !   So, 64 threads in a block = 2 warps

!   We launch a grid of 10×5 blocks !   So, 50 blocks in total

threadIdx.x = 1 threadIdx.y = 1 blockIdx.x = 9 blockIdx.y = 0 blockDim.x = 8 blockDim.y = 8 gridDim.x = 10 gridDim.y = 5

What’s with the Blocks?

!   Why did we have blocks again, instead of just a flat gigantic grid of threads? !   Because a block can be guaranteed to be localized !   Launched at the same time, in the same SM

!   Threads of a block have the same shared memory !   Load common data together and work on it

!   Threads of a block can synchronize efficiently !   Synchronization points in code

!   Individual blocks must be truly independent

Writing CUDA Programs

Two APIs

!   CUDA can be used through two APIs

!   Driver API !   Low-level API !   GPU code compiled separately into binaries !   CPU code manually loads GPU code and invokes it

!   Runtime API !   User-friendly high-level API !   Language extensions for launching kernels !   Compiler automatically splits code into GPU and CPU

parts, compiles them separately, and links together

I will talk about this now

Defining and Launching Kernels

// Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() // N-length vector add { ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); }

Number of blocks

Threads per block In general, these are of type dim3

Defining and Launching Kernels (2D)

// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() // N*N matrix add { ... // Kernel invocation with one block of N * N * 1 threads int numBlocks = 1; dim3 threadsPerBlock(N, N); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); }

Extending for Multiple Blocks

// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() // N*N matrix add { ... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); }

Function Type Qualifiers

__global__ !   A kernel function !   Executed on GPU !   Callable from CPU only (using <<< >>> launch)

__device__ !   GPU-local function !   Executed on GPU !   Callable from GPU only (using standard function call)

__host__ !   Default if nothing else specified !   CPU-only function !   Can be combined with __device__ to compile for both

Variable Type Qualifiers

__device__ !   “Global” variable residing in GPU memory !   Accessible from all threads

__shared__ !   Resides in SM shared memory space !   Accessible from threads of the same block

#define BLOCK_SIZE 16 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; ... }

!   Threads in a block may operate on largely same data !   Convolution-like operations, matrix multiply, …

!   Load the data once into shared memory, then operate on it !   Share the loading between threads in the block

!   Synchronization is important !   Call __syncthreads() after reading the data to ensure

that it is valid before starting any computation on it

Using Shared Memory

!   GPU memory needs to be allocated !   cudaMalloc() and cudaFree()

!   Data transfers must be done manually !   cudaMemcpy()

Using Global Memory

// Device code __global__ void VecAdd(float* A, float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int VecAddWrapper(float* h_A, float* h_B, float* h_C, int N) { size_t size = N * sizeof(float); // Allocate vectors in device memory float* d_A; float* d_B; float* d_C; cudaMalloc(&d_A, size); cudaMalloc(&d_B, size); cudaMalloc(&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel ... VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); // Copy result from device memory to host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); }

host (CPU) memory pointers

device (GPU) memory pointers

!   Avoid moving data around unnecessarily !   Keep intermediate buffers on GPU

!   Only transfer what you need

!   Don’t copy unchanged data again !   Don’t copy unnecessary data

!   Concurrent data transfer possible on latest devices !   Needs host memory to be page-locked !   Needs kernel execution to be non-blocking !   Needs something useful to be done at the same time !   Non-trivial, so do only if you know you need it

Smart Use of Memory

!   All functions return an error code !   cudaSuccess if the call was successful

!   Also possible to check for last error !   cudaGetLastError() and cudaPeekAtLastError()

!   Error strings available through the API !   cudaGetErrorString()

!   Checking errors of asynchronous operations is a little more complex, refer to manual

Error Checking

!   CUDA C Programming Guide !   Best starting point !   Describes CUDA programming model, language

extensions, builtin functions and types, etc.

!   CUDA Toolkit Reference Manual !   Documentation of host-side CUDA functions

!   CUDA C Best Practices Guide !   Information on improving performance and ensuring

compatibility

Manuals and Resources

!   NVCC compilation chain: CUDA C à PTX à SASS !   CPU code is compiled by MSVC

!   PTX is device-independent intermediate assembly !   To export PTX: nvcc -ptx foo.cu !   PTX is not supposed to be optimized, so don’t expect it

!   SASS is device-specific low-level assembly !   Compile: nvcc –cubin –arch=sm_<nn> foo.cu !   Dump: cuobjdump -sass foo.cubin !   SASS instruction sets in cuobjdump manual

Peeking Under the Hood

Designing Parallel Algorithms

!   Designing parallel algorithms is not trivial !   Especially so for thousands of threads !   Cannot rely on fine-grained global synchronization

!   Active area of research (partially thanks to GPUs) !   E.g., sorting performance going up all the time

!   A highly parallel algorithm may need to do more work than a sequential one !   Almost always a higher number of primitive operations !   Or worse complexity, e.g., O(n log n) instead of O(n) !   The price we have to pay for better performance

A Full Can of Worms

!   A data-parallel program does some computation for a large number of elements !   All computations must be independent !   There must be enough input to utilize GPU properly

!   Natural way to parallelize: One thread per output element !   Convolution, Mandelbrot, etc.: thread = pixel !   Ray tracing: thread = ray

!   Boost performance by sharing data if possible

Data-Parallel Is Easy

!   Many interesting tasks are not data-parallel !   Sorting, compression, variable-length data, etc. !   Even simple stuff like finding the maximum element

!   Hierarchical processing is often a good idea

!   Split input into a number of chunks !   Need enough chunks to utilize the GPU

!   Do something per chunk to reduce problem size !   Then process the remains on CPU or continue on GPU

Everything Else

!   Let’s say we have 100 million elements in an array

!   Split into 100000 chunks with 1000 elements each !   To find best performance, experiment with the numbers

!   Process one chunk of 1000 elements per thread !   Find and output the maximum

!   Now we have 100000 elements, repeat !   Utilization bad from here forward, but the heavy part was

parallelized successfully !   Or: download the 100K elements to CPU, process there

Example 1: Find Maximum

Example 2: Cumulative Sum

!   Add to each element of an array the sum of its predecessors

!   Can be done using n summations !   Looks like an inherently serial algorithm

1 4 0 3 4 Input

Output

Each summation requires that the previous one has completed Impossible to parallelize?!

Not at all 1 5 5 8 12

Parallel Cumulative Sum

!   Let’s suppose we have 2000000 elements and 1000 threads

input . . .

. . . in0 in1

slice input into 1000 segments, each 2000 elements long

. . . out0 out1

calculate cumulative sum for each segment in parallel

out999

take the last element of each output segment sums 1000 elements

!   Let’s suppose we have 2000000 elements and 1000 threads

. . . out0 out1 out999

take the last element of each output segment sums

compute cumulative sum over these

Σsums

. . . out0 out1 out999

add result to output segments

… and we’re done!

!   1st pass: process each input segment !   Perfectly parallelized, perfectly coherent memory access

!   2nd pass: cumulative sum over last elements !   Parallelizes badly, but very small amount of work

!   3rd pass: add bias to every output element !   Perfectly parallelized, perfectly coherent memory access

!   Need two additions per element, but still O(n)

Wrapping Up

Takeaways

!   It’s very easy to get started !   Just C++, some extra work needed for managing memory

!   But there’s plenty of room for creativity when striving for performance !   Low-level optimizations, data sharing, algorithmic

improvements, concurrent processing, … !   Profiling tools available

!   Scalable code will be fast on future hardware as well !   Basically just more blocks running concurrently

Thank You

CUDA programming - Aalto University Wiki

Documents