GPU Computing with CUDA -...

Post on 24-Mar-2018

219 views 0 download

transcript

September 4, 2011 Gabriel Noaje – Source-to-source code translator : OpenMP to CUDA 1

University of Reims Champagne-Ardenne, France

CUDA Training Day June 26, 2012

GPU Computing with CUDA

Gabriel Noaje gabriel.noaje@univ-reims.fr

INTRODUCTION TO MASSIVELY PARALLEL COMPUTING

Moore’s Law (paraphrased)

“The number of transistors on an integrated circuit doubles every two years.”

– Gordon E. Moore

Moore’s Law (visualized)

Credits: Wikimedia

Serial Performance Scaling is Over

!   Cannot continue to scale processor frequencies !   no 10 GHz chips

!   Cannot continue to increase power consumption

!   can’t melt chip

!   Can continue to increase transistor density !   as per Moore’s Law

How to Use Transistors?

!   Instruction-level parallelism !   out-of-order execution, speculation, … !   vanishing opportunities in power-constrained world

!   Data-level parallelism !   vector units, SIMD execution, … !   increasing … SSE, AVX, Cell SPE, Clearspeed, GPU

!   Thread-level parallelism !   increasing … multithreading, multicore, manycore !   Intel Core2, AMD Phenom, Sun Niagara, STI Cell, NVIDIA Fermi, …

!   A quiet revolution and potential build-up !   Computation: TFLOPs vs. 100 GFLOPs

!   GPU in every PC – massive volume & potential impact

Why Massively Parallel Processing?

T12

Westmere NV30 NV40

G70

G80

GT200

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad

Why Massively Parallel Processing? !   A quiet revolution and potential build-up

!   Bandwidth: ~10x

!   GPU in every PC – massive volume & potential impact

NV30 NV40 G70

G80

GT200

T12

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad

Westmere

KEPLER ARCHITECTURE

Kepler GK110 Block Diagram

Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3

Kepler GK110 SMX vs Fermi SM

SMX: Efficient Performance

192 CUDA cores 64 FP units 32 Special Function Units 32 load/store units dedicated for access memories 65536 registers 64KB shared memory 48KB cache

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy:

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Block IDs and Thread IDs • Each thread uses IDs to decide

what data to work on – Block ID: 1D or 2D – Thread ID: 1D, 2D, or 3D

• Simplifies memory

addressing when processing multidimensional data – Image processing – Solving PDEs on volumes – …

The “New” Moore’s Law

!   Computers no longer get faster, just wider

!   You must re-think your algorithms to be parallel !

!   Data-parallel computing is most scalable solution !   Otherwise: refactor code for 2 cores !   You will always have more data than cores –

build the computation around the data

8 cores 4 cores 16 cores…

Processor Memory Processor Memory

Global Memory

Generic Multicore Chip

!   Handful of processors each supporting ~1 hardware thread

!   On-chip memory near processors (cache, RAM, or both)

!   Shared global memory space (external DRAM)

• • • Processor Memory Processor Memory

Global Memory

Generic Manycore Chip

!   Many processors each supporting many hardware threads

!   On-chip memory near processors (cache, RAM, or both)

!   Shared global memory space (external DRAM)

Enter the GPU Computing

!   Massive economies of scale

!   Massively parallel

3D animation

Movie playing

CAD (Computer-Aided Design) Games

GPU Evolution

!   High throughput computation !   GeForce GTX 280: 933 GFLOP/s

!   High bandwidth memory !   GeForce GTX 280: 140 GB/s

!   High availability to all !   180+ million CUDA-capable GPUs in the wild

1995 2000 2005 2010

RIVA 128 3M xtors

GeForce® 256 23M xtors

GeForce FX 125M xtors

GeForce 8800 681M xtors

GeForce 3 60M xtors

“Fermi” 3B xtors

Why is this different from a CPU?

!   Different goals produce different designs !   GPU assumes work load is highly parallel !   CPU must be good at everything, parallel or not

!   CPU: minimize latency experienced by 1 thread !   big on-chip caches !   sophisticated control logic

!   GPU: maximize throughput of all threads !   # threads in flight limited by resources => lots of

resources (registers, bandwidth, etc.) !   multithreading can hide latency => skip the big caches !   share control logic across many threads

Financial analysis Molecular docking

Medical MRI Geological modeling

Weather simulation

Computer virus detection

• = “Compute Unified Device Architecture” • extends the ANSI C standard • gentle learning curve (compared to Cg, HLSL, etc.) • opens the underlying architecture to the user

www.nvidia.com/getcuda (driver + toolkit + SDK + docs + …)

Initially - use graphic calls: • Cg by NVIDIA • HLSL by Microsoft Nowadays - rich GPU programming ecosystem: • ATI Stream by AMD • CUDA by NVIDIA • OpenCL by Khronos Group • Direct Computing by Microsoft

( NVIDIA hardware specific )

CUDA: Scalable parallel programming

!   Augment C/C++ with minimalist abstractions !   let programmers focus on parallel algorithms !   not mechanics of a parallel programming language

!   Provide straightforward mapping onto hardware !   good fit to GPU architecture !   maps well to multi-core CPUs too

!   Scale to 100s of cores & 10,000s of parallel threads !   GPU threads are lightweight — create / switch is free !   GPU needs 1000s of threads for full utilization

Key Parallel Abstractions in CUDA

!   Hierarchy of concurrent threads

!   Lightweight synchronization primitives

!   Shared memory model for cooperating threads

Hierarchy of concurrent threads

!   Parallel kernels composed of many threads !   all threads execute the same sequential program

!   Threads are grouped into thread blocks !   threads in the same block can cooperate

!   Threads/blocks have unique IDs

Thread t

t0 t1 … tB

Block b

CUDA Model of Parallelism

!   CUDA virtualizes the physical hardware !   thread is a virtualized scalar processor (registers, PC, state) !   block is a virtualized multiprocessor (threads, shared mem.)

!   Scheduled onto physical hardware without pre-emption !   threads/blocks launch & run to completion !   blocks should be independent

• • • Block Memory Block Memory

Global Memory

Heterogeneous Computing

Multicore CPU

C for CUDA !   Philosophy: provide minimal set of extensions necessary to expose power

!   Function qualifiers: __global__ void my_kernel() { } __device__ float my_device_func() { }

!   Variable qualifiers: __constant__ float my_constant_array[32]; __shared__ float my_shared_array[32];

!   Execution configuration: dim3 grid_dim(100, 50); // 5000 thread blocks dim3 block_dim(4, 8, 8); // 256 threads per block my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel

!   Built-in variables and functions valid in device code: dim3 gridDim; // Grid dimension dim3 blockDim; // Block dimension dim3 blockIdx; // Block index dim3 threadIdx; // Thread index void __syncthreads(); // Thread synchronization

CUDA PROGRAMMING BASICS

Outline of CUDA Basics

!  Basic Kernels and Execution on GPU !  Basic Memory Management !  Coordinating CPU and GPU Execution

!  See the Programming Guide for the full API

CUDA Programming Model

!  Parallel code (kernel) is launched and executed on a device by many threads

!  Launches are hierarchical !  Threads are grouped into blocks !  Blocks are grouped into grids

!  Familiar serial code is written for a thread !  Each thread is free to execute a unique code

path !  Built-in thread and block ID variables

High Level View

SMEM

SMEM

SMEM

SMEM

Global Memory CPU Chipset

PCIe  

Blocks of threads run on an SM

Thread

Memory

Threadblock

Per-block Shared Memory

SMEM

Streaming Processor Streaming Multiprocessor

Registers

Memory

Whole grid runs on GPU

Many blocks of threads

. . .

SMEM

SMEM

SMEM

SMEM

Global Memory

Thread Hierarchy

!  Threads launched for a parallel section are partitioned into thread blocks !  Grid = all blocks for a given launch

!  Thread block is a group of threads that can: !  Synchronize their execution !  Communicate via shared memory

Memory Model

Kernel 0

. . . Per-device

Global Memory

. . .

Kernel 1

Sequential Kernels

Memory Model

Device 0 memory

Device 1 memory

Host memory cudaMemcpy()

Example: Vector Addition Kernel

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

int main() { // Run grid of N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Device Code

Example: Vector Addition Kernel

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

int main() { // Run grid of N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Host Code

Example: Host code for vecAdd

// allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; *h_C = …(empty) // allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),

cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float),

cudaMemcpyHostToDevice) );

// execute grid of N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

Example: Host code for vecAdd (2)

// execute grid of N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy( h_C, d_C, N * sizeof(float),

cudaMemcpyDeviceToHost) ); // do something with the result… // free device (GPU) memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

IDs and Dimensions !  Threads: !  3D IDs, unique within a block

!  Blocks: !  2D IDs, unique within a grid

!  Dimensions set at launch !  Can be unique for each grid

!  Built-in variables: ! threadIdx, blockIdx ! blockDim, gridDim

Device Grid 1

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Block (1, 1)

Thread (0, 1)

Thread (1, 1)

Thread (2, 1)

Thread (3, 1)

Thread (4, 1)

Thread (0, 2)

Thread (1, 2)

Thread (2, 2)

Thread (3, 2)

Thread (4, 2)

Thread (0, 0)

Thread (1, 0)

Thread (2, 0)

Thread (3, 0)

Thread (4, 0)

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; }

Kernel Variations and Output

Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Code executed on GPU !  C/C++ with some restrictions: !  Can only access GPU memory !  No variable number of arguments !  No static variables !  No recursion !  No dynamic polymorphism

!  Must be declared with a qualifier: !  __global__ : launched by CPU, cannot be called from GPU must return void !  __device__ : called from other GPU functions, cannot be called by the CPU !  __host__ : can be called by CPU !  __host__ and __device__ qualifiers can be combined

!   Function is compiled for both host and device !   Sample use: overloading operators

Memory Spaces

!  CPU and GPU have separate memory spaces !  Data is moved across PCIe bus !  Use functions to allocate/set/copy memory on

GPU !  Very similar to corresponding C functions

!  Pointers are just addresses !  Can’t tell from the pointer value whether the

address is on CPU or GPU !  Must exercise care when dereferencing: !  Dereferencing CPU pointer on GPU will likely crash !  Same for vice versa

CUDA Device Memory Model

!  Device code can: !  R/W per-thread registers !  R/W per-thread local memory !  R/W per-block shared memory !  R/W per-grid global memory !  Read only per-grid constant memory

!  Host code can: !  Transfer data to/from per-grid global and

constant memories

GPU Memory Allocation / Release

!  Host (CPU) manages device (GPU) memory: ! cudaMalloc (void ** pointer, size_t nbytes) ! cudaMemset (void * pointer, int value, size_t

count) ! cudaFree (void* pointer)

int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

Data Copies

! cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); !   returns after the copy is complete !  blocks CPU thread until all bytes have been copied ! doesn’t start copying until previous CUDA calls complete

! enum cudaMemcpyKind ! cudaMemcpyHostToDevice ! cudaMemcpyDeviceToHost ! cudaMemcpyDeviceToDevice

!  Non-blocking copies are also available ! cudaMemcpyAsync

Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; }

Example: Shuffling Data

// Reorder values based on keys // Each thread moves one element __global__ void shuffle(int* prev_array, int*

new_array, int* indices) { int i = threadIdx.x + blockDim.x * blockIdx.x; new_array[i] = prev_array[indices[i]]; }

int main() { // Run grid of N/256 blocks of 256 threads each shuffle<<< N/256, 256>>>(d_old, d_new, d_ind); }

Host Code

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }

Kernel with 2D Indexing

int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0; }

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }

Blocks must be independent

!  Any possible interleaving of blocks should be valid !  presumed to run to completion without pre-

emption !  can run in any order !  can run concurrently OR sequentially

!  A thread block is a batch of threads that can cooperate with each other by: !  synchronizing their execution: _syncthreads() !  sharing data with shared memory

!  Independence requirement gives scalability

CUDA MEMORIES

Hardware Implementation of CUDA Memories

!   Each thread can: !   Read/write per-thread

registers !   Read/write per-thread

local memory !   Read/write per-block

shared memory !   Read/write per-grid

global memory !   Read/only per-grid

constant memory

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

CUDA Variable Type Qualifiers

!   “automatic” scalar variables without qualifier reside in a register !   compiler will spill to thread local memory

!   “automatic” array variables without qualifier reside in thread-local memory

Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread __shared__ int shared_var; shared block block __device__ int global_var; global grid application __constant__ int constant_var; constant grid application

CUDA Variable Type Performance

!   scalar variables reside in fast, on-chip registers !   shared variables reside in fast, on-chip memories !   thread-local arrays & global variables reside in

uncached off-chip memory !   constant variables reside in cached off-chip memory

Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x __shared__ int shared_var; shared 1x __device__ int global_var; global 100x __constant__ int constant_var; constant 1x

Where to declare variables?

Can host access it?

Outside of any function

In the kernel

Yes No

__constant__ int constant_var;

__device__ int global_var;

int var;

int array_var[10]; __shared__ int shared_var;

Example – thread-local variables // motivate per-thread variables with // Ten Nearest Neighbors application __global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs) { // p goes in a register float2 p = ps[threadIdx.x]; // per-thread heap goes in off-chip memory float2 heap[10]; // read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ... }

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }

Two loads

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }

// once by thread i // again by thread i+1

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }

Example – shared variables // optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing __syncthreads(); ... }

Example – shared variables // optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { ... if(tx > 0) result[i] = s_data[tx] – s_data[tx–1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] – input[i-1]; } }

Example – shared variables // when the size of the array isn’t known at compile time... __global__ void adj_diff(int *result, int *input) { // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ... } // pass the size of the per-block array, in bytes, as the third // argument to the triple chevrons adj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);

A Common Programming Strategy

!   Partition data into subsets that fit into shared memory

A Common Programming Strategy

!   Handle each data subset with one thread block

A Common Programming Strategy

!   Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism

A Common Programming Strategy

!   Perform the computation on the subset from shared memory

A Common Programming Strategy

!   Copy the result from shared memory back to global memory

A Common Programming Strategy

!   Carefully partition data according to access patterns !   Read-only è __constant__ memory (fast) !   R/W & shared within block è __shared__ memory

(fast) !   R/W within each thread è registers (fast) !   Indexed R/W within each thread è local memory

(slow) !   R/W inputs/results è cudaMalloc‘ed global

memory (slow)

Communication Through Memory

!   Question:

__global__ void race(void) { __shared__ int my_shared_variable; my_shared_variable = threadIdx.x; // what is the value of // my_shared_variable? }

Communication Through Memory

!   This is a race condition !   The result is undefined !   The order in which threads access the variable is

undefined without explicit coordination !   Use barriers (e.g., __syncthreads) or atomic

operations (e.g., atomicAdd) to enforce well-defined semantics

Communication Through Memory

!   Use __syncthreads to ensure data is ready for access

__global__ void share_data(int *input) { __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads(); // the state of the entire data array // is now well-defined for all threads // in this block }

Communication Through Memory

!   Use atomic operations to ensure exclusive access to a variable

// assume *result is initialized to 0 __global__ void sum(int *input, int *result) { atomicAdd(result, input[threadIdx.x]); // after this kernel exits, the value of // *result will be the sum of the input }

Hierarchical Atomics

!   Divide & Conquer !   Per-thread atomicAdd to a __shared__ partial sum !   Per-block atomicAdd to the total sum

Σ

Σ0 Σ1 Σι

SMX EXECUTION

How an SM executes threads

!  Overview of how a Stream Multiprocessor works

!  SIMT Execution !  Divergence

Scheduling Blocks onto SMs

Thread Block 5

Thread Block 27

Thread Block 61

Streaming Multiprocessor

Thread Block 2001

!  HW Schedules thread blocks onto available SMs !  No guarantee of ordering among thread blocks !  HW will schedule thread blocks as soon as a previous

thread block finishes

Warps

!   Each thread block is executed as 32-thread warps !  An implementation decision, not part of the

CUDA programming model !  Warps are scheduling units in SMs

!  If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?

–  Each Block is divided into 256/32 = 8 Warps –  There are 8 x 3 = 24 Warps

Mapping of Thread Blocks

!  Each thread block is mapped to one or more warps

!  The hardware schedules each warp independently

Thread Block N (128 threads)

TB N W1

TB N W2

TB N W3

TB N W4

24

Thread Scheduling Example

!  SM implements zero-overhead warp scheduling !  At any time, only one of the warps is executed by

SM !  Warps whose next instruction has its inputs

ready for consumption are eligible for execution !  Eligible Warps are selected for execution on a

prioritized scheduling policy !  All threads in a warp execute the same

instruction when selected

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Control Flow Divergence

!  What happens if you have the following code? if(foo(threadIdx.x)) { do_A(); } else { do_B(); }

Control Flow Divergence

Branch

Path A

Path B

Branch

Path A

Path B

Control Flow Divergence

!  Nested branches are handled as well if(foo(threadIdx.x)) { if(bar(threadIdx.x)) do_A(); else do_B(); } else do_C();

Control Flow Divergence

Branch Branch

Path A

Path C

Branch

Path B

Control Flow Divergence

!  You don’t have to worry about divergence for correctness

!  You might have to think about it for performance !  Depends on your branch conditions

Control Flow Divergence

!  Performance drops off with the degree of divergence

switch(threadIdx.x % N) { case 0: ... case 1: ... }

Divergence

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18

Perf

orm

ance

Divergence

CUDA

.c / .cpp Host code

.gpu Device code

Object File Fat bin

C/C++ Libraries

CUDA Libraries

Executable (with embedded GPU code)

Compiling a CUDA program

CUDA Makefile CC=nvcc CUDA_DIR= /opt/cuda/4.1 SDK_DIR= ${CUDA_DIR}/sdk CFLAGS= -I. -I${CUDA_DIR}/include -I${SDK_DIR}/C/common/inc LDFLAGS= -L${CUDA_DIR}/lib64 -L${SDK_DIR}/lib -L${SDK_DIR}/C/common/lib LIB= -lm -lrt SOURCES= vector_add.cu EXECNAME= vector_add all: $(CC) --ptxas-options=-v -g -G -keep -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) clean: $(CC) --ptxas-options=-v -g -G -keep -clean -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) rm -f *.o core

OPENACC API

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in” Acceleration

Programming Languages

OpenACC Directives

Maximum Flexibility

Easily Accelerate Applications

OpenACC Directives

Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience

CPU GPU

Your original Fortran or C

code

Simple Compiler hints

Compiler Parallelizes code

Works on many-core GPUs & multicore CPUs

OpenACC Compiler

Hint

OpenACC Open Programming Standard for Parallel Computing

Easy: Directives are the easy path to accelerate compute intensive applications

Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors

Powerful: GPU Directives allow complete access to the massive parallel power of a GPU

OpenACC The Standard for GPU Directives

High-level, with low-level access Compiler directives to specify parallel regions in C, C++, Fortran

OpenACC compilers offload parallel regions from host to accelerator Portable across OSes, host CPUs, accelerators, and compilers

Create high-level heterogeneous programs Without explicit accelerator initialization, Without explicit data or program transfers between host and accelerator

Programming model allows programmers to start simple

Enhance with additional guidance for compiler on loop mappings, data location, and other performance details

Compatible with other GPU languages and libraries Interoperate between CUDA C/Fortran and GPU libraries e.g. CUFFT, CUBLAS, CUSPARSE, etc.

OpenACC Specification and Website

Full OpenACC 1.0 Specification available online

http://www.openacc-standard.org

Quick reference card also available Available compilers:

Caps HMPP OpenACC (soon at Romeo) PGI OpenACC (soon at Romeo)

subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo $!acc end kernels end subroutine saxpy ... $ Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d) ...

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

#pragma acc kernels

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

...

// Perform SAXPY on 1M elements

saxpy(1<<20, 2.0, x, y);

...

A Very Simple Exercise: SAXPY SAXPY in C SAXPY in Fortran

Directive Syntax

Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive

C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block

CUDA = much more

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

ECE 498AL, University of Illinois, Urbana-Champaign

20

• Atomic, shuffle operations • Streams • GPU Direct • CUBLAS, NPP, CUFFT, … • OpenCL

References • www.nvidia.com/cuda (CUDA zone, developer zone) • Programming Massively Parallel Processors: A Hands-on Approach by David Kirk, Wen-mei Hwu • CUDA by Example: An Introduction to General-Purpose GPU Programming by Jason Sanders and Edward Kandrot • CUDA Application Design and Development by Rob Farber

Full Version Questions ?

!   Generalize adjacent_difference example

!   AB = A * B !   Each element ABij !   = dot(row(A,i),col(B,j))

!   Parallelization strategy !   Thread à ABij

!   2D kernel

Matrix Multiplication Example

First Implementation __global__ void mat_mul(float *a, float *b, float *ab, int width) { // calculate the row & col index of the element int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x; float result = 0; // do dot product between row of a and col of b for(int k = 0; k < width; ++k) result += a[row*width+k] * b[k*width+col]; ab[row*width+col] = result; }

Idea: Use __shared__ memory to reuse global data

!   Each input element is read by width threads

!   Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth

width

Tiled Multiply

!   Partition kernel loop into phases

!   Load a tile of both matrices into __shared__ each phase

!   Each phase, each thread computes a partial result

TILE_WIDTH

Better Implementation __global__ void mat_mul(float *a, float *b, float *ab, int width) { // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx; float result = 0;

Better Implementation // loop over the tiles of the input in phases for(int p = 0; p < width/TILE_WIDTH; ++p) {

// collaboratively load tiles into __shared__ s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col]; __syncthreads(); // dot product between row of s_a and col of s_b for(int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); } ab[row*width+col] = result; }

Use of Barriers in mat_mul

!   Two barriers per phase: !   __syncthreads after all data is loaded into __shared__

memory !   __syncthreads after all data is read from __shared__

memory !   Note that second __syncthreads in phase p guards the

load in phase p+1

!   Use barriers to guard data !   Guard against using uninitialized data !   Guard against bashing live data

First Order Size Considerations

!   Each thread block should have many threads !   TILE_WIDTH = 16 à 16*16 = 256 threads

!   There should be many thread blocks !   1024*1024 matrices à 64*64 = 4096 thread blocks !   TILE_WIDTH = 16 à gives each SM 4 blocks, 1024 threads !   Full occupancy

!   Each thread block performs 2 * 256 = 512 x 4B loads for 256 * (2 * 16) = 8,192 fp ops (0.25 B/op) !   Compare to 4B/op

TILE_SIZE Effects

Memory Resources as Limit to Parallelism

!   Effective use of different memory resources reduces the number of accesses to global memory

!   These resources are finite! !   The more memory locations each thread requires à

the fewer threads an SM can accommodate

Resource Per GTX480 SM Full Occupancy on GTX480

Registers 32768 <= 32768 / 1024 threads = 32 per thread

__shared__ Memory 48KB <= 48KB / 8 blocks = 6KB per block