+ All Categories
Home > Technology > Intro to GPGPU with CUDA (DevLink)

Intro to GPGPU with CUDA (DevLink)

Date post: 17-Nov-2014
Category:
Upload: rob-gillen
View: 3,829 times
Download: 1 times
Share this document with a friend
Description:
Slides from my talk on CUDA programming at DevLink 2011
Popular Tags:
53
Intro to GPGPU Programing With CUDA Rob Gillen rob.gillenfamily.net @argodev
Transcript
Page 1: Intro to GPGPU with CUDA (DevLink)

Intro to GPGPU Programing With CUDA

Rob Gillenrob.gillenfamily.net@argodev

Page 2: Intro to GPGPU with CUDA (DevLink)

Intro to GPGPU Programming with CUDA

Rob Gillen

Page 3: Intro to GPGPU with CUDA (DevLink)

Welcome!

Goals:Overview of GPGPU with CUDA

“Vision Casting” for how you can use GPUs to improve your application

Introduction to CUDA C

OutlineWhy GPGPUs?

Applications

Tooling

Hands-On: Matrix Multiplication

Page 4: Intro to GPGPU with CUDA (DevLink)

Context Setting

Level of the TalkIntroductory/Overview

Perspective of the Speaker12+ years as professional developer

4+ years at Oak Ridge National Laboratory

Disclaimer:Many (most) of these slides are courtesy of NVIDIA corporation although they bear no responsibility for inaccuracies I introduce during this presentation.

Page 5: Intro to GPGPU with CUDA (DevLink)

WHY USE GPUS?Motivation

Page 6: Intro to GPGPU with CUDA (DevLink)

CPU vs. GPU

GPU devotes more transistors to data processing

Specialized (purpose-designed) Silicon

Page 7: Intro to GPGPU with CUDA (DevLink)

NVIDIA Fermi

~1.5TFLOPS (SP)/~800GFLOPS (DP)

230 GB/s DRAM Bandwidth

Page 8: Intro to GPGPU with CUDA (DevLink)

Motivation

FLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU

Page 9: Intro to GPGPU with CUDA (DevLink)

Example: Sparse Matrix-Vector

CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Page 10: Intro to GPGPU with CUDA (DevLink)

Rayleigh-Bénard Results

Double precision

384 x 384 x 192 grid (max that fits in 4GB)

Vertical slice of temperature at y=0

Transition from stratified (left) to turbulent (right)

Regime depends on Rayleigh number: Ra = gαΔT/κν

8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon

Page 11: Intro to GPGPU with CUDA (DevLink)

G80 Characteristics

367 GFLOPS peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics

Page 12: Intro to GPGPU with CUDA (DevLink)

Supercomputer Comparison

Page 13: Intro to GPGPU with CUDA (DevLink)

Applications

Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”

Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent world

Various granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management

Page 14: Intro to GPGPU with CUDA (DevLink)

*Not* for all applications

SPMD (Single Program, Multiple Data) are best (data parallel)

Operations need to be of sufficient size to overcome overhead

Think Millions of operations.

Page 15: Intro to GPGPU with CUDA (DevLink)

Raytracing

Page 16: Intro to GPGPU with CUDA (DevLink)

NVIRT: CUDA Ray Tracing API

Page 17: Intro to GPGPU with CUDA (DevLink)

Tooling

VS 2010 C++ (Express is OK… sort-of.)

NVIDIA CUDA-Capable GPU

NVIDIA CUDA Toolkit (v4+)

NVIDIA CUDA Tools (v4+)

GPU Computing SDK

NVIDIA Parallel Insight

Page 18: Intro to GPGPU with CUDA (DevLink)

Parallel Debugging

Page 19: Intro to GPGPU with CUDA (DevLink)

Parallel Analysis

Page 20: Intro to GPGPU with CUDA (DevLink)

VS Project Templates

Page 21: Intro to GPGPU with CUDA (DevLink)

VS Project Templates

Page 22: Intro to GPGPU with CUDA (DevLink)

Outline of CUDA Basics

Basic Memory Management

Basic Kernels and Execution on GPU

Development Resources

See the Programming Guide for the full API

See the Getting Started Guide for installation and compilation instructions

Both guest are included in the toolkit

Page 23: Intro to GPGPU with CUDA (DevLink)

Memory Spaces

CPU and GPU have separate memory spacesData is moved across PCIe bus

Use functions to allocate/set/copy memory on GPUVery similar to corresponding C functions

Pointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPU

Must exercise care when dereferencing:Dereferencing CPU pointer on GPU will likely crash

Converse is also true

Page 24: Intro to GPGPU with CUDA (DevLink)

GPU Memory Allocation / Release

Host (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_t nbytes)

cudaMemset (void * pointer, int value, size_t count)

cudaFree(void* pointer)

int n = 1024;

int nbytes = 1024*sizeof(int);

int * d_a = 0;

cudaMalloc( (void**)&d_a, nbytes);

cudaMemset(d_a, 0, nbytes);

cudaFree(d_a);

Page 25: Intro to GPGPU with CUDA (DevLink)

Data Copies

Cudamemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

Returns after copy is complete

Blocks CPU thread until all bytes have been copied

Doesn’t start copying until previous CUDA calls complete

Enum cudaMemcpyKindcudaMemcpyHostToDevice

cudaMemcpyDeviceToHost

cudaMemcpy DeviceToDevice

Non-blocking memcopies are provided

Page 26: Intro to GPGPU with CUDA (DevLink)

DEMOCode Walkthrough 1

Page 27: Intro to GPGPU with CUDA (DevLink)

CUDA Programming Model

Parallel code (kernel) is launched and executed on a device by many threads

Threads are grouped into thread blocks

Parallel code is written for a threadEach thread is free to execute a unique code path

Built-in thread and block ID variables

Page 28: Intro to GPGPU with CUDA (DevLink)

Thread Hierarchy

Threads launched for a parallel section are partitioned into thread blocks

Grid == all blocks for a given launch

Thread block is a group of threads that can:Synchronize their execution

Communicate via shared memory

Threads Thread Blocks Grid

Page 29: Intro to GPGPU with CUDA (DevLink)

Block IDs and Threads

Threads:3D IDs, Unique within a block

Blocks:2D IDs, unique within a grid

Dimensions set at launch time

Can be unique for each grid

Built-in variables:threadIdx, blockIdxblockDim, gridDim

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Page 30: Intro to GPGPU with CUDA (DevLink)

Code executed on GPU

C function with some restrictionsReturn void

Can only dereference GPU pointers

No static variables

Some additional restrictions for older GPUs

Must be declared with a qualifier:__global__ : launched by CPU, cannot be called from GPU, must return void

__device__ : called from other GPU functions, cannot be launched by the CPU

__host__ : can be executed only by the CPU

__host__ and __device__ qualifiers can be combined

Page 31: Intro to GPGPU with CUDA (DevLink)

DEMOCode Walkthrough 2

Page 32: Intro to GPGPU with CUDA (DevLink)

Launching kernels on GPU

Launch Parameters:Grid dimensions (up to 2D), dim3 type

Thread-block dimensions (up to 3D), dim3 type

Shared memory: number of bytes per blockFor extern smem variables declared without size

Optional, 0 by default

Stream IDOptional, 0 by default

dim3 grid(16, 16);

dim3 block(16, 16);

kernel<<<grid, block, 0, 0>>>(…);

kernel<<<32, 512>>>(…);

Page 33: Intro to GPGPU with CUDA (DevLink)

Kernel Variations and Output

__global__ void kernel (int *a)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

a[idx] = 7;

} Output: 7777777777777777

__global__ void kernel (int *a)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

a[idx] = blockIdx.x;

} Output: 000011112222333

__global__ void kernel (int *a)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

a[idx] = threadIdx.x;

} Output: 0123012301230123

Page 34: Intro to GPGPU with CUDA (DevLink)

Code Walkthrough 3

Build on Walkthrough 2

Write kernel to increment nxm integers

Copy the result back to CPU

Print the values

Page 35: Intro to GPGPU with CUDA (DevLink)

DEMOCode Walkthrough 3

Page 36: Intro to GPGPU with CUDA (DevLink)

Blocks must be independent

Any possible interleaving of blocks should be validPresumed to run to completion without pre-emption

Can run in any order

Can run concurrently OR sequentially

Blocks may coordinate but not synchronizeShared queue pointer: OK

Shared lock: BAD … can easily deadlock

Independence requirement gives scalability

Page 37: Intro to GPGPU with CUDA (DevLink)

Transparent Scalability

Hardware is free to assigns blocks to any processor at any time

A kernel scales across any number of parallel processors

Device

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Each block can execute in any order relative to other blocks.

time

Page 38: Intro to GPGPU with CUDA (DevLink)

EXTENDED EXAMPLEMatrix Multiplication

Page 39: Intro to GPGPU with CUDA (DevLink)

A Simple Running ExampleMatrix Multiplication

A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs

Leave shared memory usage until later

Local, register usage

Thread ID usage

Memory data transfer API between host and device

Assume square matrix for simplicity

Page 40: Intro to GPGPU with CUDA (DevLink)

Programming Model:Square Matrix Multiplication Example

P = M * N of size WIDTH x WIDTH

Without tiling:One thread calculates one element of P

M and N are loaded WIDTH times from global memory

40

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

Page 41: Intro to GPGPU with CUDA (DevLink)

Memory Layout of Matrix in C

M0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M3,1M3,0 M3,2 M3,3

M

Page 42: Intro to GPGPU with CUDA (DevLink)

Simple Matrix Multiplication (CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width){ for (int i = 0; i < Width; ++i) { for (int j = 0; j < Width; ++j) { float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } }}

42

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

i

k

k

j

Page 43: Intro to GPGPU with CUDA (DevLink)

Simple Matrix Multiplication (GPU)

void MatrixMulOnDevice(float* M, float* N, float* P, int Width)

{ int size = Width * Width * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device cudaMalloc(&Pd, size);

Page 44: Intro to GPGPU with CUDA (DevLink)

Simple Matrix Multiplication (GPU)

// 2. Kernel invocation code – to be shown later …

// 3. Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd);}

Page 45: Intro to GPGPU with CUDA (DevLink)

Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)

{ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

Page 46: Intro to GPGPU with CUDA (DevLink)

Kernel Function (contd.) for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; }

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}

46

Nd

Md Pd

WID

TH

WID

TH

WIDTH WIDTH

ty

tx

ty

tx

k

k

Page 47: Intro to GPGPU with CUDA (DevLink)

Kernel Function (full)

// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float*

Pd, int Width){ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; }

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}

Page 48: Intro to GPGPU with CUDA (DevLink)

Kernel Invocation (Host Side)

// Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width);

// Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Page 49: Intro to GPGPU with CUDA (DevLink)

Only One Thread Block Used

One Block of threads compute matrix Pd

Each thread computes one element of Pd

Each threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)

Size of matrix limited by the number of threads allowed in a thread block

Grid 1

Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

WIDTH

Md Pd

Nd

Page 50: Intro to GPGPU with CUDA (DevLink)

Handling Arbitrary Sized Square Matrices

Have each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix

Each has (TILE_WIDTH)2 threads

Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks

50

Md

Nd

Pd

WID

TH

WID

TH

WIDTH WIDTH

ty

tx

by

bx

TILE_WIDTH

You still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!

Page 51: Intro to GPGPU with CUDA (DevLink)

Small Example

P1,0P0,0

P0,1

P2,0 P3,0

P1,1

P0,2 P2,2 P3,2P1,2

P3,1P2,1

P0,3 P2,3 P3,3P1,3

Block(0,0) Block(1,0)

Block(1,1)Block(0,1)

TILE_WIDTH = 2

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0 Pd3,0

Nd0,3 Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2 Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3 Pd3,3Pd1,3

Page 52: Intro to GPGPU with CUDA (DevLink)

Cleanup Topics

Memory ManagementPinned Memory (Zero-Transfer)

Portable Pinned Memory

Multi-GPU

Wrappers (Python, Java, .NET)

Kernels

Atomics

Thread Synchronization (staged reductions)

NVCC

Page 53: Intro to GPGPU with CUDA (DevLink)

Questions?

[email protected]@argodev

http://rob.gillenfamily.net


Recommended