Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1...

12/19/11

1

Intermediate GPGPU Programming in CUDA

CSC 469/585 Winter 2011-‐12 Louisiana Tech U

NVIDIA Hardware Architecture

Host memory

12/19/11

2

Recall

•  5 steps for CUDA Programming –  IniMalize device – Allocate device memory – Copy data to device memory – Execute kernel – Copy data back from device memory

IniMalize Device Calls

•  To select the device associated to the host thread –  cudaSetDevice(device) –  This funcMon must be called before any __global__ funcMon, otherwise device 0 is automaMcally selected.

•  To get number of devices –  cudaGetDeviceCount(&devicecount)

•  To retrieve device’s property –  cudaGetDeviceProperMes(&deviceProp, device)

12/19/11

3

Hello World Example

•  Allocate host and device memory

Hello World Example

•  Host code

12/19/11

4

Hello World Example

•  Kernel code

To Try CUDA Programming •  SSH to 138.47.102.165 •  Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

•  Copy the SDK from /home/students/NVIDIA_GPU_CompuMng_SDK

•  Compile the following directories –  NVIDIA_GPU_CompuMng_SDK/shared/ –  NVIDIA_GPU_CompuMng_SDK/C/common/

•  The sample codes are in NVIDIA_GPU_CompuMng_SDK/C/src/

12/19/11

5

Demo

•  Hello World – Print out block and thread IDs

•  Vector Add – C = A + B

CUDA Language Concept

•  CUDA programming model •  CUDA memory model

12/19/11

6

Some Terminologies

•  Device = GPU = set of stream mulMprocessors •  Stream MulMprocessor (SM) = set of processors & shared memory

•  Kernel = GPU program •  Grid = array of thread blocks that execute a kernel

•  Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory

CUDA Programming Model

•  Parallel code (kernel) is launched and executed on a device by many threads

•  Threads are grouped into thread blocks •  Parallel code is wrifen for a thread

// Kernel definition !__global__ void vecAdd(float* A, float* B, float* C)!{! ! !int i = threadIdx.x; ! ! !C[i] = A[i] + B[i]; !} !

12/19/11

7

Thread Hierarchy

•  Threads launched for a parallel secMon are parMMon into thread blocks

•  Thread block is a group of threads that can: – Synchronize their execuMon – Communicate via a low latency shared memory

•  Grid = all thread blocks for a given launch

12/19/11

8

IDs and Dimensions •  Threads –  3D IDs –  Unique within a block – two threads from two different blocks cannot cooperate

•  Blocks –  2D and 3D IDs (depend on the hardware) –  Unique within a grid

•  Dimensions are set at launch Mme –  Can be unique for each secMon

•  Built-‐in variables: –  threadIdx, blockIdx –  blockDim, gridDim

Host

Kernel 1

Kernel 2

Device

Grid 1

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Grid 2

Block (1, 1)

Thread (0, 1)

Thread (1, 1)

Thread (2, 1)

Thread (3, 1)

Thread (4, 1)

Thread (0, 2)

Thread (1, 2)

Thread (2, 2)

Thread (3, 2)

Thread (4, 2)

Thread (0, 0)

Thread (1, 0)

Thread (2, 0)

Thread (3, 0)

Thread (4, 0)

12/19/11

9

12/19/11

10

CUDA Memory Model

•  Each thread can: –  R/W per-‐thread registers –  R/W per-‐thread local memory –  R/W per-‐block shared memory –  R/W per-‐grid global memory –  Read only per-‐grid constant memory –  Read only per-‐grid texture memory

(Device) Grid

Constant Memory

Texture Memory

Global Memory

Block (0, 0)

Shared Memory

Local Memory

Thread (0, 0)

Registers

Local Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local Memory

Thread (0, 0)

Registers

Local Memory

Thread (1, 0)

Registers

Host •  The host can R/W global, constant, and texture memories

Host memory

12/19/11

11

Device DRAM

•  Global memory – Main means of communicaMng R/W data between host and device

– Contents visible to all threads •  Texture and Constant Memories – Constants iniMalized by host – Contents visible to all threads

CUDA Device Memory AllocaMon

•  cudaMalloc(pointer, memsize) – Allocates object in the device Global Memory – pointer = address of a pointer to the allocated object

– memsize = Size of allocated object

•  cudaFree(pointer) – Frees object from device Global Memory

12/19/11

12

CUDA Host-‐Device Data Transfer

•  cudaMemcpy() – Memory data transfer – Requires four parameters •  Pointer to source •  Pointer to desMnaMon •  Number of bytes copied •  Type of transfer: Host to Host, Host to Device, Device to Host, Device to Device

CUDA FuncMon DeclaraMon

•  __global__ defines a kernel funcMon – Must return void

Executed on the:

Only callable from the:

__device__ float DeviceFunc() device device

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

12/19/11

13

CUDA FuncMon Calls RestricMons

•  __device__ funcMons cannot have their address taken

•  For funcMons executed on the device: – No recursion – No staMc variable declaraMons inside the funcMon – No variable number of arguments

Calling a Kernel FuncMon – Thread CreaMon

•  A kernel funcMon must be called with an execuMon configuraMon:

KernelFunc<<< DimGrid, DimBlock, SharedMemBytes, Streams >>>(...); – DimGrid = dimension and size of the grid – DimBlock = dimension and size of each block –  SharedMemBytes specifies the number of bytes in shared memory (opMon)

–  Streams specifies the associated stream (opMon)

12/19/11

14


Host memory


SM

12/19/11

15

SpecificaMons of a Device

•  For more details – deviceQuery in CUDA SDK – Appendix F in Programming Guide 4.0

Specifica1ons Compute Capability 1.3

Compute Capability 2.0

Warp size 32 32

Max threads/block 512 1024

Max Blocks/grid 65535 65535

Shared mem 16 KB/SM 48 KB/SM

Demo

•  deviceQuery – Show hardware specificaMons in details

12/19/11

16

Memory OpMmizaMons

•  Reduce the Mme of memory transfer between host and device – Use asynchronous memory transfer (CUDA streams)

– Use zero copy •  Reduce the number of transacMons between on-‐chip and off-‐chip memory – Memory coalescing

•  Avoid bank conflicts in shared memory

Reduce Time of Host-‐Device Memory Transfer

•  Regular memory transfer (synchronously)

12/19/11

17

Reduce Time of Host-‐Device Memory Transfer •  CUDA streams –  Allow overlapping between kernel and memory copy

CUDA Streams Example

12/19/11

18

CUDA Streams Example

GPU Timers

•  CUDA Events – An API – Use the clock shade in kernel – Accurate for Mming kernel execuMons

•  CUDA Mmer calls – Libraries implemented in CUDA SDK

12/19/11

19

CUDA Events Example

Demo

•  simpleStreams

12/19/11

20

Reduce Time of Host-‐Device Memory Transfer

•  Zero copy – Allow device pointers to access page-‐locked host memory directly

– Page-‐locked host memory is allocated by cudaHostAlloc()

Demo

•  Zero copy

12/19/11

21

Reduce number of On-‐chip and Off-‐chip Memory TransacMons

•  Threads in a warp access global memory •  Memory coalescing – Copy a bunch of words at the same Mme

Memory Coalescing

•  Threads in a warp access global memory in a straight forward way

12/19/11

22

Memory Coalescing

•  Memory addresses are aligned in the same segment but the accesses are not sequenMal

Memory Coalescing

•  Memory addresses are not aligned in the same segment

12/19/11

23

Shared Memory

•  16 banks for compute capability 1.x, 32 banks for compute capability 2.x

•  Help uMlizing memory coalescing •  Bank conflicts may occur – Two or more threads in access the same bank –  In compute capability 1.x, no broadcast –  In compute capability 2.x, broadcast the same data the many threads that request

Bank Conflicts

0 0 Threads: Banks:

1 1

2 2

3 3

0 0 Threads: Banks:

1 1

2 2

3 3

No bank conflict 2-‐way bank conflict

12/19/11

24

Matrix MulMplicaMon Example

Matrix MulMplicaMon Example

•  Reduce accesses to global memory –  (A.height/BLOCK_SIZE) Mmes reading A –  (B.width/BLOCK_SIZE) Mmes reading B

12/19/11

25

Demo

•  Matrix MulMplicaMon – With and without shared memory – Different block sizes

Control Flow

•  if, switch, do, for, while •  Branch divergence in a warp – Threads in a warp issue different instrucMon sets

•  Different execuMon paths will be serialized •  Increase number of instrucMons in that warp

12/19/11

26

Branch Divergence

Summary

•  5 steps for CUDA Programming •  NVIDIA Hardware Architecture – Memory hierarchy: global memory, shared memory, register file

– SpecificaMons of a device: block, warp, thread, SM

12/19/11

27

Summary

•  Memory opMmizaMon – Reduce overhead due to host-‐device memory transfer with CUDA streams, Zero copy

– Reduce the number of transacMons between on-‐chip and off-‐chip memory by uMlizing memory coalescing (shared memory)

– Try to avoid bank conflicts in shared memory

•  Control flow – Try to avoid branch divergence in a warp

References

•  CUDA C Programming Guide •  CUDA Best PracMces Guide •  hfp://www.developer.nvidia.com/cuda-‐toolkit

12/19/11

28

Date post:	09-May-2018
Category:	Documents
Upload:	buituong
View:	224 times
Download:	1 times

Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1...

Documents