12/19/11
1
Intermediate GPGPU Programming in CUDA
CSC 469/585 Winter 2011-‐12 Louisiana Tech U
NVIDIA Hardware Architecture
Host memory
12/19/11
2
Recall
• 5 steps for CUDA Programming – IniMalize device – Allocate device memory – Copy data to device memory – Execute kernel – Copy data back from device memory
IniMalize Device Calls
• To select the device associated to the host thread – cudaSetDevice(device) – This funcMon must be called before any __global__ funcMon, otherwise device 0 is automaMcally selected.
• To get number of devices – cudaGetDeviceCount(&devicecount)
• To retrieve device’s property – cudaGetDeviceProperMes(&deviceProp, device)
12/19/11
3
Hello World Example
• Allocate host and device memory
Hello World Example
• Host code
12/19/11
4
Hello World Example
• Kernel code
To Try CUDA Programming • SSH to 138.47.102.165 • Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
• Copy the SDK from /home/students/NVIDIA_GPU_CompuMng_SDK
• Compile the following directories – NVIDIA_GPU_CompuMng_SDK/shared/ – NVIDIA_GPU_CompuMng_SDK/C/common/
• The sample codes are in NVIDIA_GPU_CompuMng_SDK/C/src/
12/19/11
5
Demo
• Hello World – Print out block and thread IDs
• Vector Add – C = A + B
CUDA Language Concept
• CUDA programming model • CUDA memory model
12/19/11
6
Some Terminologies
• Device = GPU = set of stream mulMprocessors • Stream MulMprocessor (SM) = set of processors & shared memory
• Kernel = GPU program • Grid = array of thread blocks that execute a kernel
• Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory
CUDA Programming Model
• Parallel code (kernel) is launched and executed on a device by many threads
• Threads are grouped into thread blocks • Parallel code is wrifen for a thread
// Kernel definition !__global__ void vecAdd(float* A, float* B, float* C)!{! ! !int i = threadIdx.x; ! ! !C[i] = A[i] + B[i]; !} !
12/19/11
7
Thread Hierarchy
• Threads launched for a parallel secMon are parMMon into thread blocks
• Thread block is a group of threads that can: – Synchronize their execuMon – Communicate via a low latency shared memory
• Grid = all thread blocks for a given launch
12/19/11
8
IDs and Dimensions • Threads – 3D IDs – Unique within a block – two threads from two different blocks cannot cooperate
• Blocks – 2D and 3D IDs (depend on the hardware) – Unique within a grid
• Dimensions are set at launch Mme – Can be unique for each secMon
• Built-‐in variables: – threadIdx, blockIdx – blockDim, gridDim
Host
Kernel 1
Kernel 2
Device
Grid 1
Block (0, 0)
Block (1, 0)
Block (2, 0)
Block (0, 1)
Block (1, 1)
Block (2, 1)
Grid 2
Block (1, 1)
Thread (0, 1)
Thread (1, 1)
Thread (2, 1)
Thread (3, 1)
Thread (4, 1)
Thread (0, 2)
Thread (1, 2)
Thread (2, 2)
Thread (3, 2)
Thread (4, 2)
Thread (0, 0)
Thread (1, 0)
Thread (2, 0)
Thread (3, 0)
Thread (4, 0)
12/19/11
9
12/19/11
10
CUDA Memory Model
• Each thread can: – R/W per-‐thread registers – R/W per-‐thread local memory – R/W per-‐block shared memory – R/W per-‐grid global memory – Read only per-‐grid constant memory – Read only per-‐grid texture memory
(Device) Grid
Constant Memory
Texture Memory
Global Memory
Block (0, 0)
Shared Memory
Local Memory
Thread (0, 0)
Registers
Local Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local Memory
Thread (0, 0)
Registers
Local Memory
Thread (1, 0)
Registers
Host • The host can R/W global, constant, and texture memories
Host memory
12/19/11
11
Device DRAM
• Global memory – Main means of communicaMng R/W data between host and device
– Contents visible to all threads • Texture and Constant Memories – Constants iniMalized by host – Contents visible to all threads
CUDA Device Memory AllocaMon
• cudaMalloc(pointer, memsize) – Allocates object in the device Global Memory – pointer = address of a pointer to the allocated object
– memsize = Size of allocated object
• cudaFree(pointer) – Frees object from device Global Memory
12/19/11
12
CUDA Host-‐Device Data Transfer
• cudaMemcpy() – Memory data transfer – Requires four parameters • Pointer to source • Pointer to desMnaMon • Number of bytes copied • Type of transfer: Host to Host, Host to Device, Device to Host, Device to Device
CUDA FuncMon DeclaraMon
• __global__ defines a kernel funcMon – Must return void
Executed on the:
Only callable from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
12/19/11
13
CUDA FuncMon Calls RestricMons
• __device__ funcMons cannot have their address taken
• For funcMons executed on the device: – No recursion – No staMc variable declaraMons inside the funcMon – No variable number of arguments
Calling a Kernel FuncMon – Thread CreaMon
• A kernel funcMon must be called with an execuMon configuraMon:
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes, Streams >>>(...); – DimGrid = dimension and size of the grid – DimBlock = dimension and size of each block – SharedMemBytes specifies the number of bytes in shared memory (opMon)
– Streams specifies the associated stream (opMon)
12/19/11
14
NVIDIA Hardware Architecture
Host memory
NVIDIA Hardware Architecture
SM
12/19/11
15
SpecificaMons of a Device
• For more details – deviceQuery in CUDA SDK – Appendix F in Programming Guide 4.0
Specifica1ons Compute Capability 1.3
Compute Capability 2.0
Warp size 32 32
Max threads/block 512 1024
Max Blocks/grid 65535 65535
Shared mem 16 KB/SM 48 KB/SM
Demo
• deviceQuery – Show hardware specificaMons in details
12/19/11
16
Memory OpMmizaMons
• Reduce the Mme of memory transfer between host and device – Use asynchronous memory transfer (CUDA streams)
– Use zero copy • Reduce the number of transacMons between on-‐chip and off-‐chip memory – Memory coalescing
• Avoid bank conflicts in shared memory
Reduce Time of Host-‐Device Memory Transfer
• Regular memory transfer (synchronously)
12/19/11
17
Reduce Time of Host-‐Device Memory Transfer • CUDA streams – Allow overlapping between kernel and memory copy
CUDA Streams Example
12/19/11
18
CUDA Streams Example
GPU Timers
• CUDA Events – An API – Use the clock shade in kernel – Accurate for Mming kernel execuMons
• CUDA Mmer calls – Libraries implemented in CUDA SDK
12/19/11
19
CUDA Events Example
Demo
• simpleStreams
12/19/11
20
Reduce Time of Host-‐Device Memory Transfer
• Zero copy – Allow device pointers to access page-‐locked host memory directly
– Page-‐locked host memory is allocated by cudaHostAlloc()
Demo
• Zero copy
12/19/11
21
Reduce number of On-‐chip and Off-‐chip Memory TransacMons
• Threads in a warp access global memory • Memory coalescing – Copy a bunch of words at the same Mme
Memory Coalescing
• Threads in a warp access global memory in a straight forward way
12/19/11
22
Memory Coalescing
• Memory addresses are aligned in the same segment but the accesses are not sequenMal
Memory Coalescing
• Memory addresses are not aligned in the same segment
12/19/11
23
Shared Memory
• 16 banks for compute capability 1.x, 32 banks for compute capability 2.x
• Help uMlizing memory coalescing • Bank conflicts may occur – Two or more threads in access the same bank – In compute capability 1.x, no broadcast – In compute capability 2.x, broadcast the same data the many threads that request
Bank Conflicts
0 0 Threads: Banks:
1 1
2 2
3 3
0 0 Threads: Banks:
1 1
2 2
3 3
No bank conflict 2-‐way bank conflict
12/19/11
24
Matrix MulMplicaMon Example
Matrix MulMplicaMon Example
• Reduce accesses to global memory – (A.height/BLOCK_SIZE) Mmes reading A – (B.width/BLOCK_SIZE) Mmes reading B
12/19/11
25
Demo
• Matrix MulMplicaMon – With and without shared memory – Different block sizes
Control Flow
• if, switch, do, for, while • Branch divergence in a warp – Threads in a warp issue different instrucMon sets
• Different execuMon paths will be serialized • Increase number of instrucMons in that warp
12/19/11
26
Branch Divergence
Summary
• 5 steps for CUDA Programming • NVIDIA Hardware Architecture – Memory hierarchy: global memory, shared memory, register file
– SpecificaMons of a device: block, warp, thread, SM
12/19/11
27
Summary
• Memory opMmizaMon – Reduce overhead due to host-‐device memory transfer with CUDA streams, Zero copy
– Reduce the number of transacMons between on-‐chip and off-‐chip memory by uMlizing memory coalescing (shared memory)
– Try to avoid bank conflicts in shared memory
• Control flow – Try to avoid branch divergence in a warp
References
• CUDA C Programming Guide • CUDA Best PracMces Guide • hfp://www.developer.nvidia.com/cuda-‐toolkit
12/19/11
28