+ All Categories
Home > Documents > Intermediate GPGPU Programming in CUDA

Intermediate GPGPU Programming in CUDA

Date post: 24-Feb-2016
Category:
Upload: moya
View: 48 times
Download: 1 times
Share this document with a friend
Description:
Intermediate GPGPU Programming in CUDA. Supada Laosooksathit. NVIDIA Hardware Architecture. Host memory. Recall. 5 steps for CUDA Programming Initialize device Allocate device memory Copy data to device memory Execute kernel Copy data back from device memory. - PowerPoint PPT Presentation
38
Intermediate GPGPU Programming in CUDA Supada Laosooksathit
Transcript
Page 1: Intermediate  GPGPU Programming in CUDA

Intermediate GPGPU Programming in CUDA

Supada Laosooksathit

Page 2: Intermediate  GPGPU Programming in CUDA

NVIDIA Hardware Architecture

Hostmemory

Page 3: Intermediate  GPGPU Programming in CUDA

Recall

• 5 steps for CUDA Programming– Initialize device– Allocate device memory– Copy data to device memory– Execute kernel– Copy data back from device memory

Page 4: Intermediate  GPGPU Programming in CUDA

Initialize Device Calls

• To select the device associated to the host thread– cudaSetDevice(device)– This function must be called before any __global__

function, otherwise device 0 is automatically selected.

• To get number of devices– cudaGetDeviceCount(&devicecount)

• To retrieve device’s property– cudaGetDeviceProperties(&deviceProp, device)

Page 5: Intermediate  GPGPU Programming in CUDA

Hello World Example

• Allocate host and device memory

Page 6: Intermediate  GPGPU Programming in CUDA

Hello World Example

• Host code

Page 7: Intermediate  GPGPU Programming in CUDA

Hello World Example

• Kernel code

Page 8: Intermediate  GPGPU Programming in CUDA

To Try CUDA Programming• SSH to 138.47.102.111• Set environment vals in .bashrc in your home directory

export PATH=$PATH:/usr/local/cuda/binexport LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATHexport LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

• Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK

• Compile the following directories– NVIDIA_GPU_Computing_SDK/shared/– NVIDIA_GPU_Computing_SDK/C/common/

• The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

Page 9: Intermediate  GPGPU Programming in CUDA

Demo

• Hello World– Print out block and thread IDs

• Vector Add– C = A + B

Page 10: Intermediate  GPGPU Programming in CUDA

NVIDIA Hardware Architecture

SM

Page 11: Intermediate  GPGPU Programming in CUDA

Specifications of a Device

• For more details– deviceQuery in CUDA SDK– Appendix F in Programming Guide 4.0

Specifications Compute Capability 1.3

Compute Capability 2.0

Warp size 32 32

Max threads/block 512 1024

Max Blocks/grid 65535 65535

Shared mem 16 KB/SM 48 KB/SM

Page 12: Intermediate  GPGPU Programming in CUDA

Demo

• deviceQuery– Show hardware specifications in details

Page 13: Intermediate  GPGPU Programming in CUDA

Memory Optimizations

• Reduce the time of memory transfer between host and device– Use asynchronous memory transfer (CUDA streams)– Use zero copy

• Reduce the number of transactions between on-chip and off-chip memory– Memory coalescing

• Avoid bank conflicts in shared memory

Page 14: Intermediate  GPGPU Programming in CUDA

Reduce Time of Host-Device Memory Transfer

• Regular memory transfer (synchronously)

Page 15: Intermediate  GPGPU Programming in CUDA

Reduce Time of Host-Device Memory Transfer• CUDA streams– Allow overlapping between kernel and memory copy

Page 16: Intermediate  GPGPU Programming in CUDA

CUDA Streams Example

Page 17: Intermediate  GPGPU Programming in CUDA

CUDA Streams Example

Page 18: Intermediate  GPGPU Programming in CUDA

GPU Timers

• CUDA Events– An API– Use the clock shade in kernel– Accurate for timing kernel executions

• CUDA timer calls– Libraries implemented in CUDA SDK

Page 19: Intermediate  GPGPU Programming in CUDA

CUDA Events Example

Page 20: Intermediate  GPGPU Programming in CUDA

Demo

• simpleStreams

Page 21: Intermediate  GPGPU Programming in CUDA

Reduce Time of Host-Device Memory Transfer

• Zero copy– Allow device pointers to access page-locked host

memory directly– Page-locked host memory is allocated by cudaHostAlloc()

Page 22: Intermediate  GPGPU Programming in CUDA

Demo

• Zero copy

Page 23: Intermediate  GPGPU Programming in CUDA

Reduce number of On-chip and Off-chip Memory Transactions

• Threads in a warp access global memory• Memory coalescing– Copy a bunch of words at the same time

Page 24: Intermediate  GPGPU Programming in CUDA

Memory Coalescing

• Threads in a warp access global memory in a straight forward way (4-byte word per thread)

Page 25: Intermediate  GPGPU Programming in CUDA

Memory Coalescing

• Memory addresses are aligned in the same segment but the accesses are not sequential

Page 26: Intermediate  GPGPU Programming in CUDA

Memory Coalescing

• Memory addresses are not aligned in the same segment

Page 27: Intermediate  GPGPU Programming in CUDA

Shared Memory

• 16 banks for compute capability 1.x, 32 banks for compute capability 2.x

• Help utilizing memory coalescing• Bank conflicts may occur– Two or more threads in access the same bank– In compute capability 1.x, no broadcast– In compute capability 2.x, broadcast the same

data to many threads that request

Page 28: Intermediate  GPGPU Programming in CUDA

Bank Conflicts

00Threads: Banks:

11

22

33

00Threads: Banks:

11

22

33

No bank conflict 2-way bank conflict

Page 29: Intermediate  GPGPU Programming in CUDA

Matrix Multiplication Example

Page 30: Intermediate  GPGPU Programming in CUDA

Matrix Multiplication Example

• Reduce accesses to global memory– (A.height/BLOCK_SIZE) times reading A– (B.width/BLOCK_SIZE) times reading B

Page 31: Intermediate  GPGPU Programming in CUDA

Demo

• Matrix Multiplication– With and without shared memory– Different block sizes

Page 32: Intermediate  GPGPU Programming in CUDA

Control Flow

• if, switch, do, for, while• Branch divergence in a warp– Threads in a warp issue different instruction sets

• Different execution paths will be serialized• Increase number of instructions in that warp

Page 33: Intermediate  GPGPU Programming in CUDA

Branch Divergence

Page 34: Intermediate  GPGPU Programming in CUDA

Summary

• 5 steps for CUDA Programming• NVIDIA Hardware Architecture– Memory hierarchy: global memory, shared

memory, register file– Specifications of a device: block, warp, thread, SM

Page 35: Intermediate  GPGPU Programming in CUDA

Summary

• Memory optimization– Reduce overhead due to host-device memory

transfer with CUDA streams, Zero copy– Reduce the number of transactions between on-

chip and off-chip memory by utilizing memory coalescing (shared memory)

– Try to avoid bank conflicts in shared memory• Control flow– Try to avoid branch divergence in a warp

Page 37: Intermediate  GPGPU Programming in CUDA
Page 38: Intermediate  GPGPU Programming in CUDA

Recommended