+ All Categories
Home > Documents > GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

Date post: 30-Dec-2015
Category:
Upload: pierce-tucker
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
22
GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/
Transcript
Page 1: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

GPU Programming with CUDA – Optimisation

Mike Griffiths

GPUComputing@Sheffieldhttp://gpucomputing.sites.sheffield.ac.uk/

Page 2: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

Hardware Model

Main Program Code___________________________

DRAM GDRAM

CPUGPU

I/O I/OPCIe SM

Shared Memory

GPU Kernel Code____________________

GPU Kernel Code____________________

GPU Kernel Code____________________

Page 3: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Data transfer to/from device memory• Device under-utilisation• GPU memory bandwidth• Code Branching

Performance Inhibitors

Page 4: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching

Performance Inhibitors

Page 5: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• CPU (host) and GPU (device) have separate dedicated memory• All data read/written on the device must be copied via PCIe bus• Very expensive operation

• Optimisation Technique: Minimise data copies• Keep resident data on the device• May have to move some computation to the GPU even if is not

computationally expensive • Might be quicker to re-calculate data on the device than copy it

Data Transfer

Page 6: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Port inexpensive routine to the device• Minimise transfers by moving copy out of the loop

Data Transfer ExampleLoop over timesteps

inexpensive_routine_on_host(data_on_host)

copy data from host to device

expensive_routine_on_device(data_on_device)

copy data from device to host

End loop over timesteps

copy data from host to device

Loop over timesteps

inexpensive_routine_on_device(data_on_device)

expensive_routine_on_device(data_on_device)

End loop over timesteps

copy data from device to host

Page 7: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching

Performance Inhibitors

Page 8: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• GPU performance relies on the use of many threads• Degree of parallelism must be much higher than on the CPU• Ideally need many more threads than cores

• Effort must be made to expose as much parallelism as possible• May require re-engineering your problem

• If significant sections of code are serial then GPU acceleration will be limited• Amdahl’s Law

Exposing Parallelism

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 (𝑁 )= 1

𝐵+1𝑁

(1−𝐵)

Page 9: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Access to GPU memory has several hundred cycles of latency• When a thread is waiting for data it is stalled

• GPUs have very fast context switching• Stalled threads can be switched with active threads• Switching hides memory latency if other threads are performing compute• Requires many threads ideally performing large amounts of computation

• Optimisation Technique: Have lots of threads with high arithmetic intensity• Defined as the ratio of arithmetic computation to memory accesses

Memory Latency

Page 10: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

Exposing parallelism exampleLoop over i from 1 to 512

Loop over j from 1 to 512

independent iteration

Calc i from thread/block ID

Loop over j from 1 to 512

independent iteration

Calc i & j from thread/block ID

independent iteration

Original code

1D decomposition 2D decomposition

512 threads 262,144 threads✖ ✔

Page 11: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching

Performance Inhibitors

Page 12: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• GPUs have high peak memory bandwidth• Maximum bandwidth is achieved when data is accessed in large

requests rather than many small requests• Large requests must come from multiple threads• Otherwise memory accesses are serialised degrading performance• Memory coalescing: Consecutive threads accessing consecutive memory

locations

• Optimisation technique: Coalesced memory accesses reduce the number of requests and achieve higher bandwidth

Memory Coalescing

Page 13: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Consecutive threads are those with consecutive threadIdx.x values• Question: Do consecutive threads access consecutive memory

locations?

Coalescing Example

index = blockIdx.x*blockDim.x + threadIdx.x;

output[index] = 2*input[index];

Page 14: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Consecutive threads are those with consecutive threadIdx.x values• Question: Do consecutive threads access consecutive memory

locations?

• Yes: Consecutive index values leads to consecutive threadIdx values

Coalescing Example

index = blockIdx.x*blockDim.x + threadIdx.x;

output[index] = 2*input[index];

Page 15: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Question: Do consecutive threads access consecutive memory locations?

Coalescing Example 2

i = blockIdx.x*blockDim.x + threadIdx.x;

for (j=0; j<N; j++)

output[i][j]=2*input[i][j];

j = blockIdx.x*blockDim.x + threadIdx.x;

for (i=0; i<N; i++)

output[i][j]=2*input[i][j];

Page 16: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Question: Do consecutive threads access consecutive memory locations?

• No: Consecutive threadIdx.x corresponds to consecutive i values

• Yes: Consecutive threadIdx.x corresponds to consecutive j values

Coalescing Example 2

i = blockIdx.x*blockDim.x + threadIdx.x;

for (j=0; j<N; j++)

output[i][j]=2*input[i][j];

j = blockIdx.x*blockDim.x + threadIdx.x;

for (i=0; i<N; i++)

output[i][j]=2*input[i][j];

Page 17: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• What about 2D or 3D decompositions• Exactly the same• Always threadIdx.x which should increment with consecutive threads• E.g. Matrix addition

Memory Coalescing in 2D

int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y;

c[i][j] = a[i][j] + b[i][j];

Page 18: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching

Performance Inhibitors

Page 19: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• On NVIDIA GPUs there are less instructional scheduling units than cores• Threads are scheduled in groups of 32 (a warp)• Threads within a warp execute the same instruction in lock-step• Single Instruction Multiple Data (SIMD)

• CUDA C Kernels are free to specify branches• BUT all threads will have to follow all code paths within the warp

• Optimisation Technique: Avoid inter warp branching wherever possible

Code Branching

Page 20: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• You want to split your threads into two groups:

Branching Example

i = blockIdx.x*blockDim.x + threadIdx.x;if (i%2 == 0)

…else

i = blockIdx.x*blockDim.x + threadIdx.x;if ((i/32)%2 == 0)

…else

Page 21: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• Set COMPUTE_PROFILE environment variable to 1• Log file will be created at runtime• E.g. cuda_profile_0.log• Contains timing information for kernel and data transfer• Possible to output more metrics (see doc/Compute_Profiler.txt)

CUDA Profiling

# CUDA_PROFILE_LOG_VERSION 2.0# CUDA_DEVICE 0 Tesla M1060# CUDA_CONTEXT 1# TIMESTAMPFACTOR fffff6e2e9ee8858method,gputime,cputime,occupancymethod=[ memcpyHtoD ] gputime=[ 37.952 ] cputime=[ 86.000 ] method=[ memcpyHtoD ] gputime=[ 37.376 ] cputime=[ 71.000 ] method=[ memcpyHtoD ] gputime=[ 37.184 ] cputime=[ 57.000 ] method=[ _Z23inverseEdgeDetect1D_colPfS_S_ ] gputime=[ 253.536 ] cputime=[ 13.000 ] occupancy=[ 0.250 ] ...

Page 22: GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield

• GPUs offers higher Floating Point and memory bandwidth performance than CPUs• A number of factors will inhibit execution performance• A number of techniques can be applied to circumvent these• Some techniques may require re-engineering your problem• If you application cant be adapted GPU performance will not be good!

• It is important to have an understanding of the application, architecture and programming model

Conclusions


Recommended