Performance modeling in GPGPU computing
Wenjing xu Professor: Dr.Box
GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, engineering, and enterprise applications.
What’s GPGPU?
a simplified representation of a system or phenomenon
it is the most explicit way in which to describe a system or phenomenon
use the parameter we set to build formula to Analysis system
What’s modeling
Hong and Kim [3] introduce two metrics, Memory Warp Parallelism (MWP) and Computation Warp Parallelism (CWP) in order to describe the GPU parallel architecture.
Zhang and Owens [4] develop a performance model based on their microbenchmarks so that they can identify bottlenecks in the program.
Supada [5] performance model consider memory latencies are varied depending on the data type and the type of memory
Relate work
Different application and device cannot use same setting
Find the relationship between each parameters in this model, and choose the best block size for each application on different device to get peak performance.
1 Introduction and background
varies data size with varies size of block have different performance
How GPU working
Memory latency hiding
The structure of threads
Specification of GeForce GTX 650
Parameters
Threads / Warp Number of thread in warp NRW Warps / Multiprocessor Number of warp in multiprocessor NWM Threads / Multiprocessor number of thread can be resided
into SM NRT
Thread Blocks / Multiprocessor number of block can be resided into SM
NRB
Thread Blocks / Multiprocessor needed
Number of block needed NB
Max Shared Memory / Multiprocessor (bytes)
Size of memory can be used by SM MSM
Register File Size Registers per block RB Max Registers / Thread Max number of Registers can be
used by thread
Max Thread Block Size Max number of threads in block NMB Threads number of threads needed NT Threads Blocks number of threads in block NTB Threads in warp Number of threads in warp NTW
NMB >= NTB = N* NTW (N is integer) >= NRT/ NRB
Block size setting under threads limitation
Memory resource
Memory
Location Hit Latency
Program Scope
Global Off-chip 200-300cycles global
Local
Off- chip
Same as global function
Shared
on-chip register latency
function
Constant
on-chip cache
register latency
global
Texture
on-chip cache
>100 cycles
global
MR / MTR >= N* NTB (N is integer) N* NTB (N is integer) <= NRT N<= MSM / MSB
Block size setting under stream multiprocessor resource
Though more threads can hide memory access latency, but the more thread use the more resource needed. Find the balance point between resource limitation and memory latency is a shortcut to touch the peak performance. By different application and device this performance model shows it advantage, adaptable and without any rework and redesign let application running on the best tuning.
Conclusion