Performance modeling in GPGPU computing

Performance modeling in GPGPU computing

Wenjing xu Professor: Dr.Box

GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, engineering, and enterprise applications.

What’s GPGPU?

a simplified representation of a system or phenomenon

it is the most explicit way in which to describe a system or phenomenon

use the parameter we set to build formula to Analysis system

What’s modeling

Hong and Kim [3] introduce two metrics, Memory Warp Parallelism (MWP) and Computation Warp Parallelism (CWP) in order to describe the GPU parallel architecture.

Zhang and Owens [4] develop a performance model based on their microbenchmarks so that they can identify bottlenecks in the program.

Supada [5] performance model consider memory latencies are varied depending on the data type and the type of memory

Relate work

Different application and device cannot use same setting

Find the relationship between each parameters in this model, and choose the best block size for each application on different device to get peak performance.

1 Introduction and background

varies data size with varies size of block have different performance

How GPU working

Memory latency hiding

The structure of threads

Specification of GeForce GTX 650

Parameters

Threads / Warp Number of thread in warp NRW Warps / Multiprocessor Number of warp in multiprocessor NWM Threads / Multiprocessor number of thread can be resided

into SM NRT

Thread Blocks / Multiprocessor number of block can be resided into SM

NRB

Thread Blocks / Multiprocessor needed

Number of block needed NB

Max Shared Memory / Multiprocessor (bytes)

Size of memory can be used by SM MSM

Register File Size Registers per block RB Max Registers / Thread Max number of Registers can be

used by thread

Max Thread Block Size Max number of threads in block NMB Threads number of threads needed NT Threads Blocks number of threads in block NTB Threads in warp Number of threads in warp NTW

NMB >= NTB = N* NTW (N is integer) >= NRT/ NRB

Block size setting under threads limitation

Memory resource

Memory

Location Hit Latency

Program Scope

Global Off-chip 200-300cycles global

Local

Off- chip

Same as global function

Shared

on-chip register latency

function

Constant

on-chip cache

register latency

global

Texture

on-chip cache

>100 cycles

global

MR / MTR >= N* NTB (N is integer) N* NTB (N is integer) <= NRT N<= MSM / MSB

Block size setting under stream multiprocessor resource

Though more threads can hide memory access latency, but the more thread use the more resource needed. Find the balance point between resource limitation and memory latency is a shortcut to touch the peak performance. By different application and device this performance model shows it advantage, adaptable and without any rework and redesign let application running on the best tuning.

Conclusion

Date post:	22-Feb-2016
Category:	Documents
Upload:	wind
View:	50 times
Download:	0 times

Performance modeling in GPGPU computing

Documents