GPU programming and Its Case Study

GPU Programming and SMAC Case Study

Zhengjie Lu

Master Student of Electronic Group

Electrical Engineering

Technische Universiteit Eindhoven, NL

Contents

Part 1: GPU Programming1.1 NVIDIA GPU Hardware

1.2 NVIDIA CUDA Programming

1.3 Programming Environment

Part 2: SMAC Case Study2.1 SMAC Introduction

2.2 SMAC Mapping

2.3 Experiment & Analysis

Part 3: Conclusion & Future Development

Concepts

1. GPU• Graphic processing unit or graphic card• Chip vendor: NVIDIA and ATI

2. CUDA Programming• “Compute Unified Device Architecture”• Support by NVIDIA

3. SMAC Application• “Simplified Method for Atmosphere Correction”

PAGE 36/28/15

Part 1: GPU Programming

/ name of department PAGE 46/28/15

1.1 NVIDIA GPU Hardware

• What does NVIDA GPU look like?



• Example: NVIDIA 8-Series GPU− 128 stream processors (SPs): 1.35GHz per processor− 16 shared memories: shared by every 8 SPs, small

but fast.− 1 global memory: shared by 128 SPs, slow but large.− 1 constant memory: shared by 128 SPs, small but

fast.




Stream Processor (SP) Shared Memory

Global/Constant Memory

Stream Multi-Processor (SM)


• Connection between GPU and CPU− CPU => main memory => GPU global memory => GPU− GPU => GPU global memory => main memory => CPU



• Hardware summary:− Multi-threading is supported physically with the SPs.

− SPs inside a SM communicate with each other through the shared memory.

− SMs communicate with each other through the global memory.

− GPU and CPU communicate with each other through their memories: global memory main memory



• CUDA programming concepts− Thread: the basic unit− Block: the collection of threads− Grid: the collection of blocks



• CUDA programming concepts− A grid is mapped on GPU by the scheduler− A block is mapped on SM by the scheduler− A thread is mapped on SP by the scheduler





• CPU programming custom:i. Allocate the CPU memory

ii. Run the CPU kernel

• CUDA programming custom:i. Allocate the GPU memory

ii. Copy the input to the GPU memory

iii. Run the GPU kernel

iv. Copy the output from the GPU memory



• Example:

/*******************************************************//* File: main.c/* Description: 8x8 matrix addition on CPU /*******************************************************/

//Data definitionconst int mat1[64] = {…};const int mat2[64] = {…};const int mat3[64];

//Matrix addition on CPUvoid matrixAdd_CPU(int index, int* IN1, int* IN2, int* OUT);

// Main bodyint main(){ // Run the matrix addition on CPU matrixAdd_CPU(64, mat1, mat2, mat3);

return 0;}

/****************************************************//* File: main.cu /* Description: 8x8 matrix addition on GPU /****************************************************/

//Data definitionconst int mat1[64] = {…};const int mat2[64] = {…};const int mat3[64];

//Matrix addition on GPUvoid matrixAdd_GPU(int index, int* IN1, int* IN2, int* OUT);

// Main bodyint main(){ // Run the matrix addition on CPU matrixAdd_GPU(row, col, mat1, mat2, mat3);

return 0;}


/**************************************************************//* File: main.c/* Description: 8x8 matrix addition on CPU /******************************************************************///Matrix addition on CPUvoid matrixAdd_CPU(int index, int* IN1, int* IN2, int* OUT){ int i;

for(i = 0; i < index ; i++){ OUT[i] = IN1[i] + IN2[i]; }}

/****************************************************//* File: main.cu /* Description: 8x8 matrix addition on GPU /****************************************************///Matrix addition on GPUvoid matrixAdd_GPU(int index, int* IN1, int* IN2, int* OUT){ int* deviceIN1; int* deviceIN2; int* deviceOUT; dim3 grid(1, 1, 1); dim3 block(index, 1, 1);

cutilSafeCall(cudaMalloc((void**) &deviceIN1, sizeof(int) * index)); cutilSafeCall(cudaMalloc((void**) &deviceIN2, sizeof(int) * index)); cutilSafeCall(cudaMalloc((void**) &deviceOUT, sizeof(int) * index));

cutilSafeCall(cudaMemcpy(deviceIN1, IN1, sizeof(int) * index, cudaMemcpyHostToDevice)); cutilSafeCall(cudaMemcpy(deviceIN2, IN2, sizeof(int) * index, cudaMemcpyHostToDevice));

GPU_kernel<<<grid, block>>>(deviceIN1, deviceIN2, deviceOUT);

cutilSafeCall(cudaMemcpy(OUT, deviceOUT, sizeof(int) * index, cudaMemcpyDeviceToHost));}

Allocate GPU memory

Copy the input

Copy the output

Run the matrix addition

/***********************************//* File: main.cu/* Description: GPU kernel /*****************************************/__global__ voidGPU_kernel( int* IN1, int* IN2, int* OUT){ int tx = threadIdx.x; OUT[tx] = IN1[tx] + IN2[tx];}


• CUDA programming optimization1) Use the registers and shared memories

2) Maximize the number of threads per block

3) Global Memory access coalescence

4) Shared memory bank conflict

5) Group the byte access

6) Stream execution



1) Use the registers and shared memories− Register is the fastest. (8192 32-bit reg. per SM)− Shared memory is fast, but small. (16KB per SM)− Global memory is slow, but large. (at least 256MB )



2) Maximize the number of threads per blocki. Determine the register/memory budget per thread,

with the tool cudaProf.

ii. Determine the maximum number of threads, with the tool cudaCal.

iii. Determine the number of blocks



3) Global Memory access coalescence− Global memory access pattern: 16 threads per time− 16 threads must access the global memory with 16

continuous words− 1st thread must access the global memory address

which is 16-word aligned



/ name of department

PAGE 206/28/15

Coalescence Non-coalescence Non-coalescenceNon-coalescence


4) Shared memory bank conflict− Shared memory access pattern: 16 thread− 16KB shared memory: 16 x 1KB memory bank− The threads shouldn’t access two different addresses

inside a memory bank




No bank confliction Bank confliction Bank confliction


5) Group the byte access


No group access

Group access


6) Stream execution



• Tips− Examples: NVIDIA SDK− Programming: “NVIDIA CUDA Programming Guide”− Optimization: “NVIDIA CUDA C Programming: Best

Practices Guide”



1. Preparation− Windows: Microsoft Visual C++ 2008 Express− Linux



2. CUDA installment− Step 1: Download the CUDA package suitable for your

operation system (http://www.nvidia.com/object/cuda_get.html)

− Step 2: Install CUDA Driver− Step 3: Install CUDA Toolkit− Step 4: Install CUDA SDK− Step 5: Verify the installment with running the SDK

examples


http://www.nvidia.com/object/cuda_get.html


3. CUDA project setup (Windows)− Step 1: Download “CUDA Wizard” and install it

(http://www.comp.hkbu.edu.hk/~kyzhao/)− Step2 : Open VC++ 2008 express− Step 3: Click “File” and choose “New/Project”− Step 4: Choose “CUDA” in “Project types” and then

select “CUDAWinApp” in “Visual Studio installed templates”

− Step 5: Name the CUDA project and click “OK”. − Step 6: Click “Solution Explorer”


http://www.comp.hkbu.edu.hk/~kyzhao/


3. CUDA project setup (Windows)− Step 7: right click “Source Files” and choose

“Add/New Item…”− Step 8: Click “Code” in “Categories” and then choose

“C++. File (.cpp)” in “Visual Studio installed templates”

− Step 9: Name the file as “main.cu” and click “Add”− Step 10: Repeat Step 6~8, and make the other file

named “GPU_kernel.cu”− Step 11: Click “Solution Explorer” and select

“GPU_kernel.cu” under the menu “Source Files”



3. CUDA project setup (Windows)− Step 12: Right click “GPU_kernel.cu”− Step 13: Click “Configuration Properties” and then

click “General”− Step 14: Select “Custom Build Tool” in “Tool” and

click OK− Step 15: Implement your GPU kernel in

“GPU_kernel.cu” and the others in “main.cu”



• Tips− Set up a CUDA project on Linux:

http://sites.google.com/site/5kk70gpu/installation

http://forums.nvidia.com/lofiversion/index.php?f62.html





Part 2: SMAC Case Study


2.1 SMAC Introduction

• SMAC algorithm• SMAC as “Simplified Method for Atmospheric

Correction”• A fast computation on the atmosphere reflections


• SMAC application profile

Data size: 5781 x 10 X 4 Bytes = 231240 Bytes


• SMAC in the satellite data center

SMAC Algorithm

Image Remapping

2.2 SMAC Mapping

• Mapping approach

2.2 SMAC Mapping

• Mapping approach− GPU.cu: GPU operation functions− GPU_kernel.cu: SMAC kernel

2.2 SMAC Mapping

• SMAC kernel on GPU− Data size: 64 x 5781 x 4 Bytes− CPU time:− GPU Time: − Demo


• Experiment Preparation

HARDWARECPU Intel Duo-Core, 2.5GHz per core.GPU nVidia 32-Core GPU, 0.95GHz per

core.Main Memory 4GBPCI-E PCI express 1.0 x 16Operation system Widows Vista EnterpriseCUDA version CUDA 1.1SOFTWAREGPU maximum registers per thread

60

GPU thread number 192 x 4 (#thread per block x #block)CPU thread number 1


• Experiment setup− Performance

− GPU improvement

CPU timeGPU Improvement

GPU time

CPU time CPU stop timer CPU start timer －

GPU time GPU stop timer GPU start timer －


• Experiment setup− Linear execution-time prediction

CPU time CPU overhead Bytes CPU speed

( )

( )

( )

( )

GPU time GPU memory time GPU run time

GPUmemory overhead Bytes GPU memory speed

GPUkernel overhead Bytes GPU kernel speed

GPUmemory overhead GPUkernel overhead

Bytes GPU memory speed GPU kernel speed

GPU overhe

ad Bytes GPU speed

CPU overhead Bytes CPU speedImprovement

GPU overhead Bytes GPU speed

Bytes CPU speed CPU speed

Bytes GPU speed GPU speed

Only holds for large-size data !!!


• Experiment setup− Linear execution-time prediction

55.39 10CPU time data size 61.67 2.41 10GPU time data size 64.45 2.01 10GPU time data size

1-stream:

8-stream:

1-thread:


• Experiment result− GPU improvement


• Experiment result− Linear execution-time model


• Experiment result− Linear execution-time model


• Roofline model


Log Scaling

Log

Sca

ling


• Roofline model with SMAC kernel

Hardware: NVIDIA Quadro FX570MPCI express bandwidth (GB/sec): 4Peak performance (GFlops/sec): 91.2Peak performance without FMAU (Gflops/sec):

30.4

Software: SMAC kernel on GPUData size (Bytes): 5971968

0Issued instruction number (Flops):

4189335552

Execution time (ms): 79.2Instruction density (Flops/Byte): 70.15Instruction Throughput (GFlops/sec):

52.8


0.25 2.5 25 2501

10

100 G

Flo

ps/

sec w/out FMA

Peak Performance

70.15

Roofline Model of SMAC on GPU

52.8 GFlops/sec

Flops/Byte

Hard

disk I

O BW

3. Conclusion & Future Development

• SMAC application: − The bottleneck is the hard disk IO.

• SMAC kernel on GPU:− The bottleneck is the computation.− 25 times faster than CPU, when large-size data is

processed with the streams.− The performance ceiling would occur when the data

size is “infinitely” huge.



• Future development− Power measurement: “Consumption of Contemporary

Graphics Accelerators”



• Power measurement: physical setup


8 x 0.12 omg (5W)


• Future development− Improve the hard disk I/O− Employ more powerful GPU


0 1 2 3 4 5 6 7 8 90

20

40

60

80

100

120

520937472

260468736

130234368

65117184

32558592

16279296

8139648

4069824

2034912

1017456

508728

254364

Q & A

Thanks for your attention!

Date post:	15-Aug-2015
Category:	Documents
Upload:	zhengjie-lu
View:	18 times
Download:	1 times

GPU programming and Its Case Study

Documents