GPGPU: HIGH-PERFORMANCE COMPUTING - TUM...Fig. 3. Schematic comparison of CPU and GPU architecture...

Joined Advanced Student School (JASS) 2009March 29 - April 7, 2009St. Petersburg, Russia

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Dmitry PuzyrevSt. Petersburg State University Faculty of Physics Department of Computational Physics

In recent years the application of graphics processing units to general purpose computing becomes widely developed. GPGPU, which stands for General-purpose computing on Graphics Processing Units, makes its way into the fields of computations, traditionally associated with and handled on CPUs or clusters of CPUs. The breakthrough in the area of GPU computing was caused by the introduction of programmable stages and higher precision arithmetics on rendering pipelines, allowing to perform stream processing of non-graphics data.

Let’s understand, what makes GPUs effective in high-performance computing. GPUs are built for parallel processing of data, and are highly effective in data parallel tasks. High amount of computing units (GPUs have in the range of 128-800 ALUs, compared to 4 ALUs on a typical quad-core) allows computation power of GPU to exceed that of CPU by up to 10 times for high-end models, while high-end GPUs cost much less than CPUs (see Fig. 1). Memory bandwidth, essential to many applications, is 100+ GB/s, compared to ≈10-20 GB/s for CPUs.Chapter 1. Introduction

!

2 CUDA Programming Guide Version 2.1!

!!

Figure 1-1. Floating-Point Operations per Second and Memory Bandwidth for the CPU and GPU

!

"#$!%$&'()!*$#+),!-#$!,+'.%$/&).0!+)!12(&-+)34/(+)-!.&/&*+2+-0!*$-5$$)!-#$!678!&),!-#$!978!+'!-#&-!-#$!978!+'!'/$.+&2+:$,!1(%!.(;/!&'!'.#$;&-+.&220!+22

Such computational power can now be applied in different areas, including scientific computing, signal and image processing, and, of course, computer graphics itself (including non-traditional rendering algorithms, e.g. ray tracing). Scientific applications of high-performance GPGPU include molecular dynamics, astrophysics, geophysics, quantum chemistry, neural networks. Fig. 2 illustrates the effectiveness (speedup) of GPGPU in several fields.

146X

Medical Imaging Medical Imaging U of UtahU of Utah

36X

Molecular DynamicsMolecular DynamicsU of Illinois, UrbanaU of Illinois, Urbana

18X

Video Video TranscodingTranscodingElemental TechElemental Tech

Not 2x or 3x : Speedups are 20x to 150x

149X

Financial simulationFinancial simulationOxfordOxford

47X

Linear AlgebraLinear AlgebraUniversidad Jaime

20X

3D Ultrasound3D UltrasoundTechniscanTechniscan

18X

TranscodingTranscodingElemental TechElemental Tech

50X

MatlabMatlab ComputingComputingAccelerEyesAccelerEyes

100X

AstrophysicsAstrophysicsRIKENRIKEN

Not 2x or 3x : Speedups are 20x to 150x

20X

3D Ultrasound3D UltrasoundTechniscanTechniscan

130X

Quantum ChemistryQuantum ChemistryU of Illinois, UrbanaU of Illinois, Urbana

30X

Gene SequencingGene SequencingU of MarylandU of Maryland

Fig. 2. Speedup by use of GPGPU in scientific applications (by NVIDIA)

GPGPU computing can be performed on various hardware, including virtually all modern GPUs. NVIDIA desktop GPUs (9x or GTX series) and modern AMD GPUs support general-purpose computing. Both NVIDIA and AMD have their own dedicated high-performance GPGPU solutions, which are NVIDIA Tesla and AMD FireStream.

Of course, GPU architecture is highly specific. Basically, GPU devotes much more transistors to data processing rather than data caching and flow control. Fig. 3 roughly illustrates the CPU and GPU architecture specifics.

Hardware specifics affect the programming model. The basics of GPGPU programming model are:

•Small program (called kernel) works on many data elements

•Each data element is processed concurrently

•Communication is effective only inside one execution unit.

Two slightly different models are used on GPUs: SIMD (Single instruction, multiple data) and SPMD (Single program, multiple data).

GPGPU: High-performance Computing Puzyrev 2

! Chapter 1. Introduction

!

CUDA Programming Guide Version 2.1 3!

!

!

Figure 1-2. The GPU Devotes More Transistors to Data Processing

!

"#$%!&'%()*)(+,,-.!/0%!123!)&!%&'%()+,,-!4%,,5&6)/%7!/#!+77$%&&!'$#8,%9&!/0+/!(+:!8%!%;'$%&&%7!+&!7+/+5'+$+,,%,!(#9'6/+/)#:&!!?%(+6&%!/0%!&+9%!'$#=$+9!)&!%;%(6/%7!*#$!%+(0!7+/+!%,%9%:/.!/0%$%!)&!+!,#4%$!$%@6)$%9%:/!*#$!'0)&/)(+/%7!*,#4!(#:/$#,A!+:7!8%(+6&%!)/!)&!%;%(6/%7!#:!9+:-!7+/+!%,%9%:/&!+:7!0+&!0)=0!+$)/09%/)(!):/%:&)/-.!/0%!9%9#$-!+((%&&!,+/%:(-!(+:!8%!0)77%:!4)/0!(+,(6,+/)#:&!):&/%+7!#*!8)=!7+/+!(+(0%&>!

B+/+5'+$+,,%,!'$#(%&&):=!9+'&!7+/+!%,%9%:/&!/#!'+$+,,%,!'$#(%&&):=!/0$%+7&>!"+:-!+'',)(+/)#:&!/0+/!'$#(%&&!,+$=%!7+/+!&%/&!(+:!6&%!+!7+/+5'+$+,,%,!'$#=$+99):=!9#7%,!/#!&'%%7!6'!/0%!(#9'6/+/)#:&>!C:!DB!$%:7%$):=.!,+$=%!&%/&!#*!');%,&!+:7!E%$/)(%&!+$%!9+''%7!/#!'+$+,,%,!/0$%+7&>!F)9),+$,-.!)9+=%!+:7!9%7)+!'$#(%&&):=!+'',)(+/)#:&!&6(0!+&!'#&/5'$#(%&&):=!#*!$%:7%$%7!)9+=%&.!E)7%#!%:(#7):=!+:7!7%(#7):=.!)9+=%!&(+,):=.!&/%$%#!E)&)#:.!+:7!'+//%$:!$%(#=:)/)#:!(+:!9+'!)9+=%!8,#(G&!+:7!');%,&!/#!'+$+,,%,!'$#(%&&):=!/0$%+7&>!C:!*+(/.!9+:-!+,=#$)/09&!#6/&)7%!/0%!*)%,7!#*!)9+=%!$%:7%$):=!+:7!'$#(%&&):=!+$%!+((%,%$+/%7!8-!7+/+5'+$+,,%,!'$#(%&&):=.!*$#9!=%:%$+,!&)=:+,!'$#(%&&):=!#$!'0-&)(&!&)96,+/)#:!/#!(#9'6/+/)#:+,!*):+:(%!#$!(#9'6/+/)#:+,!8)#,#=->!

1.2 CUDA™: a General-Purpose Parallel Computing Architecture

C:!H#E%98%$!IJJK.!HLCBCM!):/$#76(%7!N3BMO.!+!=%:%$+,!'6$'#&%!'+$+,,%,!(#9'6/):=!+$(0)/%(/6$%!!

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU

DRAM

GPU

Fig. 3. Schematic comparison of CPU and GPU architecture

The most popular instrument of general-purpose GPU computing is NVIDIA’s GPGPU implementation, that is called NVIDIA CUDA. CUDA is positioned as general purpose parallel computing architecture and works with all modern NVIDIA GPUs. It uses C as high-level programming language, though other languages will be supported in the future, as seen on Fig. 4.

Fig. 4. NVIDIA CUDA architecture

One of the basics for CUDA is the concept of kernels. These are C functions, that are executed N times in parallel with specific syntax in main functions. Fig. 5 contains the basic example of CUDA code with kernel and main function.

Chapter 1. Introduction

!


!

Figure 1-3. CUDA is Designed to Support Various Languages or Application Programming Interfaces

1.3 CUDA’s Scalable Programming Model

"#$!%&'$()!*+!,-.)/0*1$!2345!%(&!,%(60*1$!7345!,$%(5!)#%)!,%/(5)1$%,!81*0$55*1!0#/85!%1$!(*9!8%1%..$.!565)$,5:!;-1)#$1,*1$5!.%9:!"#$!0#%..$(?$!/5!)*!&$'$.*8!%88./0%)/*(!5*+)9%1$!)#%)!)1%(58%1$().6!50%.$5!/)5!8%1%..$./5,!)*!.$'$1%?$!)#$!/(01$%5/(?!(-,@$1!*+!81*0$55*1!0*1$5

!

!

!

!

!

!


Chapter 2. Programming Model

"#$%!'()*+!$,)+-./&*%!)#*!0'$,!&-,&*()%!)#')!0'1*!/(!)#*!2345!(+-6+'00$,6!0-.*7!89!-/)7$,$,6!#-:!)#*9!'+*!*;(-%*.!$,!2!2!>-+!2345!$%!6$=*,!$,!2#'()*+!?

2.1 Kernels

2!>-+!2345!*;)*,.%!2!89!'77-:$,6!)#*!(+-6+'00*+!)-!.*>$,*!2!>/,&)$-,%@!&'77*.!!"#$"%&@!)#')@!:#*,!&'77*.@!'+*!*;*&/)*.!A!)$0*%!$,!('+'77*7!89!A!.$>>*+*,)!'()*+,-#"./&@!'%!-((-%*.!)-!-,79!-,&*!7$1*!+*6/7'+!2!>/,&)$-,%

5!1*+,*7!$%!.*>$,*.!/%$,6!)#*!__global__!.*&7'+')$-,!%(*&$>$*+!',.!)#*!,/08*+!->!2345!)#+*'.%!>-+!*'!&'77!$%!%(*&$>$*.!/%$,6!'!,*:!!%9,)';B!

// Kernel definition

__global__ void vecAdd(float* A, float* B, float* C)

{

}

int main()

{

// Kernel invocation

vecAdd(A, B, C);

}

C'!->!)#*!)#+*'.%!)#')!*;*&/)*!'!1*+,*7!$%!6$=*,!'!/,$D/*!,-#"./+0)!)#')!$%!'&&*%%$87*!:$)#$,!)#*!1*+,*7!)#+-/6#!)#*!8/$7)E$,!threadIdx!='+$'87*-77-:$,6!%'0(7*!&-.*!'..%!):-!=*&)-+%!5!',.!F!->!%$G*!A!',.!%)-+*%!)#*!+*%/7)!$,)-!=*&)-+!2B!

__global__ void vecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x;

C[i] = A[i] + B[i];

}

int main()

{


vecAdd(A, B, C);

}

Fig. 5. Kernel: this code adds two vectors A and B of size N

The next basic concept of CUDA is thread hierarchy. One kernel is executed in one thread, and threads are combined in thread blocks. Next figure shows an example of matrix addition using thread blocks.


!


"#$%!&'!(%)!(%*)#+,!(%#(!)-)$.()!vecAdd()!/)*'&*0,!&1)!/#2*342,)!#++2(2&15!!

2.2 Thread Hierarchy

6&*!$&17)12)1$)8!threadIdx!2,!#!93$&0/&1)1(!7)$(&*8!,&!(%#(!(%*)#+,!$#1!:)!2+)1(2'2)+!.,21;!#!&1)3+20)1,2&1#1$%*&12B21;!(%)2*!)-)$.(2&1!(&!$&&*+21#()!0)0&*>!#$$),,),5!J&*)!/*)$2,)8!&1)!$#1!,/)$2'>!,>1$%*&12B#(2&1!/&21(,!21!(%)!?)*1)!$#

6&*!)''2$2)1(!$&&/)*#(2&18!(%)!,%#*)+!0)0&*>!2,!)-/)$()+!(&!:)!#!!1)#*!)#$%!/*&$),,&*!$&*)8!0.$%!

S&4)7)*8!#!?)*1)!0.!,>1(#-5!"#$%!:

Fig. 6. Thread blocks: this code adds two matrices A and B of size N*N

Threads within a block have shared memory and can synchronize. On current GPUs, a thread block may contain up to 512 threads. A kernel can be executed by multiple equally shaped thread blocks put in a grid. Thread blocks in a grid are required to execute independently. Fig. 7 shows an


example of matrix addition on a grid. Fig. 8 shows the full hierarchy of threads. ! Chapter 2: Programming Model

!


__global__ void matAdd(float A[N][N], float B[N][N],

float C[N][N])

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < N && j < N)

C[i][j] = A[i][j] + B[i][j];

}

int main()

{


dim3 dimBlock(16, 16);

dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,

(N + dimBlock.y – 1) / dimBlock.y);

matAdd(A, B, C);

}

"#$!%#&$'(!)*+,-!./0$!+1!23423!5!673!%#&$'(.!8'.!,#+.$9!.+:$8#'%!'&)/%&'&/*;:)$&!+1!%#&$'(!)*+,-.!/9!'!=&/(!/.!%;@/,'**;!(/,%'%$(!);!%#$!./0$!+1!%#$!('%'!)$/9=!@&+,$..$(!&'%#$&!%#'9!);!%#$!9>:)$&!+1!@&+,$..+&.!/9!%#$!.;.%$:

!!!

Fig. 7. Grid: this code adds two matrices A and B of size N*N

Fig. 8. Full thread hierarchy


!


Figure 2-1. Grid of Thread Blocks

2.3 Memory Hierarchy

"#$%!&'()*+,!-*.!*//),,!+*&*!0(1-!-23&453)!-)-1(.!,5*/),!+2(467!&')4(!)8)/2&416!*,!4332,&(*&)+!9.!:472()!;*/'!&'()*+!'*,!*!5(4?*&)!31/*3!-)-1(.=!>*/'!&'()*+!931/@!'*,!*!,'*()+!-)-1(.!?4,493)!&1!*33!&'()*+,!10!&')!931/@!*6+!A4&'!&')!,*-)!340)&4-)!*,!&')!931/@=!:46*33.B!*33!&'()*+,!'*?)!*//),,!&1!&')!,*-)!7319*3!-)-1(.=!

C')()!*()!*3,1!&A1!*++4&416*3!()*+

Memory hierarchy is another complicated part of CUDA. As shown on Fig. 9, multiple memory spaces are present, including additional specialized memory spaces: constant and texture memory. Different memory usage strategies are used for different applications. Memory usage is usually the bottleneck of GPGPU applications.

Fig. 9. Memory hierarchy

The host and device model controls the execution of CUDA program. CUDA threads are executed execute on a physically separate device that operates as a coprocessor to the host running the C program. Both the host and the device maintain their own DRAM, referred to as host memory and device memory, and CUDA runtime manages data transfer.

! Chapter 2: Programming Model

!


Figure 2-2. Memory Hierarchy

2.4 Host and Device

"#!$%%'()'*+!,-!.$/&(*!0123!456"!)##&7*#!'8)'!'8*!456"!'8(*)+#!7)-!*9*:&'*!;!?8$#!$#!'8*!:)#*3!@;(!*9)7=%*3!A8*!

Global memory

Grid 0

Block (2, 1) Block (1, 1) Block (0, 1)


Grid 1

Block (1, 1)

Block (1, 0)

Block (1, 2)

Block (0, 1)

Block (0, 0)

Block (0, 2)

Thread Block

Per-block shared memory

Thread

Per-thread local

memory


Fig. 10. Host and device: CUDA program execution

! Chapter 2: Programming Model

!


Serial code executes on the host while parallel code executes on the device.

Figure 2-3. Heterogeneous Programming

Device

Grid 0



Host

C Program

Sequential Execution

Serial code

Parallel kernel

Kernel0()

Serial code

Parallel kernel

Kernel1()

Host

Device

Grid 1

Block (1, 1)

Block (1, 0)

Block (1, 2)

Block (0, 1)

Block (0, 0)

Block (0, 2)


CUDA allows to implement various libraries of functions, that essential for scientific high-performance computing. There is an efficient implementation of FFT on CUDA, as well as an implementation of BLAS, called CUBLAS. Unfortunately, CUBLAS doesn’t yet include all BLAS functions. CUDA-based plugins for MATLAB exists. LAPACK libraries for CUDA are under development. CUDA allows to perform efficient operations on dense matrices, sparse solvers are still under development.

CUDA still has many limitations. Of course, some of them are fundamental and come from specific architecture and data-parallel programming model. But other limitations should definitely be fixed in the future. For example, double precision has no deviations from IEEE 754 standard, but single precision is not standard. Double precision is supported only by the last generation of NVIDIA GPUs. CUDA still lacks advanced profiler. Emulation on CPU is slow and often results, which are different from that on GPU. Another drawback is that CUDA works only on NVIDIA GPUs.

Alternative implementation of GPGPU architecture is Stream Computing by AMD, which is based on Brook programming language. Brook works on both AMD and NVIDIA GPUs, Brook+ is AMD hardware optimized version. Brook is faster in some applications and has better support of double precision.

There are two main future directions in GPGPU computing. The first focuses on creation of new hardware, which should be more suitable for general-purpose computing. The best example is Larrabee - Intel’s upcoming discrete GPU. It will compete with both GPUs and high-performance computing. Larrabee has a hybrid architecture: 16-32 simple x86 cores with SIMD vector units and per-core L1/L2 cache, without fixed function hardware, except for texturing units.

Programming model for Larrabee will be task-parallel on core level and data-parallel on vector units inside the core. One significant feature is that Larrabee cores can submit work to itself without the host. Larrabee will use Intel C/C++ compiler and benefit from all its features. Larrabee gathers much praise from Intel and huge criticism from GPU manufacturers.

Another direction is to create a standardized API for programming on different architectures. OpenCL, which stands for Open Computing Language, is a framework for programming on CPUs, GPUs and other processors. It was proposed by Apple and developed by Khronos Group, and has full support from both AMD and NVIDIA. OpenCL is likely to be introduced with Mac OS X 10.6.

For the conclusion, let’s make a statement: GPGPU is a developed branch of computing. NVIDIA CUDA already allows us to utilize the power of GPUs in convenient way.

Data parallel programming model is effective in many applications, but GPGPU could become more flexible, supporting more programming languages and programming concepts.

A fusion between many-core CPUs and GPUs is a promising direction, and this direction is explored by Intel.

Standardized API for GPGPU is highly anticipated, and upcoming OpenCL is the main candidate for this API.


R E F E R E N C E S

NVIDIA CUDA Programming Guide 2.1

Various NVIDIA CUDA presentations


Date post:	31-Jan-2021
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

GPGPU: HIGH-PERFORMANCE COMPUTING - TUM...Fig. 3. Schematic comparison of CPU and GPU architecture...

Documents