+ All Categories
Home > Documents > GPGPU: HIGH-PERFORMANCE COMPUTING - TUM...Fig. 3. Schematic comparison of CPU and GPU architecture...

GPGPU: HIGH-PERFORMANCE COMPUTING - TUM...Fig. 3. Schematic comparison of CPU and GPU architecture...

Date post: 31-Jan-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
9
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia GPGPU: HIGH-PERFORMANCE COMPUTING Dmitry Puzyrev St. Petersburg State University Faculty of Physics Department of Computational Physics In recent years the application of graphics processing units to general purpose computing becomes widely developed. GPGPU, which stands for General-purpose computing on Graphics Processing Units, makes its way into the fields of computations, traditionally associated with and handled on CPUs or clusters of CPUs.The breakthrough in the area of GPU computing was caused by the introduction of programmable stages and higher precision arithmetics on rendering pipelines, allowing to perform stream processing of non-graphics data. Let’s understand, what makes GPUs effective in high-performance computing. GPUs are built for parallel processing of data, and are highly effective in data parallel tasks. High amount of computing units (GPUs have in the range of 128-800 ALUs, compared to 4 ALUs on a typical quad-core) allows computation power of GPU to exceed that of CPU by up to 10 times for high-end models, while high-end GPUs cost much less than CPUs (see Fig. 1). Memory bandwidth, essential to many applications, is 100+ GB/s, compared to 10-20 GB/s for CPUs. !"#$ &’( )$$# &*( +,- )$$. )$$/ )$$0 )$$1 )$$2 !"#/ !".$ 30$ 304 32$ 35) 36)$$ &*( !78 9’- 9’: &*( GT200 = GeForce GTX 280 G92 = GeForce 9800 GTX G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 #;$ 3<= >7-?) @*7 #;) 3<= <’-,?-A7B( 32$ CDA-’ Fig. 1. Comparison of computational power: floating-point operations per second for the GPU and CPU (by NVIDIA)
Transcript
  • Joined Advanced Student School (JASS) 2009March 29 - April 7, 2009St. Petersburg, Russia

    G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

    Dmitry PuzyrevSt. Petersburg State University Faculty of Physics Department of Computational Physics

    In recent years the application of graphics processing units to general purpose computing becomes widely developed. GPGPU, which stands for General-purpose computing on Graphics Processing Units, makes its way into the fields of computations, traditionally associated with and handled on CPUs or clusters of CPUs. The breakthrough in the area of GPU computing was caused by the introduction of programmable stages and higher precision arithmetics on rendering pipelines, allowing to perform stream processing of non-graphics data.

    Let’s understand, what makes GPUs effective in high-performance computing. GPUs are built for parallel processing of data, and are highly effective in data parallel tasks. High amount of computing units (GPUs have in the range of 128-800 ALUs, compared to 4 ALUs on a typical quad-core) allows computation power of GPU to exceed that of CPU by up to 10 times for high-end models, while high-end GPUs cost much less than CPUs (see Fig. 1). Memory bandwidth, essential to many applications, is 100+ GB/s, compared to ≈10-20 GB/s for CPUs.Chapter 1. Introduction

    !

    2 CUDA Programming Guide Version 2.1!

    !!

    Figure 1-1. Floating-Point Operations per Second and Memory Bandwidth for the CPU and GPU

    !

    "#$!%$&'()!*$#+),!-#$!,+'.%$/&).0!+)!12(&-+)34/(+)-!.&/&*+2+-0!*$-5$$)!-#$!678!&),!-#$!978!+'!-#&-!-#$!978!+'!'/$.+&2+:$,!1(%!.(;/!&'!'.#$;&-+.&220!+22

  • Such computational power can now be applied in different areas, including scientific computing, signal and image processing, and, of course, computer graphics itself (including non-traditional rendering algorithms, e.g. ray tracing). Scientific applications of high-performance GPGPU include molecular dynamics, astrophysics, geophysics, quantum chemistry, neural networks. Fig. 2 illustrates the effectiveness (speedup) of GPGPU in several fields.

    146X

    Medical Imaging Medical Imaging U of UtahU of Utah

    36X

    Molecular DynamicsMolecular DynamicsU of Illinois, UrbanaU of Illinois, Urbana

    18X

    Video Video TranscodingTranscodingElemental TechElemental Tech

    Not 2x or 3x : Speedups are 20x to 150x

    149X

    Financial simulationFinancial simulationOxfordOxford

    47X

    Linear AlgebraLinear AlgebraUniversidad Jaime

    20X

    3D Ultrasound3D UltrasoundTechniscanTechniscan

    18X

    TranscodingTranscodingElemental TechElemental Tech

    50X

    MatlabMatlab ComputingComputingAccelerEyesAccelerEyes

    100X

    AstrophysicsAstrophysicsRIKENRIKEN

    Not 2x or 3x : Speedups are 20x to 150x

    20X

    3D Ultrasound3D UltrasoundTechniscanTechniscan

    130X

    Quantum ChemistryQuantum ChemistryU of Illinois, UrbanaU of Illinois, Urbana

    30X

    Gene SequencingGene SequencingU of MarylandU of Maryland

    Fig. 2. Speedup by use of GPGPU in scientific applications (by NVIDIA)

    GPGPU computing can be performed on various hardware, including virtually all modern GPUs. NVIDIA desktop GPUs (9x or GTX series) and modern AMD GPUs support general-purpose computing. Both NVIDIA and AMD have their own dedicated high-performance GPGPU solutions, which are NVIDIA Tesla and AMD FireStream.

    Of course, GPU architecture is highly specific. Basically, GPU devotes much more transistors to data processing rather than data caching and flow control. Fig. 3 roughly illustrates the CPU and GPU architecture specifics.

    Hardware specifics affect the programming model. The basics of GPGPU programming model are:

    •Small program (called kernel) works on many data elements

    •Each data element is processed concurrently

    •Communication is effective only inside one execution unit.

    Two slightly different models are used on GPUs: SIMD (Single instruction, multiple data) and SPMD (Single program, multiple data).

    GPGPU: High-performance Computing Puzyrev 2

  • ! Chapter 1. Introduction

    !

    CUDA Programming Guide Version 2.1 3!

    !

    !

    Figure 1-2. The GPU Devotes More Transistors to Data Processing

    !

    "#$%!&'%()*)(+,,-.!/0%!123!)&!%&'%()+,,-!4%,,5&6)/%7!/#!+77$%&&!'$#8,%9&!/0+/!(+:!8%!%;'$%&&%7!+&!7+/+5'+$+,,%,!(#9'6/+/)#:&!!?%(+6&%!/0%!&+9%!'$#=$+9!)&!%;%(6/%7!*#$!%+(0!7+/+!%,%9%:/.!/0%$%!)&!+!,#4%$!$%@6)$%9%:/!*#$!'0)&/)(+/%7!*,#4!(#:/$#,A!+:7!8%(+6&%!)/!)&!%;%(6/%7!#:!9+:-!7+/+!%,%9%:/&!+:7!0+&!0)=0!+$)/09%/)(!):/%:&)/-.!/0%!9%9#$-!+((%&&!,+/%:(-!(+:!8%!0)77%:!4)/0!(+,(6,+/)#:&!):&/%+7!#*!8)=!7+/+!(+(0%&>!

    B+/+5'+$+,,%,!'$#(%&&):=!9+'&!7+/+!%,%9%:/&!/#!'+$+,,%,!'$#(%&&):=!/0$%+7&>!"+:-!+'',)(+/)#:&!/0+/!'$#(%&&!,+$=%!7+/+!&%/&!(+:!6&%!+!7+/+5'+$+,,%,!'$#=$+99):=!9#7%,!/#!&'%%7!6'!/0%!(#9'6/+/)#:&>!C:!DB!$%:7%$):=.!,+$=%!&%/&!#*!');%,&!+:7!E%$/)(%&!+$%!9+''%7!/#!'+$+,,%,!/0$%+7&>!F)9),+$,-.!)9+=%!+:7!9%7)+!'$#(%&&):=!+'',)(+/)#:&!&6(0!+&!'#&/5'$#(%&&):=!#*!$%:7%$%7!)9+=%&.!E)7%#!%:(#7):=!+:7!7%(#7):=.!)9+=%!&(+,):=.!&/%$%#!E)&)#:.!+:7!'+//%$:!$%(#=:)/)#:!(+:!9+'!)9+=%!8,#(G&!+:7!');%,&!/#!'+$+,,%,!'$#(%&&):=!/0$%+7&>!C:!*+(/.!9+:-!+,=#$)/09&!#6/&)7%!/0%!*)%,7!#*!)9+=%!$%:7%$):=!+:7!'$#(%&&):=!+$%!+((%,%$+/%7!8-!7+/+5'+$+,,%,!'$#(%&&):=.!*$#9!=%:%$+,!&)=:+,!'$#(%&&):=!#$!'0-&)(&!&)96,+/)#:!/#!(#9'6/+/)#:+,!*):+:(%!#$!(#9'6/+/)#:+,!8)#,#=->!

    1.2 CUDA™: a General-Purpose Parallel Computing Architecture

    C:!H#E%98%$!IJJK.!HLCBCM!):/$#76(%7!N3BMO.!+!=%:%$+,!'6$'#&%!'+$+,,%,!(#9'6/):=!+$(0)/%(/6$%!!

    Cache

    ALU Control

    ALU

    ALU

    ALU

    DRAM

    CPU

    DRAM

    GPU

    Fig. 3. Schematic comparison of CPU and GPU architecture

    The most popular instrument of general-purpose GPU computing is NVIDIA’s GPGPU implementation, that is called NVIDIA CUDA. CUDA is positioned as general purpose parallel computing architecture and works with all modern NVIDIA GPUs. It uses C as high-level programming language, though other languages will be supported in the future, as seen on Fig. 4.

    Fig. 4. NVIDIA CUDA architecture

    One of the basics for CUDA is the concept of kernels. These are C functions, that are executed N times in parallel with specific syntax in main functions. Fig. 5 contains the basic example of CUDA code with kernel and main function.

    Chapter 1. Introduction

    !

    4 CUDA Programming Guide Version 2.1!

    !

    Figure 1-3. CUDA is Designed to Support Various Languages or Application Programming Interfaces

    1.3 CUDA’s Scalable Programming Model

    "#$!%&'$()!*+!,-.)/0*1$!2345!%(&!,%(60*1$!7345!,$%(5!)#%)!,%/(5)1$%,!81*0$55*1!0#/85!%1$!(*9!8%1%..$.!565)$,5:!;-1)#$1,*1$5!.%9:!"#$!0#%..$(?$!/5!)*!&$'$.*8!%88./0%)/*(!5*+)9%1$!)#%)!)1%(58%1$().6!50%.$5!/)5!8%1%..$./5,!)*!.$'$1%?$!)#$!/(01$%5/(?!(-,@$1!*+!81*0$55*1!0*1$5

  • !

    !

    !

    !

    !

    !

    CUDA Programming Guide Version 2.1 7!

    Chapter 2. Programming Model

    "#$%!'()*+!$,)+-./&*%!)#*!0'$,!&-,&*()%!)#')!0'1*!/(!)#*!2345!(+-6+'00$,6!0-.*7!89!-/)7$,$,6!#-:!)#*9!'+*!*;(-%*.!$,!2!2!>-+!2345!$%!6$=*,!$,!2#'()*+!?

    2.1 Kernels

    2!>-+!2345!*;)*,.%!2!89!'77-:$,6!)#*!(+-6+'00*+!)-!.*>$,*!2!>/,&)$-,%@!&'77*.!!"#$"%&@!)#')@!:#*,!&'77*.@!'+*!*;*&/)*.!A!)$0*%!$,!('+'77*7!89!A!.$>>*+*,)!'()*+,-#"./&@!'%!-((-%*.!)-!-,79!-,&*!7$1*!+*6/7'+!2!>/,&)$-,%

    5!1*+,*7!$%!.*>$,*.!/%$,6!)#*!__global__!.*&7'+')$-,!%(*&$>$*+!',.!)#*!,/08*+!->!2345!)#+*'.%!>-+!*'!&'77!$%!%(*&$>$*.!/%$,6!'!,*:!!%9,)';B!

    // Kernel definition

    __global__ void vecAdd(float* A, float* B, float* C)

    {

    }

    int main()

    {

    // Kernel invocation

    vecAdd(A, B, C);

    }

    C'!->!)#*!)#+*'.%!)#')!*;*&/)*!'!1*+,*7!$%!6$=*,!'!/,$D/*!,-#"./+0)!)#')!$%!'&&*%%$87*!:$)#$,!)#*!1*+,*7!)#+-/6#!)#*!8/$7)E$,!threadIdx!='+$'87*-77-:$,6!%'0(7*!&-.*!'..%!):-!=*&)-+%!5!',.!F!->!%$G*!A!',.!%)-+*%!)#*!+*%/7)!$,)-!=*&)-+!2B!

    __global__ void vecAdd(float* A, float* B, float* C)

    {

    int i = threadIdx.x;

    C[i] = A[i] + B[i];

    }

    int main()

    {

    // Kernel invocation

    vecAdd(A, B, C);

    }

    Fig. 5. Kernel: this code adds two vectors A and B of size N

    The next basic concept of CUDA is thread hierarchy. One kernel is executed in one thread, and threads are combined in thread blocks. Next figure shows an example of matrix addition using thread blocks.

    Chapter 2. Programming Model

    !

    8 CUDA Programming Guide Version 2.1!

    "#$%!&'!(%)!(%*)#+,!(%#(!)-)$.()!vecAdd()!/)*'&*0,!&1)!/#2*342,)!#++2(2&15!!

    2.2 Thread Hierarchy

    6&*!$&17)12)1$)8!threadIdx!2,!#!93$&0/&1)1(!7)$(&*8!,&!(%#(!(%*)#+,!$#1!:)!2+)1(2'2)+!.,21;!#!&1)3+20)1,2&1#1$%*&12B21;!(%)2*!)-)$.(2&1!(&!$&&*+21#()!0)0&*>!#$$),,),5!J&*)!/*)$2,)8!&1)!$#1!,/)$2'>!,>1$%*&12B#(2&1!/&21(,!21!(%)!?)*1)!$#

    6&*!)''2$2)1(!$&&/)*#(2&18!(%)!,%#*)+!0)0&*>!2,!)-/)$()+!(&!:)!#!!1)#*!)#$%!/*&$),,&*!$&*)8!0.$%!

    S&4)7)*8!#!?)*1)!0.!,>1(#-5!"#$%!:

    Fig. 6. Thread blocks: this code adds two matrices A and B of size N*N

    Threads within a block have shared memory and can synchronize. On current GPUs, a thread block may contain up to 512 threads. A kernel can be executed by multiple equally shaped thread blocks put in a grid. Thread blocks in a grid are required to execute independently. Fig. 7 shows an

    GPGPU: High-performance Computing Puzyrev 4

  • example of matrix addition on a grid. Fig. 8 shows the full hierarchy of threads. ! Chapter 2: Programming Model

    !

    CUDA Programming Guide Version 2.1 9!

    __global__ void matAdd(float A[N][N], float B[N][N],

    float C[N][N])

    {

    int i = blockIdx.x * blockDim.x + threadIdx.x;

    int j = blockIdx.y * blockDim.y + threadIdx.y;

    if (i < N && j < N)

    C[i][j] = A[i][j] + B[i][j];

    }

    int main()

    {

    // Kernel invocation

    dim3 dimBlock(16, 16);

    dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,

    (N + dimBlock.y – 1) / dimBlock.y);

    matAdd(A, B, C);

    }

    "#$!%#&$'(!)*+,-!./0$!+1!23423!5!673!%#&$'(.!8'.!,#+.$9!.+:$8#'%!'&)/%&'&/*;:)$&!+1!%#&$'(!)*+,-.!/9!'!=&/(!/.!%;@/,'**;!(/,%'%$(!);!%#$!./0$!+1!%#$!('%'!)$/9=!@&+,$..$(!&'%#$&!%#'9!);!%#$!9>:)$&!+1!@&+,$..+&.!/9!%#$!.;.%$:

    !!!

    Fig. 7. Grid: this code adds two matrices A and B of size N*N

    Fig. 8. Full thread hierarchy

    Chapter 2. Programming Model

    !

    10 CUDA Programming Guide Version 2.1!

    Figure 2-1. Grid of Thread Blocks

    2.3 Memory Hierarchy

    "#$%!&'()*+,!-*.!*//),,!+*&*!0(1-!-23&453)!-)-1(.!,5*/),!+2(467!&')4(!)8)/2&416!*,!4332,&(*&)+!9.!:472()!;*/'!&'()*+!'*,!*!5(4?*&)!31/*3!-)-1(.=!>*/'!&'()*+!931/@!'*,!*!,'*()+!-)-1(.!?4,493)!&1!*33!&'()*+,!10!&')!931/@!*6+!A4&'!&')!,*-)!340)&4-)!*,!&')!931/@=!:46*33.B!*33!&'()*+,!'*?)!*//),,!&1!&')!,*-)!7319*3!-)-1(.=!

    C')()!*()!*3,1!&A1!*++4&416*3!()*+

  • Memory hierarchy is another complicated part of CUDA. As shown on Fig. 9, multiple memory spaces are present, including additional specialized memory spaces: constant and texture memory. Different memory usage strategies are used for different applications. Memory usage is usually the bottleneck of GPGPU applications.

    Fig. 9. Memory hierarchy

    The host and device model controls the execution of CUDA program. CUDA threads are executed execute on a physically separate device that operates as a coprocessor to the host running the C program. Both the host and the device maintain their own DRAM, referred to as host memory and device memory, and CUDA runtime manages data transfer.

    ! Chapter 2: Programming Model

    !

    CUDA Programming Guide Version 2.1 11!

    Figure 2-2. Memory Hierarchy

    2.4 Host and Device

    "#!$%%'()'*+!,-!.$/&(*!0123!456"!)##&7*#!'8)'!'8*!456"!'8(*)+#!7)-!*9*:&'*!;!?8$#!$#!'8*!:)#*3!@;(!*9)7=%*3!A8*!

    Global memory

    Grid 0

    Block (2, 1) Block (1, 1) Block (0, 1)

    Block (2, 0) Block (1, 0) Block (0, 0)

    Grid 1

    Block (1, 1)

    Block (1, 0)

    Block (1, 2)

    Block (0, 1)

    Block (0, 0)

    Block (0, 2)

    Thread Block

    Per-block shared memory

    Thread

    Per-thread local

    memory

    GPGPU: High-performance Computing Puzyrev 6

  • Fig. 10. Host and device: CUDA program execution

    ! Chapter 2: Programming Model

    !

    CUDA Programming Guide Version 2.1 13!

    Serial code executes on the host while parallel code executes on the device.

    Figure 2-3. Heterogeneous Programming

    Device

    Grid 0

    Block (2, 1) Block (1, 1) Block (0, 1)

    Block (2, 0) Block (1, 0) Block (0, 0)

    Host

    C Program

    Sequential Execution

    Serial code

    Parallel kernel

    Kernel0()

    Serial code

    Parallel kernel

    Kernel1()

    Host

    Device

    Grid 1

    Block (1, 1)

    Block (1, 0)

    Block (1, 2)

    Block (0, 1)

    Block (0, 0)

    Block (0, 2)

    GPGPU: High-performance Computing Puzyrev 7

  • CUDA allows to implement various libraries of functions, that essential for scientific high-performance computing. There is an efficient implementation of FFT on CUDA, as well as an implementation of BLAS, called CUBLAS. Unfortunately, CUBLAS doesn’t yet include all BLAS functions. CUDA-based plugins for MATLAB exists. LAPACK libraries for CUDA are under development. CUDA allows to perform efficient operations on dense matrices, sparse solvers are still under development.

    CUDA still has many limitations. Of course, some of them are fundamental and come from specific architecture and data-parallel programming model. But other limitations should definitely be fixed in the future. For example, double precision has no deviations from IEEE 754 standard, but single precision is not standard. Double precision is supported only by the last generation of NVIDIA GPUs. CUDA still lacks advanced profiler. Emulation on CPU is slow and often results, which are different from that on GPU. Another drawback is that CUDA works only on NVIDIA GPUs.

    Alternative implementation of GPGPU architecture is Stream Computing by AMD, which is based on Brook programming language. Brook works on both AMD and NVIDIA GPUs, Brook+ is AMD hardware optimized version. Brook is faster in some applications and has better support of double precision.

    There are two main future directions in GPGPU computing. The first focuses on creation of new hardware, which should be more suitable for general-purpose computing. The best example is Larrabee - Intel’s upcoming discrete GPU. It will compete with both GPUs and high-performance computing. Larrabee has a hybrid architecture: 16-32 simple x86 cores with SIMD vector units and per-core L1/L2 cache, without fixed function hardware, except for texturing units.

    Programming model for Larrabee will be task-parallel on core level and data-parallel on vector units inside the core. One significant feature is that Larrabee cores can submit work to itself without the host. Larrabee will use Intel C/C++ compiler and benefit from all its features. Larrabee gathers much praise from Intel and huge criticism from GPU manufacturers.

    Another direction is to create a standardized API for programming on different architectures. OpenCL, which stands for Open Computing Language, is a framework for programming on CPUs, GPUs and other processors. It was proposed by Apple and developed by Khronos Group, and has full support from both AMD and NVIDIA. OpenCL is likely to be introduced with Mac OS X 10.6.

    For the conclusion, let’s make a statement: GPGPU is a developed branch of computing. NVIDIA CUDA already allows us to utilize the power of GPUs in convenient way.

    Data parallel programming model is effective in many applications, but GPGPU could become more flexible, supporting more programming languages and programming concepts.

    A fusion between many-core CPUs and GPUs is a promising direction, and this direction is explored by Intel.

    Standardized API for GPGPU is highly anticipated, and upcoming OpenCL is the main candidate for this API.

    GPGPU: High-performance Computing Puzyrev 8

  • R E F E R E N C E S

    NVIDIA CUDA Programming Guide 2.1

    Various NVIDIA CUDA presentations

    GPGPU: High-performance Computing Puzyrev 9


Recommended