+ All Categories
Home > Documents > L19 Slides

L19 Slides

Date post: 24-Sep-2015
Category:
Upload: abed-momani
View: 237 times
Download: 1 times
Share this document with a friend
Popular Tags:
47
Lecture 19—OpenCL ECE 459: Programming for Performance March 18, 2014
Transcript
  • Lecture 19OpenCLECE 459: Programming for Performance

    March 18, 2014

  • Last Time: Compiler Optimizations

    Compiler reads your programand emits one just like it, but faster.

    Also: profile-guided optimizations.

    2 / 47

  • Part I

    OpenCL concepts

    3 / 47

  • Introduction

    OpenCL: coding on a heterogeneous architecture.No longer just programming the CPU;will also leverage the GPU.

    OpenCL = Open Computing Language.Usable on both NVIDIA and AMD GPUs.

    4 / 47

  • SIMT

    Another term you may see vendors using:Single Instruction, Multiple Threads.Runs on a vector of data.Similar to SIMD instructions (e.g. SSE).However, the vector is spread out over the GPU.

    5 / 47

  • Other Heterogeneous Programming Examples

    PlayStation 3 CellCUDA

    [PS4: back to a regular CPU/GPU system,albeit on one chip.]

    6 / 47

  • (PS3) Cell Overview

    Cell consists of:a PowerPC core; and8 SIMD co-processors.

    (from the Linux Cell documentation)

    7 / 47

  • CUDA Overview

    Compute Unified Device Architecture:NVIDIAs architecture for processing on GPUs.

    C for CUDA predates OpenCL,NVIDIA supports it first and foremost.

    May be faster than OpenCL on NVIDIA hardware.API allows you to use (most) C++ features in CUDA;OpenCL has more restrictions.

    8 / 47

  • GPU Programming Model

    The abstract model is simple:Write the code for the parallel computation (kernel)

    separately from main code.Transfer the data to the GPU co-processor

    (or execute it on the CPU).Wait . . .Transfer the results back.

    9 / 47

  • Data Parallelism

    Key idea: evaluate a function (or kernel)over a set of points (data).

    Another example of data parallelism.

    Another name for the set of points: index space.Each point corresponds to a work-item.

    Note: OpenCL also supports task parallelism (usingdifferent kernels), but documentation is sparse.

    10 / 47

  • Work-Items

    Work-item: the fundamental unit of work in OpenCL.Stored in an n-dimensional grid (ND-Range); 2D above.

    OpenCL spawns a bunch of threads to handle work-items.When executing, the range is divided into work-groups,which execute on the same compute unit.

    The set of compute units (or cores) is called somethingdifferent depending on the manufacturer.

    NVIDIA - warpAMD/ATI - wavefront

    11 / 47

  • Work-Items: Three more details

    One thread per work item, each with a different thread ID.

    You can say how to divide the ND-Range into work-groups,or the system can do it for you.

    Scheduler assigns work-items to warps/wavefrontsuntil no more left.

    12 / 47

  • Shared Memory

    There are many different types of memory available to you:private memory: available to a single work-item;local memory (aka shared memory): shared betweenwork-items belonging to the same work-group;like a user-managed cache;global memory: shared between all work-itemsas well as the host;constant memory: resides on the GPU and cached.Does not change.

    There is also host memory (normal memory);usually contains app data.

    13 / 47

  • Example Kernel

    Heres some traditional code to evaluate Ci = AiBi :vo id t r a d i t i o n a l _mu l ( i n t n ,

    const f l o a t a ,const f l o a t b ,f l o a t c ) {

    i n t i ;f o r ( i = 0 ; i < n ; i++) c [ i ] = a [ i ] b [ i ] ;

    }

    And as a kernel:k e r n e l vo id opencl_mul ( g l o b a l const f l o a t a ,

    g l o b a l const f l o a t b ,g l o b a l f l o a t c ) {

    i n t i d = ge t_g l oba l_ i d ( 0 ) ; // d imens ion 0c [ i d ] = a [ i d ] b [ i d ] ;

    }

    14 / 47

  • Restrictions when writing kernels in OpenCL

    Its mostly C, but:No function pointers.No bit-fields.No variable-length arrays.No recursion.No standard headers.

    15 / 47

  • OpenCLs Additions to C in Kernels

    In kernels, you can also use:Work-items.Work-groups.Vectors.Synchronization.Declarations of memory type.Kernel-specific library.

    16 / 47

  • Branches in kernels

    kernel void contains_branch(global float *a,global float *b) {

    int id = get_global_id (0);if (cond) {

    x[id] += 5.0;} else {

    y[id] += 5.0;}

    }

    The hardware will execute all branches that any thread in a warpexecutescan be slow!

    In other words: an if statement will cause each thread to executeboth branches; we keep only the result of the taken branch.

    17 / 47

  • Loops in kernels

    kernel void contains_loop(global float *a,global float *b) {

    int id = get_global_id (0);

    for (i = 0; i < id; i++) {b[i] += a[i];

    }}

    A loop will cause the workgroup to wait for the maximum numberof iterations of the loop in any work-item.

    Note: when you set up work-groups, best to arrange for allwork-items in a workgroup to execute the same branches & loops.

    18 / 47

  • Synchronization

    Different workgroups execute independently.You can only put barriers and memory fences betweenwork-items in the same workgroup.

    OpenCL supports:Memory fences (load and store).Barriers.volatile (beware!)

    19 / 47

  • Part II

    Programming with OpenCL

    20 / 47

  • Introduction

    Today, well see how to program with OpenCL.Were using OpenCL 1.1.There is a lot of initialization and querying.When you compile your program, include -lOpenCL.

    You can find the official documentation here:http://www.khronos.org/opencl/

    More specifically:http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/

    Lets just dive into an example.

    21 / 47

  • First, reminders

    All data belongs to an NDRange.The range can be divided into work-groups. (in software)The work-groups run on wavefronts/warps. (in hardware)Each wavefront/warp executes work-items.

    All branches in a wavefront/warp should execute the same path.

    If an iteration of a loop takes t:when one work-item executes 100 iterations,the total time to complete the wavefront/warp is 100t.

    22 / 47

  • Part III

    Simple Example

    23 / 47

  • Simple Example (1)// Note by PL : don t use t h i s example as a t emp la t e ;// i t u s e s the C b i n d i n g s ! I n s t e ad , use the C++ b i n d i n g s .// s ou r c e : pages 19 through 111,// h t tp : // d e v e l o p e r . amd . com/wordp re s s /media /2013/07/// AMD_Accelerated_Paral le l_Process ing_OpenCL_// Programming_Guiderev 2.7 . pdf

    #i n c l u d e #i n c l u d e

    #d e f i n e NWITEMS 512

    // A s imp l e memset k e r n e lcon s t cha r s ou r c e =" __kerne l v o i d memset ( __globa l u i n t ds t ) \n""{ \n"" d s t [ g e t_g l oba l_ i d ( 0 ) ] = ge t_g l oba l_ i d ( 0 ) ; \n""} \n " ;

    i n t main ( i n t argc , cha r a rgv ){

    // 1 . Get a p l a t f o rm .c l_p l a t f o rm_ id p l a t f o rm ;c lGe tP l a t f o rm IDs (1 , &p la t fo rm , NULL ) ;

    24 / 47

  • Explanation (1)

    Include the OpenCL header.

    Request a platform (also known as a host).

    A platform contains compute devices:GPUs or CPUs.

    25 / 47

  • Simple Example (2)

    // 2 . F ind a gpu d e v i c e .c l_d e v i c e_ i d d e v i c e ;c lG e tDev i c e ID s ( p l a t fo rm , CL_DEVICE_TYPE_GPU,

    1 ,&dev i c e ,NULL ) ;

    // 3 . Crea te a con t e x t and command queue on tha t d e v i c e .c l_ con t e x t c on t e x t = c lC r e a t eCon t e x t (NULL ,

    1 ,&dev i c e ,NULL , NULL , NULL ) ;

    cl_command_queue queue = clCreateCommandQueue ( contex t ,d ev i c e ,0 , NULL ) ;

    26 / 47

  • Explanation (2)

    Request a GPU device.

    Request a OpenCL context (representing all of OpenCLs state).

    Create a command-queue:get OpenCL to do work by telling it to run a kernel in a queue.

    27 / 47

  • Simple Example (3)

    // 4 . Perform runt ime sou r c e comp i l a t i on , and ob t a i n// k e r n e l e n t r y po i n t .c l_program program = clCreateProgramWithSource ( contex t ,

    1 ,&source ,NULL ,NULL ) ;

    c lBu i l dProg ram ( program , 1 , &dev i c e , NULL , NULL , NULL ) ;c l _ k e r n e l k e r n e l = c l C r e a t eK e r n e l ( program , "memset " ,

    NULL ) ;

    // 5 . Crea te a data b u f f e r .cl_mem b u f f e r = c l C r e a t eB u f f e r ( contex t ,

    CL_MEM_WRITE_ONLY,NWITEMS s i z e o f ( c l _u i n t ) ,NULL , NULL ) ;

    28 / 47

  • Explanation (3)

    We create an OpenCL program (runs on the compute unit):kernels;functions; anddeclarations.

    In this case, we create a kernel called memset from source.OpenCL may also create programs from binaries(may be in intermediate representation).

    Next, we need a data buffer (enables inter-device communication).

    This program does not have any input,so we dont put anything into the buffer (just declare its size).

    29 / 47

  • Simple Example (4)

    // 6 . Launch the k e r n e l . Let OpenCL p i c k the l o c a l work// s i z e .s i z e_ t g l oba l_wo rk_s i z e = NWITEMS;c l S e tKe r n e lA r g ( k e r n e l , 0 , s i z e o f ( b u f f e r ) , ( v o i d )& b u f f e r ) ;c lEnqueueNDRangeKernel ( queue ,

    k e r n e l ,1 , // d imens i on sNULL , // i n i t i a l o f f s e t s&g loba l_work_s i z e , // number o f

    // worki t emsNULL , // worki t ems pe r workgroup0 , NULL , NULL ) ; // e v en t s

    c l F i n i s h ( queue ) ;

    // 7 . Look at the r e s u l t s v i a synch ronous b u f f e r map .c l _u i n t p t r ;p t r = ( c l_u i n t ) c lEnqueueMapBuf fer ( queue , b u f f e r ,

    CL_TRUE, CL_MAP_READ,0 , NWITEMS s i z e o f ( c l _u i n t ) ,0 , NULL , NULL , NULL ) ;

    30 / 47

  • Explanation (4)

    Set kernel arguments to buffer.We launch the kernel, enqueuing the 1-dimensionalindex space starting at 0.We specify that the index space has NWITEMS elements;and not to subdivide the program into work-groups.There is also an event interface, which we do not use.

    We copy the results back; call is blocking (CL_TRUE);hence we dont need an explicit clFinish() call.

    We specify that we want to read the results back intobuffer.

    31 / 47

  • Simple Example (5)

    i n t i ;f o r ( i =0; i < NWITEMS; i++)

    p r i n t f ("%d %d\n " , i , p t r [ i ] ) ;r e t u r n 0 ;

    }

    The program simply prints 0 0, 1 1, . . . , 511 511.Note: I didnt clean up or include error handlingfor any of the OpenCL functions.

    32 / 47

  • Part IV

    Another Example

    33 / 47

  • C++ Bindings

    If we use the C++ bindings, well get automaticresource release and exceptions.

    C++ likes to use the RAII style(resource allocation is initialization).

    Change the header to CL/cl.hpp and define__CL_ENABLE_EXCEPTIONS.Wed also like to store our kernel in a file instead of a string.The C API is not so nice to work with.

    34 / 47

  • Vector Addition Kernel

    Lets write a kernel that adds two vectors and stores the result.This kernel will go in the file vector_add_kernel.cl.__kerne l v o i d vector_add ( __globa l con s t i n t A,

    __globa l con s t i n t B,__globa l i n t C) {

    // Get the i ndex o f the c u r r e n t e l ement to be p r o c e s s e di n t i = ge t_g l oba l_ i d ( 0 ) ;

    // Do the op e r a t i o nC [ i ] = A[ i ] + B[ i ] ;

    }

    Other possible qualifiers: local, constant and private.

    35 / 47

  • Vector Addition (1)// Vecto r add example , C++ b i n d i n g s ( use t h e s e ! )// s ou r c e :// h t tp : //www. t h e b i g b l o b . com/ ge t t i n gs t a r t e d// withopenc landgpucomputing /

    #d e f i n e __CL_ENABLE_EXCEPTIONS

    #i n c l u d e

    #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e

    i n t main ( ) {// Crea te the two i npu t v e c t o r scon s t i n t LIST_SIZE = 1000 ;i n t A = new i n t [ LIST_SIZE ] ;i n t B = new i n t [ LIST_SIZE ] ;f o r ( i n t i = 0 ; i < LIST_SIZE ; i++) {

    A[ i ] = i ;B [ i ] = LIST_SIZE i ;

    }36 / 47

  • Vector Addition (2)

    t r y {// Get a v a i l a b l e p l a t f o rm ss td : : v e c to r p l a t f o rm s ;c l : : P l a t fo rm : : ge t (& p l a t f o rm s ) ;

    // S e l e c t the d e f a u l t p l a t f o rm and c r e a t e a con t e x t// u s i n g t h i s p l a t f o rm and the GPUc l_ c o n t e x t_p r o p e r t i e s cps [ 3 ] = {

    CL_CONTEXT_PLATFORM,( c l _ c o n t e x t_p r o p e r t i e s ) ( p l a t f o rm s [ 0 ] ) ( ) ,0

    } ;c l : : Context c on t e x t (CL_DEVICE_TYPE_GPU, cps ) ;

    // Get a l i s t o f d e v i c e s on t h i s p l a t f o rms td : : v e c to r d e v i c e s =

    con t e x t . g e t I n f o () ;

    // Crea te a command queue and use the f i r s t d e v i c ec l : : CommandQueue queue = c l : : CommandQueue( contex t ,

    d e v i c e s [ 0 ] ) ;

    37 / 47

  • Explanation (2)

    You can define __NO_STD_VECTOR and use cl::vector(same with strings).

    You can enable profiling by addingCL_QUEUE_PROFILING_ENABLE as 3rd argument to queue.

    38 / 47

  • Vector Addition (3)

    // Read sou r c e f i l es t d : : i f s t r e am s o u r c e F i l e ( " vec to r_add_ke rne l . c l " ) ;s t d : : s t r i n g sourceCode (

    s td : : i s t r e ambu f_ i t e r a t o r ( s o u r c e F i l e ) ,( s t d : : i s t r e ambu f_ i t e r a t o r ())

    ) ;c l : : Program : : Sou rce s s ou r c e (

    1 ,s t d : : make_pair ( sourceCode . c_s t r ( ) ,

    sourceCode . l e n g t h ()+1)) ;

    // Make program o f the s ou r c e code i n the con t e x tc l : : Program program = c l : : Program ( contex t , s ou r c e ) ;

    // Bu i l d program f o r t h e s e s p e c i f i c d e v i c e sprogram . b u i l d ( d e v i c e s ) ;

    // Make k e r n e lc l : : Ke rne l k e r n e l ( program , " vector_add " ) ;

    39 / 47

  • Vector Addition (4)

    // Crea te memory b u f f e r sc l : : Bu f f e r bu f f e rA = c l : : Bu f f e r (

    contex t ,CL_MEM_READ_ONLY,LIST_SIZE s i z e o f ( i n t )

    ) ;c l : : Bu f f e r bu f f e rB = c l : : Bu f f e r (

    contex t ,CL_MEM_READ_ONLY,LIST_SIZE s i z e o f ( i n t )

    ) ;c l : : Bu f f e r bu f f e rC = c l : : Bu f f e r (

    contex t ,CL_MEM_WRITE_ONLY,LIST_SIZE s i z e o f ( i n t )

    ) ;

    40 / 47

  • Vector Addition (5)

    // Copy l i s t s A and B to the memory b u f f e r squeue . enqueueWr i t eBu f f e r (

    bu f f e rA ,CL_TRUE,0 ,LIST_SIZE s i z e o f ( i n t ) ,A

    ) ;queue . enqueueWr i t eBu f f e r (

    bu f f e rB ,CL_TRUE,0 ,LIST_SIZE s i z e o f ( i n t ) ,B

    ) ;

    // Set arguments to k e r n e lk e r n e l . s e tArg (0 , bu f f e rA ) ;k e r n e l . s e tArg (1 , bu f f e rB ) ;k e r n e l . s e tArg (2 , bu f f e rC ) ;

    41 / 47

  • Explanation (5)

    enqueue*Buffer arguments:buffercl_ bool blocking_write::size_t offset::size_t sizeconst void * ptr

    42 / 47

  • Vector Addition (6)

    // Run the k e r n e l on s p e c i f i c ND rangec l : : NDRange g l o b a l ( LIST_SIZE ) ;c l : : NDRange l o c a l ( 1 ) ;queue . enqueueNDRangeKernel (

    k e r n e l ,c l : : Nul lRange ,g l o ba l ,l o c a l

    ) ;

    // Read b u f f e r C i n t o a l o c a l l i s ti n t C = new i n t [ LIST_SIZE ] ;queue . enqueueReadBuf fe r (

    bu f f e rC ,CL_TRUE,0 ,LIST_SIZE s i z e o f ( i n t ) ,C

    ) ;

    43 / 47

  • Vector Addition (7)

    f o r ( i n t i = 0 ; i < LIST_SIZE ; i ++) {s td : : cout

  • Other Improvements

    The host memory is still unreleased.With the same number of lines, we could use the C++11unique_ptr, which would free the memory for us.

    You can use a vector instead of an array,and use &v[0] instead of *.

    Valid as long as the vector is not resized.

    45 / 47

  • OpenCL Programming Summary

    Went through real OpenCL examples.Have the reference card for the API.

    Saw a C++ template for setting up OpenCL.

    Aside: if youre serious about programming in C++, checkout Effective C++ by Scott Meyers (slightly dated withC++11, but it still has some good stuff)

    46 / 47

  • Overall summary

    First Half: Brief overview of OpenCL and its programmingmodel.

    Many concepts are similar to plain parallel programming(more structure).

    Second Half: Looked at an OpenCL implementation andhow to organize it.

    Need to write lots of boilerplate!

    47 / 47

    OpenCL conceptsProgramming with OpenCLSimple ExampleAnother Example


Recommended