L19 Slides

Lecture 19OpenCLECE 459: Programming for Performance

March 18, 2014

Last Time: Compiler Optimizations

Compiler reads your programand emits one just like it, but faster.

Also: profile-guided optimizations.

2 / 47

Part I

OpenCL concepts

3 / 47

Introduction

OpenCL: coding on a heterogeneous architecture.No longer just programming the CPU;will also leverage the GPU.

OpenCL = Open Computing Language.Usable on both NVIDIA and AMD GPUs.

4 / 47

SIMT

Another term you may see vendors using:Single Instruction, Multiple Threads.Runs on a vector of data.Similar to SIMD instructions (e.g. SSE).However, the vector is spread out over the GPU.

5 / 47

Other Heterogeneous Programming Examples

PlayStation 3 CellCUDA

[PS4: back to a regular CPU/GPU system,albeit on one chip.]

6 / 47

(PS3) Cell Overview

Cell consists of:a PowerPC core; and8 SIMD co-processors.

(from the Linux Cell documentation)

7 / 47

CUDA Overview

Compute Unified Device Architecture:NVIDIAs architecture for processing on GPUs.

C for CUDA predates OpenCL,NVIDIA supports it first and foremost.

May be faster than OpenCL on NVIDIA hardware.API allows you to use (most) C++ features in CUDA;OpenCL has more restrictions.

8 / 47

GPU Programming Model

The abstract model is simple:Write the code for the parallel computation (kernel)

separately from main code.Transfer the data to the GPU co-processor

(or execute it on the CPU).Wait . . .Transfer the results back.

9 / 47

Data Parallelism

Key idea: evaluate a function (or kernel)over a set of points (data).

Another example of data parallelism.

Another name for the set of points: index space.Each point corresponds to a work-item.

Note: OpenCL also supports task parallelism (usingdifferent kernels), but documentation is sparse.

10 / 47

Work-Items

Work-item: the fundamental unit of work in OpenCL.Stored in an n-dimensional grid (ND-Range); 2D above.

OpenCL spawns a bunch of threads to handle work-items.When executing, the range is divided into work-groups,which execute on the same compute unit.

The set of compute units (or cores) is called somethingdifferent depending on the manufacturer.

NVIDIA - warpAMD/ATI - wavefront

11 / 47

Work-Items: Three more details

One thread per work item, each with a different thread ID.

You can say how to divide the ND-Range into work-groups,or the system can do it for you.

Scheduler assigns work-items to warps/wavefrontsuntil no more left.

12 / 47

Shared Memory

There are many different types of memory available to you:private memory: available to a single work-item;local memory (aka shared memory): shared betweenwork-items belonging to the same work-group;like a user-managed cache;global memory: shared between all work-itemsas well as the host;constant memory: resides on the GPU and cached.Does not change.

There is also host memory (normal memory);usually contains app data.

13 / 47

Example Kernel

Heres some traditional code to evaluate Ci = AiBi :vo id t r a d i t i o n a l _mu l ( i n t n ,

const f l o a t a ,const f l o a t b ,f l o a t c ) {

i n t i ;f o r ( i = 0 ; i < n ; i++) c [ i ] = a [ i ] b [ i ] ;

}

And as a kernel:k e r n e l vo id opencl_mul ( g l o b a l const f l o a t a ,

g l o b a l const f l o a t b ,g l o b a l f l o a t c ) {

i n t i d = ge t_g l oba l_ i d ( 0 ) ; // d imens ion 0c [ i d ] = a [ i d ] b [ i d ] ;

}

14 / 47

Restrictions when writing kernels in OpenCL

Its mostly C, but:No function pointers.No bit-fields.No variable-length arrays.No recursion.No standard headers.

15 / 47

OpenCLs Additions to C in Kernels

In kernels, you can also use:Work-items.Work-groups.Vectors.Synchronization.Declarations of memory type.Kernel-specific library.

16 / 47

Branches in kernels

kernel void contains_branch(global float *a,global float *b) {

int id = get_global_id (0);if (cond) {

x[id] += 5.0;} else {

y[id] += 5.0;}

}

The hardware will execute all branches that any thread in a warpexecutescan be slow!

In other words: an if statement will cause each thread to executeboth branches; we keep only the result of the taken branch.

17 / 47

Loops in kernels

kernel void contains_loop(global float *a,global float *b) {

int id = get_global_id (0);

for (i = 0; i < id; i++) {b[i] += a[i];

}}

A loop will cause the workgroup to wait for the maximum numberof iterations of the loop in any work-item.

Note: when you set up work-groups, best to arrange for allwork-items in a workgroup to execute the same branches & loops.

18 / 47

Synchronization

Different workgroups execute independently.You can only put barriers and memory fences betweenwork-items in the same workgroup.

OpenCL supports:Memory fences (load and store).Barriers.volatile (beware!)

19 / 47

Part II

Programming with OpenCL

20 / 47

Introduction

Today, well see how to program with OpenCL.Were using OpenCL 1.1.There is a lot of initialization and querying.When you compile your program, include -lOpenCL.

You can find the official documentation here:http://www.khronos.org/opencl/

More specifically:http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/

Lets just dive into an example.

21 / 47

First, reminders

All data belongs to an NDRange.The range can be divided into work-groups. (in software)The work-groups run on wavefronts/warps. (in hardware)Each wavefront/warp executes work-items.

All branches in a wavefront/warp should execute the same path.

If an iteration of a loop takes t:when one work-item executes 100 iterations,the total time to complete the wavefront/warp is 100t.

22 / 47

Part III

Simple Example

23 / 47

Simple Example (1)// Note by PL : don t use t h i s example as a t emp la t e ;// i t u s e s the C b i n d i n g s ! I n s t e ad , use the C++ b i n d i n g s .// s ou r c e : pages 19 through 111,// h t tp : // d e v e l o p e r . amd . com/wordp re s s /media /2013/07/// AMD_Accelerated_Paral le l_Process ing_OpenCL_// Programming_Guiderev 2.7 . pdf

#i n c l u d e #i n c l u d e

#d e f i n e NWITEMS 512

// A s imp l e memset k e r n e lcon s t cha r s ou r c e =" __kerne l v o i d memset ( __globa l u i n t ds t ) \n""{ \n"" d s t [ g e t_g l oba l_ i d ( 0 ) ] = ge t_g l oba l_ i d ( 0 ) ; \n""} \n " ;

i n t main ( i n t argc , cha r a rgv ){

// 1 . Get a p l a t f o rm .c l_p l a t f o rm_ id p l a t f o rm ;c lGe tP l a t f o rm IDs (1 , &p la t fo rm , NULL ) ;

24 / 47

Explanation (1)

Include the OpenCL header.

Request a platform (also known as a host).

A platform contains compute devices:GPUs or CPUs.

25 / 47

Simple Example (2)

// 2 . F ind a gpu d e v i c e .c l_d e v i c e_ i d d e v i c e ;c lG e tDev i c e ID s ( p l a t fo rm , CL_DEVICE_TYPE_GPU,

1 ,&dev i c e ,NULL ) ;

// 3 . Crea te a con t e x t and command queue on tha t d e v i c e .c l_ con t e x t c on t e x t = c lC r e a t eCon t e x t (NULL ,

1 ,&dev i c e ,NULL , NULL , NULL ) ;

cl_command_queue queue = clCreateCommandQueue ( contex t ,d ev i c e ,0 , NULL ) ;

26 / 47

Explanation (2)

Request a GPU device.

Request a OpenCL context (representing all of OpenCLs state).

Create a command-queue:get OpenCL to do work by telling it to run a kernel in a queue.

27 / 47

Simple Example (3)

// 4 . Perform runt ime sou r c e comp i l a t i on , and ob t a i n// k e r n e l e n t r y po i n t .c l_program program = clCreateProgramWithSource ( contex t ,

1 ,&source ,NULL ,NULL ) ;

c lBu i l dProg ram ( program , 1 , &dev i c e , NULL , NULL , NULL ) ;c l _ k e r n e l k e r n e l = c l C r e a t eK e r n e l ( program , "memset " ,

NULL ) ;

// 5 . Crea te a data b u f f e r .cl_mem b u f f e r = c l C r e a t eB u f f e r ( contex t ,

CL_MEM_WRITE_ONLY,NWITEMS s i z e o f ( c l _u i n t ) ,NULL , NULL ) ;

28 / 47

Explanation (3)

We create an OpenCL program (runs on the compute unit):kernels;functions; anddeclarations.

In this case, we create a kernel called memset from source.OpenCL may also create programs from binaries(may be in intermediate representation).

Next, we need a data buffer (enables inter-device communication).

This program does not have any input,so we dont put anything into the buffer (just declare its size).

29 / 47

Simple Example (4)

// 6 . Launch the k e r n e l . Let OpenCL p i c k the l o c a l work// s i z e .s i z e_ t g l oba l_wo rk_s i z e = NWITEMS;c l S e tKe r n e lA r g ( k e r n e l , 0 , s i z e o f ( b u f f e r ) , ( v o i d )& b u f f e r ) ;c lEnqueueNDRangeKernel ( queue ,

k e r n e l ,1 , // d imens i on sNULL , // i n i t i a l o f f s e t s&g loba l_work_s i z e , // number o f

// worki t emsNULL , // worki t ems pe r workgroup0 , NULL , NULL ) ; // e v en t s

c l F i n i s h ( queue ) ;

// 7 . Look at the r e s u l t s v i a synch ronous b u f f e r map .c l _u i n t p t r ;p t r = ( c l_u i n t ) c lEnqueueMapBuf fer ( queue , b u f f e r ,

CL_TRUE, CL_MAP_READ,0 , NWITEMS s i z e o f ( c l _u i n t ) ,0 , NULL , NULL , NULL ) ;

30 / 47

Explanation (4)

Set kernel arguments to buffer.We launch the kernel, enqueuing the 1-dimensionalindex space starting at 0.We specify that the index space has NWITEMS elements;and not to subdivide the program into work-groups.There is also an event interface, which we do not use.

We copy the results back; call is blocking (CL_TRUE);hence we dont need an explicit clFinish() call.

We specify that we want to read the results back intobuffer.

31 / 47

Simple Example (5)

i n t i ;f o r ( i =0; i < NWITEMS; i++)

p r i n t f ("%d %d\n " , i , p t r [ i ] ) ;r e t u r n 0 ;

}

The program simply prints 0 0, 1 1, . . . , 511 511.Note: I didnt clean up or include error handlingfor any of the OpenCL functions.

32 / 47

Part IV

Another Example

33 / 47

C++ Bindings

If we use the C++ bindings, well get automaticresource release and exceptions.

C++ likes to use the RAII style(resource allocation is initialization).

Change the header to CL/cl.hpp and define__CL_ENABLE_EXCEPTIONS.Wed also like to store our kernel in a file instead of a string.The C API is not so nice to work with.

34 / 47

Vector Addition Kernel

Lets write a kernel that adds two vectors and stores the result.This kernel will go in the file vector_add_kernel.cl.__kerne l v o i d vector_add ( __globa l con s t i n t A,

__globa l con s t i n t B,__globa l i n t C) {

// Get the i ndex o f the c u r r e n t e l ement to be p r o c e s s e di n t i = ge t_g l oba l_ i d ( 0 ) ;

// Do the op e r a t i o nC [ i ] = A[ i ] + B[ i ] ;

}

Other possible qualifiers: local, constant and private.

35 / 47

Vector Addition (1)// Vecto r add example , C++ b i n d i n g s ( use t h e s e ! )// s ou r c e :// h t tp : //www. t h e b i g b l o b . com/ ge t t i n gs t a r t e d// withopenc landgpucomputing /

#d e f i n e __CL_ENABLE_EXCEPTIONS

#i n c l u d e

#i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e

i n t main ( ) {// Crea te the two i npu t v e c t o r scon s t i n t LIST_SIZE = 1000 ;i n t A = new i n t [ LIST_SIZE ] ;i n t B = new i n t [ LIST_SIZE ] ;f o r ( i n t i = 0 ; i < LIST_SIZE ; i++) {

A[ i ] = i ;B [ i ] = LIST_SIZE i ;

}36 / 47

Vector Addition (2)

t r y {// Get a v a i l a b l e p l a t f o rm ss td : : v e c to r p l a t f o rm s ;c l : : P l a t fo rm : : ge t (& p l a t f o rm s ) ;

// S e l e c t the d e f a u l t p l a t f o rm and c r e a t e a con t e x t// u s i n g t h i s p l a t f o rm and the GPUc l_ c o n t e x t_p r o p e r t i e s cps [ 3 ] = {

CL_CONTEXT_PLATFORM,( c l _ c o n t e x t_p r o p e r t i e s ) ( p l a t f o rm s [ 0 ] ) ( ) ,0

} ;c l : : Context c on t e x t (CL_DEVICE_TYPE_GPU, cps ) ;

// Get a l i s t o f d e v i c e s on t h i s p l a t f o rms td : : v e c to r d e v i c e s =

con t e x t . g e t I n f o () ;

// Crea te a command queue and use the f i r s t d e v i c ec l : : CommandQueue queue = c l : : CommandQueue( contex t ,

d e v i c e s [ 0 ] ) ;

37 / 47

Explanation (2)

You can define __NO_STD_VECTOR and use cl::vector(same with strings).

You can enable profiling by addingCL_QUEUE_PROFILING_ENABLE as 3rd argument to queue.

38 / 47

Vector Addition (3)

// Read sou r c e f i l es t d : : i f s t r e am s o u r c e F i l e ( " vec to r_add_ke rne l . c l " ) ;s t d : : s t r i n g sourceCode (

s td : : i s t r e ambu f_ i t e r a t o r ( s o u r c e F i l e ) ,( s t d : : i s t r e ambu f_ i t e r a t o r ())

) ;c l : : Program : : Sou rce s s ou r c e (

1 ,s t d : : make_pair ( sourceCode . c_s t r ( ) ,

sourceCode . l e n g t h ()+1)) ;

// Make program o f the s ou r c e code i n the con t e x tc l : : Program program = c l : : Program ( contex t , s ou r c e ) ;

// Bu i l d program f o r t h e s e s p e c i f i c d e v i c e sprogram . b u i l d ( d e v i c e s ) ;

// Make k e r n e lc l : : Ke rne l k e r n e l ( program , " vector_add " ) ;

39 / 47

Vector Addition (4)

// Crea te memory b u f f e r sc l : : Bu f f e r bu f f e rA = c l : : Bu f f e r (

contex t ,CL_MEM_READ_ONLY,LIST_SIZE s i z e o f ( i n t )

) ;c l : : Bu f f e r bu f f e rB = c l : : Bu f f e r (

contex t ,CL_MEM_READ_ONLY,LIST_SIZE s i z e o f ( i n t )

) ;c l : : Bu f f e r bu f f e rC = c l : : Bu f f e r (

contex t ,CL_MEM_WRITE_ONLY,LIST_SIZE s i z e o f ( i n t )

) ;

40 / 47

Vector Addition (5)

// Copy l i s t s A and B to the memory b u f f e r squeue . enqueueWr i t eBu f f e r (

bu f f e rA ,CL_TRUE,0 ,LIST_SIZE s i z e o f ( i n t ) ,A

) ;queue . enqueueWr i t eBu f f e r (

bu f f e rB ,CL_TRUE,0 ,LIST_SIZE s i z e o f ( i n t ) ,B

) ;

// Set arguments to k e r n e lk e r n e l . s e tArg (0 , bu f f e rA ) ;k e r n e l . s e tArg (1 , bu f f e rB ) ;k e r n e l . s e tArg (2 , bu f f e rC ) ;

41 / 47

Explanation (5)

enqueue*Buffer arguments:buffercl_ bool blocking_write::size_t offset::size_t sizeconst void * ptr

42 / 47

Vector Addition (6)

// Run the k e r n e l on s p e c i f i c ND rangec l : : NDRange g l o b a l ( LIST_SIZE ) ;c l : : NDRange l o c a l ( 1 ) ;queue . enqueueNDRangeKernel (

k e r n e l ,c l : : Nul lRange ,g l o ba l ,l o c a l

) ;

// Read b u f f e r C i n t o a l o c a l l i s ti n t C = new i n t [ LIST_SIZE ] ;queue . enqueueReadBuf fe r (

bu f f e rC ,CL_TRUE,0 ,LIST_SIZE s i z e o f ( i n t ) ,C

) ;

43 / 47

Vector Addition (7)

f o r ( i n t i = 0 ; i < LIST_SIZE ; i ++) {s td : : cout

Other Improvements

The host memory is still unreleased.With the same number of lines, we could use the C++11unique_ptr, which would free the memory for us.

You can use a vector instead of an array,and use &v[0] instead of *.

Valid as long as the vector is not resized.

45 / 47

OpenCL Programming Summary

Went through real OpenCL examples.Have the reference card for the API.

Saw a C++ template for setting up OpenCL.

Aside: if youre serious about programming in C++, checkout Effective C++ by Scott Meyers (slightly dated withC++11, but it still has some good stuff)

46 / 47

Overall summary

First Half: Brief overview of OpenCL and its programmingmodel.

Many concepts are similar to plain parallel programming(more structure).

Second Half: Looked at an OpenCL implementation andhow to organize it.

Need to write lots of boilerplate!

47 / 47

OpenCL conceptsProgramming with OpenCLSimple ExampleAnother Example

Date post:	24-Sep-2015
Category:	Documents
Upload:	abed-momani
View:	237 times
Download:	1 times

L19 Slides

Documents