Date post: | 24-Sep-2015 |
Category: |
Documents |
Upload: | abed-momani |
View: | 237 times |
Download: | 1 times |
Lecture 19OpenCLECE 459: Programming for Performance
March 18, 2014
Last Time: Compiler Optimizations
Compiler reads your programand emits one just like it, but faster.
Also: profile-guided optimizations.
2 / 47
Part I
OpenCL concepts
3 / 47
Introduction
OpenCL: coding on a heterogeneous architecture.No longer just programming the CPU;will also leverage the GPU.
OpenCL = Open Computing Language.Usable on both NVIDIA and AMD GPUs.
4 / 47
SIMT
Another term you may see vendors using:Single Instruction, Multiple Threads.Runs on a vector of data.Similar to SIMD instructions (e.g. SSE).However, the vector is spread out over the GPU.
5 / 47
Other Heterogeneous Programming Examples
PlayStation 3 CellCUDA
[PS4: back to a regular CPU/GPU system,albeit on one chip.]
6 / 47
(PS3) Cell Overview
Cell consists of:a PowerPC core; and8 SIMD co-processors.
(from the Linux Cell documentation)
7 / 47
CUDA Overview
Compute Unified Device Architecture:NVIDIAs architecture for processing on GPUs.
C for CUDA predates OpenCL,NVIDIA supports it first and foremost.
May be faster than OpenCL on NVIDIA hardware.API allows you to use (most) C++ features in CUDA;OpenCL has more restrictions.
8 / 47
GPU Programming Model
The abstract model is simple:Write the code for the parallel computation (kernel)
separately from main code.Transfer the data to the GPU co-processor
(or execute it on the CPU).Wait . . .Transfer the results back.
9 / 47
Data Parallelism
Key idea: evaluate a function (or kernel)over a set of points (data).
Another example of data parallelism.
Another name for the set of points: index space.Each point corresponds to a work-item.
Note: OpenCL also supports task parallelism (usingdifferent kernels), but documentation is sparse.
10 / 47
Work-Items
Work-item: the fundamental unit of work in OpenCL.Stored in an n-dimensional grid (ND-Range); 2D above.
OpenCL spawns a bunch of threads to handle work-items.When executing, the range is divided into work-groups,which execute on the same compute unit.
The set of compute units (or cores) is called somethingdifferent depending on the manufacturer.
NVIDIA - warpAMD/ATI - wavefront
11 / 47
Work-Items: Three more details
One thread per work item, each with a different thread ID.
You can say how to divide the ND-Range into work-groups,or the system can do it for you.
Scheduler assigns work-items to warps/wavefrontsuntil no more left.
12 / 47
Shared Memory
There are many different types of memory available to you:private memory: available to a single work-item;local memory (aka shared memory): shared betweenwork-items belonging to the same work-group;like a user-managed cache;global memory: shared between all work-itemsas well as the host;constant memory: resides on the GPU and cached.Does not change.
There is also host memory (normal memory);usually contains app data.
13 / 47
Example Kernel
Heres some traditional code to evaluate Ci = AiBi :vo id t r a d i t i o n a l _mu l ( i n t n ,
const f l o a t a ,const f l o a t b ,f l o a t c ) {
i n t i ;f o r ( i = 0 ; i < n ; i++) c [ i ] = a [ i ] b [ i ] ;
}
And as a kernel:k e r n e l vo id opencl_mul ( g l o b a l const f l o a t a ,
g l o b a l const f l o a t b ,g l o b a l f l o a t c ) {
i n t i d = ge t_g l oba l_ i d ( 0 ) ; // d imens ion 0c [ i d ] = a [ i d ] b [ i d ] ;
}
14 / 47
Restrictions when writing kernels in OpenCL
Its mostly C, but:No function pointers.No bit-fields.No variable-length arrays.No recursion.No standard headers.
15 / 47
OpenCLs Additions to C in Kernels
In kernels, you can also use:Work-items.Work-groups.Vectors.Synchronization.Declarations of memory type.Kernel-specific library.
16 / 47
Branches in kernels
kernel void contains_branch(global float *a,global float *b) {
int id = get_global_id (0);if (cond) {
x[id] += 5.0;} else {
y[id] += 5.0;}
}
The hardware will execute all branches that any thread in a warpexecutescan be slow!
In other words: an if statement will cause each thread to executeboth branches; we keep only the result of the taken branch.
17 / 47
Loops in kernels
kernel void contains_loop(global float *a,global float *b) {
int id = get_global_id (0);
for (i = 0; i < id; i++) {b[i] += a[i];
}}
A loop will cause the workgroup to wait for the maximum numberof iterations of the loop in any work-item.
Note: when you set up work-groups, best to arrange for allwork-items in a workgroup to execute the same branches & loops.
18 / 47
Synchronization
Different workgroups execute independently.You can only put barriers and memory fences betweenwork-items in the same workgroup.
OpenCL supports:Memory fences (load and store).Barriers.volatile (beware!)
19 / 47
Part II
Programming with OpenCL
20 / 47
Introduction
Today, well see how to program with OpenCL.Were using OpenCL 1.1.There is a lot of initialization and querying.When you compile your program, include -lOpenCL.
You can find the official documentation here:http://www.khronos.org/opencl/
More specifically:http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/
Lets just dive into an example.
21 / 47
First, reminders
All data belongs to an NDRange.The range can be divided into work-groups. (in software)The work-groups run on wavefronts/warps. (in hardware)Each wavefront/warp executes work-items.
All branches in a wavefront/warp should execute the same path.
If an iteration of a loop takes t:when one work-item executes 100 iterations,the total time to complete the wavefront/warp is 100t.
22 / 47
Part III
Simple Example
23 / 47
Simple Example (1)// Note by PL : don t use t h i s example as a t emp la t e ;// i t u s e s the C b i n d i n g s ! I n s t e ad , use the C++ b i n d i n g s .// s ou r c e : pages 19 through 111,// h t tp : // d e v e l o p e r . amd . com/wordp re s s /media /2013/07/// AMD_Accelerated_Paral le l_Process ing_OpenCL_// Programming_Guiderev 2.7 . pdf
#i n c l u d e #i n c l u d e
#d e f i n e NWITEMS 512
// A s imp l e memset k e r n e lcon s t cha r s ou r c e =" __kerne l v o i d memset ( __globa l u i n t ds t ) \n""{ \n"" d s t [ g e t_g l oba l_ i d ( 0 ) ] = ge t_g l oba l_ i d ( 0 ) ; \n""} \n " ;
i n t main ( i n t argc , cha r a rgv ){
// 1 . Get a p l a t f o rm .c l_p l a t f o rm_ id p l a t f o rm ;c lGe tP l a t f o rm IDs (1 , &p la t fo rm , NULL ) ;
24 / 47
Explanation (1)
Include the OpenCL header.
Request a platform (also known as a host).
A platform contains compute devices:GPUs or CPUs.
25 / 47
Simple Example (2)
// 2 . F ind a gpu d e v i c e .c l_d e v i c e_ i d d e v i c e ;c lG e tDev i c e ID s ( p l a t fo rm , CL_DEVICE_TYPE_GPU,
1 ,&dev i c e ,NULL ) ;
// 3 . Crea te a con t e x t and command queue on tha t d e v i c e .c l_ con t e x t c on t e x t = c lC r e a t eCon t e x t (NULL ,
1 ,&dev i c e ,NULL , NULL , NULL ) ;
cl_command_queue queue = clCreateCommandQueue ( contex t ,d ev i c e ,0 , NULL ) ;
26 / 47
Explanation (2)
Request a GPU device.
Request a OpenCL context (representing all of OpenCLs state).
Create a command-queue:get OpenCL to do work by telling it to run a kernel in a queue.
27 / 47
Simple Example (3)
// 4 . Perform runt ime sou r c e comp i l a t i on , and ob t a i n// k e r n e l e n t r y po i n t .c l_program program = clCreateProgramWithSource ( contex t ,
1 ,&source ,NULL ,NULL ) ;
c lBu i l dProg ram ( program , 1 , &dev i c e , NULL , NULL , NULL ) ;c l _ k e r n e l k e r n e l = c l C r e a t eK e r n e l ( program , "memset " ,
NULL ) ;
// 5 . Crea te a data b u f f e r .cl_mem b u f f e r = c l C r e a t eB u f f e r ( contex t ,
CL_MEM_WRITE_ONLY,NWITEMS s i z e o f ( c l _u i n t ) ,NULL , NULL ) ;
28 / 47
Explanation (3)
We create an OpenCL program (runs on the compute unit):kernels;functions; anddeclarations.
In this case, we create a kernel called memset from source.OpenCL may also create programs from binaries(may be in intermediate representation).
Next, we need a data buffer (enables inter-device communication).
This program does not have any input,so we dont put anything into the buffer (just declare its size).
29 / 47
Simple Example (4)
// 6 . Launch the k e r n e l . Let OpenCL p i c k the l o c a l work// s i z e .s i z e_ t g l oba l_wo rk_s i z e = NWITEMS;c l S e tKe r n e lA r g ( k e r n e l , 0 , s i z e o f ( b u f f e r ) , ( v o i d )& b u f f e r ) ;c lEnqueueNDRangeKernel ( queue ,
k e r n e l ,1 , // d imens i on sNULL , // i n i t i a l o f f s e t s&g loba l_work_s i z e , // number o f
// worki t emsNULL , // worki t ems pe r workgroup0 , NULL , NULL ) ; // e v en t s
c l F i n i s h ( queue ) ;
// 7 . Look at the r e s u l t s v i a synch ronous b u f f e r map .c l _u i n t p t r ;p t r = ( c l_u i n t ) c lEnqueueMapBuf fer ( queue , b u f f e r ,
CL_TRUE, CL_MAP_READ,0 , NWITEMS s i z e o f ( c l _u i n t ) ,0 , NULL , NULL , NULL ) ;
30 / 47
Explanation (4)
Set kernel arguments to buffer.We launch the kernel, enqueuing the 1-dimensionalindex space starting at 0.We specify that the index space has NWITEMS elements;and not to subdivide the program into work-groups.There is also an event interface, which we do not use.
We copy the results back; call is blocking (CL_TRUE);hence we dont need an explicit clFinish() call.
We specify that we want to read the results back intobuffer.
31 / 47
Simple Example (5)
i n t i ;f o r ( i =0; i < NWITEMS; i++)
p r i n t f ("%d %d\n " , i , p t r [ i ] ) ;r e t u r n 0 ;
}
The program simply prints 0 0, 1 1, . . . , 511 511.Note: I didnt clean up or include error handlingfor any of the OpenCL functions.
32 / 47
Part IV
Another Example
33 / 47
C++ Bindings
If we use the C++ bindings, well get automaticresource release and exceptions.
C++ likes to use the RAII style(resource allocation is initialization).
Change the header to CL/cl.hpp and define__CL_ENABLE_EXCEPTIONS.Wed also like to store our kernel in a file instead of a string.The C API is not so nice to work with.
34 / 47
Vector Addition Kernel
Lets write a kernel that adds two vectors and stores the result.This kernel will go in the file vector_add_kernel.cl.__kerne l v o i d vector_add ( __globa l con s t i n t A,
__globa l con s t i n t B,__globa l i n t C) {
// Get the i ndex o f the c u r r e n t e l ement to be p r o c e s s e di n t i = ge t_g l oba l_ i d ( 0 ) ;
// Do the op e r a t i o nC [ i ] = A[ i ] + B[ i ] ;
}
Other possible qualifiers: local, constant and private.
35 / 47
Vector Addition (1)// Vecto r add example , C++ b i n d i n g s ( use t h e s e ! )// s ou r c e :// h t tp : //www. t h e b i g b l o b . com/ ge t t i n gs t a r t e d// withopenc landgpucomputing /
#d e f i n e __CL_ENABLE_EXCEPTIONS
#i n c l u d e
#i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e
i n t main ( ) {// Crea te the two i npu t v e c t o r scon s t i n t LIST_SIZE = 1000 ;i n t A = new i n t [ LIST_SIZE ] ;i n t B = new i n t [ LIST_SIZE ] ;f o r ( i n t i = 0 ; i < LIST_SIZE ; i++) {
A[ i ] = i ;B [ i ] = LIST_SIZE i ;
}36 / 47
Vector Addition (2)
t r y {// Get a v a i l a b l e p l a t f o rm ss td : : v e c to r p l a t f o rm s ;c l : : P l a t fo rm : : ge t (& p l a t f o rm s ) ;
// S e l e c t the d e f a u l t p l a t f o rm and c r e a t e a con t e x t// u s i n g t h i s p l a t f o rm and the GPUc l_ c o n t e x t_p r o p e r t i e s cps [ 3 ] = {
CL_CONTEXT_PLATFORM,( c l _ c o n t e x t_p r o p e r t i e s ) ( p l a t f o rm s [ 0 ] ) ( ) ,0
} ;c l : : Context c on t e x t (CL_DEVICE_TYPE_GPU, cps ) ;
// Get a l i s t o f d e v i c e s on t h i s p l a t f o rms td : : v e c to r d e v i c e s =
con t e x t . g e t I n f o () ;
// Crea te a command queue and use the f i r s t d e v i c ec l : : CommandQueue queue = c l : : CommandQueue( contex t ,
d e v i c e s [ 0 ] ) ;
37 / 47
Explanation (2)
You can define __NO_STD_VECTOR and use cl::vector(same with strings).
You can enable profiling by addingCL_QUEUE_PROFILING_ENABLE as 3rd argument to queue.
38 / 47
Vector Addition (3)
// Read sou r c e f i l es t d : : i f s t r e am s o u r c e F i l e ( " vec to r_add_ke rne l . c l " ) ;s t d : : s t r i n g sourceCode (
s td : : i s t r e ambu f_ i t e r a t o r ( s o u r c e F i l e ) ,( s t d : : i s t r e ambu f_ i t e r a t o r ())
) ;c l : : Program : : Sou rce s s ou r c e (
1 ,s t d : : make_pair ( sourceCode . c_s t r ( ) ,
sourceCode . l e n g t h ()+1)) ;
// Make program o f the s ou r c e code i n the con t e x tc l : : Program program = c l : : Program ( contex t , s ou r c e ) ;
// Bu i l d program f o r t h e s e s p e c i f i c d e v i c e sprogram . b u i l d ( d e v i c e s ) ;
// Make k e r n e lc l : : Ke rne l k e r n e l ( program , " vector_add " ) ;
39 / 47
Vector Addition (4)
// Crea te memory b u f f e r sc l : : Bu f f e r bu f f e rA = c l : : Bu f f e r (
contex t ,CL_MEM_READ_ONLY,LIST_SIZE s i z e o f ( i n t )
) ;c l : : Bu f f e r bu f f e rB = c l : : Bu f f e r (
contex t ,CL_MEM_READ_ONLY,LIST_SIZE s i z e o f ( i n t )
) ;c l : : Bu f f e r bu f f e rC = c l : : Bu f f e r (
contex t ,CL_MEM_WRITE_ONLY,LIST_SIZE s i z e o f ( i n t )
) ;
40 / 47
Vector Addition (5)
// Copy l i s t s A and B to the memory b u f f e r squeue . enqueueWr i t eBu f f e r (
bu f f e rA ,CL_TRUE,0 ,LIST_SIZE s i z e o f ( i n t ) ,A
) ;queue . enqueueWr i t eBu f f e r (
bu f f e rB ,CL_TRUE,0 ,LIST_SIZE s i z e o f ( i n t ) ,B
) ;
// Set arguments to k e r n e lk e r n e l . s e tArg (0 , bu f f e rA ) ;k e r n e l . s e tArg (1 , bu f f e rB ) ;k e r n e l . s e tArg (2 , bu f f e rC ) ;
41 / 47
Explanation (5)
enqueue*Buffer arguments:buffercl_ bool blocking_write::size_t offset::size_t sizeconst void * ptr
42 / 47
Vector Addition (6)
// Run the k e r n e l on s p e c i f i c ND rangec l : : NDRange g l o b a l ( LIST_SIZE ) ;c l : : NDRange l o c a l ( 1 ) ;queue . enqueueNDRangeKernel (
k e r n e l ,c l : : Nul lRange ,g l o ba l ,l o c a l
) ;
// Read b u f f e r C i n t o a l o c a l l i s ti n t C = new i n t [ LIST_SIZE ] ;queue . enqueueReadBuf fe r (
bu f f e rC ,CL_TRUE,0 ,LIST_SIZE s i z e o f ( i n t ) ,C
) ;
43 / 47
Vector Addition (7)
f o r ( i n t i = 0 ; i < LIST_SIZE ; i ++) {s td : : cout
Other Improvements
The host memory is still unreleased.With the same number of lines, we could use the C++11unique_ptr, which would free the memory for us.
You can use a vector instead of an array,and use &v[0] instead of *.
Valid as long as the vector is not resized.
45 / 47
OpenCL Programming Summary
Went through real OpenCL examples.Have the reference card for the API.
Saw a C++ template for setting up OpenCL.
Aside: if youre serious about programming in C++, checkout Effective C++ by Scott Meyers (slightly dated withC++11, but it still has some good stuff)
46 / 47
Overall summary
First Half: Brief overview of OpenCL and its programmingmodel.
Many concepts are similar to plain parallel programming(more structure).
Second Half: Looked at an OpenCL implementation andhow to organize it.
Need to write lots of boilerplate!
47 / 47
OpenCL conceptsProgramming with OpenCLSimple ExampleAnother Example