Productive OpenCL Programming An Introduction to OpenCL Libraries with ArrayFire COO Oded Green

Post on 13-Jan-2015

733 views 8 download

Tags:

description

In this webinar presentation, ArrayFire COO Oded Green demonstrates best practices to help you quickly get started with OpenCL™ programming. Learn how to get the best performance from AMD hardware in various programming languages using ArrayFire. Oded discusses the latest advancements in the OpenCL™ ecosystem, including cutting edge OpenCL™ libraries such as clBLAS, clFFT, clMAGMA and ArrayFire. Examples are shown in real code for common application domains. Watch the webinar here: http://bit.ly/1obT0M2 For more developer resources, visit: http://arrayfire.com/ http://developer.amd.com/ Follow us on Twitter: https://twitter.com/AMDDevCentral See info in the slides for more contact information and resource links!

transcript

An Introduction to OpenCL Libraries

Productive OpenCL Programming

● We make code run faster○ Started in 2007 by Georgia Tech researchers○ 1000s of paying customers

● We build an acceleration library○ for really cool science, engineering, and finance applications○ for mobile computing

Libraries are Great!

Eliminate Hidden Costs

Library Types

● Specialized GPU Libs○ Targeted at a specific set of operators (functionality) ○ Optimized for specific systems○ C-like interface○ Raw pointer interface

● General GPU Libs○ Manage GPU resources using containers○ Applicable to a large set of applications and domains○ Portable across multiple architectures○ Higher level functions○ C++ interface (supports templates)

Specialized GPU Libraries

● Fast Fourier Transforms○ clFFT

● Random Number Generation○ Random123

● Linear Algebra○ clBLAS○ MAGMA

● Signal and Image Processing○ OpenCLIPP

Specialized GPU Libraries

● C Interface○ Use pointers to reference data

● Memory management is programmer responsibility● Mimic existing libraries

○ clBLAS ≈ BLAS○ MAGMA ≈ BLAS + LAPACK○ clFFT ≈ FFTW

● Simplifies GPU integration of specialized scientific libraries○ Still requires setting up the GPU

clFFT

● 1D, 2D and 3D transforms● CPU and GPU backends● Supports

○ Real and complex data types○ Single and double-precision ○ Execution of multiple transformations concurrently

Random123

● Counter-based RNG● Passed SmallCrush, Crush and BigCrush tests● Four RNG families

○ Threefry○ Philox○ AESNI○ ARS

● Not suitable for cryptography

Magma & clBLAS

● Implements many popular linear algebra routines● Supports

○ Real and complex data types ○ Single and double-precision

OpenCLIPP

● Supports multiple image types● Similar to Intel IPP● Primitives

○ Arithmetic and logic○ LUT○ Morphology○ Transform○ Resize○ Histogram○ Many more…

● C and C++ interface

General-Purpose GPU Libraries

● Bolt● OpenCV● ArrayFire

Images taken from: http://wordlesstech.com/2012/10/12/leatherman-oht-multi-tool/

Bolt

● GPU library which resembles C++ STL○ STL like data structures○ Iterators○ Fully interoperable with OpenCL

● Parallel vector operation methods○ Reductions○ Sorting○ Prefix-Sum

● Customizable GPU kernels using functors● Some functions only supported on AMD GPUs

Bolt - Data Structures

● Built around the device_vector● Supports the same data types as C++

○ device_vector<float> data(2e6);

● Useful when performing multiple operations on a vector

● Can be passed into STL algorithms○ Always interoperability○ Data transfer will be costly

Bolt - Algorithms

● Uses a C++ STL like interface○ Pass the begin and end iterators

● Accept functors which allow you to run custom operations on OpenCL devices

● Multiple backends○ OpenCL, C++AMP, and TBB○ Not all algorithms implemented across all backends

● Works on vector and device_vector

OpenCV

● Open source computer vision library● C++ interface with many language wrappers● Hundreds of CV functions

OpenCV ArrayFire Interop

● Helper Functions○ https://github.com/arrayfire-community/arrayfire_opencv.git

Mat R; Rodrigues(poses(Rect(0, 0, 1, 3)), R);af::array af_R = mat_to_array(R);

ArrayFire - Data Structures

● Built around a flexible data structure named "array"○ Lightweight wrapper around the data on the compute device

○ Manages the data and basic metadata such as size, type and dimensions

● You can transfer data into an array using constructors● Column major

float hA[6] = {0, 1, 2, 3, 4, 5};array A(2, 3, hA);

ArrayFire - Indexing#include <arrayfire.h>

#include <af/utils.h>

void af_example()

{

float f[8] = {1, 2, 4, 8, 16, 32, 64, 128};

array a(2, 4, f); // 2 rows x 4 col array initialized with f values

array sumSecondCol = sum(a(span, 1)); // reduce-sum over the second column

print(sumSecondCol); // 12

}

Using ArrayFire:

array tmp = img(span,span,0); // save the R channel

img(span,span,0) = img(span,span,2); // R channel gets values of B

img(span,span,2) = tmp; // B channel gets value of R

Can also do it this way:

array swapped = join(2, img(span,span,2), // blue

img(span,span,1), // green

img(span,span,0)); // red

Or simply:

array swapped = img(span,span,seq(2,-1,0));

ArrayFire Example - swap R and B

Using ArrayFire:array img = loadimage("image.jpg", false); // load grayscale image from disk to

device

array img_T = img.T(); // transpose

ArrayFire Functions

Original

Grayscale

Box filter blur

Gaussian blur

Image Negative

ArrayFire // erode an image, 8-neighbor connectivity

array mask8 = constant(1,3, 3);

array img_out = erode(img_in, mask8);

// erode an image, 4-neighbor connectivity

const float h_mask4[] = { 0.0, 1.0, 0.0,

1.0, 1.0, 1.0,

0.0, 1.0, 0.0 };

array mask4 = array(3, 3, h_mask4);

array img_out = erode(img_in, mask4);

Erosion

Erosion

ArrayFire

array R = convolve(img, ker); // 1, 2 and 3d convolution filter

array R = convolve(fcol, frow, img); // Separable convolution

array R = filter(img, ker); // 2d correlation filter

Filtering

Histograms

ArrayFireint nbins = 256;

array hist = histogram(img,nbins);

Transforms

ArrayFirearray half = resize(0.5, img);

array rot90 = rotate(img, af::Pi/2);

array warped = approx2(img, xLocations, yLocations);

Image smoothing

ArrayFire

array S = bilateral(I, sigma_r, sigma_c);

array M = meanshift(I, sigma_r, sigma_c, iter);

array R = medfilt(img, 3, 3);

// Gaussian blur

array gker = gaussiankernel(ncols, ncols);

array res = convolve(img, gker);

FFT

ArrayFire

array R1 = fft2(I); // 2d fft. check fft, fft3

array R2 = fft2(I, M, N); // fft2 with padding

array R3 = ifft2(fft2(I, M, N) * fft2(K, M, N)); // convolve using fft2

ArrayFire Capabilities

● Hundreds of parallel functions for multi-disciplinary work○ Image processing○ Machine learning○ Graphics○ Sets

● Support for multiple languages○ C/C++, Fortran, Java and R

● Linux, Windows, Mac OS X

ArrayFire Capabilities

● OpenGL based graphics● JIT

○ Combine multiple operations into one kernel

● GFOR - data parallel loop○ Allows concurrent execution over multiple data sets (for example

images)

ArrayFire Functions

● Supports hundreds of parallel functions○ Building blocks

■ Reductions■ Scan■ Set operations■ Sorting■ Statistics■ Basic matrix manipulation

Images taken from: http://technogems.blogspot.com/2011/06/sorting-included-files-by-importance.htmlhttp://www.cmsoft.com.br/tutorialOpenCL/CLMatrixMultExplanationSubMatrixes.png

ArrayFire Functions

● Hundreds of highly-optimized parallel functions○ Signal/image processing

■ Convolution■ FFT■ Histograms■ Interpolation■ Connected components

○ Linear Algebra■ Matrix multiply■ Linear system solving■ Factorization

GFOR: What is it?

• Data-Parallel for loop, e.g.

for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B;

gfor (array i, 3) C(span,span,i) = A(span,span,i) * B;

Serial matrix-vector multiplications (3 kernel launches)

Parallel matrix-vector multiplications (1 kernel launch)

Example: Matrix Multiply

• Data-Parallel for loop, e.g.

*

BA(,,1)

iteration i = 1

C(,,1)

=

for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B;

Serial matrix-vector multiplications (3 kernel launches)

Example: Matrix Multiply

• Data-Parallel for loop, e.g.

for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B;

*

BA(,,1)

iteration i = 1

C(,,1)

= *

BA(,,2)

iteration i = 2

C(,,2)

=

Serial matrix-vector multiplications (3 kernel launches)

Example: Matrix Multiply

• Data-Parallel for loop, e.g.

for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B;

*

BA(,,1)

iteration i = 1

C(,,1)

= *

BA(,,2)

iteration i = 2

C(,,2)

= *

BA(,,3)

iteration i = 3

C(,,3)

=

Serial matrix-vector multiplications (3 kernel launches)

Example: Matrix Multiply

gfor (array i, 3) C(span,span,i) = A(span,span,i) * B;

Parallel matrix multiplications (1 kernel launch)

simultaneous iterations i = 1:3

*

BA(,,1)C(,,1)

= *

BA(,,2)C(,,2)

= *

BA(,,3)C(,,3)

=

Example: Matrix Multiply

simultaneous iterations i = 1:3

BA(,,1:3)C(,,1:3)

*=*=

*=

Think of GFOR as compiling 1 stacked kernel with all iterations.

gfor (array i, 3) C(span,span,i) = A(span,span,i) * B;

Parallel matrix multiplications (1 kernel launch)

JIT Code Generation

● Run time kernel generation● Combines multiple element wise operations into one

kernel● Reduces kernel launching overhead● Intermediate data not allocated● Improves cache performance

Success Stories

Field Application Speedup

Academia Power Systems Simulations 35x

Finance Option Pricing 52x

Government Radar Image Formation 45x

Life Sciences Pathology Advances > 100x

Manufacturing Tomography of Vegetation 10x

Media & Computer Vision Digital Holography 17x

Oil & Gas Ground Water Simulations > 20x

Future capabilities

● We are interested in Big Data applications● Create capabilities for

○ Streaming video○ Large number of images○ Machine learning○ Data analysis○ Dynamic data

● Faster rendering utilities for Big Data

Comments on Open Source

● https://github.com/arrayfire-community

Q & A

Speaker: Oded Green (oded@arrayfire.com)

Engineers: Umar Urshad (umar@ArrayFire.com)

Pavan Yalamanchili (pavan@ArrayFire.com)

Sales:Scott Blakeslee (scott@ArrayFire.com)

Look us up

www.ArrayFire.com

For language wrappers and exampleshttps://github.com/ArrayFire