PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Post on 07-Dec-2014

893 views 3 download

description

Presentation PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

transcript

Software Librariesfor CUDA & OpenCL

Heterogeneous Computing is Hard

Two Examples:

1. Median Filtering

2. Local Windowing

Median Filtering

Increasingly

Difficult

Local Windowing

Best algorithm to use changes given which

device is in the system.

Device 1 Device 2 Device 3 Device 4

Algorithm 1 395 ms 599 244 102

Algorihm 2 270 703 241 103

Algorithm 3 699 407 138 116

Algorithm 4 380 522 202 98

Why Software Libraries Are Great

Reduce many lines of code to one line

Obsessively tuned by experts; faster than DIY

Well-tested and maintained

Continuously improving

Five Influencers (besides price)

Portability Scalability Community

ProgrammabilityPerformance

Faster

Time-

consuming

SSE or

AVXSlower

Easy-to-use

Performance & Programmability

Faster

Time-

consuming

Writing

Kernels

SSE or

AVXSlower

Easy-to-use

Performance & Programmability

Faster

Time-

consuming

Writing

Kernels

Compiler

Directives

SSE or

AVXSlower

Easy-to-use

Performance & Programmability

Faster

Time-

consuming

Writing

Kernels

Using

Libraries

Compiler

Directives

SSE or

AVXSlower

Easy-to-use

Performance & Programmability

Performance

Performance

Portability

Flavors of portability

HW vendor options

Accelerator options (GPU, coprocessor, FPGA)

CPU fallback

High-performance mobile computing

Libraries can provide portability

Scalability

Always start with one device

Potential headaches of adding devices

Performance hit

Development complexity

Libraries can make scaling easy

Community

What do you do when bugs arise?

Continuous refinement

Someone to answer questions

Libraries can have great community support

Benefits of Using a Library

Development

Documentation

Test and QA

Maintenance

Porting

TIM

E

COST

TIM

E

COST

Libraries eliminate

hidden costs of software

development

Pain Pleasure

ArrayFire: Technical Computing

Performance & Programmability

Super easy to program

Highly optimized

Portability

Scalability

Multi-GPU is 1-line of code

array *y = new array[n];

for (int i = 0; i < n; ++i) {

deviceset(i); // change GPUs

array x = randu(5,5); // add work to GPU’s queue

y[i] = fft(x); // more work in queue

}

// all GPUs are now computing simultaneously

Community

Over 8,000 posts at

http://forums.accelereyes.com

Nightly library update releases

Stable releases a few times a year

v2.0 coming at the end of summer

Example Case Studies 1

45X

Radar Imaging

System Planning

17X

Neuro-imaging

Georgia Tech

20X

Video Processing

Google

12X

Medical Devices

Spencer Tech

20X

Viral Analyses

CDC

Example Case Studies 2

70X

Drug Delivery

Georgia Tech

5X

Weather Models

NCAR

17X

Surveillance

BAE Systems

35X

Bioinformatics

Leibnitz

35X

Power Eng

IIT India

Hundreds of Functions

reductions

• sum, min, max, count,

prod

• vectors, columns,

rows, etc

convolutions

• 2D, 3D, ND

dense linear algebra

• LU, QR, Cholesky, SVD,

Eigenvalues, Inversion,

Solvers, Determinant,

Matrix Power

FFTs

• 2D, 3D, ND

image processing

• filter, rotate, erode,

dilate, morph,

resize, rgb2gray,

histograms

interpolate & scale

• vectors, matrices

• rescaling

sorting

• along any

dimension

• sort detection

and many more…

Intuitive Functions (estimate π)

#include <stdio.h>

#include <arrayfire.h>

using namespace af;

int main() {

// 20 million random samples

int n = 20e6;

array x = randu(n,1), y = randu(n,1);

// how many fell inside unit circle?

float pi = 4 * sum<float>(x*x + y*y < 1) / n;

printf("pi = %g\n", pi);

return 0;

}

Data Types

c32complex

single precision

f64real

double precision

f32real

single precision

c64complex

double precision

b8boolean byte

arraycontainer object

s32signed integer

u32unsigned integer

array x = randu(n, f32);

array y = randu(n, f64);

array z = randu(n, u32);

ND Support

vectors

matrices volumes… ND

Subscripting

A(span,span,2)

ArrayFire Keywords: end, span

A(end,span)

A(1,span)A(1,1)

A(end,1)

Generate Arrays

constant(0,3) // 3-by-1 column of zeros, single-precision

constant(1,3,2,f64) // 3-by-2 matrix, double-precision

randu(1,8) // row vector (1x8) of random values (uniform)

randn(2,2) // square matrix (2x2) random values (normal)

identity(3,3) // 3-by-3 identity

randu(5,7,c32) // complex random values

Create Arrays from CPU Data

float hA[] = {0,1,2,3,4,5};

array A(2,3,hA); // 2x3 matrix, single-precision

print(A);

// A = [ 0 2 4 ] Note: Fortran storage order

// [ 1 3 5 ]

Arithmetic

array R = randu(3,3);

array C = constant(1,3,3) + complex(sin(R)); // C is c32

// rescale complex values to unit circle

array a = randn(5,c32);

print(a / abs(a));

L-2 Norm Example

// calculate L-2 norm of every column

sqrt(sum(pow(X, 2))) // norm of every column vector

sqrt(sum(pow(X, 2), 0)) // ..same

sqrt(sum(pow(X, 2), 1)) // norm of every row vector

Subscripting Examples

array A = randu(3,3);

array a1 = A(0); // first element

array a2 = A(0,1); // first row, second column

A(1,span); // second row

A.row(end); // last row

A.cols(1,end); // all but first column

Subscripting Examples

float b_ptr[] = {0,1,2,3,4,5,6,7,8,9};

array b(1,10,b_ptr);

b(seq(3)); // {0,1,2}

b(seq(1,7)); // {1,2,3,4,5,6,7}

b(seq(1,2,7)); // {1,3,5,7}

b(seq(0,2,end)); // {0,2,4,6,8}

Data Manipulation

// setting entries to a constant

A(span) = 4; // fill entire array

A.row(0) = -1; // first row

A(seq(3)) = 3.1415; // first three elements

Data Manipulation

// copy in another matrix

array B = constant(1,4,4,f64);

B.row(0) = randu(1,4,f32); // set row (upcast)

Data Manipulation

// index with another array

float h_inds[] = {0, 4, 2, 1}; // zero-based

array inds(1,4,h_inds);

B(inds) = randu(4,1); // set to random

Linear Algebra

// matrix factorization

array L, U;

lu(L, U, randu(n,n));

// linear systems: A x = b

array A = randu(n,n), b = randu(n,1);

array x = solve(A,b);

Graphics Functions

asynchronous

non-blocking

throttled at 35 Hz

Graphics Functions

non-blocking primitives

surface - surface plotting (2d data)

image - intensity image visualization

arrows - vector fields

plot2 - line plotting (x,y)

plot3 - scatter plot (x,y,z)

volume - volume rendering for 3d data

Graphics Functions

utility commands

keep_on keep_off

subfigure

palette

clearfig

draw (blocking)

figure

title

close

Graphics Example

#include <arrayfire.h>

using namespace af;

int main() {

// random 3d surface

const int n = 256;

while (1) {

array x = randu(n,n);

// 3d surface plot

surface(x);

}

return 0;

}

GFOR Parallel Loops

gfor (array i, 3)

C(span,span,i) = A(span,span,i) * B;

Parallel matrix multiplications (1 kernel launch)

C(,,1) A(,,1) B

*=

C(,,3) A(,,3) B

*=

C(,,2) A(,,2) B

*=

GFOR Parallel Loops

BA(,,1:3)C(,,1:3)

*=*=

*=

gfor (array i, 3)

C(span,span,i) = A(span,span,i) * B;

Parallel matrix multiplications (1 kernel launch)

GFOR Parallel Loops

gfor (array i, 3)

C(span,span,i) = A(span,span,i) * B;

Parallel matrix multiplications (1 kernel launch)

= *

BAC

Four Quick Stories in Conclusion

Advertising Healthcare Finance Oil & Gas

Virtual Glasses Try-On

Acceleration Demands

The CPU code

45 seconds for one session to complete

Highly optimized OpenMP code leveraging all cores

1,000 sessions/minute required 750 CPU nodes

Convert Mac-only research code to C#

Focus on efficiently developed robust performance

ArrayFire Solution

Linear algebra

Matrix multiple, Transpose

Linear solvers

Image processing

Convolutions

Fast Fourier Transform

Correlation Filter

Sobel Filter

Gaussian Blur

OpenCV functions

Custom edge detection

Graphics

Rendering points

Reductions

Min, Max, Sum

JIT

Increased productivity

Results

3X acceleration

Dropped from 750 nodes,

to 250 nodes

Benefit from ongoing

library support

Culture-Free Microbiology

Filling

Filled

Computer-

controlled

pipettes

Microscope

A computer-controlled microscope scans a

cassette of pipettes, changes imaging

modes, and acquires digital images

according to program

Acceleration Demands

This platform provides a rapid alternative to traditional cell culturing for susceptibility testing

The faster the analysis pipeline, the sooner a patient can be diagnosed and treated with an antibiotic

Culture-based methods can take 2-3 days, which is problematic for many critically ill patients

ArrayFire Solution

Image Processing

Heavily filter based

Convolve, Filter, Resize

Image Statistics

Mean, StdDev, Variance

Results

Realtime throughputKernel Speedup

Image Registration (Heavy use of

statistics functions)

73.17x

Custom Filter (Prep Center Image) 26.48x

Gaussian Blur 2.19x

Hedge Protection System

Acceleration Demands

CPU-only version was taking 115 hours

Needs to run entire database of portfolios

each night before trading begins next day

ArrayFire Solution

Statistics Functions

Random number

generation

Variance

Exponentials

Arithmetic

Sqrt

Element-wise math

Reductions

Sum

Results

GPU version drops runtime to 7 hours and

meets the requirement to run overnight

Time left over to try more permutations

Oil Well Monitoring

Ordinary telecom

fiber used as an

efficient, high fidelity

acoustic sensor

Threaded along the

length of oil well

Acceleration Demands

Require realtime signal processing from 24

channels per unit with an onsite server

CPU-only solution was 5x slower than realtime

ArrayFire Solution

Heavy usage of signal filtering functions

FIR

IIR

Results

6x performance improvements in signal

processing

20x overall performance improvement

through more efficiently vectorized code

Software Shop for CUDA & OpenCL

Two ways to work with us:

Use

Hire our CUDA & OpenCL developers

Code development; CUDA & OpenCL training