+ All Categories
Home > Documents > Productive GPU Software

Productive GPU Software

Date post: 23-Feb-2016
Category:
Upload: bruis
View: 26 times
Download: 0 times
Share this document with a friend
Description:
This conference uses integrated audio. To interact with the host, you need a working microphone and speakers. To speak, please click on the “Raise Hand” button in the Participants box. You can speak into your microphone after the host allows you to. - PowerPoint PPT Presentation
Popular Tags:
37
Productive GPU Software This conference uses integrated audio. To interact with the host, you need a working microphone and speakers. To speak, please click on the “Raise Hand” button in the Participants box. You can speak into your microphone after the host allows you to. If you cannot hear the host, or if your voice is not being transmitted, please let us know using
Transcript
Page 1: Productive GPU Software

Productive GPU Software This conference uses integrated audio. To interact with the host, you need a working microphone and speakers.

To speak, please click on the “Raise Hand” button in the Participants box. You can speak into your microphone after the host allows you to.

If you cannot hear the host, or if your voice is not being transmitted, please let us know using the Chat window.

Page 2: Productive GPU Software

Outline• Introduction to Jacket for MATLAB®• GFOR• Comparison with PCT™ alternative• Moving into the future• Case studies and code demos

MATLAB® and Parallel Computing Toolbox™ (PCT) are trademarks of MathWorks®

Page 3: Productive GPU Software

n = 20e6; % 20 million random samplesX = grand(1,n,’gdouble’);Y = grand(1,n,’gdouble’);distance_to_origin = sqrt( X.*X + Y.*Y );is_inside = (distance_to_origin <= 1);pi = 4 * sum(is_inside) / n;

Easy GPU Acceleration of M code

Page 4: Productive GPU Software

Matrix Types

gdoubledouble precision

gsinglesingle precision

glogicalboolean

gint#integers

guint#unsigned integers

Page 5: Productive GPU Software

Matrix Types: ND Support

vectorsmatrices

volumes … ND

Page 6: Productive GPU Software

Matrix Types: Easy Manipulation

A(1,:)

A(end,1)

A(1,1)

A(end,:)

A(:,:,2)

Page 7: Productive GPU Software

Easy GPU Acceleration of M coden = 20e6; % 20 million random samplesX = grand(1,n);Y = grand(1,n);distance_to_origin = sqrt( X.*X + Y.*Y );is_inside = (distance_to_origin <= 1);pi = 4 * sum(is_inside) / n;

Page 8: Productive GPU Software

No GPU-specific stuff involved (no kernels, no threads, no blocks, just regular M code)

“Very little recoding was needed to promote our Lattice Boltzmann Model code to run on the GPU.” –Dr. Kevin Tubbs, HPTi

Easy GPU Acceleration of M code

Page 9: Productive GPU Software

GFOR – Parallel FOR-loop for GPUs• Like a normal FOR-loop, but faster

for i = 1:3 C(:,:,i) = A(:,:,i) * B;

Regular FOR-loop (3 serial kernel launches)

gfor i = 1:3 C(:,:,i) = A(:,:,i) * B;

Parallel GPU FOR-loop (only 1 kernel launch)

Page 10: Productive GPU Software

Example: Matrix Multiply

*BA(:,:,i)

iteration i = 1

C(:,:,i)

=

for i = 1:3 C(:,:,i) = A(:,:,i) * B;

Regular FOR-loop (3 serial kernel launches)

Page 11: Productive GPU Software

Example: Matrix Multiply

*BA(:,:,i)

iteration i = 1

C(:,:,i)

= *BA(:,:,i)

iteration i = 2

C(:,:,i)

=

for i = 1:3 C(:,:,i) = A(:,:,i) * B;

Regular FOR-loop (3 serial kernel launches)

Page 12: Productive GPU Software

Example: Matrix Multiply

*BA(:,:,i)

iteration i = 1

C(:,:,i)

= *BA(:,:,i)

iteration i = 2

C(:,:,i)

= *BA(:,:,i)

iteration i = 3

C(:,:,i)

=

for i = 1:3 C(:,:,i) = A(:,:,i) * B;

Regular FOR-loop (3 serial kernel launches)

Page 13: Productive GPU Software

simultaneous iterations i = 1:3

BA(:,:,1:3)C(:,:,1:3)

*= *= *=

Example: Matrix Multiply

gfor i = 1:3 C(:,:,i) = A(:,:,i) * B;

Parallel GPU FOR-loop (only 1 kernel launch)

Page 14: Productive GPU Software

simultaneous iterations i = 1:3

*

BA(:,:,1)C(:,:,1)

=

Example: Matrix Multiply

gfor i = 1:3 C(:,:,i) = A(:,:,i) * B;

Parallel GPU FOR-loop (only 1 kernel launch)

Page 15: Productive GPU Software

Example: Summing over Columns• Think of gfor as “syntactic sugar” to write

vectorized code in an iterative style.

for i = 1:3 A(i) = sum(B(:,i));

gfor i = 1:3 A(i) = sum(B(:,i));

Three passes to sum all columns of B

One pass to sum all columns of B

Both equivalent to “sum(B)”, but latter is faster (more

explicitly written)

Page 16: Productive GPU Software

y = gzeros( 5, 5, n );for i = 1:n, gselect(i); % choose GPU for this iteration x = grand(5,5); % add work to GPU’s queue y(:,:,i) = fft(x); % more work in queueend

% all GPUs are now computing simultaneously, until done

Easy Multi GPU Scaling

Page 17: Productive GPU Software

Technology Stack• A full system making optimizations for you• Including

– “Core” brains– “JIT” speed– “Calls” heavy-lifting

runtime memory mgt binary handling GPU-multiplex thread mgt

core

JIT Engine(s)

plus.mexminus.mex

bsxfun.mex

tan.mex

times.mex

power.mex

Calls (library routines + JIT)

fft.mex

fft2.mex

bessel.mex

conv2.mex

convn.mex

find.mex

sum.mex

subsasgn.mex

mldivide.mex

lu.mex

Page 18: Productive GPU Software

http://www.accelereyes.com/case_studies

17X

Neuro-imagingGeorgia Tech

20X

Video ProcessingGoogle

12X

Medical DevicesSpencer Tech

5X

Weather ModelingNCAR

35X

Power EngineeringIIT India

17X

Track Bad GuysBAE

Systems

70X

Drug DeliveryGeorgia Tech

35X

BioinformaticsLeibniz

20X

Bio-ResearchCDC

45X

Radar ImagingSystem Planning

Page 19: Productive GPU Software

Automated Optimizations

300 cyclesone-way

GPU Memory

GPU Cores

A = sin( x + y ).^2

CPU

Page 20: Productive GPU Software

Automated Optimizations

300 cyclesone-way

GPU Memory

GPU Cores

A = sin( x + y ).^2

CPUOptimized via

async transfer and smart copy

Optimized viaruntime

Page 21: Productive GPU Software

Compare versus PCTA = sin( x + y ).^2

PCTLoad x, y (300 cycles)+ (4 cycles)Store Temp1 (300 cycles)Load Temp1 (300 cycles)Sin (~20 cycles)Store Temp2 (300 cycles)Load Temp2 (300 cycles).^ (~10 cycles)Store A (300 cycles)

JacketLoad x, y (300 cycles)Sin( x+y ).^2 (34 cycles)Store A (300 cycles)

MATLAB and PCT are products and trademarks of MathWorks.

parallel computing toolbox™

Page 22: Productive GPU Software

Compare versus PCTA = sin( x + y ).^2

PCTLoad x, y (300 cycles)+ (4 cycles)Store Temp1 (300 cycles)Load Temp1 (300 cycles)Sin (~20 cycles)Store Temp2 (300 cycles)Load Temp2 (300 cycles).^ (~10 cycles)Store A (300 cycles)

JacketLoad x, y (300 cycles)Sin( x+y ).^2 (34 cycles)Store A (300 cycles)

1834 cycles

634 cycles

parallel computing toolbox™

MATLAB® and PCT™ are products and trademarks of MathWorks.

Page 23: Productive GPU Software

Compare versus PCTA = sin( x + y ).^2

PCTLoad x, y (300 cycles)+ (4 cycles)Store Temp1 (300 cycles)Load Temp1 (300 cycles)Sin (~20 cycles)Store Temp2 (300 cycles)Load Temp2 (300 cycles).^ (~10 cycles)Store A (300 cycles)

JacketLoad x, y (300 cycles)Sin( x+y ).^2 (34 cycles)Store A (300 cycles)

1834 cycles

634 cycles

Theoretically, a 3x increase. Actually, a 20x difference:• Legacy Java system• Better GPU code

parallel computing toolbox™

Page 24: Productive GPU Software

Jacket has 10X more functions…reductions • sum, min, max, any,

all, nnz, prod• vectors, columns,

rows, etc

convolutions• 2D, 3D, ND

dense linear algebra• LU, QR, Cholesky,

SVD, Eigenvalues, Inversion, det, Matrix Power, Solvers

FFTs• 2D, 3D, ND

image processing• filter, rotate, erode,

dilate, bwmorph, resize, rgb2gray

• hist, histeq

interp and rescale• vectors, matrices• rescaling

sorting• along any dimension• find

and many more…

gfor (loops) gcompile (fine-grain) gselect (multi-GPU)

help• gprofview

Page 25: Productive GPU Software

Easy To Maintain• Write your code once and let Jacket carry you

through the coming hardware evolution.– Each new Jacket release improves the speed of

your code, without any code modification.– Each new Jacket release leverages latest GPU

hardware (e.g. Fermi, Kepler), without any code modification.

Page 26: Productive GPU Software

New in Jacket 2.1: Optimization• Unconstrained Optimization in 2.1

– Gradient Descent and BFGS methods– Jacobian computation with GFOR

• Batched-mode Optimization in 2.2• Search-based Optimization in 2.2• Constrained Optimization in 2.3

Page 27: Productive GPU Software

Sparse RoadmapCurrent functions supported:• Matrix multiply• Triangular matrix solve• Iterative solvers with no pre-conditioning.• Examples: CG, BICG, BICGSTAB, BICGSTABL, GMRES, LSQR

Under development:• Iterative solvers with pre-conditioning and improved performance• Examples: CG, BICG, BICGSTAB, GMRES

Page 28: Productive GPU Software

Move to C/C++, Fortran, or Python

The World’s Largest, Fastest GPU Library

ArrayFire GPU library• Free version for most users (single GPU usage)• Pro version (multi-GPU usage)• Available for CUDA or OpenCL devices

Page 29: Productive GPU Software

ArrayFire Example (C++)#include <stdio.h>#include <arrayfire.h>using namespace af;int main() { // 20 million random samples int n = 20e6; array x = randu(n,1), y = randu(n,1); // how many fell inside unit circle? float pi = 4 * sum<float>(sqrt(mul(x,x)+mul(y,y))<1) / n; printf("pi = %g\n", pi); return 0;}

Page 30: Productive GPU Software

Case Studies

See more examples: http://www.accelereyes.com/examples/case_studies

http://blog.accelereyes.com/blog/

Page 31: Productive GPU Software

Case Study: Australian Brokerage• Description: Nonlinear regressive model

fitting• Speedup: 115x• Solution: Jacket, Jacket DLA, ArrayFire Pro,

Consulting

Page 32: Productive GPU Software

Case Study: Australian Brokerage• Description: Modified conjugate gradient for

sparse matrices• Speedup: 10-30x (Depends on data size.

Larger data gives bigger speedups.)• Solution: Jacket, Jacket SLA, ArrayFire Pro,

Consulting

Page 33: Productive GPU Software

Case Study: Koch Industries• Description: Option pricing based on Monte-

Carlo simulation• Speedup: 60 - 70x• Solution: Jacket

Page 34: Productive GPU Software

Case Study: Bank of America

• Description: Visualization of server utilization and workloads, required to run in MATLAB®

• Focus only on visualization, not computation• Result: Beautiful OpenGL 3D renderings• Solution: Jacket with the Graphics Library

Page 35: Productive GPU Software

Automotive Trader Example• Description: Algorithmic trading• Speedup: 37x on 3 GPUs (14x on 1 GPU)• Solution: Jacket, Jacket MGL for 3 GPUs• Learn more:

http://www.automatedtrader.net/articles/software-review/107768/mashup

Page 36: Productive GPU Software

Demos

Page 37: Productive GPU Software

Discussion

Faster MATLAB® through GPU computing


Recommended