Productive GPU Software This conference uses integrated audio. To interact with the host, you need a working microphone and speakers.
To speak, please click on the “Raise Hand” button in the Participants box. You can speak into your microphone after the host allows you to.
If you cannot hear the host, or if your voice is not being transmitted, please let us know using the Chat window.
Outline• Introduction to Jacket for MATLAB®• GFOR• Comparison with PCT™ alternative• Moving into the future• Case studies and code demos
MATLAB® and Parallel Computing Toolbox™ (PCT) are trademarks of MathWorks®
n = 20e6; % 20 million random samplesX = grand(1,n,’gdouble’);Y = grand(1,n,’gdouble’);distance_to_origin = sqrt( X.*X + Y.*Y );is_inside = (distance_to_origin <= 1);pi = 4 * sum(is_inside) / n;
Easy GPU Acceleration of M code
Matrix Types
gdoubledouble precision
gsinglesingle precision
glogicalboolean
gint#integers
guint#unsigned integers
Matrix Types: ND Support
vectorsmatrices
volumes … ND
Matrix Types: Easy Manipulation
A(1,:)
A(end,1)
A(1,1)
A(end,:)
A(:,:,2)
Easy GPU Acceleration of M coden = 20e6; % 20 million random samplesX = grand(1,n);Y = grand(1,n);distance_to_origin = sqrt( X.*X + Y.*Y );is_inside = (distance_to_origin <= 1);pi = 4 * sum(is_inside) / n;
No GPU-specific stuff involved (no kernels, no threads, no blocks, just regular M code)
“Very little recoding was needed to promote our Lattice Boltzmann Model code to run on the GPU.” –Dr. Kevin Tubbs, HPTi
Easy GPU Acceleration of M code
GFOR – Parallel FOR-loop for GPUs• Like a normal FOR-loop, but faster
for i = 1:3 C(:,:,i) = A(:,:,i) * B;
Regular FOR-loop (3 serial kernel launches)
gfor i = 1:3 C(:,:,i) = A(:,:,i) * B;
Parallel GPU FOR-loop (only 1 kernel launch)
Example: Matrix Multiply
*BA(:,:,i)
iteration i = 1
C(:,:,i)
=
for i = 1:3 C(:,:,i) = A(:,:,i) * B;
Regular FOR-loop (3 serial kernel launches)
Example: Matrix Multiply
*BA(:,:,i)
iteration i = 1
C(:,:,i)
= *BA(:,:,i)
iteration i = 2
C(:,:,i)
=
for i = 1:3 C(:,:,i) = A(:,:,i) * B;
Regular FOR-loop (3 serial kernel launches)
Example: Matrix Multiply
*BA(:,:,i)
iteration i = 1
C(:,:,i)
= *BA(:,:,i)
iteration i = 2
C(:,:,i)
= *BA(:,:,i)
iteration i = 3
C(:,:,i)
=
for i = 1:3 C(:,:,i) = A(:,:,i) * B;
Regular FOR-loop (3 serial kernel launches)
simultaneous iterations i = 1:3
BA(:,:,1:3)C(:,:,1:3)
*= *= *=
Example: Matrix Multiply
gfor i = 1:3 C(:,:,i) = A(:,:,i) * B;
Parallel GPU FOR-loop (only 1 kernel launch)
simultaneous iterations i = 1:3
*
BA(:,:,1)C(:,:,1)
=
Example: Matrix Multiply
gfor i = 1:3 C(:,:,i) = A(:,:,i) * B;
Parallel GPU FOR-loop (only 1 kernel launch)
Example: Summing over Columns• Think of gfor as “syntactic sugar” to write
vectorized code in an iterative style.
for i = 1:3 A(i) = sum(B(:,i));
gfor i = 1:3 A(i) = sum(B(:,i));
Three passes to sum all columns of B
One pass to sum all columns of B
Both equivalent to “sum(B)”, but latter is faster (more
explicitly written)
y = gzeros( 5, 5, n );for i = 1:n, gselect(i); % choose GPU for this iteration x = grand(5,5); % add work to GPU’s queue y(:,:,i) = fft(x); % more work in queueend
% all GPUs are now computing simultaneously, until done
Easy Multi GPU Scaling
Technology Stack• A full system making optimizations for you• Including
– “Core” brains– “JIT” speed– “Calls” heavy-lifting
runtime memory mgt binary handling GPU-multiplex thread mgt
core
JIT Engine(s)
plus.mexminus.mex
bsxfun.mex
tan.mex
times.mex
power.mex
Calls (library routines + JIT)
fft.mex
fft2.mex
bessel.mex
conv2.mex
convn.mex
find.mex
sum.mex
subsasgn.mex
mldivide.mex
lu.mex
http://www.accelereyes.com/case_studies
17X
Neuro-imagingGeorgia Tech
20X
Video ProcessingGoogle
12X
Medical DevicesSpencer Tech
5X
Weather ModelingNCAR
35X
Power EngineeringIIT India
17X
Track Bad GuysBAE
Systems
70X
Drug DeliveryGeorgia Tech
35X
BioinformaticsLeibniz
20X
Bio-ResearchCDC
45X
Radar ImagingSystem Planning
Automated Optimizations
300 cyclesone-way
GPU Memory
GPU Cores
A = sin( x + y ).^2
CPU
Automated Optimizations
300 cyclesone-way
GPU Memory
GPU Cores
A = sin( x + y ).^2
CPUOptimized via
async transfer and smart copy
Optimized viaruntime
Compare versus PCTA = sin( x + y ).^2
PCTLoad x, y (300 cycles)+ (4 cycles)Store Temp1 (300 cycles)Load Temp1 (300 cycles)Sin (~20 cycles)Store Temp2 (300 cycles)Load Temp2 (300 cycles).^ (~10 cycles)Store A (300 cycles)
JacketLoad x, y (300 cycles)Sin( x+y ).^2 (34 cycles)Store A (300 cycles)
MATLAB and PCT are products and trademarks of MathWorks.
parallel computing toolbox™
Compare versus PCTA = sin( x + y ).^2
PCTLoad x, y (300 cycles)+ (4 cycles)Store Temp1 (300 cycles)Load Temp1 (300 cycles)Sin (~20 cycles)Store Temp2 (300 cycles)Load Temp2 (300 cycles).^ (~10 cycles)Store A (300 cycles)
JacketLoad x, y (300 cycles)Sin( x+y ).^2 (34 cycles)Store A (300 cycles)
1834 cycles
634 cycles
parallel computing toolbox™
MATLAB® and PCT™ are products and trademarks of MathWorks.
Compare versus PCTA = sin( x + y ).^2
PCTLoad x, y (300 cycles)+ (4 cycles)Store Temp1 (300 cycles)Load Temp1 (300 cycles)Sin (~20 cycles)Store Temp2 (300 cycles)Load Temp2 (300 cycles).^ (~10 cycles)Store A (300 cycles)
JacketLoad x, y (300 cycles)Sin( x+y ).^2 (34 cycles)Store A (300 cycles)
1834 cycles
634 cycles
Theoretically, a 3x increase. Actually, a 20x difference:• Legacy Java system• Better GPU code
parallel computing toolbox™
Jacket has 10X more functions…reductions • sum, min, max, any,
all, nnz, prod• vectors, columns,
rows, etc
convolutions• 2D, 3D, ND
dense linear algebra• LU, QR, Cholesky,
SVD, Eigenvalues, Inversion, det, Matrix Power, Solvers
FFTs• 2D, 3D, ND
image processing• filter, rotate, erode,
dilate, bwmorph, resize, rgb2gray
• hist, histeq
interp and rescale• vectors, matrices• rescaling
sorting• along any dimension• find
and many more…
gfor (loops) gcompile (fine-grain) gselect (multi-GPU)
help• gprofview
Easy To Maintain• Write your code once and let Jacket carry you
through the coming hardware evolution.– Each new Jacket release improves the speed of
your code, without any code modification.– Each new Jacket release leverages latest GPU
hardware (e.g. Fermi, Kepler), without any code modification.
New in Jacket 2.1: Optimization• Unconstrained Optimization in 2.1
– Gradient Descent and BFGS methods– Jacobian computation with GFOR
• Batched-mode Optimization in 2.2• Search-based Optimization in 2.2• Constrained Optimization in 2.3
Sparse RoadmapCurrent functions supported:• Matrix multiply• Triangular matrix solve• Iterative solvers with no pre-conditioning.• Examples: CG, BICG, BICGSTAB, BICGSTABL, GMRES, LSQR
Under development:• Iterative solvers with pre-conditioning and improved performance• Examples: CG, BICG, BICGSTAB, GMRES
Move to C/C++, Fortran, or Python
The World’s Largest, Fastest GPU Library
ArrayFire GPU library• Free version for most users (single GPU usage)• Pro version (multi-GPU usage)• Available for CUDA or OpenCL devices
ArrayFire Example (C++)#include <stdio.h>#include <arrayfire.h>using namespace af;int main() { // 20 million random samples int n = 20e6; array x = randu(n,1), y = randu(n,1); // how many fell inside unit circle? float pi = 4 * sum<float>(sqrt(mul(x,x)+mul(y,y))<1) / n; printf("pi = %g\n", pi); return 0;}
Case Studies
See more examples: http://www.accelereyes.com/examples/case_studies
http://blog.accelereyes.com/blog/
Case Study: Australian Brokerage• Description: Nonlinear regressive model
fitting• Speedup: 115x• Solution: Jacket, Jacket DLA, ArrayFire Pro,
Consulting
Case Study: Australian Brokerage• Description: Modified conjugate gradient for
sparse matrices• Speedup: 10-30x (Depends on data size.
Larger data gives bigger speedups.)• Solution: Jacket, Jacket SLA, ArrayFire Pro,
Consulting
Case Study: Koch Industries• Description: Option pricing based on Monte-
Carlo simulation• Speedup: 60 - 70x• Solution: Jacket
Case Study: Bank of America
• Description: Visualization of server utilization and workloads, required to run in MATLAB®
• Focus only on visualization, not computation• Result: Beautiful OpenGL 3D renderings• Solution: Jacket with the Graphics Library
Automotive Trader Example• Description: Algorithmic trading• Speedup: 37x on 3 GPUs (14x on 1 GPU)• Solution: Jacket, Jacket MGL for 3 GPUs• Learn more:
http://www.automatedtrader.net/articles/software-review/107768/mashup
Demos
Discussion
Faster MATLAB® through GPU computing