+ All Categories
Home > Documents > Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos...

Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos...

Date post: 26-Jun-2018
Category:
Upload: trinhminh
View: 242 times
Download: 1 times
Share this document with a friend
52
1 © 2013 The MathWorks, Inc. Parallel Computing with MATLAB Jos Martin Principal Architect, Parallel Computing Tools [email protected]
Transcript
Page 1: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

1 © 2013 The MathWorks, Inc.

Parallel Computing with MATLAB

Jos Martin Principal Architect, Parallel Computing Tools

[email protected]

Page 2: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

2

Code used in this presentation can be found at http://www.matlabexpo.com/uk/2013/proceedings/parallel-computing-with-matlab.zip

www.matlabexpo.com

Page 3: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

3

Overview

Scene setting

Task Parallel (par*)

Why doesn’t it speed up as much as I expected?

Data parallel (spmd)

GPUs

Page 4: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

4

What I assume

Reasonable MATLAB knowledge

– e.g. vectorization, pre-allocation

Some use of PCT and associated concepts

– What is a cluster

– Simple parfor usage

Page 5: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

5

parfor

Definition

Code in a parfor loop is guaranteed by the programmer to be

execution order independent

Why is that important?

We can execute the iterates of the loop in any order, potentially at

the same time on many different workers.

Task Parallel (parfor)

Page 6: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

6

A simple parfor loop

parfor i = 1:N

out(i) = someFunction(in(i));

end

Task Parallel (parfor)

Page 7: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

7

parfor – how it works

A loop from 1:N has N iterates which we partition into a

number of intervals

– Each interval may have a different number of iterates

Allocate the intervals to execute on the workers

Stitch the results back together

Task Parallel (parfor)

Page 8: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

8

The Mechanics of parfor Loops

Pool of MATLAB Workers

a = zeros(10, 1)

parfor i = 1:10

a(i) = i;

end

a a(i) = i;

a(i) = i;

a(i) = i;

a(i) = i;

Worker

Worker

Worker Worker

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

parforIterateDemo

Task Parallel (parfor)

Page 9: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

9

Variable Classification

reduce = 0; bcast = …; in = …;

parfor i = 1:N

temp = foo1(bcast, i);

out(i) = foo2(in(i), temp);

reduce = reduce + foo3(temp);

end

Task Parallel (parfor)

Page 10: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

10

Loop variable

reduce = 0; bcast = …; in = …;

parfor i = 1:N

temp = foo1(bcast, i);

out(i) = foo2(in(i), temp);

reduce = reduce + foo3(temp);

end

Task Parallel (parfor)

Page 11: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

11

Making extra parallelism

No one loop appears to have enough iterations to go

parallel effectively

for ii = 1:smallNumber_I

for jj = 1:smallNumber_J

for kk = 1:smallNumber_K

end

end

end

smallNumber_I * smallNumber_J * smallNumber_K == quiteBigNumber

mergeLoopsDemo

Task Parallel (parfor)

Page 12: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

12

Sliced Variable

reduce = 0; bcast = …; in = …;

parfor i = 1:N

temp = foo1(bcast, i);

out(i) = foo2(in(i), temp);

reduce = reduce + foo3(temp);

end

Task Parallel (parfor)

Page 13: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

13

Broadcast variable

reduce = 0; bcast = …; in = …;

parfor i = 1:N

temp = foo1(bcast, i);

out(i) = foo2(in(i), temp);

reduce = reduce + foo3(temp);

end

Task Parallel (parfor)

Page 14: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

14

Reusing data

D = makeSomeBigData;

for ii = 1:N

parfor jj = 1:M

a(jj) = func(D, jj);

end

end

Task Parallel (parfor)

Page 15: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

15

Reusing data

D = WorkerObjectWrapper(@makeSomeBigData);

for ii = 1:N

parfor jj = 1:M

a(jj) = func(D.value, jj);

end

end

from Edric Ellis on MATLAB Central

www.mathworks.com/matlabcentral/fileexchange/31972-worker-object-wrapper

Task Parallel (parfor)

Page 16: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

16

Counting events in parallel

Inside the parallel loop you are looking to count the

number of times some particular result is obtained

– Histograms, interesting results, etc.

Task Parallel (parfor)

Page 17: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

17

Reduction Variable

reduce = 0; bcast = …; in = …;

parfor i = 1:N

temp = foo1(bcast, i);

out(i) = foo2(in(i), temp);

reduce = reduce + foo3(temp);

end

parforSearchDemo

Task Parallel (parfor)

Page 18: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

18

Common parallel program

set stuff going

while not all finished {

for next available result do something;

}

Task Parallel (parfeval)

Page 19: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

19

parfeval

New feature in R2013b

Introduces asynchronous programming

f = parfeval(@func, numOut, in1, in2, …)

The return f is a future which allows you to

– Wait for the completion of calling func(in1, in2, …)

– Get the result of that call

– … do other useful parallel programming tasks …

Task Parallel (parfeval)

Page 20: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

20

Fetch Next

Fetch next available unread result from an array of

futures.

[idx, out1, ...] = fetchNext(arrayOfFutures)

idx is the index of the future from which the result is fetched

Once a particular future has returned a result via fetchNext it will

never do so again

– That particular result is considered read, and will not be re-read

Task Parallel (parfeval)

Page 21: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

21

Common parallel program (MATLAB)

% Set stuff going

for ii = N:-1:1

fs(ii) = parfeval(@stuff, 1);

end

% While not all finished

for ii = 1:N

% for next available result

[whichOne, result] = fetchNext(fs);

doSomething(whichOne, result);

end

parfevalWaitbarDemo

Task Parallel (parfeval)

Page 22: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

22

Better parallel program

set N things going

while not all finished {

set N more things going

for N {

for next available result do something;

}

}

parfevalNeedleDemo

Task Parallel (parfeval)

Page 23: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

24

Why isn’t it as fast as I expect?

How fast did you expect?

– Why?

Consider

– Data transfer

– Resource contention

– Other overheads

Performance

Page 24: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

25

Data Transfer

parfor (Variable classification)

– Broadcast goes once to each worker (what is actually accessed?)

– Sliced sends just the slice (is all of the slice accessed?)

– Reduction is sent back once per worker (usually efficient)

parfeval

– All inputs for a given call are passed to that worker

Performance

Page 25: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

26

Resource Contention

Memory

HT HT

Core

HT HT

Core

HT HT

Core

HT HT

Core

Cache Memory (L3)

HT HT

Core

HT HT

Core

HT HT

Core

HT HT

Core

Cache Memory (L3)

IO Hub

Disk

Network

Performance

Page 26: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

27

Speedup vs. num. Concurrent Processes

a = bigMatrix

a*a

fft(a)

sum(a)

Performance

Page 27: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

28

Speedup vs. num. Concurrent Processes

Hyperthreaded

Cores

a = bigMatrix

a*a

fft(a)

sum(a)

Performance

Page 28: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

29

Speedup vs. Size of Data (6 procs.)

a = matrix(N)

a*a

sum(a)

svd(a)

Performance

Page 29: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

30

Summary (par*)

Find enough parallelism

– Go parallel as soon as possible

– But not too small with parfeval

Know how much data is being sent

– Try to send as little as possible

Understand how multiple algorithms might interact

Keep workers busy if possible

Task Parallel (par*)

Page 30: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

31

Single Program, Multiple Data (spmd)

Everyone executes the same program

– Just with different data

– Inter-lab communication library enabled

– labindex and numlabs available to distinguish labs

Example

x = 1

spmd

y = x + labindex;

end

Data Parallel (spmd)

Page 31: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

32

MPI ring

A Mental Model for spmd ... end

Pool of MATLAB Workers

x = 1;

spmd

y = x + 1;

end

y x = 1

y = x + 1

x = 1

y = x + 1

x = 1

y = x + 1

Worker Worker

Worker Worker

x = 1

y = x + 1

Data Parallel (spmd)

Page 32: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

33

Common Parallel Program

forever {

results = independentStuff( params )

if results are OK {

break

} else {

params = chooseNewParams( results, params )

}

}

Data Parallel (spmd)

Page 33: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

34

Solve with parfor

forever {

parfor ii = 1:N {

results(ii) = independentStuff( params(ii) )

}

if results are OK {

break

} else {

params = chooseNewParams( results, params )

}

}

Data Parallel (spmd)

Page 34: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

35

Solve with spmd

spmd { forever {

// Each of the workers computes its results (mine)

results = gcat(independentStuff( params(mine) ))

if results are OK {

break

} else {

params = chooseNewParams( results, params )

}

}}

spmdDemo

Data Parallel (spmd)

Page 35: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

36

Summary (spmd)

Required if inter-worker communication is needed for

the algorithm

Can provide better performance for some algorithms

Data Parallel (spmd)

Page 36: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

37

GPUs

Highly threaded

– 106 threads not uncommon

Very fast memory access

– 200GB/s (~8x best CPU)

Peak performance (double)

– 1TFlop (~3x best CPU)

GPU

Page 37: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

38

Getting data to the GPU

To make an array exist on the GPU

g = gpuArray( dataOnCpu );

g = zeros( argsToZeros, 'gpuArray' );

g = ones( argsToZeros, 'uint8', 'gpuArray' );

Supported types

– All built-in numeric types

[complex|][[uint|int][8|16|32|64]|double|single]

GPU

Page 38: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

39

Using gpuArray

Honestly – it’s just like an ordinary MATLAB array

Except that the methods that are implemented for it will

run on the GPU (over 200 currently and growing)

– Maybe some of these will be faster on your GPU

Want to get the data back to the CPU c = gather(g);

GPU

Page 39: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

40

GPUness spreads

function [a, b, c] = example(d, e, f)

a = sin(d) + e;

b = cos(d) + f;

c = a + b + e + f;

GPU

Page 40: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

41

GPUness spreads

function [a, b, c] = example(d, e, f)

% Imagine if the input d were on the GPU

a = sin(d) + e;

b = cos(d) + f;

c = a + b + e + f;

GPU

Page 41: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

42

Getting data in the right place (new in 13b)

sIn = size(in);

out = in * eye(sIn) + ones(sIn);

The problem is that eye and ones make data in CPU memory

– And so we need to transfer data to the GPU (which is relatively slow)

out = in * eye(sIn,'like',in) + ones(sIn,'like',in);

'like' says make the data in the same place and as the same

type as the prototype provided

GPU

Page 42: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

43

Semantic work pattern: gpuArray

.*

.*

.*

.*

A tmp

D = A.*B + C

+

+

+

+

D

B C

time

GPU

Page 43: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

44

Lazy Evaluation

Where possible we queue things up on the GPU and

return back to the program immediately

– We also try to amalgamate sets of operations together

GPU

Page 44: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

45

Actual work pattern: gpuArray

.*

.*

.*

.*

A

tmpactual

+

+

+

+

B

C

Dfuture Dactual CPU code continues

time

On

GPU

On

CPU

GPU

tmpfuture

Page 45: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

46

Lazy Evaluation

Why do you care?

– Improves performance a lot

– CPU & GPU work at the same time.

But be careful because tic;toc; can easily give you

the wrong time, since the computation hasn’t finished

d = gpuDevice; % Get the current GPU device

tic

gpuStuffToTime;

wait(d); % wait for computation on the GPU d to finished

toc

GPU

Page 46: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

47

Can we do better?

.*

.*

.*

.*

A(1) B(1) tmp(1) +

+

+

+

D(1)

A(2) B(2)

A(3) B(3)

A(4) B(4)

tmp(2)

tmp(3)

tmp(4)

D(2)

D(3)

D(4)

D = A.*B + C

GPU

Page 47: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

48

arrayfun

Apply a function to each element of a set of gpuArrays

[o1, o2] = arrayfun(@aFunction, s1, s2, s3)

Some limitations apply

– All code uses scalar variables

– Only a subset of the MATLAB language is supported

GPU

Page 48: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

49

Why is this a good idea?

We know what inputs are being passed to your function

We know what code is in your function

with that we can infer the type of all variables in your code

and then we can generate code for your GPU

for each element of your input arrays we can execute your function on a

single CUDA thread – remember a GPU can execute thousands of threads at once, and schedule even more

GPU

gpuMandelbrotDemo

Page 49: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

50

Singleton Expansion

Whenever a dimension of an input array is singleton (equal to

one), we virtually replicates that array along that dimension to

match the other arrays.

– scalar expansion is a specific instance of singleton expansion

Look for functions that support singleton expansion (arrayfun,

etc.)

singletonExpansionDemo

GPU

Page 50: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

51

Batching many small operations (pagefun)

You have many matrices held in the pages of a multi-

dimensional array

You want to carry-out the same operation on each of

the individual pages of the big array e.g.

for ii = 1:numPages

C(:,:,ii) = A(:,:,ii) * B;

end

gpuPagefunDemo

GPU

Page 51: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

52

Invoking CUDA Kernels

MATLAB % Setup

kern = parallel.gpu.CUDAKernel(‘myKern.ptx’, cFcnSig)

% Configure

kern.ThreadBlockSize=[512 1];

kern.GridSize=[1024 1024];

% Run

[c, d] = feval(kern, a, b);

C & mex // Setup

mxGPUArray const * A = mxGPUCreateFromMxArray(prhs[0]);

// Create a GPUArray to hold the result and get its underlying

// pointer.

mxGPUArray * B = mxGPUCreateGPUArray(mxGPUGetNumberOfDimensions(A),

mxGPUGetDimensions(A),

mxGPUGetClassID(A),

mxGPUGetComplexity(A),

MX_GPU_DO_NOT_INITIALIZE);

double * d_B = (double *)(mxGPUGetData(B));

// Standard CUDA kernel call using the CUDA runtime.

TimesTwo<<<blocksPerGrid, threadsPerBlock>>>(d_B, N);

}

// Device code prototype ...

void __global__ TimesTwo(double * const B, int const N) { ... };

GPU

Page 52: Parallel Computing with MATLAB - MathWorks - Makers of ... · Parallel Computing with MATLAB Jos Martin Principal Architect, ... Apply a function to each element of a set of gpuArrays

53

Summary (GPU)

Vectorize as much as possible

Performance better for larger arrays (overhead smaller)

Keep data on the GPU as long as possible

Look for opportunities to use arrayfun and pagefun

– Particularly some loops can become serial calls to these

functions

– Use less memory with singleton expansion

GPU


Recommended