Download - 12/12/11 Outline$ - inf.ufrgs.br · 12/12/11 2 Example:CUDAforGPUs$ 14 [email protected]$ void MatrixMul(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float);

12/12/11

1

Parallel Programming for

Exa-‐Scale Compu8ng Nicolas Maillard

[email protected] ASM 2011, Porto Alegre

Outline

•  The GPPD, •  ExaScale: Challenges, •  Parallel Programming Landscape,

•  Task parallelism: what, why, how?

ASM 2011 2

Exa-‐Flops/s == 1018 Challenges

•  Heterogeneity –  GPUs

•  Fault-‐Tolerance, –  Global state

•  Scheduling/mapping, –  Dynamicity

•  Bandwidth/data-‐transfer, –  Locality

•  IO, •  Power supply, •  …

ASM 2011 10

So many difficul.es!

Parallel Programming

WHAT DO YOU HAVE? Parallel Programming

ASM 2011 11

OpenMP: default for Shared-‐Memory Systems

[email protected] 12

Sequen.al code double res[10000]; for (i=0 ; i<10000; i++)

compute(res[i]);

Parallel Open-‐MP code double res[10000]; #pragma omp parallel for for (i=0 ; i<10000; i++)

compute(res[i]);

Data Parallelism

•  Simple idea – Define your data-‐structures (e.g arrays, trees...); – Decide how you distribute them among your CPUs;

– Run the computa8on in parallel on each local piece of data.

With few dependencies, simple and efficient.

[email protected] 13 01/09/2010

12/12/11

2

Example: CUDA for GPUs


void MatrixMul(float* M, float* N, float* P, int Width)‏ { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; // Vectors that will be processed in parallel. dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Call a function to be run by each CUDA thread on its own // piece of data: MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width); } __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏

{…;}

Message Passing Interface for Distributed Memory


void main() { int p, r, tag = 103; MPI_Status stat; double valor ; MPI_Init(&argc, &argv) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { prinn("Processor 0 sends a message to 1\n"); valor = 3.14 ; MPI_Send(&valor, 1, MPI_DOUBLE, 1, tag,

MPI_COMM_WORLD); } else { prinn("Processor 1 receives a message from 0\n"); MPI_Recv(&valor, 1, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, &stat); prinn("O valor recebido vale %.2lf\n", valor); } }

MPI defined datatype

Send/Receive

A MPI parallel program p processes interact through message passing.

2 CPUS [email protected]

Process 2

void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);

Process 2



Process 0

void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...); Process 1

2 CPUS


Process 0


Process 0

17 [email protected]

Process 2


Process 2



Process 0

void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...); Process 1

2 CPUS


Process 0


Process 0

18 [email protected]

Scalability of MPI

•  MPI on a Million Processors? – Limita8ons in the API (e.g.: some calls may need O(p) arguments),

– Limita8ons in the implementa8on •  Mapping rank/pid in a communicator

– How to enable fault-‐tolerance in MPI?

ASM 2011 19

Euro-‐PVM/MPI 2009

12/12/11

3

Anyway…

•  As always in Parallel Programming, the conclusion is that: – There is s8ll no universal, ideal API for parallel programming (at whatever scale),

– You need to redesign your program/algorithm if you want to scale to the Exa. •  No all2all, •  3D decomposi8on of the data, •  …

ASM 2011 20

TASK PARALLELISM Another approach/idea

ASM 2011 21

What are tasks?

•  Task: set of sequen8ally ordered instruc8ons.

•  Tasks have an input/output, can synch(), can communicate, can be stored, can be run.

•  Fine granularity –  The Task is a (virtual) en8ty which MAY be run in parallel by resources from the OS/HW. Or not.

ASM 2011 22

Defini.on

Why do you want tasks?

•  ``This is not the way one programs with (MPI|OpenMP|Pthreads|CUDA…)!” – Okay – but MPI is what you have! Is it what you want?

ASM 2011 23

Parallel Programming Model

•  Programming model: how do you want to write down your algorithms? –  You want to specify/program your algorithm with as much parallelism as possible, independently of any hw constraint.

•  In the sequen8al world: –  Do you care about the mapping of the vars to registers/stack/heap?

–  Do you split your instruc8ons into func8ons based on their run8me?

–  So why do you want to do it in parallel?

ASM 2011 24

Hardware vs. So{ware

ASM 2011 25

Hardware Applications Parallel

Programming Model

[Asanovic et al, 2006]

Virtual machine, Simulator,

…

Libraries, Templates, Algorithms,

…

12/12/11

4

How do you “think parallel”?

ASM 2011 26

The factorial case – idea from Yale PaQ.

•  n! = n x (n-‐1) x (n-‐2) x (n-‐3) x ... x 3 x 2 x 1 •  Basic no8on of recursion: n! is defined by

1! == 1 n! = n * (n-‐1)!

fact(int n) { if (n <= 1) return(1); else return(n * fact(n-‐1)); }

Sequen8al or parallel?

ASM 2011 27

The factorial case – idea from Yale PaQ.

•  n! = n x (n-‐1) x (n-‐2) x (n-‐3) x ... x 3 x 2 x 1 •  Basic no8on of recursion: n! is defined by

1! == 1 n! = n * (n-‐1)!

fact(int n) { if (n <= 1) return(1); else return(n * fact(n-‐1)); }

fact(n)

fact(n-‐1) fact(n-‐2)

fact(n-‐3)


ASM 2011 28

An other look at fact() n! = [ n x (n-‐1) x (n-‐2) x ...(n/2+1) ] x [ n/2... x 3 x 2 x 1 ] Which means: n! can be defined by:

fact(int n) { int res_l = 1; int res_h = 1; for (i=1; i<=n/2 ; i++) res_l = res_l * i; for (in/2+=1; i<=n ; i++) res_h = res_h * i; return(res_l * res_h); }

fact(n)

Run-‐8me divided by 2


ASM 2011 29

One step more! n! = [ n x (n-‐1) x (n-‐2) x ...(n/2+1) ] x [ n/2... x 3 x 2 x 1 ] Which means: n! Can be programmed recursively:

fact(int start, int end) { if (end – start <= 1) return(1); else { res_l = fact(start, (end-‐start)/2); res_h = fact((end-‐start)/2+1, end; return(res_l * res_h); } } fact(1,n); // n = 2k

fact(1, n)

Run-‐8me divided by ??


ASM 2011 30

One step more!

n! = [ n x (n-‐1) x (n-‐2) x ...(n/2+1) ] x [ n/2... x 3 x 2 x 1 ] Which means: n! Can be programmed recursively:

Run-‐8me divided by ?? n leaves (= 2k)

fact(1, n) fact(n/2+1, n) fact(1, n/2)

Tasks and Recursivity

•  The first algorithm was recursive and sequen8al.

•  The second algorithm was itera8ve and sequen8al.

•  Both could be redesigned to show a high degree of “poten8al” parallelism .

•  Recursivity, together with Divide & Conquer, usually provides a lot of parallel tasks.

ASM 2011 31

Moral from the fact() example

12/12/11

5

Scheduling tasks

•  PRAM theory and Brent Theorem: from a fine-‐grained highly parallel program, without HW limit, you can deduce an efficient (i.e. with good speed-‐up) program.

•  Tp(n) = T∞ + Wpar / p •  Yet, you need an efficient scheduler. •  In prac8ce: Work Stealing serves.

ASM 2011 32

Tasks with CILK

•  h�p://supertech.csail.mit.edu/cilk/intro.html

ASM 2011 33

cilk int fib (int n){ if (n < 2) return n; else { int x, y; x = spawn fib (n-‐1); y = spawn fib (n-‐2); sync; return (x+y); } }

OpenMP

ASM 2011 34

•  Tradi8onal OpenMP programming is based on loop parallelism. –  The task is an itera8on.

func8on soma(integer n, real v(:)): real

real res, res1, res2 if (n .eq. 1) then res= v(1) else res = 0 $!omp parallel for reduce(res,+) do i=1, n res = res + v(i) end do soma = res return end

Tasks in OpenMP

ASM 2011 35

•  Tradi8onal OpenMP programming is based on loop parallelism. – The task is an itera8on.

•  OpenMP3 (mid 2008) has introduced Task parallelism – Heritage from Cilk

func8on soma(integer n, real v(:)): real real res, res1, res2 if (n .eq. 1) then res= v(1) else $!omp task res1= soma(n/2, v(1:n/2)) $!omp end task $!omp task res2 = soma(n/2, v(n/2+1:n)) $!omp end task $!omp taskwait res = res1 + res2 end soma = res return end

Intel TBB

h�p://www.threadingbuildingblocks.org Intel’s proposal for Mul8core programming.

– Based on C++ STL/Containers •  “C[++]lear” syntax

– The tasks are the elements in the container •  You can “iterate” on the tasks in parallel •  You can apply algorithms to the tasks.

ASM 2011 36

Example TBB

ASM 2011 37

•  “It represents a higher-‐level, task-‐based parallelism that abstracts planorm details and threading mechanisms” – Intel.

•  “create many more tasks than there are threads, and let the task scheduler choose the mapping from tasks to threads.” –  The scheduler uses Work

Stealing.

// Parallel loop

class ApplyFoo { float *const my_a; void operator()( const blocked_range<int>& r )

const { float *a = my_a; for( int i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } ApplyFoo( float a[] ) : my_a(a) {} // Contructor }; // Parallel invoca8on void ParallelApplyFoo( float a[], int n )

{ parallel_for(blocked_range<int> (0,n,GrainSize), ApplyFoo(a) ); }

// Sequen.al loop void SerialApplyFoo( float a[], int n ) {

for( int i=0; i<n; ++i ) Foo(a[i]); }

12/12/11

6

Tasks with Kaapi

ASM 2011 38

Kaapi

•  A French project -‐ h�p://kaapi.gforge.inria.fr/ •  C++ library that provides an API for parallel programming based on tasks.

•  A shared global address space –  You create objects inside it with the keyword “shared”

•  A task is a call to a func8on, prefixed by “fork” –  Just like Unix / Cilk –  Tasks communicate through objects shared. –  Tasks specify the access mode to shared objects (read/write)

•  The Kaapi run8me builds the data-‐flow graph and uses it to schedule the tasks.

ASM 2011 39

A simple example

ASM 2011 40

Fibonacci struct Fibonacci { void operator()( int n, a1::Shared_w<int> result ) { if (n < 2) result.write( n ); else { a1::Shared<int> subresult1; a1::Shared<int> subresult2; a1::Fork<Fibonacci>()(n-‐1, subresult1); a1::Fork<Fibonacci>()(n-‐2, subresult2); a1::Fork<Sum>()(result, subresult1, subresult2); } } }; struct Sum { void operator()(a1::Shared_w<int> result, a1::Shared_r<int> sr1, a1::Shared_r<int> sr2 ) { result.write( sr1.read() + sr2.read() ); } }

A simple example

ASM 2011 41

Fibonacci struct Fibonacci { void operator()( int n, a1::Shared_w<int> result ) { if (n < 2) result.write( n ); else { a1::Shared<int> subresult1; a1::Shared<int> subresult2; a1::Fork<Fibonacci>()(n-‐1, subresult1); a1::Fork<Fibonacci>()(n-‐2, subresult2); a1::Fork<Sum>()(result, subresult1, subresult2); } } }; struct Sum { void operator()(a1::Shared_w<int> result, a1::Shared_r<int> sr1, a1::Shared_r<int> sr2 ) { result.write( sr1.read() + sr2.read() ); } }

Tasks & Heterogeneous Parallel Programming

ASM 2011 42

•  Adap8ve Work Stealing already uses 2 algorithms –  1 sequen8al, 1 parallel.

•  Why not using 2 different implementa.ons (methods) to run on a container, e.g. one for CPU and one for GPU? –  The merge() method handles the different address spaces.

•  This has been done par8ally by B. Raffin and E. Hermann [EGPGV 09]. •  J. Lima´PhD.

Tasks with MPI?

ASM 2011 43

12/12/11

7

The MPI task

•  MPI defines tasks that: – Have their own address space, –  Communicate with other tasks through messages. – Usually all are launched at the start of the program.

•  The mapping “MPI task” / O.S. is not specified. – Usually, 1 task == 1 heavy process (O.S. view);

•  MPI-‐2 has somewhat reinforced this common understanding –  Some MPI Distribu8ons use threads (MPICH);

•  A-‐MPI defines an abstract task (Urbana Champaign)

ASM 2011 44

Dynamic Tasks in MPI: D&C and Spawn

•  Program with Divide & Conquer techniques. •  Use MPI_Comm_spawn to (recursively) create new tasks. –  1 task == 1 (MPI) process.

•  Make sure that the children tasks may communicate with their parent

–  Have the parent send the children input data, and then block. –  Have the children send their results back to the parent.

•  This implies very large-‐grained parallelism, but at least: –  You can benefit from dynamic resources. –  You can improve the load balance.

ASM 2011 45

Using MPI-‐2

Parallel run8me of Eratosthene’s Sieve

ASM 2011 46

[Cera 2007]

Sta.c MPI implementa.on

Malleable MPI implementa.on

Cluster usage with Rigid jobs + malleable, MPI programs

ASM 2011 47

[Cera 2009]

MPI tasks: processes vs. threads

ASM 2011 48

•  João Lima´s Master •  Modifica8on of MPICH (C++ API). •  Transparent to the user, •  The spawned tasks may be run by processes or threads.

• Does not change the Send/Recv seman8cs.

MPI Processes and threads

ASM 2011 49

[Lima 2009]

Mergesort of 3M numbers.

Number of threads / process

12/12/11

8

Valida8on with “Real-‐World” apps

•  OLAM -‐ Ocean-‐Land-‐Atmosphere Model.

•  Uses finite elements

ASM 2011 50

OLAM

Conclusions

•  To reach the ExaFlops/s, new tools are required: –  New hardware, be�er IO solu8ons, be�er interfaces with the user

(visualiza8on…) –  A parallel programming model that decouples the parallelism from the

hardware.

ASM 2011 51

Task-‐based parallelism may be one way.

•  There are solu8ons for shared-‐mem. systems (TBB, OpenMP3…)

•  The support for distributed memory is s8ll an on-‐going research (Kaapi, Charm++…)

Other ideas to go further

•  It is probably utopic to try and devise one more API for parallel programming.

•  Automa8c genera8on of code – Different from Compiling: support to the programmer. – Source2source compiling, interac8on with the programmer.

•  Auto-‐Tuning

ASM 2011 52