12/12/11
1
Parallel Programming for
Exa-‐Scale Compu8ng Nicolas Maillard
[email protected] ASM 2011, Porto Alegre
Outline
• The GPPD, • ExaScale: Challenges, • Parallel Programming Landscape,
• Task parallelism: what, why, how?
ASM 2011 2
Exa-‐Flops/s == 1018 Challenges
• Heterogeneity – GPUs
• Fault-‐Tolerance, – Global state
• Scheduling/mapping, – Dynamicity
• Bandwidth/data-‐transfer, – Locality
• IO, • Power supply, • …
ASM 2011 10
So many difficul.es!
Parallel Programming
WHAT DO YOU HAVE? Parallel Programming
ASM 2011 11
OpenMP: default for Shared-‐Memory Systems
Sequen.al code double res[10000]; for (i=0 ; i<10000; i++)
compute(res[i]);
Parallel Open-‐MP code double res[10000]; #pragma omp parallel for for (i=0 ; i<10000; i++)
compute(res[i]);
Data Parallelism
• Simple idea – Define your data-‐structures (e.g arrays, trees...); – Decide how you distribute them among your CPUs;
– Run the computa8on in parallel on each local piece of data.
With few dependencies, simple and efficient.
[email protected] 13 01/09/2010
12/12/11
2
Example: CUDA for GPUs
void MatrixMul(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; // Vectors that will be processed in parallel. dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Call a function to be run by each CUDA thread on its own // piece of data: MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width); } __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{…;}
Message Passing Interface for Distributed Memory
void main() { int p, r, tag = 103; MPI_Status stat; double valor ; MPI_Init(&argc, &argv) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { prinn("Processor 0 sends a message to 1\n"); valor = 3.14 ; MPI_Send(&valor, 1, MPI_DOUBLE, 1, tag,
MPI_COMM_WORLD); } else { prinn("Processor 1 receives a message from 0\n"); MPI_Recv(&valor, 1, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, &stat); prinn("O valor recebido vale %.2lf\n", valor); } }
MPI defined datatype
Send/Receive
A MPI parallel program p processes interact through message passing.
2 CPUS [email protected]
Process 2
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);
Process 2
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);
Process 0
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...); Process 1
2 CPUS
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);
Process 0
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);
Process 0
Process 2
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);
Process 2
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);
Process 0
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...); Process 1
2 CPUS
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);
Process 0
void main() { int r, tag = 103; MPI_Init(...) ; MPI_Comm_rank(MPI_COMM_WORLD, &r); if (r==0) { val = 3.14 ; MPI_Send(&val,..., 2, tag, ...); } if (r==2) MPI_Recv(&val, ..., 0, tag, ...);
Process 0
Scalability of MPI
• MPI on a Million Processors? – Limita8ons in the API (e.g.: some calls may need O(p) arguments),
– Limita8ons in the implementa8on • Mapping rank/pid in a communicator
– How to enable fault-‐tolerance in MPI?
ASM 2011 19
Euro-‐PVM/MPI 2009
12/12/11
3
Anyway…
• As always in Parallel Programming, the conclusion is that: – There is s8ll no universal, ideal API for parallel programming (at whatever scale),
– You need to redesign your program/algorithm if you want to scale to the Exa. • No all2all, • 3D decomposi8on of the data, • …
ASM 2011 20
TASK PARALLELISM Another approach/idea
ASM 2011 21
What are tasks?
• Task: set of sequen8ally ordered instruc8ons.
• Tasks have an input/output, can synch(), can communicate, can be stored, can be run.
• Fine granularity – The Task is a (virtual) en8ty which MAY be run in parallel by resources from the OS/HW. Or not.
ASM 2011 22
Defini.on
Why do you want tasks?
• ``This is not the way one programs with (MPI|OpenMP|Pthreads|CUDA…)!” – Okay – but MPI is what you have! Is it what you want?
ASM 2011 23
Parallel Programming Model
• Programming model: how do you want to write down your algorithms? – You want to specify/program your algorithm with as much parallelism as possible, independently of any hw constraint.
• In the sequen8al world: – Do you care about the mapping of the vars to registers/stack/heap?
– Do you split your instruc8ons into func8ons based on their run8me?
– So why do you want to do it in parallel?
ASM 2011 24
Hardware vs. So{ware
ASM 2011 25
Hardware Applications Parallel
Programming Model
[Asanovic et al, 2006]
Virtual machine, Simulator,
…
Libraries, Templates, Algorithms,
…
12/12/11
4
How do you “think parallel”?
ASM 2011 26
The factorial case – idea from Yale PaQ.
• n! = n x (n-‐1) x (n-‐2) x (n-‐3) x ... x 3 x 2 x 1 • Basic no8on of recursion: n! is defined by
1! == 1 n! = n * (n-‐1)!
fact(int n) { if (n <= 1) return(1); else return(n * fact(n-‐1)); }
Sequen8al or parallel?
ASM 2011 27
The factorial case – idea from Yale PaQ.
• n! = n x (n-‐1) x (n-‐2) x (n-‐3) x ... x 3 x 2 x 1 • Basic no8on of recursion: n! is defined by
1! == 1 n! = n * (n-‐1)!
fact(int n) { if (n <= 1) return(1); else return(n * fact(n-‐1)); }
fact(n)
fact(n-‐1) fact(n-‐2)
fact(n-‐3)
Sequen8al or parallel?
ASM 2011 28
An other look at fact() n! = [ n x (n-‐1) x (n-‐2) x ...(n/2+1) ] x [ n/2... x 3 x 2 x 1 ] Which means: n! can be defined by:
fact(int n) { int res_l = 1; int res_h = 1; for (i=1; i<=n/2 ; i++) res_l = res_l * i; for (in/2+=1; i<=n ; i++) res_h = res_h * i; return(res_l * res_h); }
fact(n)
Run-‐8me divided by 2
Sequen8al or parallel?
ASM 2011 29
One step more! n! = [ n x (n-‐1) x (n-‐2) x ...(n/2+1) ] x [ n/2... x 3 x 2 x 1 ] Which means: n! Can be programmed recursively:
fact(int start, int end) { if (end – start <= 1) return(1); else { res_l = fact(start, (end-‐start)/2); res_h = fact((end-‐start)/2+1, end; return(res_l * res_h); } } fact(1,n); // n = 2k
fact(1, n)
Run-‐8me divided by ??
Sequen8al or parallel?
ASM 2011 30
One step more!
n! = [ n x (n-‐1) x (n-‐2) x ...(n/2+1) ] x [ n/2... x 3 x 2 x 1 ] Which means: n! Can be programmed recursively:
Run-‐8me divided by ?? n leaves (= 2k)
fact(1, n) fact(n/2+1, n) fact(1, n/2)
Tasks and Recursivity
• The first algorithm was recursive and sequen8al.
• The second algorithm was itera8ve and sequen8al.
• Both could be redesigned to show a high degree of “poten8al” parallelism .
• Recursivity, together with Divide & Conquer, usually provides a lot of parallel tasks.
ASM 2011 31
Moral from the fact() example
12/12/11
5
Scheduling tasks
• PRAM theory and Brent Theorem: from a fine-‐grained highly parallel program, without HW limit, you can deduce an efficient (i.e. with good speed-‐up) program.
• Tp(n) = T∞ + Wpar / p • Yet, you need an efficient scheduler. • In prac8ce: Work Stealing serves.
ASM 2011 32
Tasks with CILK
• h�p://supertech.csail.mit.edu/cilk/intro.html
ASM 2011 33
cilk int fib (int n){ if (n < 2) return n; else { int x, y; x = spawn fib (n-‐1); y = spawn fib (n-‐2); sync; return (x+y); } }
OpenMP
ASM 2011 34
• Tradi8onal OpenMP programming is based on loop parallelism. – The task is an itera8on.
func8on soma(integer n, real v(:)): real
real res, res1, res2 if (n .eq. 1) then res= v(1) else res = 0 $!omp parallel for reduce(res,+) do i=1, n res = res + v(i) end do soma = res return end
Tasks in OpenMP
ASM 2011 35
• Tradi8onal OpenMP programming is based on loop parallelism. – The task is an itera8on.
• OpenMP3 (mid 2008) has introduced Task parallelism – Heritage from Cilk
func8on soma(integer n, real v(:)): real real res, res1, res2 if (n .eq. 1) then res= v(1) else $!omp task res1= soma(n/2, v(1:n/2)) $!omp end task $!omp task res2 = soma(n/2, v(n/2+1:n)) $!omp end task $!omp taskwait res = res1 + res2 end soma = res return end
Intel TBB
h�p://www.threadingbuildingblocks.org Intel’s proposal for Mul8core programming.
– Based on C++ STL/Containers • “C[++]lear” syntax
– The tasks are the elements in the container • You can “iterate” on the tasks in parallel • You can apply algorithms to the tasks.
ASM 2011 36
Example TBB
ASM 2011 37
• “It represents a higher-‐level, task-‐based parallelism that abstracts planorm details and threading mechanisms” – Intel.
• “create many more tasks than there are threads, and let the task scheduler choose the mapping from tasks to threads.” – The scheduler uses Work
Stealing.
// Parallel loop
class ApplyFoo { float *const my_a; void operator()( const blocked_range<int>& r )
const { float *a = my_a; for( int i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } ApplyFoo( float a[] ) : my_a(a) {} // Contructor }; // Parallel invoca8on void ParallelApplyFoo( float a[], int n )
{ parallel_for(blocked_range<int> (0,n,GrainSize), ApplyFoo(a) ); }
// Sequen.al loop void SerialApplyFoo( float a[], int n ) {
for( int i=0; i<n; ++i ) Foo(a[i]); }
12/12/11
6
Tasks with Kaapi
ASM 2011 38
Kaapi
• A French project -‐ h�p://kaapi.gforge.inria.fr/ • C++ library that provides an API for parallel programming based on tasks.
• A shared global address space – You create objects inside it with the keyword “shared”
• A task is a call to a func8on, prefixed by “fork” – Just like Unix / Cilk – Tasks communicate through objects shared. – Tasks specify the access mode to shared objects (read/write)
• The Kaapi run8me builds the data-‐flow graph and uses it to schedule the tasks.
ASM 2011 39
A simple example
ASM 2011 40
Fibonacci struct Fibonacci { void operator()( int n, a1::Shared_w<int> result ) { if (n < 2) result.write( n ); else { a1::Shared<int> subresult1; a1::Shared<int> subresult2; a1::Fork<Fibonacci>()(n-‐1, subresult1); a1::Fork<Fibonacci>()(n-‐2, subresult2); a1::Fork<Sum>()(result, subresult1, subresult2); } } }; struct Sum { void operator()(a1::Shared_w<int> result, a1::Shared_r<int> sr1, a1::Shared_r<int> sr2 ) { result.write( sr1.read() + sr2.read() ); } }
A simple example
ASM 2011 41
Fibonacci struct Fibonacci { void operator()( int n, a1::Shared_w<int> result ) { if (n < 2) result.write( n ); else { a1::Shared<int> subresult1; a1::Shared<int> subresult2; a1::Fork<Fibonacci>()(n-‐1, subresult1); a1::Fork<Fibonacci>()(n-‐2, subresult2); a1::Fork<Sum>()(result, subresult1, subresult2); } } }; struct Sum { void operator()(a1::Shared_w<int> result, a1::Shared_r<int> sr1, a1::Shared_r<int> sr2 ) { result.write( sr1.read() + sr2.read() ); } }
Tasks & Heterogeneous Parallel Programming
ASM 2011 42
• Adap8ve Work Stealing already uses 2 algorithms – 1 sequen8al, 1 parallel.
• Why not using 2 different implementa.ons (methods) to run on a container, e.g. one for CPU and one for GPU? – The merge() method handles the different address spaces.
• This has been done par8ally by B. Raffin and E. Hermann [EGPGV 09]. • J. Lima´PhD.
Tasks with MPI?
ASM 2011 43
12/12/11
7
The MPI task
• MPI defines tasks that: – Have their own address space, – Communicate with other tasks through messages. – Usually all are launched at the start of the program.
• The mapping “MPI task” / O.S. is not specified. – Usually, 1 task == 1 heavy process (O.S. view);
• MPI-‐2 has somewhat reinforced this common understanding – Some MPI Distribu8ons use threads (MPICH);
• A-‐MPI defines an abstract task (Urbana Champaign)
ASM 2011 44
Dynamic Tasks in MPI: D&C and Spawn
• Program with Divide & Conquer techniques. • Use MPI_Comm_spawn to (recursively) create new tasks. – 1 task == 1 (MPI) process.
• Make sure that the children tasks may communicate with their parent
– Have the parent send the children input data, and then block. – Have the children send their results back to the parent.
• This implies very large-‐grained parallelism, but at least: – You can benefit from dynamic resources. – You can improve the load balance.
ASM 2011 45
Using MPI-‐2
Parallel run8me of Eratosthene’s Sieve
ASM 2011 46
[Cera 2007]
Sta.c MPI implementa.on
Malleable MPI implementa.on
Cluster usage with Rigid jobs + malleable, MPI programs
ASM 2011 47
[Cera 2009]
MPI tasks: processes vs. threads
ASM 2011 48
• João Lima´s Master • Modifica8on of MPICH (C++ API). • Transparent to the user, • The spawned tasks may be run by processes or threads.
• Does not change the Send/Recv seman8cs.
MPI Processes and threads
ASM 2011 49
[Lima 2009]
Mergesort of 3M numbers.
Number of threads / process
12/12/11
8
Valida8on with “Real-‐World” apps
• OLAM -‐ Ocean-‐Land-‐Atmosphere Model.
• Uses finite elements
ASM 2011 50
OLAM
Conclusions
• To reach the ExaFlops/s, new tools are required: – New hardware, be�er IO solu8ons, be�er interfaces with the user
(visualiza8on…) – A parallel programming model that decouples the parallelism from the
hardware.
ASM 2011 51
Task-‐based parallelism may be one way.
• There are solu8ons for shared-‐mem. systems (TBB, OpenMP3…)
• The support for distributed memory is s8ll an on-‐going research (Kaapi, Charm++…)
Other ideas to go further
• It is probably utopic to try and devise one more API for parallel programming.
• Automa8c genera8on of code – Different from Compiling: support to the programmer. – Source2source compiling, interac8on with the programmer.
• Auto-‐Tuning
ASM 2011 52