Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 227 times |
Download: | 0 times |
MPI – An introduction by Jeroen van Hunen
• What is MPI and why should we use it?
• Simple example + some basic MPI functions
• Other frequently used MPI functions
• Compiling and running code with MPI
• Domain decomposition
• Stokes solver
• Tracers/markers
• Performance
• Documentation
What is MPI?
• Mainly a data communication tool: “Message-Passing Interface”• Allows parallel calculation on distributed memory machines• Usually Single-Program-Multiple-Data principle used: all processors have similar tasks (e.g. in domain decomposition)• Alternative: OpenMP for shared memory machines
Why should we use MPI?
• If sequential calculations take too long• If sequential calculations use too much memory
Output for 4 processors:
Code:
contains definitions, macros, function prototypes
initialize MPIask processor ‘rank’ ask # processors p
stop MPI
Simple MPI example
Other frequently used MPI calls
Sending and receiving at the same time: no risk for deadlocks:
… or overwrite send buffer with received info:
Other frequently used MPI calls
Synchronizing the processors: wait for each other at the barrier:
Broadcasting a message from one processor to all the others: both sending and receiving processors use same call to MPI_BCAST
Other frequently used MPI calls
“Reducing” (combining) data from all processors: add, find maximum/minimum, etc.
OP can be one of the following:
For results to be available at all processors, use MPI_Allreduce:
Additional comments:
• ‘wildcards’ are allowed in MPI calls for:• source: MPI_ANY_SOURCE• tag: MPI_ANY_TAG
•MPI_SEND and MPI_RECV are ‘blocking’: they wait until job is done
Deadlocks:•Deadlock
•Depending on buffer
•Safe
•Don’t let processor send a message to itself•In this case use MPI_SENDRECV
Non-matching send/receive calls my block the code
Compiling and running code with MPI
Compiling: Fortran:mpif77 –o binary code.fmpif90 –o binary code.f
C:mpicc –o binary code.c
Running in general, no queueing system: mpirun –np 4 binarympirun -np 4 -nolocal -machinefile mach binary
Running on Gonzales, with queueing system:bsub -n 4 -W 8:00 prun binary
Domain decomposition
x
yz
• Total computational domain divided into ‘equal size’ blocks
• Each processor only deals with its own block
• At block boundaries some information exchange necessary
• Block division matters:
• surface/volume ratio
• number of processor bnds.
M2
S2
N2
EW
M1 =0.25*(N1+S1+W)
M
S
N
EW
M=0.25*(N+S+E+W)
S1
M1
N1
M2 =0.25*(E)M =M1+M2 (using MPI_SENDRECV)
M1 =M1=M
Stokes equation: Jacobi iterative solver
In block interior:no MPI needed
At block boundary:MPI needed
Gauss-Seidel solver performs better, butis also slightly more difficult to implement.
Tracers/Markers
proc n proc n+1
2nd order Runge-Kutta scheme:k1= dt v(t,x(t))k2= dt v(t+dt/2, x(t) + k1/2)x(t+dt) = x(t) + k2
Procedure:• Calculate x(t+dt/2)
If in procn+1: •procn sends tracer coordinates to procn+1
•procn+1 reports tracer velocity back to procn
• Calculate x(t)If in procn+1:
•procn sends tracer coordinates + function values permantently to procn+1
k1
k2
Performance
For too small jobs communication quickly becomes the bottleneck.
This problem: • R-B convection (Ra=106)• 2-D 64x64 finite elements, 104 steps• 3-D 64x64x64 FE, 100 steps• Calculation on gonzales