Post on 29-May-2020
transcript
AMS 529: Finite Element Methods:Fundamentals, Applications, and New TrendsLecture 24: Distributed-Memory Programming for FEM
Xiangmin Jiao
SUNY Stony Brook
Xiangmin Jiao Finite Element Methods 1 / 23
Outline
1 Introduction to MPI
2 Distributed-Memory Parallelization of FEM
Xiangmin Jiao Finite Element Methods 2 / 23
Review of Shared-Memory Programming Paradigm
There is a single (shared) address spaceAll parallel threads can access all data (if address is known)
Advantages:Easier to get started
Disadvantages:Prone to performance bottlenecks due to synchronization (locks,mutex)Difficult to scale due to hardware limitations (number of cores,memory bandwidth, etc.)Bottomline: Difficult to get good performance
Xiangmin Jiao Finite Element Methods 3 / 23
Distributed-Memory Programming Paradigm
Multiple machines with their own address spacesNo direct access to remote data; data must be transported explicitly
Advantages:Programmer has more control of communication and synchronizationEasier to scale to a large number of cores (more than 1M cores)
Disadvantages:Much more difficult programming modelYou are forced to think algorithmically and make hard decisionsPractical difficulties in debugging, profiling, etc.
Good distributed-memory algorithms may be converted into efficientshared-memory codes, so it is advantageous to think distributed-memoryalgorithmically
Xiangmin Jiao Finite Element Methods 4 / 23
Distributed-Memory Programming Languages/Libraries
Message Passing Interface (MPI)Remote memory access (RMA) – one-sided communication; supportedby MPI-2 & 3Partitioned global address space (PGAS)
I Unified Parallel C (UPC)I Coarray Fortran (part of Fortran 2008)I Chapel, X10, Titanium, etc.
Remote procedure calls (RPC)Most of these belong to SPMD (Single-Program, Multiple Data) orMIMD (Multiple Instructions, Multiple Data) parallelism
Xiangmin Jiao Finite Element Methods 5 / 23
Message Passing
Parallel program consists of processes, each with own address spaceData need to be
I distributed explicitly by algorithm/programmerI sent/received explicitly by algorithm/programmer
Communication patternsI Point-to-point (one-to-one) communication (one side sends; the other
side receives)I Collective (global) communication (broadcast and reduction) and
synchronization (barriers) on arbitrary subset of processes
Xiangmin Jiao Finite Element Methods 6 / 23
Message Passing Interface (MPI)MPI is a library, with C and Fortran bindings
I Also for other languages such as Python, MATLAB/Octave, etc.I C++ binding was added in MPI-2 but deleted in MPI-3
Communicators in MPII Processes form groups (subset of processes)I Each process has its rank (ID) within group
MPI is composed ofI data types (MPI_Comm etc.)I functions
F System: MPI_Init/MPI_Init_thread, MPI_FinalizeF Communicator: MPI_Comm_size, MPI_Comm_rankF Point-to-point: MPI_Send, MPI_Recv, MPI_Isend, MPI_Irecv,
MPI_Test, MPI_Wait, MPI_Probe/MPI_Iprobe, etc.F Collective: MPI_Barrier, MPI_Bcast, MPI_Scatter, MPI_Gather,
MPI_Reduce, MPI_Allreduce, ...F Others: Parallel I/O, profiling, communicator/datatype/operator
creation, etc.I constants (MPI_COMM_WORLD, MPI_FLOAT, etc.)
Xiangmin Jiao Finite Element Methods 7 / 23
Example MPI Program in C#include "mpi.h"int main(int argc, char *argv[]) {
int rank, size, i, provided;float A[10];MPI_Init_thread(&argc, &argv, MPI_THREAD_SINGLE, &provided);MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);for (i=0; i<10; i++) A[i] = i;printf("My rank %d of %d\n", rank, size);printf("Here are my values for A\n");for (i=0; i<10; i++) printf("%f ", A[i]);printf("\n");MPI_Finalize();
}
Compile using mpicc instead of ccRun using “mpirun -np nprocs program” (optional when nprocs=1)
Xiangmin Jiao Finite Element Methods 8 / 23
Extended Example with Communication
#include "mpi.h"int main(int argc, char *argv[]) {
...../* for (i=0; i<10; i++) A[i] = i; */if (rank == 0) {
for (i=0; i<10; i++) A[i] = i;for (i=1, i<size; i++)
MPI_Send(A, /*count*/ 10, /*datatype*/ MPI_FLOAT,/*to*/ i, /*tag*/ 0, MPI_COMM_WORLD);
} else {MPI_Recv(A, /*count*/ 10, /*datatype*/ MPI_FLOAT,/*from*/ 0, /*tag*/ 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}.....
}
Xiangmin Jiao Finite Element Methods 9 / 23
Point-to-Point Communication in MPI
MPI_Send/MPI_Recv are blockingI safe to change/use data after function returns
MPI_Isend and MPI_Irecv are nonblockingI Cannot change/use data after function returnsI Use MPI_Test/MPI_Wait[some|any|all] to check completion
Algorithmic considerations:1 Avoid deadlock in blocking communication2 Overlap computation with communication using nonblocking
communication3 Reduce number of messages by aggregating small messages
Xiangmin Jiao Finite Element Methods 10 / 23
Collective Communication in MPI
Barrier: block until all reach here (MPI_Barrier)Broadcast: from root process to others in group (MPI_Bcast)Scatter: scatter buffer in parts to all in group (MPI_Scatter[v])Gather: gather buffer in parts from all in group (MPI_Gather[v])Reduce: reduce values on all processes to single value(MPI_Reduce/MPI_Allreduce)All processes in group must participate to avoid deadlock!Algorithmic considerations:
1 Collective comms. are synchronous and expensive, so use scarcely2 Reduce number of messages by aggregating small messages3 Load balancing is critical when there are collective communications
Xiangmin Jiao Finite Element Methods 11 / 23
Advanced MPI Topics
One-sided communication (MPI_Get, MPI_Put)I One side gets and puts data without involvement of other process
An MPI process can be multithreaded (hybrid programming models)I Process must be initialized using MPI_Init_thread instead of MPI_InitI All threads share a common MPI rankI Typically, only one thread should be in charge of MPI communication
(MPI_THREAD_FUNNELED)Debugging using totalview or gdb (or ddd)
I “mpirun -np 4 xterm -e gdb ./myprog” or “mpirun -np 4 ddd ./myprog”
Timing using MPI_Wtime (typically should use barriers before call)Profiling using TAU and mpiPAdditional resources
I W. Gropp, CS 598 Designing and Building Applications for ExtremeScale Systems at University of Illinois
Xiangmin Jiao Finite Element Methods 12 / 23
Outline
1 Introduction to MPI
2 Distributed-Memory Parallelization of FEM
Xiangmin Jiao Finite Element Methods 13 / 23
Partitioning and Distribution of Mesh
Mesh is typically partitioned and distributed onto processes
Algorithmic considerationsI Load balancing: computation on each
process should be about equalI Reduce communication: “minimize”
partition boundaries (surface tovolume ratio)
I Cost of partitioner: partitioner itselfshould be fast
I Parallel partitioning/repartitioning:partitioner need to run in parallel
Xiangmin Jiao Finite Element Methods 14 / 23
Graph/Mesh Partitioning TechniquesGraph partitioner partitions nodes of graph
I balance work loads, whileI “minimizing” edge cuts
When partitioning meshes, apply to nodal graph or dual graph
Xiangmin Jiao Finite Element Methods 15 / 23
Geometric Partitioning Techniques
Coordinate nested dissection (CND): split along shortest axesI Fast but low quality
Recursive inertial bisection (RIB): extension of CND with PCAI Fast and better quality than CND
Space-filling curve (SFC): locality preserving curvesI Fast and good quality; popular for hierarchical meshes
CND vs. RIB Peano-Hilbert SFCXiangmin Jiao Finite Element Methods 16 / 23
Combinatorial Partitioning TechniquesLevelized nested dissection (LND): bread-first search from v0 untilreach half of vertices
LND Before and after KLKL/FM partition refinement: starting from initial partitioning, movevertices to reduce edge cut; need to escape local extremaOthers
I Spectral: based on graph LaplacianI Multilevel schemes: recursively coarsen/partition/refine
Xiangmin Jiao Finite Element Methods 17 / 23
Comparison of Partitioning Methods
Xiangmin Jiao Finite Element Methods 18 / 23
Repartitioning MethodsNeeded for adaptive meshesKey issue: tradeoff between redistribution cost and edge-cut
Xiangmin Jiao Finite Element Methods 19 / 23
Graph Partitioning Software
Chaco METIS ParMETIS ScotchGeometric schemes RIB SFCSpectral methods X
Combinatorial schemes KL/FM XMultilevel schemes X X X
Dynamic repartitioning XParallel partitioning X
Another software similar to ParMETIS was Jostle, but it is no longeravailable as open sourceAdditional reading
I K. Schloegel, G. Karypis, V. Kumar, Graph Partitioning forHigh-Performance Scientific Simulations, in Sourcebook for ParallelComputing, Chapter 18, 2003, Morgan Kaufmann
I Slides by R. van Engelen
Xiangmin Jiao Finite Element Methods 20 / 23
Distribution of Mesh and Field Variables
In FEM, all global data must be distributedI mesh (coordinates and elements)I degree of freedomsI matrixI right-hand side and solution vectorsI post-processed data (parallel I/O)
It is natural to partition DoFs and matrices based on nodesI Each node (and corresponding DoFs and rows in matrix) is owned by
one process
What about elements?
Xiangmin Jiao Finite Element Methods 21 / 23
Typical Control Flow of Parallel FEM
Each element is owned by one process (partition by dual graph)Each process owns part of nodes (DoFs) associated with elementsA process may own ghost nodesParallel assembly
I Each process computes element matrices on locally owned elementsI Merge partial nodal values along partition boundaries of elements
Required parallel communication patterns:I Requires updating values on ghost nodesI Merge partial nodal values during assembly
These communication patterns are in general abstracted into high-levelfunctions, typically implemented using nonblocking point-to-pointcommunication; application codes do not call MPI directlyTypically, linear solvers are typically done by calling PETSc or TrilinosExample code: parallel assembly in deal.II
Xiangmin Jiao Finite Element Methods 22 / 23
Alternative Control Flow of Parallel FEM
Each node (and associated DoFs) is owned by one process (partitionby nodal graph)Each process owns part of elements incident on nodeA process may own ghost nodes and ghost elementsParallel assembly: Each process computes element matrices on locallyowned and ghost elements incident on locally owned nodesRequired parallel communication pattern:
I Update values on ghost nodes and ghost elements
Disadvantage: Computation of element matrices for elements incidenton partition boundaries are duplicatedAdvantage: Can be adapted to shared-memory implementationwithout race condition
Xiangmin Jiao Finite Element Methods 23 / 23