AMS 529: Finite Element Methods: Fundamentals, Applications, and...

transcript

AMS 529: Finite Element Methods:Fundamentals, Applications, and New TrendsLecture 24: Distributed-Memory Programming for FEM

Xiangmin Jiao

SUNY Stony Brook

Xiangmin Jiao Finite Element Methods 1 / 23

Outline

1 Introduction to MPI

2 Distributed-Memory Parallelization of FEM

Review of Shared-Memory Programming Paradigm

There is a single (shared) address spaceAll parallel threads can access all data (if address is known)

Advantages:Easier to get started

Disadvantages:Prone to performance bottlenecks due to synchronization (locks,mutex)Difficult to scale due to hardware limitations (number of cores,memory bandwidth, etc.)Bottomline: Difficult to get good performance

Distributed-Memory Programming Paradigm

Multiple machines with their own address spacesNo direct access to remote data; data must be transported explicitly

Advantages:Programmer has more control of communication and synchronizationEasier to scale to a large number of cores (more than 1M cores)

Disadvantages:Much more difficult programming modelYou are forced to think algorithmically and make hard decisionsPractical difficulties in debugging, profiling, etc.

Good distributed-memory algorithms may be converted into efficientshared-memory codes, so it is advantageous to think distributed-memoryalgorithmically

Distributed-Memory Programming Languages/Libraries

Message Passing Interface (MPI)Remote memory access (RMA) – one-sided communication; supportedby MPI-2 & 3Partitioned global address space (PGAS)

I Unified Parallel C (UPC)I Coarray Fortran (part of Fortran 2008)I Chapel, X10, Titanium, etc.

Remote procedure calls (RPC)Most of these belong to SPMD (Single-Program, Multiple Data) orMIMD (Multiple Instructions, Multiple Data) parallelism

Message Passing

Parallel program consists of processes, each with own address spaceData need to be

I distributed explicitly by algorithm/programmerI sent/received explicitly by algorithm/programmer

Communication patternsI Point-to-point (one-to-one) communication (one side sends; the other

side receives)I Collective (global) communication (broadcast and reduction) and

synchronization (barriers) on arbitrary subset of processes

Message Passing Interface (MPI)MPI is a library, with C and Fortran bindings

I Also for other languages such as Python, MATLAB/Octave, etc.I C++ binding was added in MPI-2 but deleted in MPI-3

Communicators in MPII Processes form groups (subset of processes)I Each process has its rank (ID) within group

MPI is composed ofI data types (MPI_Comm etc.)I functions

F System: MPI_Init/MPI_Init_thread, MPI_FinalizeF Communicator: MPI_Comm_size, MPI_Comm_rankF Point-to-point: MPI_Send, MPI_Recv, MPI_Isend, MPI_Irecv,

MPI_Test, MPI_Wait, MPI_Probe/MPI_Iprobe, etc.F Collective: MPI_Barrier, MPI_Bcast, MPI_Scatter, MPI_Gather,

MPI_Reduce, MPI_Allreduce, ...F Others: Parallel I/O, profiling, communicator/datatype/operator

creation, etc.I constants (MPI_COMM_WORLD, MPI_FLOAT, etc.)

Example MPI Program in C#include "mpi.h"int main(int argc, char *argv[]) {

int rank, size, i, provided;float A[10];MPI_Init_thread(&argc, &argv, MPI_THREAD_SINGLE, &provided);MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);for (i=0; i<10; i++) A[i] = i;printf("My rank %d of %d\n", rank, size);printf("Here are my values for A\n");for (i=0; i<10; i++) printf("%f ", A[i]);printf("\n");MPI_Finalize();

Compile using mpicc instead of ccRun using “mpirun -np nprocs program” (optional when nprocs=1)

Extended Example with Communication

#include "mpi.h"int main(int argc, char *argv[]) {

...../* for (i=0; i<10; i++) A[i] = i; */if (rank == 0) {

for (i=0; i<10; i++) A[i] = i;for (i=1, i<size; i++)

MPI_Send(A, /*count*/ 10, /*datatype*/ MPI_FLOAT,/*to*/ i, /*tag*/ 0, MPI_COMM_WORLD);

} else {MPI_Recv(A, /*count*/ 10, /*datatype*/ MPI_FLOAT,/*from*/ 0, /*tag*/ 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

}.....

Point-to-Point Communication in MPI

MPI_Send/MPI_Recv are blockingI safe to change/use data after function returns

MPI_Isend and MPI_Irecv are nonblockingI Cannot change/use data after function returnsI Use MPI_Test/MPI_Wait[some|any|all] to check completion

Algorithmic considerations:1 Avoid deadlock in blocking communication2 Overlap computation with communication using nonblocking

communication3 Reduce number of messages by aggregating small messages

Collective Communication in MPI

Barrier: block until all reach here (MPI_Barrier)Broadcast: from root process to others in group (MPI_Bcast)Scatter: scatter buffer in parts to all in group (MPI_Scatter[v])Gather: gather buffer in parts from all in group (MPI_Gather[v])Reduce: reduce values on all processes to single value(MPI_Reduce/MPI_Allreduce)All processes in group must participate to avoid deadlock!Algorithmic considerations:

1 Collective comms. are synchronous and expensive, so use scarcely2 Reduce number of messages by aggregating small messages3 Load balancing is critical when there are collective communications

Advanced MPI Topics

One-sided communication (MPI_Get, MPI_Put)I One side gets and puts data without involvement of other process

An MPI process can be multithreaded (hybrid programming models)I Process must be initialized using MPI_Init_thread instead of MPI_InitI All threads share a common MPI rankI Typically, only one thread should be in charge of MPI communication

(MPI_THREAD_FUNNELED)Debugging using totalview or gdb (or ddd)

I “mpirun -np 4 xterm -e gdb ./myprog” or “mpirun -np 4 ddd ./myprog”

Timing using MPI_Wtime (typically should use barriers before call)Profiling using TAU and mpiPAdditional resources

I W. Gropp, CS 598 Designing and Building Applications for ExtremeScale Systems at University of Illinois

Outline

1 Introduction to MPI

2 Distributed-Memory Parallelization of FEM

Partitioning and Distribution of Mesh

Mesh is typically partitioned and distributed onto processes

Algorithmic considerationsI Load balancing: computation on each

process should be about equalI Reduce communication: “minimize”

partition boundaries (surface tovolume ratio)

I Cost of partitioner: partitioner itselfshould be fast

I Parallel partitioning/repartitioning:partitioner need to run in parallel

Graph/Mesh Partitioning TechniquesGraph partitioner partitions nodes of graph

I balance work loads, whileI “minimizing” edge cuts

When partitioning meshes, apply to nodal graph or dual graph

Geometric Partitioning Techniques

Coordinate nested dissection (CND): split along shortest axesI Fast but low quality

Recursive inertial bisection (RIB): extension of CND with PCAI Fast and better quality than CND

Space-filling curve (SFC): locality preserving curvesI Fast and good quality; popular for hierarchical meshes

CND vs. RIB Peano-Hilbert SFCXiangmin Jiao Finite Element Methods 16 / 23

Combinatorial Partitioning TechniquesLevelized nested dissection (LND): bread-first search from v0 untilreach half of vertices

LND Before and after KLKL/FM partition refinement: starting from initial partitioning, movevertices to reduce edge cut; need to escape local extremaOthers

I Spectral: based on graph LaplacianI Multilevel schemes: recursively coarsen/partition/refine

Comparison of Partitioning Methods

Repartitioning MethodsNeeded for adaptive meshesKey issue: tradeoff between redistribution cost and edge-cut

Graph Partitioning Software

Chaco METIS ParMETIS ScotchGeometric schemes RIB SFCSpectral methods X

Combinatorial schemes KL/FM XMultilevel schemes X X X

Dynamic repartitioning XParallel partitioning X

Another software similar to ParMETIS was Jostle, but it is no longeravailable as open sourceAdditional reading

I K. Schloegel, G. Karypis, V. Kumar, Graph Partitioning forHigh-Performance Scientific Simulations, in Sourcebook for ParallelComputing, Chapter 18, 2003, Morgan Kaufmann

I Slides by R. van Engelen

Distribution of Mesh and Field Variables

In FEM, all global data must be distributedI mesh (coordinates and elements)I degree of freedomsI matrixI right-hand side and solution vectorsI post-processed data (parallel I/O)

It is natural to partition DoFs and matrices based on nodesI Each node (and corresponding DoFs and rows in matrix) is owned by

one process

What about elements?

Typical Control Flow of Parallel FEM

Each element is owned by one process (partition by dual graph)Each process owns part of nodes (DoFs) associated with elementsA process may own ghost nodesParallel assembly

I Each process computes element matrices on locally owned elementsI Merge partial nodal values along partition boundaries of elements

Required parallel communication patterns:I Requires updating values on ghost nodesI Merge partial nodal values during assembly

These communication patterns are in general abstracted into high-levelfunctions, typically implemented using nonblocking point-to-pointcommunication; application codes do not call MPI directlyTypically, linear solvers are typically done by calling PETSc or TrilinosExample code: parallel assembly in deal.II

Alternative Control Flow of Parallel FEM

Each node (and associated DoFs) is owned by one process (partitionby nodal graph)Each process owns part of elements incident on nodeA process may own ghost nodes and ghost elementsParallel assembly: Each process computes element matrices on locallyowned and ghost elements incident on locally owned nodesRequired parallel communication pattern:

I Update values on ghost nodes and ghost elements

Disadvantage: Computation of element matrices for elements incidenton partition boundaries are duplicatedAdvantage: Can be adapted to shared-memory implementationwithout race condition

AMS 529: Finite Element Methods: Fundamentals, Applications, and...

Documents