Parallel Tetrahedral Mesh Adaptation
with Dynamic Load Balancing 1
Leonid Oliker
NERSC, MS 50B-2239, Lawrence Berkeley National Laboratory,
Berkeley, CA 94720. E-mail: [email protected]
Rupak Biswas
MRJ Technology Solutions, MS T27A-1, NASA Ames Research Center,
Moffett Field, CA 94035. E-mail: [email protected]
Harold N. Gabow
Department of Computer Science, University of Colorado,
Boulder, CO 80309. E-mail: [email protected]
Abstract
The ability to dynamically adapt an unstructured grid is a powerful tool for ef-
ficiently solving computational problems with evolving physical features. In this
paper, we report on our experience parallelizing azl edge-based adaptation scheme,
called 3D_TAG, using message passing. Results show excellent speedup when a
realistic helicopter rotor mesh is randomly refined. However, performance deteri-
orates when the mesh is refined using a solution-based error indicator since mesh
adaptation for practical problems occurs in a localized region, creating a severe
load imbalance. To address this problem, we have developed PLUM, a global dy-
namic load balancing framework for adaptive numerical computations. Even though
PLUM primarily balances processor workloads for the solution phase, it. reduces the
load imbalance problem within mesh adaptation l)y repartitioning the mesh after
targeting edges for refinement but before the actual subdivision. This dramatically
improves the performance of parallel 3D_TAG since refinement occurs in a more load
balanced fashion. We also present, optimal and heuristic algorithms that, when ap-
plied to the default, mapping of a parallel repartitioner, significantly reduce the data
redistribution overhead. Finally, portability is examined by comparing performance
on three state-of-the-art parallel machines.
t Work supported by NASA under Contract Numbers NAS 2-96027 with Universi-
ties Space Research Association and NAS 2-14303 with MRJ Technology Solutions.
Preprint submitted to Elsevier Preprint I2 July 1999
https://ntrs.nasa.gov/search.jsp?R=20000091009 2020-02-10T05:44:33+00:00Z
1 Introduction
Unstructured grids2 for solving computational problems have two major ad-
vantages over structured grids. First, unstructured meshes enable efficient grid
generation around highly complex geometries. Second, appropriate unstruc-
tured-grid data structures facilitate the rapid insertion and deletion of points
to allow the mesh to locally adapt to the solution.
Two solution-adaptive strategies are commonly used with unstructured-grid
methods. Regeneration schemes generate a new grid with a higher or lower
concentration of points in different regions depending on an error indicator. A
major disadvantage of such schemes is that they are computationally expen-
sive. This is a serious drawback for unsteady problems where the mesh must
be frequently adapted. However, resulting grids are usually well-formed with
smooth transitions between regions of coarse and fine mesh spacing.
Local mesh adaptation, on the other hand, involves adding points to the ex-
isting grid in regions where the error indicator is high, and removing points
from regions where the indicator is low. The advantage of such strategies is
that relatively few mesh points need to be inserted or deleted at each refine-
meat/coarsening step for unsteady problems, ttowever, complicated logic and
data structures are required to keep track of the points that are added andremoved.
For problems that evolve with time, local mesh adaptation procedures have
proved to be robust, reliable, and efficient. By redistributing the available
mesh points to capture physical phenomena of interest, such procedures make
standard computational methods more cost effective. Highly localized regions
of mesh refinement are required in order to accurately capture shock waves,
contact discontinuities, vortices, and shear layers. This provides scientists the
opportunity to obtain solutions on adapted meshes that are comparable to
those obtained on globally-refined grids but at a. much lower cost. Even though
adai)tive mesh algorithms are commonly used for problems in fluid flow and
structural mechanics, the5" are also of significant interest in several other areas
like computer vision and graphics.
Advances in adaptive software and methodology notwithstanding, parallel
computational strategies will be an essential ingredient in solving complex real-
life problems. However, parallel computers are usually easier to program with
regular data structures; so the development of efficient parallel adaptive algo-
rithms for unstructured grids (that use complex data structures and indirect
addressing) poses a serious challenge. Their parallel performance for supercom-
puting applications not only depends on the design strategies, but also on the
2 The terms grid aald mesh are used synonymously throughout this paper.
choiceof efficientdata structureswhich must beamenableto simplemanipula-tion without significant memorycontention (for shared-memoryarchitectures)or communication overhead(for message-passingarchitectures). Nonetheless,it is generallybelievedthat adaptiveunstructured-grid techniqueswill consti-tute a significant fraction of future high-performancecomputing.
A significant amount of research has been done to design sequential algo-
rithms to effectively use unstructured meshes for fluid flow applications, e.g.,
the solution of the Euler equations. Unfortunately, many of these techniques
cannot take advantage of the power of parallel computing due to the dif-
ficulties of porting these codes onto distributed-memory architectures. Re-
cently, several adaptive schemes have been successfully developed in a par-
allel environment. Most of these codes are based on two-dimensional finite
elements [1,2,5,9,17,18,25], and some progress has been made towards three-
dimensional unstructured-mesh simulations [4,21,27,28]. Various dynamic load
balancing methods for unstructured-grid applications have also been reported
to date [10-12,16,22,32-34]; however, most of them lack a global view of loads
across all processors.
I Partitioning I
[ Mapping [i ,
Initialization t[ Repartitioning ]
Solution [
[ Finalization I
Execution
[ Rem_pping,, [
Fig. I. Overview of our global dynamic load balaaicing framework for adaptive nu-
merical computat.ions.
Figure 1 depicts our global dynamic load balancing framework for adaptive
computations. It essentially consists of a numerical solver and our mesh adap-
tor, with a partitioner and a remapper that load balance and redistribute the
computational mesh when necessary. The mesh is first partitioned and mapped
among the available processors. The initia.lization phase distributes the global
data among the processors and generates a database for all shared objects a .
The numerical solver then runs for several iterations, updating solution vari-
a.bles that are typically stored at the vertices of the mesh. When an acceptable
3 The term object is used generically to denote a vertex, edge, tetrahedron, or facein the mesh.
solution is obtained, local mesh adaptation is performed to generate a new
computational mesh, if so desired. A quick evaluation step determines if the
new mesh is sufficiently unbalanced to warrant a repartitioning. If the current
partitioning indicates that it is adequately load balanced, control is passed
back to the solver. Otherwise, a mesh repartitioning procedure is invoked to
divide the new grid into subgrids. The new partitions are then reassigned to
the processors in a way that minimizes the cost of data movement. If the
cost of remapping the data is less than the computational gain that would
be achieved with balanced partitions, all necessary data is appropriately re-
distributed. Otherwise, the new partitioning is discarded and the calculation
continues on the old partitions. The finalization step combines the local grids
on each processor into a single global mesh. This is usually required for some
post-processing tasks, such as visualization, or to save a snapshot of the grid
on secondary storage for future restart runs.
Notice from the framework in Fig. 1 that the computational load is balanced
and the runtime communication reduced only for the solver but not for the
mesh adaptor. This is important since solvers are usually several times more
expensive. However, parallel performance for the mesh adaptation procedure
can be significantly improved if the mesh is repartitioned and remapped in a
load-balanced fashion after edges are targeted for refinement and coarsening
[)lit before performing the actual adaptation. This strategy also reduces the
redistribution cost significantly since a smaller volume of data is moved.
The numerical solver is usually application-dependent, and is beyond the scope
of this paper. Here, we focus on some of the tools that enable numerical sim-
ulations to be accomplished rapidly and efficiently. Parallel mesh adaptation
and dynamic load balancing are two such critical tools.
2 Tetrahedral Mesh Adaptation
We first give a brief description of the tetrahedral mesh adaptation scheme [7]
that is used in this work to better explain tile modifications that were made
for the distributed-memory implementation. The 5,000-line C code, called
3D_TAG, has its data structures based on edges that connect the vertices
of a tetrahedral mesh. This means that the elements 4 and boundary faces are
defined by their edges rather than by their vertices. These edge-based data
structures make the mesh adaptation procedure capable of efficiently perform-
ing anisotropic refinement and coarsening. A successful data structure must
contain the right amount of information to rapidly reconstruct the mesh con-
nectivity when vertices are added or deleted while having reasonable memory
4 The terms element and tetrahedrou are used synonymously throughout this paper.
requirements.
R.ecently,the 3D_TAGcodehasbeenmodified to refineand coarsenhexahedralmeshes[8]. The data structures and serial implementation for the hexahedralschemeare similar to those for the tetra.hedralcode. Their parallel implemen-
ta.tions should also be similar; however, this paper focuses solely on tetrahedral
mesh adaptation.
2. I The Algorithm
At each mesh adaptation step, individual edges are marked for coarsening,
refinement, or no change, based on an error indicator calculated fi-om the
solution. Edges whose error values exceed a user-specified upper threshold are
targeted for subdivision. Similarly, edges whose error values lie below another
user-specified lower threshold are targeted for removal. Only three subdivision
types are allowed for each tetrahedral element and these are shown in Fig. 2.
Tile 1:8 isotropic subdivision is implemented by adding a new vertex at the
mid-point of each of the six edges. The 1:4 and 1:2 subdivisions can result
either because the edges of a. parent tetrahedron are targeted anisotropically
or because they are required to form a valid connectivity for the new mesh.
When an edge is bisected, the solution quantities are linearly interpolated at
tile mid-point from its two end-points.
1:8 1:4 !:2
Fig. 2. Three types of subdivision are permit, t,ed for a t.etrahedral element.
Mesh refinement is performed by first setting a bit flag to one for each edge
that is targeted for subdivision. The edge markings for each element are then
('ombined to form a 6-bit pattern as shown in Fig. 3 where the edges marked
with an _R' are the ones to be bisected. Elements are continuously upgraded to
valid patterns corresponding to the three allowed subdivision types (cf. Fig. 2)
until none of the patterns show any change. Once this edge marking is com-
l)leted, each element is independently subdivided into smaller child elements
based on its binary pattern. Special data structures are used to ensure that
this process is computationally efficient.
6 5 4 3 2 1 Edgenumber0 0 1 0 1 1 Pattem=ll
Fig. 3. Sampleedge-markingpattern for elementsubdivision.
Meshcoarseningalso usesthe edge-markingpatterns. If a child element hasany edge marked for coarsening,this element and its siblings are removedand their parent is reinstated. Parentedgesand elementsare retained at eachrefinement step so they do not haveto be reconstructed. Reinstated parentelementshavetheir edge-markingpatternsadjusted to reflect that someedgeshavebeencoarsened.The parentsare then subdividedbasedon their newpat-terns by invoking the meshrefinementprocedure.As a result, the coarsening
and refinement procedures share much of the same logic.
There are some constraints for mesh coarsening. For example, edges cannot
be coarsened beyond the initial mesh. Edges must also be coarsened in an
order that is reversed from the one by which they were refined. Moreover, an
edge can coarsen if and only if its sibling is also targeted for coarsening. More
details about these coarsening constraints are given in [8].
Details of the data structures are given in [7]; however, a brief description of
the salient features is necessary to understand the distributed-memory imple-
mentation of the mesh adaptation code. Pertinent information is maintained
for" the vertices, elements, edges, and boundary faces of the mesh. For each
vertex, the coordinates are stored in coord[3], the solution ira soln[5], and
a pointer to the first entry in the edge list in edges. The edge list for a vertex
is a linked list of pointers to all the edges that are incident upon it. Such lists
eliminate extensive searches and are crucial to the efficiency of the overall
adat)tation scheme. The tetrahedral elements have their six edges stored in
tedge [6], the edge-marking pattern in part, the parent element in tparent,
and the first dlild element in tchild. Sibling elements always reside contigu-
ously in memory; hence, a parent element only needs a pointer to the first
child. For each edge, we store its two end-points in vertex[2], its parent
edge in eparent, its two children edges in echild[2], the two boundary faces
it defines in bfac [2], and a pointer to the first entry in the element list in
elems. The element list for an edge is again a linked list of pointers to all
the elements that share it. Finally, for each boundary face, we store the three
edges in bodge [3], the element to which it belongs in belera, the parent in
bparent, and the first child in bchild. Sibling boundary faces, like elements,
are stored consecutively in memory.
6
2.2 Parallel Implementation
The parallel version of the 3D_TAG mesh adaptation code contains an ad-
(liti(mal 3,000 lines of C++ with Message-Passing Interface (MPI), allowing
port_l.bi]ity to a.z_y system supporting these languages. This code is a wrap-
per around the original mesh adaptation program written in C, and required
the addition of only 10 instructions to link it with the parallel constructs. The
ol)ject-oriented approach allowed us to build a clean interface between the two
layers of the program while maintaining efficiency. Only a slight increase in
space was necessary to keep track of the global mappings and shared processor
lists (SPLs) for objects on partition boundaries.
Parallel 3D_TAG consists of three phases: initialization, execution, and final-
ization. The initialization step consists of scattering the global data. across the
processors, defining a local numbering scheme for each object, and creating
the mapping for objects that are shared by multiple processors. The execu-
tion step runs a copy of 3D_TAG on each processor that refines or coarsens
its local region, while maintaining a globally-consistent grid along partition
boundaries. Parallel performance is extremely critical during this phase since
it will be executed several times during a computation. Finally, a gather op-
eration is performed in the finalization step to combine the local grids into
one global mesh. Locally-numbered objects and the corresponding pointers are
reordered to represent one single consistent mesh. Note from Fig. 1 that the
initialization and finalization phases are invoked only once for each problem
outside the solutione+execution_load-balancing cycle.
In order to perform parallel mesh adaptation, the initial grid must first be
partitioned among the available processors. A good partitioner minimizes the
total execution time by equidistributing the workload and reducing the inter-
I)rocessor communication. However, it is also important within our framework
t[lat the partitioning phase be performed ral)idly. There are several excel-
l,,nt heuristic algorithms for solving the NP-hard graph partitioning prob-
h'm [20,28,29,32,3,1]. We used the ParMETIS [19] parallel multilevel partition-
ing algorithm for the test cases in this paper. Pa.rMETIS reduces the size of the
graph by collapsing vertices and edges using a heavy edge matching scheme,
applies a greedy graph growing algorithm for partitioning the coarsest graph,
and then uncoarsens it back using a eombina.tion of boundary greedy and
Kernighan-Lin refinement to construct a partitioning for the original graph.
2.2.1 ]'nitialization
The initialization phase takes as input the global initial grid and the cor-
responding partitioning that maps each tetrahedral element to exactly one
7
partition. The elementdata and partition information are then broadcasttoall processorswhich, in parallel, assigna local, zero-basednatural numbertoeach element.\Ve are thus assumingthat a.ninitial tetrahedral meshexists,and that it is partitioned amongtile availableprocessors.Oncethe elementshavebeenprocessed,local edgeinformation can becomputed.
In threedimensions,an individual edge may belong to an arbitrary number of
elements. Since each element is assigned to only one partition, it is theoretically
possible for an edge to be shared by all the processors. For each partition, a
local zero-based natural number is assigned to every edge that belongs to
at least one element. Each processor then redefines its elements in tedge [6]
in terms of these local edge numbers. Edges that are shared by more than
one processor are identified by searching for elements that lie on partition
boundaries. A bit flag is set to distinguish between shared and internal edges.
A SPL is also generated for each shared edge. Finally, the element list in elems
for each edge is updated to contain only the local elements.
The vertices are initialized using the vertex [2] data structure for each edge.
Every local vertex is assigned a zero-based natural number in each partition.
Next the local edge list for each vertex is created from the appropriate sub-
set of the global edges array. Like shared edges, each shared vertex must be
identified and assigned its SPL. A naive approach would be to thread through
the data structures to the elements and their partitions to determine which
vertices lie on partition boundaries. But this procedure requires excessive in-
direction. A faster approach is based on the following two properties of a
shared vertex: it must be an end-point for at least one shared edge, and its
SPL is the union of its shared edges' SPLs. However, some communication is
required when using this method. For each vertex containing a shared edge in
its edges list, that edge's SPL is communicated to the processors in the SPLs
of all other shared edges until the union of all the SPLs is formed. For the
cases in this paper, this process required no more than three iterations, and
all shared vertices were processed as a function of the number of shared edges
plus a small conmmnication overhead. An example is shown in Fig. 4 where
the SPL is being formed in P0 for the center vertex that is shared by three
other processors. Without comnmnication. P0 would incorrectlv conclude that
the vertex is shared only with Pl and P3.
The final step in the initialization phase is the local renumbering of the exter-
nal boundary faces s Since a boundary face belongs to only one element, it
is never shared among processors. Each boundary face is defined by its three
edges in hedge [3], while each edge maintains a pair of pointers in bfac [2]
to the boundary faces it defines. Since the global mesh is closed (water-tight),
an edge on the external boundary is shared by exactly two boundary faces.
5 The internal faces are not stored in the mesh data structures.
8
Before communication
PO shares center vertex with P1, P3
@ @
After communication
P0 shares center vertex with PI, P2, P3
Fig. 4. An example showing tile communicat.ion need to form tile SPL for a shared
vertex.
However, when the mesh is partitioned, this is no longer true. An example is
shown in Fig. 5. An affected edge creates an empty ghost boundary face in
each of the two processors for the execution phase. The ghost boundary faces
do not participate in the adaptation process but are required to create a valid
subgrid in each processor. These ghost faces are later eliminated during the
finalization stage.
¢
Before partitioning
Global edge GE5 shared by
global bdy faces GBF7 and GBF8
Ghost ",
"',, Ghost
After partitioningGE5 stored as LE1 and LE3 in PO and PI
GBF7 as LBF3 in PO; GBF8 as LBFO in P1
Fig. 5. An example showing how boundary faces are represented at. partition bound-
aries.
:\ new data structure ha.s been added to the serial code to represent all this
shared information. Each shared edge and vertex contains a two-way mapping
between its local and its global numbers 6 , and a. SPL of processors where
its shared copies reside. The maximum additional storage depends on the
number of processors used and the fraction of shared objects. For the cases in
this paper, this was less than 10% of the memory requirements of the serial
version.
6 Tim global numbers for the various mesh objects are obtained trivially during the
initialization phase.
9
'2.'2. '2 Execution
The first step in the actual mesh adaptation phase is to target edges for refine-
ment or coarsening. This is usually based on an error indicator for each edge
that is computed from the solution. This strategy results in a symmetrical
marking of all shared edges across partitions since such edges have the same
numerical and geometrical information regardless of their processor number.
However, elements have to be continuously upgraded to one of the three al-
lowed subdivision patterns shown in Fig. 2. This causes some propagation of
edges being targeted that could mark local copies of shared edges inconsis-
tently. This is because the local geometry and marking patterns affect the
nature of the propagation. Communication is therefore required after each
iteration of the propagation process. Every processor sends a list of all the
newly-marked local copies of shared edges to all the other processors in their
SPLs. This process may continue for several iterations, and edge markings
could propagate back and forth across partitions.
Figure 6 shows a two-dimensional example of two iterations of the propagation
process across a partition boundary. The process is similar in three dimensions.
Processor P0 marks its local copy of shared edge GEl and communicates
that to Pl. P1 then marks its own copy of GEl, which causes some internal
propagation because element marking patterns must be upgraded to those
tha.t are valid. Note that Pl marks its third internal edge and its local copy of
shared edge GE2 during this phase. Marking information about GE2 is then
communicated to P0, and the propagation phase terminates. The four original
triangles can now be correctly subdivided into a total of 12 smaller triangles.
GE2 P_ • Shared marko Internal mark
--- Shared edge
-- Internal edge
..... New edge
Fig. 6. A two-dimensional example showing communication during propagation of
the edge marking phase.
Once all edge markings are complete, each processor executes the mesh adap-
tation code without the need for further communication, since all edges are
10
consistently marked. Tile only task remaining is to update the shared edgeand vertex information as the mesh is adapted. This is handled as a post-processingphase.
New edgesand vertices that arecreatedduring refinementareassignedsharedprocessorinfornmtion that depends on several factors. Four dift_rent cases can
occur when new edges are created:
• It" an internal edge is bisected, the center vertex and all new edges incident
on that vertex are also internal to the partition. Shared processor informa-
tion is not required in this case.
• If a shared edge is bisected, its two children and the center vertex inherit
its SPL, since they lie on the same partition boundary.
• If a new edge is created in the interior of an element, it is internal to the
partition since processor boundaries only lie along element faces. Shared
processor information is not required.
• If a new edge is created that lies across an element face, communication is
required to determine whether it is shared or internal. If it is shared, the
SPL must be formed.
All the cases are straightforward, except for the last one. If the intersection of
the SPLs of the two end-points of the new edge is null, the edge is internal.
Otherwise, communication is required with the shared processors to determine
whether they have a local copy of the edge. This communication is necessary
because no information is stored about the internal faces of the tetrahedral
elements. An alternate solution would be to incorporate internal faces as an
additional object into the data structures, and maintaining it through the
adaptation. However, this strategy does not compare favorably in terms of
memory or CPU time to a. single communication at the end of the refinement
procedure. This is primarily because the number of triangular faces for a
tetrahedral mesh is asymptotically ten times the number of mesh vertices.
Figure 7 shows the top view of a tetrahedron in processor P0 that shares
two faces with P] while the third face is izm'rnal. The fourth face is not
shown and is irrelevant for this example. Assume that due to mesh refinement,
three new edges LE1, LE2, and LEa, are formed in P0. An intersection of
the SPLs for the two end-points of all the three edges yields Pl. However,
when P0 communicates this information to Pl, Pl will only have local copies
corresponding to bE1 and LE2. Thus, P0 would be able to correctly classify
LE1 and LE2 as shared edges but LEa as an internal edge.
The coarsening phase purges the data structures of all edges that are removed,
as well as their associated vertices, elements, and boundary faces. No new
shared processor infortnation is generated since no mesh objects are created
during this step. However, objects are renumbered ,as a result of compaction
11
D/_ [ SharedfacewithPIInternalfaceof P0
-- SharededgewithP1Internaledgeof P0
LE3
Fig. 7. An example showing how a new edge across a face is classified as shared orinternal.
and all internal and shared data are updated accordingly. The refinement
routine is then invoked to generate a valid mesh from the vertices left after
the coarsening.
2.2.3 Finalization
Under certain conditions, it is necessary to create a single global mesh after one
or more adaptation steps. Some post processing tasks, such as visualization,
need to process the whole grid simultaneously. Storing a snapshot of a grid
for future restarts could also require a. global view. Our finalization phase
accomplishes this goal by merging the individual subgrids into one globaldata structure.
Each local object is first assigned a unique global number. Next, all local data
structures are updated in terms of these global numbers. Finally, gather oper-
ations are performed to a host processor to create the global mesh. Individual
processors are responsible for correctly arranging the data so that the host
0nly collects and concatenates without further processing.
It is relatively simple to assign global element numbers since elements are
not shared among processors. By performing a scan-reduce add 7 on the total
nmnl)er of elements, each processor can assign the final global element number.
The global l)oundary face numb(_ring is also done similarly since they too are
not shared among processors.
Assigning global numbers to edges and vertices is somewhat more complicated
since they may be shared by several processors. Each shared edge (and vertex)
is assigned an owner from its SPL which is then responsible for generating the
global number. Owners are randomly selected to keep the computation and
communication loads balanced. Once all processors complete numbering their
edges (and vertices), a communication phase propagates the global values from
owners to other processors that have local copies.
7 A scan-reduce add operation creates a vector whose ith element is the addition
of the first i - 1 elements of the argument vector.
12
After global numbers have been assignedto every object, all data. struc-tures are updated to contain consistent global information. Since elementsand boundary faces are unique in each processor,no duplicates exist. Allunownededge copies are removed from the data structures, which are then
compacted. However, the element lists in elems cannot be discarded for the
nnowt_ed edges. Some commut_ication is required to adjust the pointers in the
local lists so that global lists can be formed without any serial computation.
The pair of pointers in bfac [2] that were split during the initialization phase
for shared edges are glued back by communicating the boundary face informa-
tion to the owner. Vertex data structures are updated much like edges except
for the manner in which their edge lists in edges are handled. Since shared
vertices may contain local copies of the same global edge in their lists on dif-
ferent processors, the unowned edge copies are first deleted. Pointers are next
a.djusted as in the elems case with some communication among processors.
At this time, all processors have updated their local data with respect to their
relative positions in the final global data. structures. A gather operation by a
host processor is performed to concatenate the local data structures. The host
can then interface the global mesh directly to the appropriate post-processing
module without having to perform any serial computation.
3 Dynamic Load Balancing
PLUM [23] is a novel method to dynamically balance the processor workloads
for unstructured adaptive-grid computations with a global view. It has five
novel features:
• Repeated use of the initial mesh dual graph keeps the connectivity and par-
titioning complexity constant during the course of an adaptive computation.
• Paralld mesh repartitioning avoids a potential serial bottleneck.
• Fast heuristic remapping assigns partitions to processors so that the redis-
tribution cost is minimized.
• Efficient data movement significantly reduces the cost of remapping and
mesh subdivision.
• Accurate metrics estimate and compare the computational gain and the re-
distribution cost of having a balanced workload after each mesh adaptation.
3. I Repartitioning the Initial Mesh Dual Graph
Repeatedly using the dual of the initial computational mesh for dynamic load
balancing is one of the key features of PLUM. Each dual graph vertex has
13
two weights associatedwith it. The computational weight, Wcomp , models the
workload for the corresponding element. The remapping weight, Wremap, mod-
els the cost of moving the element from one processor to another. Every edge
of the dual graph also has a weight, Wcomm, that models the runtime interpro-
cessor co,nmunication. These three weights are determined by the numerical
algorithm and the data structures. In our current work, Wcomr, is set to the
number of leaf elements in the refinement tree, Wrema p is set to the total num-
ber of elements in the refinement tree, and W¢omm is set to the number of faces
in the computational mesh that corresponds to the dual graph edge. The mesh
connectivity, W¢omp, and Wcomm , together determine how balanced partitions
with minimum runtime communication are formed. The Wremap determine how
partitions should be assigned to processors such that the data redistribution
cost is minimized. New computational grids obtained by hierarchical adap-
tation are translated to Wcomp and Wremap for every vertex and to Weomm for
every edge in the dual mesh. If the dual graph with a new set of weights
is deemed unbalanced, the mesh is repartitioned using the ParMETIS [19]
parallel multilevel partitioner.
3.2 Processor Reassignment
New partitions generated by a partitioner must be mapped to processors such
that the data redistribution cost is minimized. In general, the number of new
partitions is an integer multiple F of the number of processors, and each
processor is assigned F unique partitions. Allowing multiple partitions per
processor reduces the volume of data movement at the expense of partitioning
and processor reassignment times [23]; however, the simpler scheme of setting
F to unity suffices for most practical applications.
We first generate a similarity lneasure }t[ that indicates how the remapping
weights IUrcmap of tile new partitions are distributed over the processors. It is
represented as a matrix where entry Mi3 is the sum of the W_emap values of all
tile dual graph vertices in new partition j that already reside on processor i.
Various cost functions are usually needed to solve the processor reassignment
problem using M for different machine architectures. We present three general
lnetrics: TotalY, MaxY, and MaxSR, which model the remapping cost on most
,nultiprocessor systems. TotalV minimizes the total volume of data moved
among all the processors, MaxV minimizes the maximum flow of data to or from
any single processor, while MaxSR minimizes the sum of the maximum flow of
data to and from any processor. A greedy heuristic algorithm to lninimize the
remapping overhead is also presented.
14
3.2. ! TotalV Metric
The TotalV metric assumes that by reducing network contention and the total
number of elements moved, the remapping time will be reduced. In general,
each processor cannot be assigned F unique partitions corresponding to their
F largest weights. To minimize TotalV, each processor i must be assigned F
partitions ji_f, f = 1, 2,..., F, such that the objective
P F
E E M,j,_,i=l f=l
is maximized subject to the constraint
ji_r =fijk_s, fori:fikor'r-¢_'; i,k=l,2,...,P; r,s = 1,2,..., F.
We call optimally solve this by mapping it to a network flow optimization
problem described as follows. Let G = (V, E) be an undirected graph. G is
bipartite if V can be partitioned into two sets A and B such that every edge
has one vertex in A and the other vertex in B. A matching is a subset of edges,
no two of which share a common vertex. A mazimum-cardinality matching is
one that contains as many edges as possible. If G has a real-valued cost on each
edge, we can consider the problem of finding a maximum-cardinality matching
whose total edge cost is maximized. We refer to this as tile maximally weighted
bipartite graph (MWBG) problem (also known as the assignment problem).
When F = 1, optimally solving for the TotalV metric trivially reduces to
MWBG, where V consists of P processors and P partitions in each set. An edge
o[" weight Mij exists between vertex i of the first set and vertex j of the second
set. If F > 1, the processor reassignment problem can l)e reduced to MWBG
i)v duplicating each processor and all of its incident edges F times. Each set
(,[" the bipartite graph then has P×F vertices. After the optimal solution is
obtained, the solutions Ibr all F copies of a processor are combined to form
a one-to-F mapping between the processors and the partitions. Tile optimal
solution for the TotalV metric and the corresponding processor assignment of
;_n example similarity matrix is shown in Fig. 8(a).
The fastest MWBG algorithm [13] carl compute a matching in O([VI 2 log [V I+
[I/'IIEI) time, or in O(IVIL/21Et log(IVIC)) time if all edge costs are integers of
absolute value at most C [15]. We have implemented the optimal algorithm
with a runtime of O(IVla). Since M is generally dense, IEI _ IVI 2, implying
that we should not sec a dramatic performance gain from a faster implemen-
tation.
15
New Partitions New Partitions New Partitions New Partitions
New Processors
TotalV moved = 525
MaxV moved = 275
MaxSR moved = 485
(a)
New Processors
TotalV moved = 640
MaxV moved = 245
MaxSR moved -- 475
(b)
New Processors
TotalV moved = 570
MaxV moved = 255
MaxSR moved -- 465
(c)
New Processors
TotalV moved = 550
MaxV moved = 260
MaxSR moved = 470
(d)
Fig. 8. Various cost metrics of a similarity matrix M for P = 4 and F = 1 using
(a) the optimal MWBG, (b) the optimal BMCM, (c) the optimal DBMCM, and
(d) our heuristic Mgorithms.
3.2.2 MaxV Metric
The metric MaxV, unlike TotalY, considers data redistribution in terms of
solving a load imbalance problem, where it is more important to minimize the
workload of the most heavily-weighted processor than to minimize the sum of
all the loads. During the process of remapping, each processor must pack and
unpack send and receive buffers, incur remote-memory latency time, and per-
form the computational overhead of rebuilding internal and shared data struc-
tures. By minimizing max (o_ x max (ElemsSent), flx max (ElemsRe cd)), where
a and g are machine-specific parameters, MaxV attempts to reduce the total
remapping time by minimizing the execution time of the most heavily-loaded
processor. We can solve this optimally by considering the problem of find-
ing a maximum-cardinality matching whose maximum edge cost is minimum.
We refer to this as the bottleneck maximum cardinality matching (BMCM)
problem.
To find the BMCM of the graph G corresponding to the similarity matrix, we
first need to transform ?t,[ into a new matrix M'. Each entry Mi'j represents
the maximum cost of migrating data between processor i and partition j"
P P
y=l x=l
Optimally solving the BMCM problem is NP-complete for F > 1. For F = 2,
it is NP-complete by reduction from numerical matching with target sums; for
F > 2, it is NP-complete by reduction from 3-partition. We have implemented
the BMCM algorithm in [3] for F = 1 which combines a maximum eardinality
matching algorithm with a binary search, and runs in O(IVI /21EIlog IVI). The
fastest known BMCM algorithm [14] has a runtime of O(([V ItoglVl) / lEI).
16
The new processorassignmentfor the similarity matrix in Fig. 8 using thisapproachwith a = fl = 1 is shown in Fig. 8(b). Notice that the total number of
elements moved in Fig. 8(b) is larger tha.n the corresponding value in Fig. 8(a);
however, the maximun_ number of elements moved is smaller.
,9.2.3 MaxSR Metric
Our third metric, MaxSR, is similar to MaW in the sense that the overhead
of the bottleneck processor is minimized during the remapping phase. MaxSR
differs, however, in that it minimizes the sum of the heaviest data flow from
any processor and to any processor, expressed as (a × max (ElemsSent) +
× max (ElemsRecd)). We refer to this as the double bottleneck maximum
cardinality matchillg (DBMCM) problem. The MaxSR formulation allows us
to capture the computational overhead of packing and unpacking data, when
these two phases are separated by a barrier synchronization. Additionally,
the MaxSR metric may also approximate the many-to-many communication
pattern of our remapping phase. Since a processor can either be sending or
receiving data, the overhead of these two phases should be modeled as a sum
o[' costs.
We have developed an algorithm for computing the minimum MaxSR of the
graph G corresponding to our similarity matrix. We first transform M to a new
matrix M". Each entry M_5. contains a pair of values {Sij, R_j} corresponding
to the total cost of sending and receiving data. when partition j is mapped to
processor i:
P P
{s,, ?%,v# j), R,; = (ZE M ,,x # i)}.y=l x=l
The optimal algorithnl for the MaxSR metric is NP-complete for F > 1. since
the u,derlying BMCM algorithm is also NP-comph, te.
I..et cry, o2, • • •, ak be the distinct Sij values appearing in M". sorted in increas-
ing order. Thus, ai < ai+1 and k _< p2. Form the bipa.rtite graph Gi = (I.', E,),
where V consists of processor vertices u = 1, 2, .... P and partition vertices
v = 1,2,..., P, and Ei contains edge (u, v) ifS_,v <_ ai; furthermore, edge (u,v)
has weight Ruv if it is in El.
For small values of i, graph Gi may not have a perfect matching. Let iml, be
the smallest index such that Gi,,,_,, has a perfect matching. Obviously, Gi has
a perfect matching for all i _> imin. Solving the BMCM problem of Gi gives a
matching that minimizes the maximum Rij edge weight. It gives a matching
17
with MaxSRvalue at most c_i + MaxV(Gi). Defining
MaxSR(i) = rain (crj+ MaxV(Gj)),
it is easy to see that MaxSR(k) equals the correct value of MaxSR. Thus, our
algorithnl computes MaxSR by solving k BMCM problems on tile graphs G_
and computing the minimum value MaxSR(k). However, we can prematurely
terminate tile algorithm if there exists an imp× such that ai ..... +1 >_ MaxSR(imax),
since it is then guaranteed that the MaxSR solution is MaxSR(imax).
Our implementation has a runtime of O([I/'It/2]E] 2 log IV]) since the BMCbl
algorithm is called IEI times in the worst case; however, it can be decreased
to O(IEI2). The following is a sketch of this more efficient implementation.
Suppose we have constructed a matching l_ that solves the BMCM problem
Of Gi for i _> imin. We soh'e the BMCM problem of Gi+l as follows. Initialize
a working graph G to be Gi+t with all edges of weight greater than MaxV(Gi)
deleted. Take the matching f_ on G, and delete all unmatched edges of weight
MaxV(Gi). Choose an edge (u, v) of maximum weight in /_, remove it from p,
and G, and search for an augmenting path from u to v in G. If no such path
exists, we know that MaxV(Gi) = MaxV(Gi+l). If an augmenting path is found.
repeat this procedure by choosing a new edge (u', v') of ma.ximunl weight in
the matching and searching for a.n augmenting path. After some repetitions of
this procedure, the maximum weight of a matched edge will have decreased to
the desired value MaxY(Gi+t). At this point our algorithm to solve the BMCM
problem of Gi+l will stop, since no augmenting path will be found.
This algorithm is of complexity O([E] ')) since each search for a.n augmenting
path uses O(IEI) time and there are O(IEI) such searches. A successful search
for an augmenting path for edge (u,v) permanently eliminates it from all
future graphs, so there are at most ]E I successful searches. Furthermore. there
arc at most IEI unsuccessfifl searches, one for each value of i.
The new processor assignment for the similarity matrix in Fig. 8 using the
DB.MCXl algorithm with ct = 3 = 1 is shown in Fig. 8(c). Notice that the
MaxSR solution is minimized; however, the tmmber of Tota17 elements moved
is larger than the corresponding value in Fig. 8(a), and more Max7 elements
are moved than in Fig. 8(b). Also note that the optimal similarity matrix
solution for MaxSR is provably no more than twice that of MaxV.
3.2.4 Heuristic Algorithm
We haxe developed a heuristic greedy algorithm that gives a suboptimal so-
lution to the TotalV metric in O([E]) steps [23]. All partitions are initially
18
flaggedasunassignedand eachprocessorhasa(:ounter set to F that indicates
the remaining number of partitions it needs. The non-zero entries of the simi-
larity matrix M are then sorted in descending order. Starting from the largest
entry, partitions are assigned to processors that have less than F partitions
until done. If necessary, the zero entries in 3,[ are also used. It has been proven
tim.t a. processor assignment obtained using the heuristic algorithm can never
result in a. data movement cost that is more than twice that of the optimal
Tot alV assignment [23]. In addition, experimental results in § 4.3 demonstrate
that our heuristic quickly finds high quality solutions for all three metrics. Ap-
plying this heuristic algorithm to the similarity matrix in Fig. 8 generates the
new processor assignment shown in Fig. 8(d).
3.2 Remapping Cost Model
After the new partitions are reassigned to the processors, a model is required
to predict the redistribution cost for a given machine. Accurately estimating
this time is difficult because of the number and complexity of the costs in-
voh'ed in the remapping procedure. The total remapping cost includes the
computational overhead for rebuilding internal data structures and updating
sha.red boundary information. The communication overhead is architecture-
dependent and complicated because of the many-to-many collective commu-
nication pattern used by the remapper.
Our redistribution algorithm first removes the data objects moving out of a
pa.rtition and places them in a buffer. A collective communication then ap-
propriately distributes the data to their final destination, where they are in-
tegrated into the data structures. Finally, the partition boundary information
is consistently updated. This remapping strategy closely follow_ the superstep
mo<t,'l of BSP [31].
The ('Xlwctcd redistribution tim_, on bandwidth-rich svstem_ is then given by:
-, x MaxSR + O,
where MaxSR = max (ElemsSent) + max (ElemsRecd), "f is the total compu-
t_ttion and communication cost to process each redistributed clement, and O
is the predicted sum of all constant overheads [23]. This formulation demon-
stra.tes the need to model and minimize the MaxSR metric when performing
processor reassignment. To compute the values of ? and O, a simple least
squares fit through several data points for various redistribution patterns and
their corresponding runtimes can be used. This procedure needs to be per-
form_'d only once for each architecture, and the values of ? and O can then
be used in actual computations to estima.te the redistribution cost.
19
4 Experimental Results
The parallel 3D_TAG mesh adaptation procedure and the PLUM global load
balancing strategy have been implenmnted in C and C++, with the parallel
activities in MPI for portability. All experiments were performed oil the wide-
node SP2 at NASA Ames, the Origin2000 at NCSA, and the T3E at NASA
Goddard, without any machine-specific optimizations.
Our computational mesh is the one used to simulate the acoustics experiment
of Purcell [24] where a 1/Tth-scale model of a UH-1H helicopter rotor blade
was tested over a range of Mach numbers. Detailed numerical results of the
simulation are given elsewhere [30]. This paper reports only on the perfor-
mance of parallel 3D_TAG and PLUM.
Performance results are presented for one refinement and one coarsening step
using various edge-marking strategies. Six strategies are used for the refine-
ment step. The first set of experiments, denoted as RAND_IR, RAND_2R,
and RAND_3R, consists of randomly bisecting 5%, 33%, and 60% of the edges
in the mesh, respectively. Tile second set, denoted as REAL_IR, REAL_2R,
and REAL_3R, consists of bisecting the same numbers of edges using an error
indicator [30] derived from the actual solution. These strategies represent sig-
nificantly different scenarios. In practice, mesh adaptation tends to be local.
The RAND cases are included as they are expected to behave somewhat ide-
ally because the computational loads are automatically balanced. Thus, the
RAND results should give an indirect indication of how well parallel 3D_TAG
can really perform without explicit load balancing.
Since the coarsening procedure and performance are similar to the refine
ment method, only two cases are presented where 7% of the edges in the
refined meshes obtained with the RAND_2R and the REAL_2R strategies
are respectively coarsened randomly (RAND_2C) or based on actual solution
(REAL_2C). Table 1 presents the progression of grid sizes through the two
adai)tation steps for each edge-marking strategy.
,4.t Refinement Pha.se
Table 2 presents the computation times and parallel speedup for the refinement
step with the random marking of edges (strategies RAND_IR, RAND_2R,
and 1-{AND_3R). Note that the speedup values are calculated based on the
total time. Performance is excellent with efficiencies of more than 83% on 32
processors and 76% on 64 processors for the RAND_3R case. Parallel mesh
refinement shows a markedly better performance for RAND_3R due to its
bigger computation-to-communication ratio. In general, the total speedup will
2O
Table 1
Grid sizes for the different, refinement and coarseninl_ strategies
Vertices Elements Edges Bdy Faces
Init.ial mesh
R,._XD_I R
REAL_IR
PLAND_2R
REAL_2R
I_AND_aR
REAL_3R
RAND_2C
REAL_2C
13,967
18,274
17,880
39,829
39,332
60,916
61,161
21,756
20,998
60,968 78,343 6,818
82,417 104,526 7,672
82,489 104,209 7,682
201,734 246,949 10,774
201,780 247,115 12,008
320,919 389,686 15,704
321,841 391,233 16,464
100,537 126,448 8,312
100,124 125,261 8,280
improve a.s the size of the refined mesh increases. This is because the mesh
adaptation time will increase while the percentage of elenmnts along processor
boundaries will decrease.
Table 2
Performance of mesh refinement when edges are bisected randomly
RAND_I R RAND_2R RAND_3R
Shared Comp Total Comp Total Comp Total
P Edges Time Speedup Time Speedup Time Speedup
1 0.0% 7.044 1.00 26.904 1.00 45.015 1.00
2 1.9% 3.837 1.84 13.878 1.94 22.762 1.98
4 3.7% 2.025 3.48 7.605 3.54 11.569 3.89
8 6.6% 1.068 6.58 4.042 6.65 5.913 7.61
16 8.8% 0.587 l1.86 2.293 11.67 3.191 14.07
32 t 1.6% 0.330 20.72 1.338 19.78 1.678 26.62
64 15.3% 0.191 32.92 0.711 35.82 0.896 48.66
The communication time is less than 3% of the total time for up to 32 proces-
sors for all three cases. Oil 64 processors, the communication time although
still quite small, is more than 12% of the computation time for RAND_IR.
This is because each of the 64 partitions contains less than 1,000 elements
with more than 15% of the edges on partition boundaries. Since additional
work and storage arc necessary for shared edges, the speedup deteriorates
as the percentage of such edges increases. The situation is much better for
RAND_3R since the computation time is significantly higher.
21
Table 3 showsthe computation timesand speedupwhenedgesaremarkedus-ing a solution-basederror indicator. Performanceis extremely poor, especiallyfor REAL_IR and REAL_2R, with speedupsof only 9.2X and 19.2X on 64processors,respectively.This is becausemeshadaptation for practicalpromlemsoccurs in a localized region, causingan ahnost worst caseload-balancebehavior. Elementsare targeted for refinementon only a small subsetof theavailable processors..,Mostof the processorsremain idle since noneof theirassignedelementsneedto be refined.Performanceis somewhatbetter for theREAL_3Rstrategy becausethe refinementregionis much larger. Since60%ofall edgesarebisectedin this case,most of the processorsarebusydoing usefulwork. This is reflected by an efficienc.yof morethan 56%on 64 processors.Table3Performance of mesh refinement when edges axe bisected based on actual solution
REAL_I R REAL_2R REAL_3 R,
Comp Total Comp Total Comp Total
P Time Speedup Time Speedup Time Speedup
1 5.902 1.00 23.780 1.00 41.702 1.00
2 3.979 1.48 18.117 1.31 26.317 1.58
4 2.530 2.33 9.173 2.59 14.266 2.92
8 1.589 3.71 7.091 3.35 8.430 4.95
16 1.311 4.48 4.046 5.87 4.363 9.55
32 0.879 6.65 2.277 10.40 2.278 18.25
64 0.616 9.22 1.224 19.16 1.148 35.95
Note that the communication times constitute a much smaller fraction of the
total time compared to the cases when edges are bisected randomly. This is
due to the difference in the distribution of bisected edges. The RAND cases
require significantly more communication among processors at the partition
boundaries because refinement is scattered all over the problem domain. The
REAl, cases, on the other hand. require much less communication since the
refined regions are localized and mostly contained within partitions.
Poor parallel performance of the mesh refinement code for the three REAL
strategies is due to severe load imbalance. It is therefore worthwhile trying to
load balance this phase of 3D_TAG as much as possible. This can be achieved
within PLUM by splitting the mesh refinement step into two distinct phases of
edge marking and mesh subdivision. After edges are marked for bisection, it is
possible to exactly predict the new refined mesh before actually performing the
subdivision phase. This is because elements are independently refined based
on their binary patterns. The mesh is repartitioned if the edge markings are
skewed beyond a specified tolerance. All necessary data. is then appropriately
22
redistributed and the mesh elements are refined in their destination processors.
This enables the subdivision phase to perform in a more load-balanced fashion.
As a bonus, a smaller volume of data has to be moved around since remapping
is performed before the mesh grows in size due to refinement.
Using this methodology, the three REAL cases were run again. Table 4 presents
the performance results of this "load balanced" mesh refinement step. Com-
pared to the results in Table 3, the parallel speedups are now much higher. In
fact, the speedups for REAL_2R consistently beat the corresponding speedups
for RAND_2R, while REAL_3R outperforms RAND_3R when more than eight
processors are used. Even though the RAND cases are expected to behave
somewhat ideally, these results show that explicit load balancing can do better.
An efficiency of 82% is attained for REAL_3R on 64 processors, thereby demon-
strating that mesh adaptation ca.n deliver excellent speedups if the marked
edges are well-distributed among the processors. Communication requires a
larger fraction of the total time for this load balanced strategy because the
mesh refinement work is distributed among more processors after load bal-
ancing. However, communication times are still relatively small, requiring less
than 4% of the total time for all runs except for REAL_IR on 64 processors.
Table ,1
Performance of "load balanced" mesh refinement
REAL_I R REAL_2R REAL..3 g
Comp Total Comp Total Comp Total
P Time Speedup Time Speedup Time Speedup
1 5.902 1.00 23.780 1.00 41.702 1.00
2 3.31l 1.78 12.059 1.97 21.592 1.93
4 1.980 2.98 6.733 3.53 10.975 3.80
8 1.369 4.30 3.430 6.92 5.678 7.34
16 0.702 8.3,1 1.840 12.88 2.899 1.1.37
32 O.414 13.89 1.051 22.-11 1._I8,1 27.99
6,1 0.217 23.89 0.528 43.24 0.777 52.52
The effect of load balancing the refined mesh before performing the actual
subdivision can be seen more directly from the results presented in Table 5
for RAND_3R and REAL_3R. The quality of load balance is defined as the
ratio of the number of elements on the most heavily-loaded processor to the
optimal number of elements per processor. For the RAND_3R strategy, the
mesh was refined without any load balancing. Two different sets of results
are presented for REAL_3R: one without load balancing (NLB) and the other
using the technique of load balanced mesh refinement (LB). Notice that the
quality of load balance before refinement is excellent, and identical, for both
23
Table 5
Quality of load balance before and after mesh refinement.
RAND_3R NLB REAL_3R LB REAL_3R
P Before After Before After Before Aher
1 1.000 1.000 1.000 1.000 1.000 1.000
2 1.000 1.016 1.000 1.556 1.406 1.000
4 1.000 1.033 1.000 2.188 1.948 1.000
8 1.000 1.085 1.000 6.347 2.654 1.000
16 1.000 1.167 1.000 5.591 4.025 1.000
32 1.001 1.226 1.001 7.987 4.212 1.000
64 1.005 1.506 1.005 8.034 6.709 1.004
RAND_3R and NLB REAL_3R because the initial mesh is partitioned using
ParMETIS. However, after mesh refinement, the load imbalance is severe, par-
ticularly for NLB REAL_3R. The load imbalance is not too bad for RAND_3R
since edges are randomly marked for refinement. This is reflected by the dif-
ference in the speedup values in Tables 2 and 3. For LB REAL_3R, the initial
mesh is repartitioned after edge marldng is complete. This imbalances the load
before refinement, but generates partitions that are excellently balanced after
subdivision is complete. It also improves the speedup values significantly.
4.2 Coarsening Phase
The coarsening phase consists of three major steps: marking edges to coarsen,
cleaning up all the data structures by removing the coarsened edges and their
associated vertices and tetrahedra] elements, and finally invoking the refine-
ment routine to generate a valid mesh t¥om the remaining vertices.
Timings and parallel speedul) h)r the RAND_2C and the REAL_2C coarsening
strategies are presented in Table 6. The follow-up mesh refinement times are
not included because the goal was to demonstrate the parallel performance of
only the modules that are required during the coarsening phase. The compu-
tation time in Table 6 is the time required to mark edges for coarsening. The
communication time is negligible and not shown, but it was included when
calculating the speedup values. The cleanup time, on the other hand, is al-
ways a significant fraction of the total time. The cleanup time decreases as
more and more processors are used due to the reduction in the local mesh
size for each individual partition; however, since it depends on the fraction of
shared objects, performance deteriorates as the problem size is over-saturated
by processors. For instance, even though the total eflq.ciency is about 50% for
24
Ta,ble6Performanceof meshcoa,rsening
P
RAND_2C REAL_2C
Comp Cleanup TotM Comp Cleanup Total
Time Time Speedul) Time Time Speedup
1 3.619 2.364 1.00 3.989 2.246 1.00
2 1.832 1.352 1.88 2.026 1.283 1.88
4 0.963 0.782 3.42 1.066 0.854 3.25
8 0.572 0.498 5.57 0.600 0.498 5.68
16 0.303 0.287 10.01 0.334 0.279 10.17
32 0.170 0.170 16.95 0.167 0.161 19.01
6,1 0.070 0.098 31.17 0.093 0.097 32.82
61 processorsfor the results in Table 6, tile efficiency when consideringonlytile cleanuptimes is barely 37%.
4.3 Comparison of Reassignment Algorithms
Table 7 presents a comparison of our five different processor reassignment al-
gorithms in terms of the reassignment time (in secs) and the amount of data
movement. Results are shown for the REAL_2R strategy on the SP2 with
F = 1. The ParMETIS [19] case does not require any explicit processor reas-
signment since we choose the default partition-to-processor mapping given by
the partitioner. The poor performance is expected since ParMETIS is a global
partitioner that does not attempt to minimize the remapping overhead. A de-
tailed performance comparison of Pa.rMETIS with other pa.rtitioners within
th,' PI.UXI framework is given in [(3].
T}te ex('cution times of the other four algorithms increase with the number of
processors because of the growth in the size of the similarity matrix; however.
tile time for the heuristic algorithm on 64 processors is still very small. The
TotalV, MaxV, and YlaxSR metrics are obviously minimized by the MWBG,
BMCM, and DBMCM algorithms, respectively. All the algorithms ahnost
match the minimum MaxV value given by BMCM. Tile extremely local re-
finement in our test case requires the migration of a large number of elements
to achieve load balance, causing any reasonable reassignment algorithm to
return a similar MaxV solution.
Tile DBMCNI algorithm ol)tima[ly reduces SaxSR, but achieves no more than
a 5% improvement over the other algorithms. Nonetheless, since we believe
25
Table7Comparison of reassignment algorithms for REAL_2R on tile SP2 with F = 1
P = 32 P = 64
YotalY MaxV MaxSR Reass. TotalY MaxV MaxSR
Algorithnl Metric Met,'ic Metric Time Metric Metric Metric
P_.ea.ss.
Time
PaxMETIS 58,297 5,067 7,467 0.000 67,439 2,667 4,452 0.000
MWBG 34,738 4.410 5,822 0.018 38,059 2,261 3,142 0.065
BMCM 49,611 4.410 5,944 (}.032 52,837 2,261 3,282 0.133
DBMCM 50,270 4,414 5,733 0.092 54,896 2,261 3,121 1.252
Heuristic 35,032 4,410 5,809 0.002 38,283 2,261 3,123 0.009
that the MaxSR metric can closely approximate the remapping cost on many
architectures, computing its optimal solution can provide useful information.
Notice that the minimum TotalV increases slightly as P grows from 32 to 64,
while MaxSR is dramatically reduced by over 45%. This trend continues as the
number of processors increases, and indicates that PLUM will remain viable
on a large number of processors, since the per processor workload decreases
as P increases.
Finally, observe that the heuristic algorithm does an excellent job in minimiz-
ing all three cost metrics in a trivial amount of time. Although theoretical
bounds have only been established for the Totall/ metric, empirical evidence
indicates that the heuristic algorithm closely approximates both MaxV and
MaxSR. Similar results were obtained for the other edge-marking strategies.
Our heuristic algorithm has now been incorporated into ParMETIS [26]. This
feature gives users the option of global repartitioning while minimizing the
remapping overhead.
4.4 Portabilitg Analy.sis
The three left plots in Fig. 9 illustrate parallel speedup for the three edge-
marking strategies on the SP2, Origin2000, and T3E. Two sets of results are
presented for each machine: one when data remapping is performed after mesh
refinement, and the other when remapping is performed before refinement. The
REAL_3R case shows the best speedup values because it is the most computa-
tion intensive. Remapping data before refinement has the largest relative effect
for REAL_IR., because it has the smallest refinement region and predictively
load balancing the refined mesh returns the biggest benefit. The best results
are for REAL_3R with remapping before refinement, showing an efficiency
greater than 87% on 32 processors.
26
6O
45
30
24 ¸M
U
6
80
_'60-
20
00
SP2
.... Remap after refinement
-- Remap before refinement _
O Real-2R .,-a_/ .......
°°'°° - oo1_'°°°°° o.-ol
8 16 24 32 40 48 56 64
Number of processors
ong ..............iiiiiiiiiiii
_.'° o°,D._ o°'" ...........
4 8 1'2 16 20 2'4 28 32
Number of processors
T3E
16 32 48 64 80 96 112 128
Number of processors
t u. spa
1t "-, %1b,.lo "-.,Q: .......,, ...... -'_r:::: ........... "...... --7........... ,
fi
lo°1!
10 8 1'6 2'4 3'2 4J3 48 5'6 64
Number of processors
,_ L01]
E
,_, 1o°44
J
o
!
...a. Origin2000
"8:_ "'-,,
"'_'_.
4 8 I'2 16 20 24 28 32
Number of [mx:essors
10t
l0 °
0
T3E
_'-_,,"D'-.
o . --.g-. .... g--- o.. ---
_ .;2 48 64 sb 9'6 1;2 t28
Number of processors
Fig. 9. RelinemenL speedup (left.) and remapping t.ime (right.) within PLUM on the
SP2, Origin2000, and T3E, when data is redist.ribuled eit.her after or before meshrt, tinetuent..
To compare the performance on the three target machines more critically,one
needs to look at the actua.1 times rather than the speedup values. Table 8 shows
how the execution time (in sees) is spent during the refinement and subsequent
load balancing phases for the R_AL_2R ease when data is remapped before
the subdivision phase. Notice that the TaE adaptation times a.re consistently
more tha.n 1.4 times faster than the Origin2000 and three times faster than the
SP2. One reason for this performance difference is the disparity in the clock
speeds of the three ma.ehines. Another reason is that the mesh adaptation code
does not use the floatinDpoint units on the SP2, thereby adversely affecting
its overall performance.
27
Table8Anat.omyof executiontimesfor REAL_2R,oil t.heOri_iu2000,SP2,andT3E
Adapt.at.ionTime Renmpl)ingTinto PartitioningTime
P 02000 SP2 T3E 02000 SP2 T3E 02000 SP2 T3E
2 5.261 12.06 3A55 3.005 3.440 2.648 0.628 0.815 0.701
4 2.880 6.734 1.956 3.005 3.440 1.501 0.584 0.537 0.477
8 1.470 3.434 1.034 2.963 3.321 1.449 0.522 0.424 0.359
16 0.794 1.846 0.568 2.346 2.173 0.880 0.396 0.377 0.301
32 0.458 1.061 0.333 0.491 1.338 0.592 0.389 0.429 0.302
64 0.550 0.188 0.890 0.778 0.574 0.425
128 0.121 1.894 0.599
The three right plots in Fig. 9 show the remapping time for each of the three
cases on the SP2, Origin2000, and T3E. In almost every case, a significant
reduction in remapping time is consistently achieved when the adapted mesh
is load balanced by performing data. movement prior to refinement. This is
beta.use the mesh grows in size only after the data has been redistributed.
The remapping times also usually decrease as the number of processors is
increased because more processors are available to share the increase in the
total volume of data movement. The remapping times when data is moved
before mesh refinement are reproduced for the REAL_2R case in Table 8 since
the exact values are difficult to read off the log-scale.
A peculiarity of these results is the behavior of the T3E when P > 64. When
using up to 32 processors, the T3E closely follows the redistribution cost model
given in § sec:costmodel; however, for 64 and 128 processors, the remapping
overhead begins to increase even though the MaxSR metric continues to de-
crease. The runtime difference when data is remapped before and after re-
fim, ment is dramatically diminished; in fact, all the remapping times begin
to converge to a single value! This indicates that the remapping time is no
longer affected only by the volume of data redistributed but also by the in-
terprocessor communication pattern. One way of potentially improving these
results is to take advantage of the TaE's ability to ef[i(,iently perform one-sided
tom munieation.
Another surprising result is the dramatic reduction in remapping times when
using 32 processors on the Origin2000. This is probably because network con-
tention with other jobs is essentially removed when using the entire machine.
When using up to 16 processors, the remapping times on the SP2 and the
Origin2000 are comparable, while the T3E is about twice as fast. Recall that
the remapping phase within PLUM consists of both commuuication (to phys-
ically move data around) and computation (to rebuild the internal and shared
28
da.ta structures). Since the results in Table 8 indicate that computation is
f_ster on the Origin2000, it is reasonable to infer that bulk communication is
f_:ster oll the SP2. These results generally demonstrate that our methodology
within PLUM is effective in significantly reducing the data remapping time
m:d improving the parallel performance of mesh refinement.
Ta.ble 8 also presents the ParMETIS partitioning times for REAL_2R on all
three systems; the results for REAL_IR and REAL_3R are almost identical be-
cause the time to repartition mostly depends on the initial problem size. There
is, however, some dependence on the number of processors used. When there
are too few processors, repartitioning takes more time because ea.ch processor
has a bigger share of the total work. When there are too many processors,an increase in the communication cost slows down the repartitioner. Ta.ble 8
demonstrates that ParMETIS is fast enough to be effectively used within our
framework, and that PLUM can be successfully ported to different platforms
without a.ny code modifications.
5 Conclusions
Dynamic mesh adaptation on unstructured grids is a powerful tool for solving
problems that require local grid modifications to efficiently resolve physical
features of interest. For such problems, the coarsening/refinement step must
be performed frequently, so its efficiency must be comparable to that of the
nulnerical solver. Furthermore, with the ubiquity of parallel computing, it is
imperative to have efficient parallel implementations of adaptive unstructured-
grid algorithms. Unfortunately, parallel local mesh adaptation requires dy-
namic load balancing. In this paper, we described the parallel implementation
of.the 3D_TAG unstructured mesh adaptation algorithm and verified the ef-
fectiveness of our PLUM load balancer for a. helicopter rotor blade acoustics
problem.
Six refinement and two coarsening cases were presented with varying fractions
of a realistic-sized domain being targeted for refinement. \Ve demonstrated
excellent parallel performance when repa.rtitioning and remapping the mesh
in a. load bala.nced fashion after edges were targeted for refinement but be-
fore performing the actual subdivision. We presented three generic metrics to
model the remapping cost on most multiprocessor systems. Optimal solutions
for these metrics, as well as a heuristic approach were implemented. It was
shown that our heuristic algorithln quickly finds a solution that satisfies all
threo metrics. Additiona.lly, we showed that the data redistribution overhead
caa be significantly reduced by applying our heuristic processor reassignment
a.lgorithm to the default mapl)ing given by the global pa.rtitioner. Portability
was demonstrated by presenting results on the three vastly different archi-
29
tectures of the SP2, Origia20(}0, a.tld T3E, without the aeed for ally code
modifica.tions. Overall, the results showed that our parMlel mesh adaptation
and dynamic load balancing strategies will remain viable on large numbers of
processors.
References
[1] M.J. Berger and J.S. Saltzman, AMR oil the CM-2, AppL Numer. Math. 14
(1994) 239-253.
[2] K.S. Bey, J.T. Oden, and A. Patra, A parallel hp-axiaptive discontinuous
Galerkin method for hyperbolic conservation laws, Appl. Numer. Math. 20(1996) 321-336.
[3] K. Bhat, An O(n2"51og2n) time algorithm for the bottleneck assignment
problems (AT&T Bell Laboratories, Murray Hill, N J, 1984).
[4] R. Biswas and L. Dagu,n, Parallel implementation of an adaptive scheme for 3d
unstructured grids on a shared-memory multiprocessor, in: A. Ecer, J. Periaux,
N. Satofuka, and S. Taylor, eds., Parallel Computational Fluid Dynamics:
Implementations and Results Using Parallel Computers (Elsevier, Amsterdam,
The Netherlands, 1996) 489-496.
[5] R. Biswas, K.D. Devine, and J.E. Flahert.y, Parallel, adaptive finite element
methods for conservation laws, Appl. Numer. Math. 14 (1994) 255-283.
[6] R. Biswas and L. Oliker, Experiments with repartitioning and load balancing
adaptive meshes, Numerical Aerospace Simulation Branch Tech. Rep. NAS-97-
021, NASA Ames Research Center, Moffett Field, CA, 1997.
[7] R. Biswas and R.C. Strawn, A new procedure for dynamic adaption of three-
dimensional unstructured grids, Appl. Numer. Math. 13 (1994) 437-452.
[8] R. Biswas and R.C. Strawn, Tetrahedral and hexahedral mesh adaptation for
CFD problems, Appl. Numer. Math. 26 (1998) 135-151.
[9] J.G. Castanos and J.E. Savage, The dynamic adaptation of parallel mesh-
based computation, Proceedings 8th SIAM Conference on Pa_ullel PTvcessing
for Scientific Computing (SIAM, Minneapolis. MN, 1997).
[10] N. Chrisochoides, Multithreaded model for the dynamic load-balancing of
parallel adaptive PDE computations, Appl. Numer. Math. 20 (1996) 349-365.
[11] H.L. de Cougny, K.D. Devine, J.E. iZlaherty, R.M. Loy, C. Ozturan, and M.S.
Shephard, Load balaaming for tile parallel adaptive solution of partial differential
equal.ions, Appl. Numer. Math. 16 (1994) 157-182.
[12] P. Diniz, S. Plimpton, B. Hendrickson, and R. Leland, Parallel algorithms for
dynamically partitioning unstructured grids. Proceedings 7th SIAM Co_lference
on ParaUel Processing for Scientific Computing (SIAM, San Francisco, CA,1995) 615-620.
3O
[13]M.L. Fredmanand R.E. Tar.jan,Fibonacciheapsand their usesin improvednet.workoptimizationalgorithms,J. ACM 34 (1987) 596 615.
[14] H.N. Gabow and R.E. Tarjan, Algorithms for two bottleneck optimization
problems, J. Al9. 9 (1988) 411--417.
[15! H.N. Gabow and R.E. T_.rjan, Faster scaling algorit.hms for network problems,
SIAM J. Comput. 18 (1989) 1013-1036.
[16] J. Galtier, Automatic partitioning techniques for solving partial differential
equations on irregular adaptive meshes, Proceedings IOth ACM International
Cor_ference on Supercomputing (ACM, Philadelphia, PA, 1996) 157-164.
[17] M.T. Jones and P.E. Plassmann, Parallel algorithms for adaptive mesh
refinement, SIAM J. Sci. Comput. 18 (1997) 686-708.
[18] Y. Kallinderis and A. Vidwans, Generic parallel adaptive-grid Navier-Stokes
algorithm, AIAA J. 32 (199_) 54-61.
[19] G. Karypis and V. Kumar, Parallel multilevel k-way partitioning scheme
for irregular graphs, Department of Computer Science Tech. Rep. 96-036,
University of Minnesota, Minneapolis, MN, 1996.
[20] G. Karypis and V. Kumar, A fast and high quality multilevel scheme for
partitioning irregular graphs, SIAM J. Sci. Comput. 20 (1998) 359-392.
/21) T. Minyard and Y. Kallinderis, A parallel Navier-Stokes method and grid
adapter with hybrid prismatic/tetrahedral grids, Proceedings 33rd AIAA
Aerospace Sciences Meeting (AIAA, Reno, NV, 1995) Paper 95-0222.
[22] T. Minyard, Y. Kallinderis, and K. Schulz, Parallel load balancing for dynamic
execution environments, Proceedings 34th AIAA Aerospace Sciences Meeting
(AIAA, Reno, NV, 1996) Paper 96-0295.
[23] L. Oliker and R. Biswas, PLUM: Parallel load balancing for adaptive
unstructured meshes, J. Parallel Distrib. Comput. 52 (1998) 150-177.
I24j T.W. Purcell, CFD and transonic helicopter sound, 14th European Rotorcraft
Forum. Milan, Italy, 1988, Paper 2.
[25] J.J. Quirk, A parallel adaptive grid algorithm for computational shock
hydrodynamics, Appl. Numer. Math. 20 (1996) 427-453.
[26] K. Schloegel, G. Karypis, V. Kumar, R. Biswas, and L. Oliker, A
perfornlance study of diffusive vs. remapped load-balancing schemes, Proc.
l lth h_ternational Conference on Parallel and Distributed Computing Systems.
ISCA, Chicago, IL, 1998, pp. 59-66.
[27] P.M. Selwood, N.A. Verhoeven, J.M. Nash, M. Berzins, N.P. W_atheriIl. P.M.
Dew, and K. Morgan, Parallel mesh generation and adaptivity: partitioning and
analysis, in: P. Schiano, A. Ecer, J. Periaux, and N. Satofuka, eds., Parallel
Computational Fluid Dynamics: Algorithms and Results Using Advanced
Computers (Elsevier, Amsterdam, The Netherlands, 1997) 166-173.
31
[28]M.S.Shephard,.I.E.Flaherty,H.L. deCougny,C. OzturamC.L. Bottasso,az,dM.W. Beall,Parallelautomatedadaptiveproceduresfor unstructuredmeshes,Parallel Computing in CFD AGARD-R-807 (1995) 6.1-6.49.
[29] H.D. Simon, A. Sohn, and R. Biswas, HARP: A dynamic spectral partitioner,
.1. Parallel Distrib. Comput. 50 (1998) 83-103.
[30] R.C. Strawn, R. Biswas, and M. Garceau, Unstructured adaptive mesh
coml)utations of rotorcraft, high-speed impulsive noise, J. Aircraft 32 (1995)
754-760.
[31] L. Valiant, A bridging model for parallel computation, Comm. ACM 33 (1990)103-111.
[32] R. Van Driessche and D. Roose, Load balancing computational fluid dynamics
calculations on unstructured grids, Parallel Computing in CFD AGARD-R-807
(1995) 2.1-2.26.
[33] A. Vidwans, Y. Kallinderis, and V. Venkatakrishnan, Parallel dynamic load-
balancing algorithm for three-dimensional adaptive unstructured grids, AIAA
.1. 32 (1994) 497-505.
[34] C. Walshaw, M. Cross, and M.G. Everett, Parallel dynamic graph partitioning
for adaptive unstructured meshes, J. Parallel Distrib. Cornput. 47" (1997) 102-108.
32