Parallel Tetrahedral Mesh Adaptation with Dynamic Load ...Parallel Tetrahedral Mesh Adaptation with...

Parallel Tetrahedral Mesh Adaptation

with Dynamic Load Balancing 1

Leonid Oliker

NERSC, MS 50B-2239, Lawrence Berkeley National Laboratory,

Berkeley, CA 94720. E-mail: [email protected]

Rupak Biswas

MRJ Technology Solutions, MS T27A-1, NASA Ames Research Center,

Moffett Field, CA 94035. E-mail: [email protected]

Harold N. Gabow

Department of Computer Science, University of Colorado,

Boulder, CO 80309. E-mail: [email protected]

Abstract

The ability to dynamically adapt an unstructured grid is a powerful tool for ef-

ficiently solving computational problems with evolving physical features. In this

paper, we report on our experience parallelizing azl edge-based adaptation scheme,

called 3D_TAG, using message passing. Results show excellent speedup when a

realistic helicopter rotor mesh is randomly refined. However, performance deteri-

orates when the mesh is refined using a solution-based error indicator since mesh

adaptation for practical problems occurs in a localized region, creating a severe

load imbalance. To address this problem, we have developed PLUM, a global dy-

namic load balancing framework for adaptive numerical computations. Even though

PLUM primarily balances processor workloads for the solution phase, it. reduces the

load imbalance problem within mesh adaptation l)y repartitioning the mesh after

targeting edges for refinement but before the actual subdivision. This dramatically

improves the performance of parallel 3D_TAG since refinement occurs in a more load

balanced fashion. We also present, optimal and heuristic algorithms that, when ap-

plied to the default, mapping of a parallel repartitioner, significantly reduce the data

redistribution overhead. Finally, portability is examined by comparing performance

on three state-of-the-art parallel machines.

t Work supported by NASA under Contract Numbers NAS 2-96027 with Universi-

ties Space Research Association and NAS 2-14303 with MRJ Technology Solutions.

Preprint submitted to Elsevier Preprint I2 July 1999

https://ntrs.nasa.gov/search.jsp?R=20000091009 2020-02-10T05:44:33+00:00Z

1 Introduction

Unstructured grids2 for solving computational problems have two major ad-

vantages over structured grids. First, unstructured meshes enable efficient grid

generation around highly complex geometries. Second, appropriate unstruc-

tured-grid data structures facilitate the rapid insertion and deletion of points

to allow the mesh to locally adapt to the solution.

Two solution-adaptive strategies are commonly used with unstructured-grid

methods. Regeneration schemes generate a new grid with a higher or lower

concentration of points in different regions depending on an error indicator. A

major disadvantage of such schemes is that they are computationally expen-

sive. This is a serious drawback for unsteady problems where the mesh must

be frequently adapted. However, resulting grids are usually well-formed with

smooth transitions between regions of coarse and fine mesh spacing.

Local mesh adaptation, on the other hand, involves adding points to the ex-

isting grid in regions where the error indicator is high, and removing points

from regions where the indicator is low. The advantage of such strategies is

that relatively few mesh points need to be inserted or deleted at each refine-

meat/coarsening step for unsteady problems, ttowever, complicated logic and

data structures are required to keep track of the points that are added andremoved.

For problems that evolve with time, local mesh adaptation procedures have

proved to be robust, reliable, and efficient. By redistributing the available

mesh points to capture physical phenomena of interest, such procedures make

standard computational methods more cost effective. Highly localized regions

of mesh refinement are required in order to accurately capture shock waves,

contact discontinuities, vortices, and shear layers. This provides scientists the

opportunity to obtain solutions on adapted meshes that are comparable to

those obtained on globally-refined grids but at a. much lower cost. Even though

adai)tive mesh algorithms are commonly used for problems in fluid flow and

structural mechanics, the5" are also of significant interest in several other areas

like computer vision and graphics.

Advances in adaptive software and methodology notwithstanding, parallel

computational strategies will be an essential ingredient in solving complex real-

life problems. However, parallel computers are usually easier to program with

regular data structures; so the development of efficient parallel adaptive algo-

rithms for unstructured grids (that use complex data structures and indirect

addressing) poses a serious challenge. Their parallel performance for supercom-

puting applications not only depends on the design strategies, but also on the

2 The terms grid aald mesh are used synonymously throughout this paper.

choiceof efficientdata structureswhich must beamenableto simplemanipula-tion without significant memorycontention (for shared-memoryarchitectures)or communication overhead(for message-passingarchitectures). Nonetheless,it is generallybelievedthat adaptiveunstructured-grid techniqueswill consti-tute a significant fraction of future high-performancecomputing.

A significant amount of research has been done to design sequential algo-

rithms to effectively use unstructured meshes for fluid flow applications, e.g.,

the solution of the Euler equations. Unfortunately, many of these techniques

cannot take advantage of the power of parallel computing due to the dif-

ficulties of porting these codes onto distributed-memory architectures. Re-

cently, several adaptive schemes have been successfully developed in a par-

allel environment. Most of these codes are based on two-dimensional finite

elements [1,2,5,9,17,18,25], and some progress has been made towards three-

dimensional unstructured-mesh simulations [4,21,27,28]. Various dynamic load

balancing methods for unstructured-grid applications have also been reported

to date [10-12,16,22,32-34]; however, most of them lack a global view of loads

across all processors.

I Partitioning I

[ Mapping [i ,

Initialization t[ Repartitioning ]

Solution [

[ Finalization I

Execution

[ Rem_pping,, [

Fig. I. Overview of our global dynamic load balaaicing framework for adaptive nu-

merical computat.ions.

Figure 1 depicts our global dynamic load balancing framework for adaptive

computations. It essentially consists of a numerical solver and our mesh adap-

tor, with a partitioner and a remapper that load balance and redistribute the

computational mesh when necessary. The mesh is first partitioned and mapped

among the available processors. The initia.lization phase distributes the global

data among the processors and generates a database for all shared objects a .

The numerical solver then runs for several iterations, updating solution vari-

a.bles that are typically stored at the vertices of the mesh. When an acceptable

3 The term object is used generically to denote a vertex, edge, tetrahedron, or facein the mesh.

solution is obtained, local mesh adaptation is performed to generate a new

computational mesh, if so desired. A quick evaluation step determines if the

new mesh is sufficiently unbalanced to warrant a repartitioning. If the current

partitioning indicates that it is adequately load balanced, control is passed

back to the solver. Otherwise, a mesh repartitioning procedure is invoked to

divide the new grid into subgrids. The new partitions are then reassigned to

the processors in a way that minimizes the cost of data movement. If the

cost of remapping the data is less than the computational gain that would

be achieved with balanced partitions, all necessary data is appropriately re-

distributed. Otherwise, the new partitioning is discarded and the calculation

continues on the old partitions. The finalization step combines the local grids

on each processor into a single global mesh. This is usually required for some

post-processing tasks, such as visualization, or to save a snapshot of the grid

on secondary storage for future restart runs.

Notice from the framework in Fig. 1 that the computational load is balanced

and the runtime communication reduced only for the solver but not for the

mesh adaptor. This is important since solvers are usually several times more

expensive. However, parallel performance for the mesh adaptation procedure

can be significantly improved if the mesh is repartitioned and remapped in a

load-balanced fashion after edges are targeted for refinement and coarsening

[)lit before performing the actual adaptation. This strategy also reduces the

redistribution cost significantly since a smaller volume of data is moved.

The numerical solver is usually application-dependent, and is beyond the scope

of this paper. Here, we focus on some of the tools that enable numerical sim-

ulations to be accomplished rapidly and efficiently. Parallel mesh adaptation

and dynamic load balancing are two such critical tools.

2 Tetrahedral Mesh Adaptation

We first give a brief description of the tetrahedral mesh adaptation scheme [7]

that is used in this work to better explain tile modifications that were made

for the distributed-memory implementation. The 5,000-line C code, called

3D_TAG, has its data structures based on edges that connect the vertices

of a tetrahedral mesh. This means that the elements 4 and boundary faces are

defined by their edges rather than by their vertices. These edge-based data

structures make the mesh adaptation procedure capable of efficiently perform-

ing anisotropic refinement and coarsening. A successful data structure must

contain the right amount of information to rapidly reconstruct the mesh con-

nectivity when vertices are added or deleted while having reasonable memory

4 The terms element and tetrahedrou are used synonymously throughout this paper.

requirements.

R.ecently,the 3D_TAGcodehasbeenmodified to refineand coarsenhexahedralmeshes[8]. The data structures and serial implementation for the hexahedralschemeare similar to those for the tetra.hedralcode. Their parallel implemen-

ta.tions should also be similar; however, this paper focuses solely on tetrahedral

mesh adaptation.

2. I The Algorithm

At each mesh adaptation step, individual edges are marked for coarsening,

refinement, or no change, based on an error indicator calculated fi-om the

solution. Edges whose error values exceed a user-specified upper threshold are

targeted for subdivision. Similarly, edges whose error values lie below another

user-specified lower threshold are targeted for removal. Only three subdivision

types are allowed for each tetrahedral element and these are shown in Fig. 2.

Tile 1:8 isotropic subdivision is implemented by adding a new vertex at the

mid-point of each of the six edges. The 1:4 and 1:2 subdivisions can result

either because the edges of a. parent tetrahedron are targeted anisotropically

or because they are required to form a valid connectivity for the new mesh.

When an edge is bisected, the solution quantities are linearly interpolated at

tile mid-point from its two end-points.

1:8 1:4 !:2

Fig. 2. Three types of subdivision are permit, t,ed for a t.etrahedral element.

Mesh refinement is performed by first setting a bit flag to one for each edge

that is targeted for subdivision. The edge markings for each element are then

('ombined to form a 6-bit pattern as shown in Fig. 3 where the edges marked

with an _R' are the ones to be bisected. Elements are continuously upgraded to

valid patterns corresponding to the three allowed subdivision types (cf. Fig. 2)

until none of the patterns show any change. Once this edge marking is com-

l)leted, each element is independently subdivided into smaller child elements

based on its binary pattern. Special data structures are used to ensure that

this process is computationally efficient.

6 5 4 3 2 1 Edgenumber0 0 1 0 1 1 Pattem=ll

Fig. 3. Sampleedge-markingpattern for elementsubdivision.

Meshcoarseningalso usesthe edge-markingpatterns. If a child element hasany edge marked for coarsening,this element and its siblings are removedand their parent is reinstated. Parentedgesand elementsare retained at eachrefinement step so they do not haveto be reconstructed. Reinstated parentelementshavetheir edge-markingpatternsadjusted to reflect that someedgeshavebeencoarsened.The parentsare then subdividedbasedon their newpat-terns by invoking the meshrefinementprocedure.As a result, the coarsening

and refinement procedures share much of the same logic.

There are some constraints for mesh coarsening. For example, edges cannot

be coarsened beyond the initial mesh. Edges must also be coarsened in an

order that is reversed from the one by which they were refined. Moreover, an

edge can coarsen if and only if its sibling is also targeted for coarsening. More

details about these coarsening constraints are given in [8].

Details of the data structures are given in [7]; however, a brief description of

the salient features is necessary to understand the distributed-memory imple-

mentation of the mesh adaptation code. Pertinent information is maintained

for" the vertices, elements, edges, and boundary faces of the mesh. For each

vertex, the coordinates are stored in coord[3], the solution ira soln[5], and

a pointer to the first entry in the edge list in edges. The edge list for a vertex

is a linked list of pointers to all the edges that are incident upon it. Such lists

eliminate extensive searches and are crucial to the efficiency of the overall

adat)tation scheme. The tetrahedral elements have their six edges stored in

tedge [6], the edge-marking pattern in part, the parent element in tparent,

and the first dlild element in tchild. Sibling elements always reside contigu-

ously in memory; hence, a parent element only needs a pointer to the first

child. For each edge, we store its two end-points in vertex[2], its parent

edge in eparent, its two children edges in echild[2], the two boundary faces

it defines in bfac [2], and a pointer to the first entry in the element list in

elems. The element list for an edge is again a linked list of pointers to all

the elements that share it. Finally, for each boundary face, we store the three

edges in bodge [3], the element to which it belongs in belera, the parent in

bparent, and the first child in bchild. Sibling boundary faces, like elements,

are stored consecutively in memory.

6

2.2 Parallel Implementation

The parallel version of the 3D_TAG mesh adaptation code contains an ad-

(liti(mal 3,000 lines of C++ with Message-Passing Interface (MPI), allowing

port_l.bi]ity to a.z_y system supporting these languages. This code is a wrap-

per around the original mesh adaptation program written in C, and required

the addition of only 10 instructions to link it with the parallel constructs. The

ol)ject-oriented approach allowed us to build a clean interface between the two

layers of the program while maintaining efficiency. Only a slight increase in

space was necessary to keep track of the global mappings and shared processor

lists (SPLs) for objects on partition boundaries.

Parallel 3D_TAG consists of three phases: initialization, execution, and final-

ization. The initialization step consists of scattering the global data. across the

processors, defining a local numbering scheme for each object, and creating

the mapping for objects that are shared by multiple processors. The execu-

tion step runs a copy of 3D_TAG on each processor that refines or coarsens

its local region, while maintaining a globally-consistent grid along partition

boundaries. Parallel performance is extremely critical during this phase since

it will be executed several times during a computation. Finally, a gather op-

eration is performed in the finalization step to combine the local grids into

one global mesh. Locally-numbered objects and the corresponding pointers are

reordered to represent one single consistent mesh. Note from Fig. 1 that the

initialization and finalization phases are invoked only once for each problem

outside the solutione+execution_load-balancing cycle.

In order to perform parallel mesh adaptation, the initial grid must first be

partitioned among the available processors. A good partitioner minimizes the

total execution time by equidistributing the workload and reducing the inter-

I)rocessor communication. However, it is also important within our framework

t[lat the partitioning phase be performed ral)idly. There are several excel-

l,,nt heuristic algorithms for solving the NP-hard graph partitioning prob-

h'm [20,28,29,32,3,1]. We used the ParMETIS [19] parallel multilevel partition-

ing algorithm for the test cases in this paper. Pa.rMETIS reduces the size of the

graph by collapsing vertices and edges using a heavy edge matching scheme,

applies a greedy graph growing algorithm for partitioning the coarsest graph,

and then uncoarsens it back using a eombina.tion of boundary greedy and

Kernighan-Lin refinement to construct a partitioning for the original graph.

2.2.1 ]'nitialization

The initialization phase takes as input the global initial grid and the cor-

responding partitioning that maps each tetrahedral element to exactly one

7

partition. The elementdata and partition information are then broadcasttoall processorswhich, in parallel, assigna local, zero-basednatural numbertoeach element.\Ve are thus assumingthat a.ninitial tetrahedral meshexists,and that it is partitioned amongtile availableprocessors.Oncethe elementshavebeenprocessed,local edgeinformation can becomputed.

In threedimensions,an individual edge may belong to an arbitrary number of

elements. Since each element is assigned to only one partition, it is theoretically

possible for an edge to be shared by all the processors. For each partition, a

local zero-based natural number is assigned to every edge that belongs to

at least one element. Each processor then redefines its elements in tedge [6]

in terms of these local edge numbers. Edges that are shared by more than

one processor are identified by searching for elements that lie on partition

boundaries. A bit flag is set to distinguish between shared and internal edges.

A SPL is also generated for each shared edge. Finally, the element list in elems

for each edge is updated to contain only the local elements.

The vertices are initialized using the vertex [2] data structure for each edge.

Every local vertex is assigned a zero-based natural number in each partition.

Next the local edge list for each vertex is created from the appropriate sub-

set of the global edges array. Like shared edges, each shared vertex must be

identified and assigned its SPL. A naive approach would be to thread through

the data structures to the elements and their partitions to determine which

vertices lie on partition boundaries. But this procedure requires excessive in-

direction. A faster approach is based on the following two properties of a

shared vertex: it must be an end-point for at least one shared edge, and its

SPL is the union of its shared edges' SPLs. However, some communication is

required when using this method. For each vertex containing a shared edge in

its edges list, that edge's SPL is communicated to the processors in the SPLs

of all other shared edges until the union of all the SPLs is formed. For the

cases in this paper, this process required no more than three iterations, and

all shared vertices were processed as a function of the number of shared edges

plus a small conmmnication overhead. An example is shown in Fig. 4 where

the SPL is being formed in P0 for the center vertex that is shared by three

other processors. Without comnmnication. P0 would incorrectlv conclude that

the vertex is shared only with Pl and P3.

The final step in the initialization phase is the local renumbering of the exter-

nal boundary faces s Since a boundary face belongs to only one element, it

is never shared among processors. Each boundary face is defined by its three

edges in hedge [3], while each edge maintains a pair of pointers in bfac [2]

to the boundary faces it defines. Since the global mesh is closed (water-tight),

an edge on the external boundary is shared by exactly two boundary faces.

5 The internal faces are not stored in the mesh data structures.

8

Before communication

PO shares center vertex with P1, P3

@ @

After communication

P0 shares center vertex with PI, P2, P3

Fig. 4. An example showing tile communicat.ion need to form tile SPL for a shared

vertex.

However, when the mesh is partitioned, this is no longer true. An example is

shown in Fig. 5. An affected edge creates an empty ghost boundary face in

each of the two processors for the execution phase. The ghost boundary faces

do not participate in the adaptation process but are required to create a valid

subgrid in each processor. These ghost faces are later eliminated during the

finalization stage.

¢

Before partitioning

Global edge GE5 shared by

global bdy faces GBF7 and GBF8

Ghost ",

"',, Ghost

After partitioningGE5 stored as LE1 and LE3 in PO and PI

GBF7 as LBF3 in PO; GBF8 as LBFO in P1

Fig. 5. An example showing how boundary faces are represented at. partition bound-

aries.

:\ new data structure ha.s been added to the serial code to represent all this

shared information. Each shared edge and vertex contains a two-way mapping

between its local and its global numbers 6 , and a. SPL of processors where

its shared copies reside. The maximum additional storage depends on the

number of processors used and the fraction of shared objects. For the cases in

this paper, this was less than 10% of the memory requirements of the serial

version.

6 Tim global numbers for the various mesh objects are obtained trivially during the

initialization phase.

9

'2.'2. '2 Execution

The first step in the actual mesh adaptation phase is to target edges for refine-

ment or coarsening. This is usually based on an error indicator for each edge

that is computed from the solution. This strategy results in a symmetrical

marking of all shared edges across partitions since such edges have the same

numerical and geometrical information regardless of their processor number.

However, elements have to be continuously upgraded to one of the three al-

lowed subdivision patterns shown in Fig. 2. This causes some propagation of

edges being targeted that could mark local copies of shared edges inconsis-

tently. This is because the local geometry and marking patterns affect the

nature of the propagation. Communication is therefore required after each

iteration of the propagation process. Every processor sends a list of all the

newly-marked local copies of shared edges to all the other processors in their

SPLs. This process may continue for several iterations, and edge markings

could propagate back and forth across partitions.

Figure 6 shows a two-dimensional example of two iterations of the propagation

process across a partition boundary. The process is similar in three dimensions.

Processor P0 marks its local copy of shared edge GEl and communicates

that to Pl. P1 then marks its own copy of GEl, which causes some internal

propagation because element marking patterns must be upgraded to those

tha.t are valid. Note that Pl marks its third internal edge and its local copy of

shared edge GE2 during this phase. Marking information about GE2 is then

communicated to P0, and the propagation phase terminates. The four original

triangles can now be correctly subdivided into a total of 12 smaller triangles.

GE2 P_ • Shared marko Internal mark

--- Shared edge

-- Internal edge

..... New edge

Fig. 6. A two-dimensional example showing communication during propagation of

the edge marking phase.

Once all edge markings are complete, each processor executes the mesh adap-

tation code without the need for further communication, since all edges are

10

consistently marked. Tile only task remaining is to update the shared edgeand vertex information as the mesh is adapted. This is handled as a post-processingphase.

New edgesand vertices that arecreatedduring refinementareassignedsharedprocessorinfornmtion that depends on several factors. Four dift_rent cases can

occur when new edges are created:

• It" an internal edge is bisected, the center vertex and all new edges incident

on that vertex are also internal to the partition. Shared processor informa-

tion is not required in this case.

• If a shared edge is bisected, its two children and the center vertex inherit

its SPL, since they lie on the same partition boundary.

• If a new edge is created in the interior of an element, it is internal to the

partition since processor boundaries only lie along element faces. Shared

processor information is not required.

• If a new edge is created that lies across an element face, communication is

required to determine whether it is shared or internal. If it is shared, the

SPL must be formed.

All the cases are straightforward, except for the last one. If the intersection of

the SPLs of the two end-points of the new edge is null, the edge is internal.

Otherwise, communication is required with the shared processors to determine

whether they have a local copy of the edge. This communication is necessary

because no information is stored about the internal faces of the tetrahedral

elements. An alternate solution would be to incorporate internal faces as an

additional object into the data structures, and maintaining it through the

adaptation. However, this strategy does not compare favorably in terms of

memory or CPU time to a. single communication at the end of the refinement

procedure. This is primarily because the number of triangular faces for a

tetrahedral mesh is asymptotically ten times the number of mesh vertices.

Figure 7 shows the top view of a tetrahedron in processor P0 that shares

two faces with P] while the third face is izm'rnal. The fourth face is not

shown and is irrelevant for this example. Assume that due to mesh refinement,

three new edges LE1, LE2, and LEa, are formed in P0. An intersection of

the SPLs for the two end-points of all the three edges yields Pl. However,

when P0 communicates this information to Pl, Pl will only have local copies

corresponding to bE1 and LE2. Thus, P0 would be able to correctly classify

LE1 and LE2 as shared edges but LEa as an internal edge.

The coarsening phase purges the data structures of all edges that are removed,

as well as their associated vertices, elements, and boundary faces. No new

shared processor infortnation is generated since no mesh objects are created

during this step. However, objects are renumbered ,as a result of compaction

11

D/_ [ SharedfacewithPIInternalfaceof P0

-- SharededgewithP1Internaledgeof P0

LE3

Fig. 7. An example showing how a new edge across a face is classified as shared orinternal.

and all internal and shared data are updated accordingly. The refinement

routine is then invoked to generate a valid mesh from the vertices left after

the coarsening.

2.2.3 Finalization

Under certain conditions, it is necessary to create a single global mesh after one

or more adaptation steps. Some post processing tasks, such as visualization,

need to process the whole grid simultaneously. Storing a snapshot of a grid

for future restarts could also require a. global view. Our finalization phase

accomplishes this goal by merging the individual subgrids into one globaldata structure.

Each local object is first assigned a unique global number. Next, all local data

structures are updated in terms of these global numbers. Finally, gather oper-

ations are performed to a host processor to create the global mesh. Individual

processors are responsible for correctly arranging the data so that the host

0nly collects and concatenates without further processing.

It is relatively simple to assign global element numbers since elements are

not shared among processors. By performing a scan-reduce add 7 on the total

nmnl)er of elements, each processor can assign the final global element number.

The global l)oundary face numb(_ring is also done similarly since they too are

not shared among processors.

Assigning global numbers to edges and vertices is somewhat more complicated

since they may be shared by several processors. Each shared edge (and vertex)

is assigned an owner from its SPL which is then responsible for generating the

global number. Owners are randomly selected to keep the computation and

communication loads balanced. Once all processors complete numbering their

edges (and vertices), a communication phase propagates the global values from

owners to other processors that have local copies.

7 A scan-reduce add operation creates a vector whose ith element is the addition

of the first i - 1 elements of the argument vector.

12

After global numbers have been assignedto every object, all data. struc-tures are updated to contain consistent global information. Since elementsand boundary faces are unique in each processor,no duplicates exist. Allunownededge copies are removed from the data structures, which are then

compacted. However, the element lists in elems cannot be discarded for the

nnowt_ed edges. Some commut_ication is required to adjust the pointers in the

local lists so that global lists can be formed without any serial computation.

The pair of pointers in bfac [2] that were split during the initialization phase

for shared edges are glued back by communicating the boundary face informa-

tion to the owner. Vertex data structures are updated much like edges except

for the manner in which their edge lists in edges are handled. Since shared

vertices may contain local copies of the same global edge in their lists on dif-

ferent processors, the unowned edge copies are first deleted. Pointers are next

a.djusted as in the elems case with some communication among processors.

At this time, all processors have updated their local data with respect to their

relative positions in the final global data. structures. A gather operation by a

host processor is performed to concatenate the local data structures. The host

can then interface the global mesh directly to the appropriate post-processing

module without having to perform any serial computation.

3 Dynamic Load Balancing

PLUM [23] is a novel method to dynamically balance the processor workloads

for unstructured adaptive-grid computations with a global view. It has five

novel features:

• Repeated use of the initial mesh dual graph keeps the connectivity and par-

titioning complexity constant during the course of an adaptive computation.

• Paralld mesh repartitioning avoids a potential serial bottleneck.

• Fast heuristic remapping assigns partitions to processors so that the redis-

tribution cost is minimized.

• Efficient data movement significantly reduces the cost of remapping and

mesh subdivision.

• Accurate metrics estimate and compare the computational gain and the re-

distribution cost of having a balanced workload after each mesh adaptation.

3. I Repartitioning the Initial Mesh Dual Graph

Repeatedly using the dual of the initial computational mesh for dynamic load

balancing is one of the key features of PLUM. Each dual graph vertex has

13

two weights associatedwith it. The computational weight, Wcomp , models the

workload for the corresponding element. The remapping weight, Wremap, mod-

els the cost of moving the element from one processor to another. Every edge

of the dual graph also has a weight, Wcomm, that models the runtime interpro-

cessor co,nmunication. These three weights are determined by the numerical

algorithm and the data structures. In our current work, Wcomr, is set to the

number of leaf elements in the refinement tree, Wrema p is set to the total num-

ber of elements in the refinement tree, and W¢omm is set to the number of faces

in the computational mesh that corresponds to the dual graph edge. The mesh

connectivity, W¢omp, and Wcomm , together determine how balanced partitions

with minimum runtime communication are formed. The Wremap determine how

partitions should be assigned to processors such that the data redistribution

cost is minimized. New computational grids obtained by hierarchical adap-

tation are translated to Wcomp and Wremap for every vertex and to Weomm for

every edge in the dual mesh. If the dual graph with a new set of weights

is deemed unbalanced, the mesh is repartitioned using the ParMETIS [19]

parallel multilevel partitioner.

3.2 Processor Reassignment

New partitions generated by a partitioner must be mapped to processors such

that the data redistribution cost is minimized. In general, the number of new

partitions is an integer multiple F of the number of processors, and each

processor is assigned F unique partitions. Allowing multiple partitions per

processor reduces the volume of data movement at the expense of partitioning

and processor reassignment times [23]; however, the simpler scheme of setting

F to unity suffices for most practical applications.

We first generate a similarity lneasure }t[ that indicates how the remapping

weights IUrcmap of tile new partitions are distributed over the processors. It is

represented as a matrix where entry Mi3 is the sum of the W_emap values of all

tile dual graph vertices in new partition j that already reside on processor i.

Various cost functions are usually needed to solve the processor reassignment

problem using M for different machine architectures. We present three general

lnetrics: TotalY, MaxY, and MaxSR, which model the remapping cost on most

,nultiprocessor systems. TotalV minimizes the total volume of data moved

among all the processors, MaxV minimizes the maximum flow of data to or from

any single processor, while MaxSR minimizes the sum of the maximum flow of

data to and from any processor. A greedy heuristic algorithm to lninimize the

remapping overhead is also presented.

14

3.2. ! TotalV Metric

The TotalV metric assumes that by reducing network contention and the total

number of elements moved, the remapping time will be reduced. In general,

each processor cannot be assigned F unique partitions corresponding to their

F largest weights. To minimize TotalV, each processor i must be assigned F

partitions ji_f, f = 1, 2,..., F, such that the objective

P F

E E M,j,_,i=l f=l

is maximized subject to the constraint

ji_r =fijk_s, fori:fikor'r-¢_'; i,k=l,2,...,P; r,s = 1,2,..., F.

We call optimally solve this by mapping it to a network flow optimization

problem described as follows. Let G = (V, E) be an undirected graph. G is

bipartite if V can be partitioned into two sets A and B such that every edge

has one vertex in A and the other vertex in B. A matching is a subset of edges,

no two of which share a common vertex. A mazimum-cardinality matching is

one that contains as many edges as possible. If G has a real-valued cost on each

edge, we can consider the problem of finding a maximum-cardinality matching

whose total edge cost is maximized. We refer to this as tile maximally weighted

bipartite graph (MWBG) problem (also known as the assignment problem).

When F = 1, optimally solving for the TotalV metric trivially reduces to

MWBG, where V consists of P processors and P partitions in each set. An edge

o[" weight Mij exists between vertex i of the first set and vertex j of the second

set. If F > 1, the processor reassignment problem can l)e reduced to MWBG

i)v duplicating each processor and all of its incident edges F times. Each set

(,[" the bipartite graph then has P×F vertices. After the optimal solution is

obtained, the solutions Ibr all F copies of a processor are combined to form

a one-to-F mapping between the processors and the partitions. Tile optimal

solution for the TotalV metric and the corresponding processor assignment of

;_n example similarity matrix is shown in Fig. 8(a).

The fastest MWBG algorithm [13] carl compute a matching in O([VI 2 log [V I+

[I/'IIEI) time, or in O(IVIL/21Et log(IVIC)) time if all edge costs are integers of

absolute value at most C [15]. We have implemented the optimal algorithm

with a runtime of O(IVla). Since M is generally dense, IEI _ IVI 2, implying

that we should not sec a dramatic performance gain from a faster implemen-

tation.

15

New Partitions New Partitions New Partitions New Partitions

New Processors

TotalV moved = 525

MaxV moved = 275

MaxSR moved = 485

(a)

New Processors

TotalV moved = 640

MaxV moved = 245

MaxSR moved -- 475

(b)

New Processors

TotalV moved = 570

MaxV moved = 255

MaxSR moved -- 465

(c)

New Processors

TotalV moved = 550

MaxV moved = 260

MaxSR moved = 470

(d)

Fig. 8. Various cost metrics of a similarity matrix M for P = 4 and F = 1 using

(a) the optimal MWBG, (b) the optimal BMCM, (c) the optimal DBMCM, and

(d) our heuristic Mgorithms.

3.2.2 MaxV Metric

The metric MaxV, unlike TotalY, considers data redistribution in terms of

solving a load imbalance problem, where it is more important to minimize the

workload of the most heavily-weighted processor than to minimize the sum of

all the loads. During the process of remapping, each processor must pack and

unpack send and receive buffers, incur remote-memory latency time, and per-

form the computational overhead of rebuilding internal and shared data struc-

tures. By minimizing max (o_ x max (ElemsSent), flx max (ElemsRe cd)), where

a and g are machine-specific parameters, MaxV attempts to reduce the total

remapping time by minimizing the execution time of the most heavily-loaded

processor. We can solve this optimally by considering the problem of find-

ing a maximum-cardinality matching whose maximum edge cost is minimum.

We refer to this as the bottleneck maximum cardinality matching (BMCM)

problem.

To find the BMCM of the graph G corresponding to the similarity matrix, we

first need to transform ?t,[ into a new matrix M'. Each entry Mi'j represents

the maximum cost of migrating data between processor i and partition j"

P P

y=l x=l

Optimally solving the BMCM problem is NP-complete for F > 1. For F = 2,

it is NP-complete by reduction from numerical matching with target sums; for

F > 2, it is NP-complete by reduction from 3-partition. We have implemented

the BMCM algorithm in [3] for F = 1 which combines a maximum eardinality

matching algorithm with a binary search, and runs in O(IVI /21EIlog IVI). The

fastest known BMCM algorithm [14] has a runtime of O(([V ItoglVl) / lEI).

16

The new processorassignmentfor the similarity matrix in Fig. 8 using thisapproachwith a = fl = 1 is shown in Fig. 8(b). Notice that the total number of

elements moved in Fig. 8(b) is larger tha.n the corresponding value in Fig. 8(a);

however, the maximun_ number of elements moved is smaller.

,9.2.3 MaxSR Metric

Our third metric, MaxSR, is similar to MaW in the sense that the overhead

of the bottleneck processor is minimized during the remapping phase. MaxSR

differs, however, in that it minimizes the sum of the heaviest data flow from

any processor and to any processor, expressed as (a × max (ElemsSent) +

× max (ElemsRecd)). We refer to this as the double bottleneck maximum

cardinality matchillg (DBMCM) problem. The MaxSR formulation allows us

to capture the computational overhead of packing and unpacking data, when

these two phases are separated by a barrier synchronization. Additionally,

the MaxSR metric may also approximate the many-to-many communication

pattern of our remapping phase. Since a processor can either be sending or

receiving data, the overhead of these two phases should be modeled as a sum

o[' costs.

We have developed an algorithm for computing the minimum MaxSR of the

graph G corresponding to our similarity matrix. We first transform M to a new

matrix M". Each entry M_5. contains a pair of values {Sij, R_j} corresponding

to the total cost of sending and receiving data. when partition j is mapped to

processor i:

P P

{s,, ?%,v# j), R,; = (ZE M ,,x # i)}.y=l x=l

The optimal algorithnl for the MaxSR metric is NP-complete for F > 1. since

the u,derlying BMCM algorithm is also NP-comph, te.

I..et cry, o2, • • •, ak be the distinct Sij values appearing in M". sorted in increas-

ing order. Thus, ai < ai+1 and k _< p2. Form the bipa.rtite graph Gi = (I.', E,),

where V consists of processor vertices u = 1, 2, .... P and partition vertices

v = 1,2,..., P, and Ei contains edge (u, v) ifS_,v <_ ai; furthermore, edge (u,v)

has weight Ruv if it is in El.

For small values of i, graph Gi may not have a perfect matching. Let iml, be

the smallest index such that Gi,,,_,, has a perfect matching. Obviously, Gi has

a perfect matching for all i _> imin. Solving the BMCM problem of Gi gives a

matching that minimizes the maximum Rij edge weight. It gives a matching

17

with MaxSRvalue at most c_i + MaxV(Gi). Defining

MaxSR(i) = rain (crj+ MaxV(Gj)),

it is easy to see that MaxSR(k) equals the correct value of MaxSR. Thus, our

algorithnl computes MaxSR by solving k BMCM problems on tile graphs G_

and computing the minimum value MaxSR(k). However, we can prematurely

terminate tile algorithm if there exists an imp× such that ai ..... +1 >_ MaxSR(imax),

since it is then guaranteed that the MaxSR solution is MaxSR(imax).

Our implementation has a runtime of O([I/'It/2]E] 2 log IV]) since the BMCbl

algorithm is called IEI times in the worst case; however, it can be decreased

to O(IEI2). The following is a sketch of this more efficient implementation.

Suppose we have constructed a matching l_ that solves the BMCM problem

Of Gi for i _> imin. We soh'e the BMCM problem of Gi+l as follows. Initialize

a working graph G to be Gi+t with all edges of weight greater than MaxV(Gi)

deleted. Take the matching f_ on G, and delete all unmatched edges of weight

MaxV(Gi). Choose an edge (u, v) of maximum weight in /_, remove it from p,

and G, and search for an augmenting path from u to v in G. If no such path

exists, we know that MaxV(Gi) = MaxV(Gi+l). If an augmenting path is found.

repeat this procedure by choosing a new edge (u', v') of ma.ximunl weight in

the matching and searching for a.n augmenting path. After some repetitions of

this procedure, the maximum weight of a matched edge will have decreased to

the desired value MaxY(Gi+t). At this point our algorithm to solve the BMCM

problem of Gi+l will stop, since no augmenting path will be found.

This algorithm is of complexity O([E] ')) since each search for a.n augmenting

path uses O(IEI) time and there are O(IEI) such searches. A successful search

for an augmenting path for edge (u,v) permanently eliminates it from all

future graphs, so there are at most ]E I successful searches. Furthermore. there

arc at most IEI unsuccessfifl searches, one for each value of i.

The new processor assignment for the similarity matrix in Fig. 8 using the

DB.MCXl algorithm with ct = 3 = 1 is shown in Fig. 8(c). Notice that the

MaxSR solution is minimized; however, the tmmber of Tota17 elements moved

is larger than the corresponding value in Fig. 8(a), and more Max7 elements

are moved than in Fig. 8(b). Also note that the optimal similarity matrix

solution for MaxSR is provably no more than twice that of MaxV.

3.2.4 Heuristic Algorithm

We haxe developed a heuristic greedy algorithm that gives a suboptimal so-

lution to the TotalV metric in O([E]) steps [23]. All partitions are initially

18

flaggedasunassignedand eachprocessorhasa(:ounter set to F that indicates

the remaining number of partitions it needs. The non-zero entries of the simi-

larity matrix M are then sorted in descending order. Starting from the largest

entry, partitions are assigned to processors that have less than F partitions

until done. If necessary, the zero entries in 3,[ are also used. It has been proven

tim.t a. processor assignment obtained using the heuristic algorithm can never

result in a. data movement cost that is more than twice that of the optimal

Tot alV assignment [23]. In addition, experimental results in § 4.3 demonstrate

that our heuristic quickly finds high quality solutions for all three metrics. Ap-

plying this heuristic algorithm to the similarity matrix in Fig. 8 generates the

new processor assignment shown in Fig. 8(d).

3.2 Remapping Cost Model

After the new partitions are reassigned to the processors, a model is required

to predict the redistribution cost for a given machine. Accurately estimating

this time is difficult because of the number and complexity of the costs in-

voh'ed in the remapping procedure. The total remapping cost includes the

computational overhead for rebuilding internal data structures and updating

sha.red boundary information. The communication overhead is architecture-

dependent and complicated because of the many-to-many collective commu-

nication pattern used by the remapper.

Our redistribution algorithm first removes the data objects moving out of a

pa.rtition and places them in a buffer. A collective communication then ap-

propriately distributes the data to their final destination, where they are in-

tegrated into the data structures. Finally, the partition boundary information

is consistently updated. This remapping strategy closely follow_ the superstep

mo<t,'l of BSP [31].

The ('Xlwctcd redistribution tim_, on bandwidth-rich svstem_ is then given by:

-, x MaxSR + O,

where MaxSR = max (ElemsSent) + max (ElemsRecd), "f is the total compu-

t_ttion and communication cost to process each redistributed clement, and O

is the predicted sum of all constant overheads [23]. This formulation demon-

stra.tes the need to model and minimize the MaxSR metric when performing

processor reassignment. To compute the values of ? and O, a simple least

squares fit through several data points for various redistribution patterns and

their corresponding runtimes can be used. This procedure needs to be per-

form_'d only once for each architecture, and the values of ? and O can then

be used in actual computations to estima.te the redistribution cost.

19

4 Experimental Results

The parallel 3D_TAG mesh adaptation procedure and the PLUM global load

balancing strategy have been implenmnted in C and C++, with the parallel

activities in MPI for portability. All experiments were performed oil the wide-

node SP2 at NASA Ames, the Origin2000 at NCSA, and the T3E at NASA

Goddard, without any machine-specific optimizations.

Our computational mesh is the one used to simulate the acoustics experiment

of Purcell [24] where a 1/Tth-scale model of a UH-1H helicopter rotor blade

was tested over a range of Mach numbers. Detailed numerical results of the

simulation are given elsewhere [30]. This paper reports only on the perfor-

mance of parallel 3D_TAG and PLUM.

Performance results are presented for one refinement and one coarsening step

using various edge-marking strategies. Six strategies are used for the refine-

ment step. The first set of experiments, denoted as RAND_IR, RAND_2R,

and RAND_3R, consists of randomly bisecting 5%, 33%, and 60% of the edges

in the mesh, respectively. Tile second set, denoted as REAL_IR, REAL_2R,

and REAL_3R, consists of bisecting the same numbers of edges using an error

indicator [30] derived from the actual solution. These strategies represent sig-

nificantly different scenarios. In practice, mesh adaptation tends to be local.

The RAND cases are included as they are expected to behave somewhat ide-

ally because the computational loads are automatically balanced. Thus, the

RAND results should give an indirect indication of how well parallel 3D_TAG

can really perform without explicit load balancing.

Since the coarsening procedure and performance are similar to the refine

ment method, only two cases are presented where 7% of the edges in the

refined meshes obtained with the RAND_2R and the REAL_2R strategies

are respectively coarsened randomly (RAND_2C) or based on actual solution

(REAL_2C). Table 1 presents the progression of grid sizes through the two

adai)tation steps for each edge-marking strategy.

,4.t Refinement Pha.se

Table 2 presents the computation times and parallel speedup for the refinement

step with the random marking of edges (strategies RAND_IR, RAND_2R,

and 1-{AND_3R). Note that the speedup values are calculated based on the

total time. Performance is excellent with efficiencies of more than 83% on 32

processors and 76% on 64 processors for the RAND_3R case. Parallel mesh

refinement shows a markedly better performance for RAND_3R due to its

bigger computation-to-communication ratio. In general, the total speedup will

2O

Table 1

Grid sizes for the different, refinement and coarseninl_ strategies

Vertices Elements Edges Bdy Faces

Init.ial mesh

R,._XD_I R

REAL_IR

PLAND_2R

REAL_2R

I_AND_aR

REAL_3R

RAND_2C

REAL_2C

13,967

18,274

17,880

39,829

39,332

60,916

61,161

21,756

20,998

60,968 78,343 6,818

82,417 104,526 7,672

82,489 104,209 7,682

201,734 246,949 10,774

201,780 247,115 12,008

320,919 389,686 15,704

321,841 391,233 16,464

100,537 126,448 8,312

100,124 125,261 8,280

improve a.s the size of the refined mesh increases. This is because the mesh

adaptation time will increase while the percentage of elenmnts along processor

boundaries will decrease.

Table 2

Performance of mesh refinement when edges are bisected randomly

RAND_I R RAND_2R RAND_3R

Shared Comp Total Comp Total Comp Total

P Edges Time Speedup Time Speedup Time Speedup

1 0.0% 7.044 1.00 26.904 1.00 45.015 1.00

2 1.9% 3.837 1.84 13.878 1.94 22.762 1.98

4 3.7% 2.025 3.48 7.605 3.54 11.569 3.89

8 6.6% 1.068 6.58 4.042 6.65 5.913 7.61

16 8.8% 0.587 l1.86 2.293 11.67 3.191 14.07

32 t 1.6% 0.330 20.72 1.338 19.78 1.678 26.62

64 15.3% 0.191 32.92 0.711 35.82 0.896 48.66

The communication time is less than 3% of the total time for up to 32 proces-

sors for all three cases. Oil 64 processors, the communication time although

still quite small, is more than 12% of the computation time for RAND_IR.

This is because each of the 64 partitions contains less than 1,000 elements

with more than 15% of the edges on partition boundaries. Since additional

work and storage arc necessary for shared edges, the speedup deteriorates

as the percentage of such edges increases. The situation is much better for

RAND_3R since the computation time is significantly higher.

21

Table 3 showsthe computation timesand speedupwhenedgesaremarkedus-ing a solution-basederror indicator. Performanceis extremely poor, especiallyfor REAL_IR and REAL_2R, with speedupsof only 9.2X and 19.2X on 64processors,respectively.This is becausemeshadaptation for practicalpromlemsoccurs in a localized region, causingan ahnost worst caseload-balancebehavior. Elementsare targeted for refinementon only a small subsetof theavailable processors..,Mostof the processorsremain idle since noneof theirassignedelementsneedto be refined.Performanceis somewhatbetter for theREAL_3Rstrategy becausethe refinementregionis much larger. Since60%ofall edgesarebisectedin this case,most of the processorsarebusydoing usefulwork. This is reflected by an efficienc.yof morethan 56%on 64 processors.Table3Performance of mesh refinement when edges axe bisected based on actual solution

REAL_I R REAL_2R REAL_3 R,

Comp Total Comp Total Comp Total

P Time Speedup Time Speedup Time Speedup

1 5.902 1.00 23.780 1.00 41.702 1.00

2 3.979 1.48 18.117 1.31 26.317 1.58

4 2.530 2.33 9.173 2.59 14.266 2.92

8 1.589 3.71 7.091 3.35 8.430 4.95

16 1.311 4.48 4.046 5.87 4.363 9.55

32 0.879 6.65 2.277 10.40 2.278 18.25

64 0.616 9.22 1.224 19.16 1.148 35.95

Note that the communication times constitute a much smaller fraction of the

total time compared to the cases when edges are bisected randomly. This is

due to the difference in the distribution of bisected edges. The RAND cases

require significantly more communication among processors at the partition

boundaries because refinement is scattered all over the problem domain. The

REAl, cases, on the other hand. require much less communication since the

refined regions are localized and mostly contained within partitions.

Poor parallel performance of the mesh refinement code for the three REAL

strategies is due to severe load imbalance. It is therefore worthwhile trying to

load balance this phase of 3D_TAG as much as possible. This can be achieved

within PLUM by splitting the mesh refinement step into two distinct phases of

edge marking and mesh subdivision. After edges are marked for bisection, it is

possible to exactly predict the new refined mesh before actually performing the

subdivision phase. This is because elements are independently refined based

on their binary patterns. The mesh is repartitioned if the edge markings are

skewed beyond a specified tolerance. All necessary data. is then appropriately

22

redistributed and the mesh elements are refined in their destination processors.

This enables the subdivision phase to perform in a more load-balanced fashion.

As a bonus, a smaller volume of data has to be moved around since remapping

is performed before the mesh grows in size due to refinement.

Using this methodology, the three REAL cases were run again. Table 4 presents

the performance results of this "load balanced" mesh refinement step. Com-

pared to the results in Table 3, the parallel speedups are now much higher. In

fact, the speedups for REAL_2R consistently beat the corresponding speedups

for RAND_2R, while REAL_3R outperforms RAND_3R when more than eight

processors are used. Even though the RAND cases are expected to behave

somewhat ideally, these results show that explicit load balancing can do better.

An efficiency of 82% is attained for REAL_3R on 64 processors, thereby demon-

strating that mesh adaptation ca.n deliver excellent speedups if the marked

edges are well-distributed among the processors. Communication requires a

larger fraction of the total time for this load balanced strategy because the

mesh refinement work is distributed among more processors after load bal-

ancing. However, communication times are still relatively small, requiring less

than 4% of the total time for all runs except for REAL_IR on 64 processors.

Table ,1

Performance of "load balanced" mesh refinement

REAL_I R REAL_2R REAL..3 g

Comp Total Comp Total Comp Total

P Time Speedup Time Speedup Time Speedup

1 5.902 1.00 23.780 1.00 41.702 1.00

2 3.31l 1.78 12.059 1.97 21.592 1.93

4 1.980 2.98 6.733 3.53 10.975 3.80

8 1.369 4.30 3.430 6.92 5.678 7.34

16 0.702 8.3,1 1.840 12.88 2.899 1.1.37

32 O.414 13.89 1.051 22.-11 1._I8,1 27.99

6,1 0.217 23.89 0.528 43.24 0.777 52.52

The effect of load balancing the refined mesh before performing the actual

subdivision can be seen more directly from the results presented in Table 5

for RAND_3R and REAL_3R. The quality of load balance is defined as the

ratio of the number of elements on the most heavily-loaded processor to the

optimal number of elements per processor. For the RAND_3R strategy, the

mesh was refined without any load balancing. Two different sets of results

are presented for REAL_3R: one without load balancing (NLB) and the other

using the technique of load balanced mesh refinement (LB). Notice that the

quality of load balance before refinement is excellent, and identical, for both

23

Table 5

Quality of load balance before and after mesh refinement.

RAND_3R NLB REAL_3R LB REAL_3R

P Before After Before After Before Aher

1 1.000 1.000 1.000 1.000 1.000 1.000

2 1.000 1.016 1.000 1.556 1.406 1.000

4 1.000 1.033 1.000 2.188 1.948 1.000

8 1.000 1.085 1.000 6.347 2.654 1.000

16 1.000 1.167 1.000 5.591 4.025 1.000

32 1.001 1.226 1.001 7.987 4.212 1.000

64 1.005 1.506 1.005 8.034 6.709 1.004

RAND_3R and NLB REAL_3R because the initial mesh is partitioned using

ParMETIS. However, after mesh refinement, the load imbalance is severe, par-

ticularly for NLB REAL_3R. The load imbalance is not too bad for RAND_3R

since edges are randomly marked for refinement. This is reflected by the dif-

ference in the speedup values in Tables 2 and 3. For LB REAL_3R, the initial

mesh is repartitioned after edge marldng is complete. This imbalances the load

before refinement, but generates partitions that are excellently balanced after

subdivision is complete. It also improves the speedup values significantly.

4.2 Coarsening Phase

The coarsening phase consists of three major steps: marking edges to coarsen,

cleaning up all the data structures by removing the coarsened edges and their

associated vertices and tetrahedra] elements, and finally invoking the refine-

ment routine to generate a valid mesh t¥om the remaining vertices.

Timings and parallel speedul) h)r the RAND_2C and the REAL_2C coarsening

strategies are presented in Table 6. The follow-up mesh refinement times are

not included because the goal was to demonstrate the parallel performance of

only the modules that are required during the coarsening phase. The compu-

tation time in Table 6 is the time required to mark edges for coarsening. The

communication time is negligible and not shown, but it was included when

calculating the speedup values. The cleanup time, on the other hand, is al-

ways a significant fraction of the total time. The cleanup time decreases as

more and more processors are used due to the reduction in the local mesh

size for each individual partition; however, since it depends on the fraction of

shared objects, performance deteriorates as the problem size is over-saturated

by processors. For instance, even though the total eflq.ciency is about 50% for

24

Ta,ble6Performanceof meshcoa,rsening

P

RAND_2C REAL_2C

Comp Cleanup TotM Comp Cleanup Total

Time Time Speedul) Time Time Speedup

1 3.619 2.364 1.00 3.989 2.246 1.00

2 1.832 1.352 1.88 2.026 1.283 1.88

4 0.963 0.782 3.42 1.066 0.854 3.25

8 0.572 0.498 5.57 0.600 0.498 5.68

16 0.303 0.287 10.01 0.334 0.279 10.17

32 0.170 0.170 16.95 0.167 0.161 19.01

6,1 0.070 0.098 31.17 0.093 0.097 32.82

61 processorsfor the results in Table 6, tile efficiency when consideringonlytile cleanuptimes is barely 37%.

4.3 Comparison of Reassignment Algorithms

Table 7 presents a comparison of our five different processor reassignment al-

gorithms in terms of the reassignment time (in secs) and the amount of data

movement. Results are shown for the REAL_2R strategy on the SP2 with

F = 1. The ParMETIS [19] case does not require any explicit processor reas-

signment since we choose the default partition-to-processor mapping given by

the partitioner. The poor performance is expected since ParMETIS is a global

partitioner that does not attempt to minimize the remapping overhead. A de-

tailed performance comparison of Pa.rMETIS with other pa.rtitioners within

th,' PI.UXI framework is given in [(3].

T}te ex('cution times of the other four algorithms increase with the number of

processors because of the growth in the size of the similarity matrix; however.

tile time for the heuristic algorithm on 64 processors is still very small. The

TotalV, MaxV, and YlaxSR metrics are obviously minimized by the MWBG,

BMCM, and DBMCM algorithms, respectively. All the algorithms ahnost

match the minimum MaxV value given by BMCM. Tile extremely local re-

finement in our test case requires the migration of a large number of elements

to achieve load balance, causing any reasonable reassignment algorithm to

return a similar MaxV solution.

Tile DBMCNI algorithm ol)tima[ly reduces SaxSR, but achieves no more than

a 5% improvement over the other algorithms. Nonetheless, since we believe

25

Table7Comparison of reassignment algorithms for REAL_2R on tile SP2 with F = 1

P = 32 P = 64

YotalY MaxV MaxSR Reass. TotalY MaxV MaxSR

Algorithnl Metric Met,'ic Metric Time Metric Metric Metric

P_.ea.ss.

Time

PaxMETIS 58,297 5,067 7,467 0.000 67,439 2,667 4,452 0.000

MWBG 34,738 4.410 5,822 0.018 38,059 2,261 3,142 0.065

BMCM 49,611 4.410 5,944 (}.032 52,837 2,261 3,282 0.133

DBMCM 50,270 4,414 5,733 0.092 54,896 2,261 3,121 1.252

Heuristic 35,032 4,410 5,809 0.002 38,283 2,261 3,123 0.009

that the MaxSR metric can closely approximate the remapping cost on many

architectures, computing its optimal solution can provide useful information.

Notice that the minimum TotalV increases slightly as P grows from 32 to 64,

while MaxSR is dramatically reduced by over 45%. This trend continues as the

number of processors increases, and indicates that PLUM will remain viable

on a large number of processors, since the per processor workload decreases

as P increases.

Finally, observe that the heuristic algorithm does an excellent job in minimiz-

ing all three cost metrics in a trivial amount of time. Although theoretical

bounds have only been established for the Totall/ metric, empirical evidence

indicates that the heuristic algorithm closely approximates both MaxV and

MaxSR. Similar results were obtained for the other edge-marking strategies.

Our heuristic algorithm has now been incorporated into ParMETIS [26]. This

feature gives users the option of global repartitioning while minimizing the

remapping overhead.

4.4 Portabilitg Analy.sis

The three left plots in Fig. 9 illustrate parallel speedup for the three edge-

marking strategies on the SP2, Origin2000, and T3E. Two sets of results are

presented for each machine: one when data remapping is performed after mesh

refinement, and the other when remapping is performed before refinement. The

REAL_3R case shows the best speedup values because it is the most computa-

tion intensive. Remapping data before refinement has the largest relative effect

for REAL_IR., because it has the smallest refinement region and predictively

load balancing the refined mesh returns the biggest benefit. The best results

are for REAL_3R with remapping before refinement, showing an efficiency

greater than 87% on 32 processors.

26

6O

45

30

24 ¸M

U

6

80

_'60-

20

00

SP2

.... Remap after refinement

-- Remap before refinement _

O Real-2R .,-a_/ .......

°°'°° - oo1_'°°°°° o.-ol

8 16 24 32 40 48 56 64

Number of processors

ong ..............iiiiiiiiiiii

_.'° o°,D._ o°'" ...........

4 8 1'2 16 20 2'4 28 32


T3E

16 32 48 64 80 96 112 128


t u. spa

1t "-, %1b,.lo "-.,Q: .......,, ...... -'_r:::: ........... "...... --7........... ,

fi

lo°1!

10 8 1'6 2'4 3'2 4J3 48 5'6 64


,_ L01]

E

,_, 1o°44

J

o

!

...a. Origin2000

"8:_ "'-,,

"'_'_.

4 8 I'2 16 20 24 28 32

Number of [mx:essors

10t

l0 °

0

T3E

_'-_,,"D'-.

o . --.g-. .... g--- o.. ---

_ .;2 48 64 sb 9'6 1;2 t28


Fig. 9. RelinemenL speedup (left.) and remapping t.ime (right.) within PLUM on the

SP2, Origin2000, and T3E, when data is redist.ribuled eit.her after or before meshrt, tinetuent..

To compare the performance on the three target machines more critically,one

needs to look at the actua.1 times rather than the speedup values. Table 8 shows

how the execution time (in sees) is spent during the refinement and subsequent

load balancing phases for the R_AL_2R ease when data is remapped before

the subdivision phase. Notice that the TaE adaptation times a.re consistently

more tha.n 1.4 times faster than the Origin2000 and three times faster than the

SP2. One reason for this performance difference is the disparity in the clock

speeds of the three ma.ehines. Another reason is that the mesh adaptation code

does not use the floatinDpoint units on the SP2, thereby adversely affecting

its overall performance.

27

Table8Anat.omyof executiontimesfor REAL_2R,oil t.heOri_iu2000,SP2,andT3E

Adapt.at.ionTime Renmpl)ingTinto PartitioningTime

P 02000 SP2 T3E 02000 SP2 T3E 02000 SP2 T3E

2 5.261 12.06 3A55 3.005 3.440 2.648 0.628 0.815 0.701

4 2.880 6.734 1.956 3.005 3.440 1.501 0.584 0.537 0.477

8 1.470 3.434 1.034 2.963 3.321 1.449 0.522 0.424 0.359

16 0.794 1.846 0.568 2.346 2.173 0.880 0.396 0.377 0.301

32 0.458 1.061 0.333 0.491 1.338 0.592 0.389 0.429 0.302

64 0.550 0.188 0.890 0.778 0.574 0.425

128 0.121 1.894 0.599

The three right plots in Fig. 9 show the remapping time for each of the three

cases on the SP2, Origin2000, and T3E. In almost every case, a significant

reduction in remapping time is consistently achieved when the adapted mesh

is load balanced by performing data. movement prior to refinement. This is

beta.use the mesh grows in size only after the data has been redistributed.

The remapping times also usually decrease as the number of processors is

increased because more processors are available to share the increase in the

total volume of data movement. The remapping times when data is moved

before mesh refinement are reproduced for the REAL_2R case in Table 8 since

the exact values are difficult to read off the log-scale.

A peculiarity of these results is the behavior of the T3E when P > 64. When

using up to 32 processors, the T3E closely follows the redistribution cost model

given in § sec:costmodel; however, for 64 and 128 processors, the remapping

overhead begins to increase even though the MaxSR metric continues to de-

crease. The runtime difference when data is remapped before and after re-

fim, ment is dramatically diminished; in fact, all the remapping times begin

to converge to a single value! This indicates that the remapping time is no

longer affected only by the volume of data redistributed but also by the in-

terprocessor communication pattern. One way of potentially improving these

results is to take advantage of the TaE's ability to ef[i(,iently perform one-sided

tom munieation.

Another surprising result is the dramatic reduction in remapping times when

using 32 processors on the Origin2000. This is probably because network con-

tention with other jobs is essentially removed when using the entire machine.

When using up to 16 processors, the remapping times on the SP2 and the

Origin2000 are comparable, while the T3E is about twice as fast. Recall that

the remapping phase within PLUM consists of both commuuication (to phys-

ically move data around) and computation (to rebuild the internal and shared

28

da.ta structures). Since the results in Table 8 indicate that computation is

f_ster on the Origin2000, it is reasonable to infer that bulk communication is

f_:ster oll the SP2. These results generally demonstrate that our methodology

within PLUM is effective in significantly reducing the data remapping time

m:d improving the parallel performance of mesh refinement.

Ta.ble 8 also presents the ParMETIS partitioning times for REAL_2R on all

three systems; the results for REAL_IR and REAL_3R are almost identical be-

cause the time to repartition mostly depends on the initial problem size. There

is, however, some dependence on the number of processors used. When there

are too few processors, repartitioning takes more time because ea.ch processor

has a bigger share of the total work. When there are too many processors,an increase in the communication cost slows down the repartitioner. Ta.ble 8

demonstrates that ParMETIS is fast enough to be effectively used within our

framework, and that PLUM can be successfully ported to different platforms

without a.ny code modifications.

5 Conclusions

Dynamic mesh adaptation on unstructured grids is a powerful tool for solving

problems that require local grid modifications to efficiently resolve physical

features of interest. For such problems, the coarsening/refinement step must

be performed frequently, so its efficiency must be comparable to that of the

nulnerical solver. Furthermore, with the ubiquity of parallel computing, it is

imperative to have efficient parallel implementations of adaptive unstructured-

grid algorithms. Unfortunately, parallel local mesh adaptation requires dy-

namic load balancing. In this paper, we described the parallel implementation

of.the 3D_TAG unstructured mesh adaptation algorithm and verified the ef-

fectiveness of our PLUM load balancer for a. helicopter rotor blade acoustics

problem.

Six refinement and two coarsening cases were presented with varying fractions

of a realistic-sized domain being targeted for refinement. \Ve demonstrated

excellent parallel performance when repa.rtitioning and remapping the mesh

in a. load bala.nced fashion after edges were targeted for refinement but be-

fore performing the actual subdivision. We presented three generic metrics to

model the remapping cost on most multiprocessor systems. Optimal solutions

for these metrics, as well as a heuristic approach were implemented. It was

shown that our heuristic algorithln quickly finds a solution that satisfies all

threo metrics. Additiona.lly, we showed that the data redistribution overhead

caa be significantly reduced by applying our heuristic processor reassignment

a.lgorithm to the default mapl)ing given by the global pa.rtitioner. Portability

was demonstrated by presenting results on the three vastly different archi-

29

tectures of the SP2, Origia20(}0, a.tld T3E, without the aeed for ally code

modifica.tions. Overall, the results showed that our parMlel mesh adaptation

and dynamic load balancing strategies will remain viable on large numbers of

processors.

References

[1] M.J. Berger and J.S. Saltzman, AMR oil the CM-2, AppL Numer. Math. 14

(1994) 239-253.

[2] K.S. Bey, J.T. Oden, and A. Patra, A parallel hp-axiaptive discontinuous

Galerkin method for hyperbolic conservation laws, Appl. Numer. Math. 20(1996) 321-336.

[3] K. Bhat, An O(n2"51og2n) time algorithm for the bottleneck assignment

problems (AT&T Bell Laboratories, Murray Hill, N J, 1984).

[4] R. Biswas and L. Dagu,n, Parallel implementation of an adaptive scheme for 3d

unstructured grids on a shared-memory multiprocessor, in: A. Ecer, J. Periaux,

N. Satofuka, and S. Taylor, eds., Parallel Computational Fluid Dynamics:

Implementations and Results Using Parallel Computers (Elsevier, Amsterdam,

The Netherlands, 1996) 489-496.

[5] R. Biswas, K.D. Devine, and J.E. Flahert.y, Parallel, adaptive finite element

methods for conservation laws, Appl. Numer. Math. 14 (1994) 255-283.

[6] R. Biswas and L. Oliker, Experiments with repartitioning and load balancing

adaptive meshes, Numerical Aerospace Simulation Branch Tech. Rep. NAS-97-

021, NASA Ames Research Center, Moffett Field, CA, 1997.

[7] R. Biswas and R.C. Strawn, A new procedure for dynamic adaption of three-

dimensional unstructured grids, Appl. Numer. Math. 13 (1994) 437-452.

[8] R. Biswas and R.C. Strawn, Tetrahedral and hexahedral mesh adaptation for

CFD problems, Appl. Numer. Math. 26 (1998) 135-151.

[9] J.G. Castanos and J.E. Savage, The dynamic adaptation of parallel mesh-

based computation, Proceedings 8th SIAM Conference on Pa_ullel PTvcessing

for Scientific Computing (SIAM, Minneapolis. MN, 1997).

[10] N. Chrisochoides, Multithreaded model for the dynamic load-balancing of

parallel adaptive PDE computations, Appl. Numer. Math. 20 (1996) 349-365.

[11] H.L. de Cougny, K.D. Devine, J.E. iZlaherty, R.M. Loy, C. Ozturan, and M.S.

Shephard, Load balaaming for tile parallel adaptive solution of partial differential

equal.ions, Appl. Numer. Math. 16 (1994) 157-182.

[12] P. Diniz, S. Plimpton, B. Hendrickson, and R. Leland, Parallel algorithms for

dynamically partitioning unstructured grids. Proceedings 7th SIAM Co_lference

on ParaUel Processing for Scientific Computing (SIAM, San Francisco, CA,1995) 615-620.

3O

[13]M.L. Fredmanand R.E. Tar.jan,Fibonacciheapsand their usesin improvednet.workoptimizationalgorithms,J. ACM 34 (1987) 596 615.

[14] H.N. Gabow and R.E. Tarjan, Algorithms for two bottleneck optimization

problems, J. Al9. 9 (1988) 411--417.

[15! H.N. Gabow and R.E. T_.rjan, Faster scaling algorit.hms for network problems,

SIAM J. Comput. 18 (1989) 1013-1036.

[16] J. Galtier, Automatic partitioning techniques for solving partial differential

equations on irregular adaptive meshes, Proceedings IOth ACM International

Cor_ference on Supercomputing (ACM, Philadelphia, PA, 1996) 157-164.

[17] M.T. Jones and P.E. Plassmann, Parallel algorithms for adaptive mesh

refinement, SIAM J. Sci. Comput. 18 (1997) 686-708.

[18] Y. Kallinderis and A. Vidwans, Generic parallel adaptive-grid Navier-Stokes

algorithm, AIAA J. 32 (199_) 54-61.

[19] G. Karypis and V. Kumar, Parallel multilevel k-way partitioning scheme

for irregular graphs, Department of Computer Science Tech. Rep. 96-036,

University of Minnesota, Minneapolis, MN, 1996.

[20] G. Karypis and V. Kumar, A fast and high quality multilevel scheme for

partitioning irregular graphs, SIAM J. Sci. Comput. 20 (1998) 359-392.

/21) T. Minyard and Y. Kallinderis, A parallel Navier-Stokes method and grid

adapter with hybrid prismatic/tetrahedral grids, Proceedings 33rd AIAA

Aerospace Sciences Meeting (AIAA, Reno, NV, 1995) Paper 95-0222.

[22] T. Minyard, Y. Kallinderis, and K. Schulz, Parallel load balancing for dynamic

execution environments, Proceedings 34th AIAA Aerospace Sciences Meeting

(AIAA, Reno, NV, 1996) Paper 96-0295.

[23] L. Oliker and R. Biswas, PLUM: Parallel load balancing for adaptive

unstructured meshes, J. Parallel Distrib. Comput. 52 (1998) 150-177.

I24j T.W. Purcell, CFD and transonic helicopter sound, 14th European Rotorcraft

Forum. Milan, Italy, 1988, Paper 2.

[25] J.J. Quirk, A parallel adaptive grid algorithm for computational shock

hydrodynamics, Appl. Numer. Math. 20 (1996) 427-453.

[26] K. Schloegel, G. Karypis, V. Kumar, R. Biswas, and L. Oliker, A

perfornlance study of diffusive vs. remapped load-balancing schemes, Proc.

l lth h_ternational Conference on Parallel and Distributed Computing Systems.

ISCA, Chicago, IL, 1998, pp. 59-66.

[27] P.M. Selwood, N.A. Verhoeven, J.M. Nash, M. Berzins, N.P. W_atheriIl. P.M.

Dew, and K. Morgan, Parallel mesh generation and adaptivity: partitioning and

analysis, in: P. Schiano, A. Ecer, J. Periaux, and N. Satofuka, eds., Parallel

Computational Fluid Dynamics: Algorithms and Results Using Advanced

Computers (Elsevier, Amsterdam, The Netherlands, 1997) 166-173.

31

[28]M.S.Shephard,.I.E.Flaherty,H.L. deCougny,C. OzturamC.L. Bottasso,az,dM.W. Beall,Parallelautomatedadaptiveproceduresfor unstructuredmeshes,Parallel Computing in CFD AGARD-R-807 (1995) 6.1-6.49.

[29] H.D. Simon, A. Sohn, and R. Biswas, HARP: A dynamic spectral partitioner,

.1. Parallel Distrib. Comput. 50 (1998) 83-103.

[30] R.C. Strawn, R. Biswas, and M. Garceau, Unstructured adaptive mesh

coml)utations of rotorcraft, high-speed impulsive noise, J. Aircraft 32 (1995)

754-760.

[31] L. Valiant, A bridging model for parallel computation, Comm. ACM 33 (1990)103-111.

[32] R. Van Driessche and D. Roose, Load balancing computational fluid dynamics

calculations on unstructured grids, Parallel Computing in CFD AGARD-R-807

(1995) 2.1-2.26.

[33] A. Vidwans, Y. Kallinderis, and V. Venkatakrishnan, Parallel dynamic load-

balancing algorithm for three-dimensional adaptive unstructured grids, AIAA

.1. 32 (1994) 497-505.

[34] C. Walshaw, M. Cross, and M.G. Everett, Parallel dynamic graph partitioning

for adaptive unstructured meshes, J. Parallel Distrib. Cornput. 47" (1997) 102-108.

32

Date post:	25-Jan-2020
Category:	Documents
Upload:	others
View:	15 times
Download:	0 times

Parallel Tetrahedral Mesh Adaptation with Dynamic Load ...Parallel Tetrahedral Mesh Adaptation with...

Documents