+ All Categories
Home > Documents > [American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference -...

[American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference -...

Date post: 15-Dec-2016
Category:
Upload: jani
View: 214 times
Download: 0 times
Share this document with a friend
9
j-y5- 7.3 uu a AIAA-93-3312-CP MEDUSA - AN OVERSET GRID FLOW SOLVER FOR NETWORK-BASED PARALLEL COMPUTER SYSTEMS Merritt H. Smith t NASA Ames Research Center, Moffett Field, California Jani M. Pallis University of California, Davis, California Abstract Continuing improvement in processing speed has made it feasible to solve the Reynolds-Averaged Navi- er-Stokes equations for simple three-dimensional flows on advanced workstations. Combining multiple work- stations into a network-based heterogeneous parallel computer allows the application of programming prin- ciples learned on MIMD (Multiple Instruction Multi- ple Data) distributed memory parallel computers to the solution of larger problems. An overset-grid flow solution code has been developed which uses a cluster of workstations as a network-based parallel computer. Inter-process communication is provided by the Paral- lel Virtual Machine (PVM) software. Solution speed equivalent to one-third of a Cray-YMP processor has been achieved from a cluster of nine commonly used engineering workstation processors. Load imbalance and communication overhead are the principal imped- iments to parallel efficiency in this application. Introduction Unlike large mainframe computers (e.g. Cray), which are often saturated with jobs, most personal workstations spend most of the time idle. Assuming that the average engineering workstation is used for forty hours every week, there remain 128 hours during which the machine is unused entirely. Further, very few applications are particularly taxing on the CPU (Central Processing Unit). Editing, writing code, and composing presentations are generally very light on computation. More intensive applications, such as vi- sualization and grid generation, usually result in bursts of computation followed by long idle periods during which the engineer attempts to digest the information displayed, and then decides what to do or look at next. The idle time on a workstation represents a sig- nificant computational resource. Normally a worksta- tion is sized (processor speed and memory) such that computationally intensive operations, like grid gener- ation, may be performed interactively. In practice, this requires 2 to 10 MFLOPS (Million FLoating-point Operations Per Second) processing speed and 64MB (Mega-Bytes) memory. If twenty percent utilization of the CPU is realized during normal working hours (a very generous estimate) then a total of 8 CPU hours is actually consumed in a normal week, leaving 160 hours idle. For typical workstation performance, this equates to four Cray-YMP CPU hours per week per workstation. Conversely, the lack of supercomputer resources has long been an obstacle to the use of advanced Computational Fluid Dynamics (CFD) in the design process. Typical solutions to the three-dimensional Reynolds Averaged Navier-Stokes equations for com- plete aircraft require from tens to hundreds of hours of single-processor Cray YMP time112. On a heav- ily loaded system, a computation may take months to complete. By the time a solution is generated, a less accurate but quicker method may have been used, or a key element of the design may have changed making the solution obsolete. In a financial sense, the annual maintenance costs of a vector supercomputer is often sufficient to purchase the equivalent computing resource in the form of workstation CPU's. The obvious question, and the subject of this paper, is: How may the otherwise lost workstation CPU cycles be applied efficiently to the design process? Heterogeneous Parallel Systems Most engineering companies and research insti- tutions do not keep their computers isolated from each other. Instead, they use networks to allow the transfer of data between machines. This allows more than one worker to access the same data, or one worker to use more than one computer. In either case, an increase in worker efficiency is expected. If one compares the inter-processor connection diagram of a typical MIMD parallel computer with the network diagram of a moderately large engineer- ing department, certain similarities will be seen (figs. 1 and 2). Paths for data transfer between multiple pro- cessors clearly exist in both. The principal differences between a network of workstations and a MIMD par- allel computer are in the orientation of the operating system, the physical distance between the processors and the speed of the interconnecting network. t Research Scientist, Member AIAA 16 7 Copyright 01993 by the American Institute of Aeronautics and Astronautics, Inc. No copyright is asserted in the United States under Title 17, U.S. Code. The U S . Government has a royalty-free license to exercise all rights under the copyright claimed herein for Government purposes. All other rights are reserved by the copyright owner.
Transcript
Page 1: [American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference - Orlando,FL,U.S.A. (06 July 1993 - 09 July 1993)] 11th Computational Fluid Dynamics

j -y5- 7 . 3 uu a

AIAA-93-3312-CP MEDUSA - AN OVERSET GRID FLOW SOLVER FOR

NETWORK-BASED PARALLEL COMPUTER SYSTEMS

Merritt H. Smith t NASA Ames Research Center, Moffett Field, California

Jani M. Pallis University of California, Davis, California

Abstract

Continuing improvement in processing speed has made it feasible to solve the Reynolds-Averaged Navi- er-Stokes equations for simple three-dimensional flows on advanced workstations. Combining multiple work- stations into a network-based heterogeneous parallel computer allows the application of programming prin- ciples learned on MIMD (Multiple Instruction Multi- ple Data) distributed memory parallel computers to the solution of larger problems. An overset-grid flow solution code has been developed which uses a cluster of workstations as a network-based parallel computer. Inter-process communication is provided by the Paral- lel Virtual Machine (PVM) software. Solution speed equivalent t o one-third of a Cray-YMP processor has been achieved from a cluster of nine commonly used engineering workstation processors. Load imbalance and communication overhead are the principal imped- iments to parallel efficiency in this application.

Introduction

Unlike large mainframe computers (e.g. Cray), which are often saturated with jobs, most personal workstations spend most of the time idle. Assuming that the average engineering workstation is used for forty hours every week, there remain 128 hours during which the machine is unused entirely. Further, very few applications are particularly taxing on the CPU (Central Processing Unit). Editing, writing code, and composing presentations are generally very light on computation. More intensive applications, such as vi- sualization and grid generation, usually result in bursts of computation followed by long idle periods during which the engineer attempts to digest the information displayed, and then decides what to do or look at next.

The idle time on a workstation represents a sig- nificant computational resource. Normally a worksta- tion is sized (processor speed and memory) such that computationally intensive operations, like grid gener- ation, may be performed interactively. In practice, this requires 2 t o 10 MFLOPS (Million FLoating-point Operations Per Second) processing speed and 64MB (Mega-Bytes) memory. If twenty percent utilization of the CPU is realized during normal working hours (a

very generous estimate) then a total of 8 CPU hours is actually consumed in a normal week, leaving 160 hours idle. For typical workstation performance, this equates to four Cray-YMP CPU hours per week per workstation.

Conversely, the lack of supercomputer resources has long been an obstacle to the use of advanced Computational Fluid Dynamics (CFD) in the design process. Typical solutions to the three-dimensional Reynolds Averaged Navier-Stokes equations for com- plete aircraft require from tens to hundreds of hours of single-processor Cray YMP time112. On a heav- ily loaded system, a computation may take months to complete. By the time a solution is generated, a less accurate but quicker method may have been used, or a key element of the design may have changed making the solution obsolete.

In a financial sense, the annual maintenance costs of a vector supercomputer is often sufficient to purchase the equivalent computing resource in the form of workstation CPU's. The obvious question, and the subject of this paper, is: How may the otherwise lost workstation CPU cycles be applied efficiently to the design process?

Heterogeneous Parallel Systems

Most engineering companies and research insti- tutions do not keep their computers isolated from each other. Instead, they use networks to allow the transfer of data between machines. This allows more than one worker to access the same data, or one worker to use more than one computer. In either case, an increase in worker efficiency is expected.

If one compares the inter-processor connection diagram of a typical MIMD parallel computer with the network diagram of a moderately large engineer- ing department, certain similarities will be seen (figs. 1 and 2). Paths for data transfer between multiple pro- cessors clearly exist in both. The principal differences between a network of workstations and a MIMD par- allel computer are in the orientation of the operating system, the physical distance between the processors and the speed of the interconnecting network.

t Research Scientist, Member AIAA 16 7 Copyright 01993 by the American Institute of Aeronautics and Astronautics, Inc. No copyright is asserted in the United States

under Title 17, U.S. Code. The U S . Government has a royalty-free license to exercise all rights under the copyright claimed herein for

Government purposes. All other rights are reserved by the copyright owner.

Page 2: [American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference - Orlando,FL,U.S.A. (06 July 1993 - 09 July 1993)] 11th Computational Fluid Dynamics

On a network of workstations, each machine has its own operating system whose primary design goal is multi-user stand-alone operation. Any communica- tions software must be overlaid on the basic operat- ing system. In contrast, the operating system on each node of a MIMD computer has traditionally been of limited scope and oriented toward single user compu- tation and inter-processor communication.

The physical distance between processors has rel- atively little consequence for the use of workstations in parallel computing, but the speed of the communi- cation network does limit the number and size of the messages passed between processors. Many fine-grain parallel solvers designed for distributed memory par- allel computers (e.g. MIMD Parallel OVERFLOW^) send a large number of small messages between pro- cessors during critical phases of the computation. The rate of data transfer needed by such applications to keep all processors busy computing is not (yet) avail- able between workstations.

Additional hurdles must be cleared before a net- work of workstations can be employed as a parallel computer. Different computer vendors usually use dif- ferent processing units. This results in incompatibil- ities between executable files and differences in the binary representation of numbers. A useful commu- nication package must be able to distinguish between the computers of different vendors and translate data formats accordingly. The problem of differing executa- b l e ~ should probably be addressed by the user. It is most easily solved by compiling the source code on each type of machine in the system, and storing the executable on the local disk.

Computational Met hod

Parallel Decom~osition

The chimera4 overset grid method was originally developed to simplify the spatial discretization of com- plex geometries using structured grids. Chimera has been used in the simulation of a wide variety of fluid flows, from the Space Shuttle in ascent5, to the Harrier in ground effect1, to the Penn State artificial heart6. Individual components or regions of a problem are gridded independently and then combined in a general fashion, requiring only that all regions of the computa- tional domain be covered by one or more grids. Inter- grid communication is achieved on grid boundaries us- ing trilinear interpolation.

For a typical transport aircraft configuration, grids for the fuselage, wing, nacelles, and tail surfaces would be generated separately. If the grid for the fuse-

lage extends to the edges of the computational do- main, the remaining grids might be embedded within the fuselage grid. Grid points which fall within solid bodies are "blanked out" so that no solution is com- puted, and nearby active points are updated by in- terpolation from underlying grids. In nearly all im- plementations, chimera has been applied in a sequen- tial manner wherein each grid is brought into memory, boundary conditions are applied, the flow solution is advanced and then written out to an external device. Data for all overset boundaries are maintained in mem- ory and are updated following solution advancement in each grid.

The data structure of multiple grids, whether block structured or overset, contains a natural par- allelism. Each grid is essentially autonomous, ex- cept at chimera or block boundaries where commu- nication with other grids in the problem is required. The update-communicate nature of the solution pro- cess suggests two simple parallel strategies7.

Regular Crowd: A number of identical pro- cesses are started on the workstations, one for each grid. Each process advances the local flow solution and communicates with the other processes as dictated by the overset grid structure. If each process waits for all local boundary points t o be updated before advancing the solution, process synchronization will result.

Manager-Worker: This strategy is quite sim- ilar to the first, except that an additional manage- ment process is added to the regular crowd. Instead of communicating directly with each other, the worker processes communicate through the manager, which handles the grid boundary database and the synchro- nization.

As is often true with human managers, the con- tribution of the manager process to the final goal is not easily quantifiable. Indeed, by routing all commu- nication through the manager, a bottleneck is created which can contribute to the idle time of the workers. One advantage of the manager-worker strategy is the simplification of handling and directing data. In a reg- ular crowd strategy it is quite possible that each worker will need to communicate with all others in the prob- lem. For N workers this will result in N ( N - 1)/2 two- way message paths (assuming, possibly erroneously, that workers do not talk to themselves) and poten- tially many small messages. Inserting the manager in this process reduces the number of message paths to N and allows consolidation of the smaller messages. This is particularly important in network-based com- putations where message latencies are generally large.

Page 3: [American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference - Orlando,FL,U.S.A. (06 July 1993 - 09 July 1993)] 11th Computational Fluid Dynamics

The manager-worker strategy also allows mod- ularity, providing a convenient link for multi-discip- linary analysis. For example, an engine performance model may be added to the transport aircraft simu- lation discussed above. It would be the role of the manager to facilitate communication between the en- gine performance model and a boundary region of the nacelle grid, exchanging the necessary information so that the interaction between the engine and airframe will be accurately simulated. The worker process solv- ing on the nacelle grid will receive information from the engine model through the manager, but will be unaware of the source of that information, and so will remain indistinguishable from the rest of the flow so- lution workers.

A heterogeneous parallel computer system need not be limited to a single class of processors. If vector or parallel resources are available, they can be used to advantage. It is rare for an algorithm to run opti- mally on both parallel and vector computers without major modification. Input data may also be signifi- cantly different in both content and format. Thus, if the heterogeneous parallel system is extended to in- clude one or both architectures, the manager-worker strategy is again preferable. The uniformity of the worker processes will be lost; there will now be several different types of workers, but existing workers will re- main unchanged and only the manager process needs to be modified to communicate with the new workers.

MEDUSA

The goal of this effort is not the development of new solution algorithms specifically tailored to paral- lel systems, but the investigation of the issues unique to heterogeneous parallel computing. In this light, an existing flow solver has been modified to operate in the heterogeneous parallel environment. The tangi- ble result of this effort is the MEDUSA flow solver. Named after the mythical beast with snakes in place of hair, MEDUSA uses the manager-worker strategy for the execution of parallel tasks. It currently consists of two separate but interdependent programs, HEAD and SNAKE, which fulfill the roles of manager and worker, respectively. A third program called MIRROR monitors all SNAKE processes and notifies HEAD if any have stopped prematurely. MEDUSA is derived from the OVERFLOW8 solver which implements the partially flux-split Steger-Warming (F3D)9, and the diagonalized Beam-Warming (ARC3D)1° implicit al- gorithms.

The network communication software used in this research is the Parallel Virtual Machine (PVM)" package developed jointly a t Emory University and

the Oak Ridge National Laboratory. PVM exists in two parts. The first is a user callable library of rou- tines for the initiation and termination of remote pro- cesses, communication, and process synchronization. The second part is a daemon (pvmd) which runs on all machines in the heterogeneous system. The dae- mon is responsible for establishing and maintaining the links between machines, translating data formats, and spawning and terminating processes. Starting the daemon on one machine will start the daemon on all machines required for the given computation.

For a typical computation the user will start pvmd as a background process with a list of client ma- chines as input. Once the daemon has been started on all listed machines, the user will start HEAD. HEAD'S first function is to determine the dimensions of all grids in the computation, and allocate local tempo- rary memory for grids and solutions sufficient to hold, for transfer, up to the largest grid. HEAD then uses the PVM library routines to start SNAKE processes on the clients, with one process for each grid in the computation. The allocation of grids to processors is governed by a static load balancing algorithm which is described in detail in the next section.

HEAD will then send the grid dimensions to the SNAKES so that they may allocate local memory for the grid, solution and metrics. Dynamic memory al- location is necessary for distributed computations of this type. It is a feature of OVERFLOW which made it attractive as a starting point. For large multiple- grid systems, it is impractical to compile a copy of the SNAKE executable with array parameters sized to fit each individual grid. At the opposite extreme, it is a waste of resources to compile only a single executable sized for the largest grid, but used by all.

For each grid in the system, HEAD reads the grid coordinates, an initial or restart solution, the chimera interpolation data and boundary indices, and the flow solution parameters. These data are sent to the ap- propriate SNAKE process. Memory space is allocated in HEAD to hold the interpolated solution data for the chimera boundaries of all the grids in the system. The first chimera interpolation into this boundary data array is made at this time. Later interpolations will be made by the SNAKE processes and sent back to HEAD for redistribution. After the final grid is as- signed, HEAD waits for all SNAKE routines to com- plete the metrics computations and check in with a status report.

Each step of the solution process begins with HEAD sending a flag and the correct segment of the boundary data array to each SNAKE process in the

Page 4: [American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference - Orlando,FL,U.S.A. (06 July 1993 - 09 July 1993)] 11th Computational Fluid Dynamics

system. The flag, if positive, indicates the number of iterations between chimera boundary updates for each grid. Each SNAKE applies the boundary data at the points indicated by the chimera boundary in- dices received a t initialization. Following the solution update, each SNAKE interpolates the new solution t o the boundaries of the other overset grids. These up- dated boundary data are sent to HEAD for inclusion in the boundary data array. Negative values of the flag indicate that HEAD desires SNAKE to return the current solution and/or terminate execution.

The implementation of MEDUSA makes com- munication a serial bottleneck. In an effort to mini- mize this bottleneck, HEAD was designed with a "first come, first served" approach to handling the inbound chimera data a t the end of an iteration cycle. Imper- fections in the load balancing algorithm cause varia- tions in the execution times for each processor. In- formation from grids which finish early is handled im- mediately, regardless of order. This allows productive use of some of the idle time left by the load imbal- ance, and reduces total idle time. For messages outgo- ing from HEAD, all processors will be idle until they receive their chimera boundary data. Improvement could be found by instituting a "last in, first out" pol- icy, starting first the grids which took longest to com- pute. Again this would use some idle time incurred by the load imbalance for communication. Measurements to determine this improvement are ongoing.

Load Balancing

In order to use a parallel computer system effi- ciently, it is necessary to keep all available processors busy doing useful work. To achieve this with an ex- isting serial algorithm, it is usually necessary to make drastic modifications in the core routines to fit the parallel architecture. A second approach is to use an entirely different numerical algorithm which may be more slowly convergent, but is more suitable to par- allel computation. Neither approach was taken in the current work. Instead, the serial algorithm, with a modification t o the boundary update scheme, is ap- plied to subdomains of the computational space which are spread among the processors of the distributed sys- tem.

This almost trivially parallel approach brings with it the problem of static load balancing. A static load balancing algorithm is required to minimize the total real time needed t o complete the solution on the given grid system and allocated resources. If the total time cannot be changed, the algorithm should maxi- mize the efficiency with which computational resources are used. For example, the sizes of the grids in a

chimera grid system can vary by an order of magnitude or more. On a system of identical workstations and one grid per processor, the solution can be advanced only as fast as the largest grid if process synchronization is desired. This will leave a substantial amount of idle time as the other processors wait for the largest grid to finish. There is little a load balancing algorithm can do t o speed up this process. But it may be possible to group multiple grids on individual processors, thereby reducing the total number of processors used and im- proving efficiency. If fewer processors are allocated than there are grids in the system, the load balanc- ing algorithm should come up with the combination of grids and processors which uses the least amount of real time for solution.

The load balancing algorithm used in MEDUSA is the heuristic algorithm described by Crutchfield12. Paraphrasing from Crutchfield, the process proceeds as follows:

1. Sort grids by size

2. For each grid, starting with the largest

3. Assign the grid t o the processor with the most excess capacity

4. End for

5. Find the processor with the least excess capacity

6. For each grid i on this processor

7. For each grid j not on this processor

8. If interchanging i and j decreases the total excess capacity

9. Perform the interchange

11. End if

12. End for

13. End for

Excess capacity is defined as the product of pro- cessor speed and the idle time of the processor. Total excess capacity is simply the sum of the excess capac- ity across all processors.

The algorithm described above makes no provi- sion for an extremely slow processor. If one processor is so slow that, even with the smallest grid, it produces idle time on the rest of the processors, it may be possi- ble to eliminate that processor and achieve an overall improvement in real computing time. The Crutchfield

Page 5: [American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference - Orlando,FL,U.S.A. (06 July 1993 - 09 July 1993)] 11th Computational Fluid Dynamics

algorithm also does not address the problem of effi- cient resource usage. Therefor the following instruc- tions have been added:

14. Find the processor with the least excess capacity

15. If this processor has only one grid

16. If eliminating the slowest processor results in the same or better execution time

17. Eliminate the slowest processor

20. End if

Item 16 above implies that the entire Crutchfield al- gorithm may be rerun to achieve a new load balance for the new processor configuration.

This modified algorithm still does not guarantee the best possible load balance. Experience has shown that it performs well enough, given a reasonable vari- ety of grid sizes and processor speeds, that load bal- ance penalties are usually surpassed or masked by the costs of communication.

Results

The flow about a portion of the AV-8B Harrier wing has been computed on a number of workstation combinations using an overset grid system. The sys- tem consists of nine overset grids varying in size from 53,382 grid points to 77,175 (Table 1). An example MEDUSA solution about this wing is shown in fig. 3. Freestream conditions for this flow are a Mach number of 0.5, a Reynolds number of 100,000, and an angle of attack of 15 degrees.

All solutions were run on the NAS workstation cluster, composed of 42 Silicon Graphics Incorporated (SGI) workstations. Machines in this system vary in performance from an Indigo based on the 33MHz MIPS R3000 processor to a Crimson based on the 50MHz MIPS R4000. Several multiple-processor ma- chines are included. Table 2 is a summary of the performance of each type of processor in the system, running a single-grid computation with MEDUSA. All performance data are for the core flow solver routines and do not include times for input, output, initial grid and solution transfer, metrics computation, and the re- turn of the updated solution. This data has been used to determine the maximum performance of a combina- tion of workstations using the MEDUSA flow solver. For instance, a system which includes a Crimson, an Indigo with the R4000 chip, and one processor from

a 4D/320VGX would have an maximum performance of:

On a multi-processor machine, each process runs only on a single CPU. The other processors will impact the speed of solution only if there is more than one grid located on the machine. Within the load balancing routine, each processor of a multi-processor machine is treated as a separate resource. This treatment can produce some unexpected performance results and will be discussed shortly.

Figure 4 is a plot of the actual speed of the MEDUSA flow solver as a function of the combined po- tential of the workstations used in the solution. Again, all performance data are for the core flow solver rou- tines. The diagonal line across the plot represents lin- ear speedup of the solution with increasing potential of the system.

The coarse-grained parallelism of the algorithm limits the amount of useful performance which may be extracted from the system. Grids are assumed to be in- divisible, so the best combination of grids and proces- sors, as defined by the load balancing algorithm, still contains some idle time. The curve marked "possible" in fig. 4 shows the maximum possible performance for the combinations of grids and processors tested. The space between this curve and the diagonal represents the idle time due to the load imbalance.

Inter-processor communications is a limiting fac- tor to the performance of the distributed parallel sys- tem. A way of viewing the problem of communica- tion overhead is to look at the communication sur- face to computational volume ratio (S/V). This ratio compares the number of grid points which must have their values updated from other grids to the number of points which are updated directly by the flow solver. High S/V indicates a relatively large amount of com- munication and thus a high overhead. One simple way to effectively halve S/V for a particular grid system is to update the boundary points only every other it- eration. The benefits of reducing communication in this way my be seen in fig. 4. Performance figures for the solver for one iteration per update (triangles) and two iterations per update (squares) have been obtained for identical grid and workstation configurations. The minimum performance improvement seen in the cases run so far is 3.2%, with a maximum of 7.0%.

The best performance achieved with MEDUSA has come from nine SGI processors (1 Crimson, 1

Page 6: [American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference - Orlando,FL,U.S.A. (06 July 1993 - 09 July 1993)] 11th Computational Fluid Dynamics

4D/420VGX, 1 4D/440VGX, 1 4D/320VGX) running MEDUSA with two iterations between chimera up- dates. This performance, represented by the highest square on fig. 4, is equivalent to 50.5 MFLOPS, which is one-third of the best speed of the original serial al- gorithm on a single Cray-YMP processor.

The method of taking multiple iterations be- tween chimera updates could be taken to extremes, resulting in nearly zero overhead, but the solution would not converge. There is probably an optimum scheme for communication which would result in min- imum clock time to convergence. One possibility might involve frequent communication initially while the startup transients leave the computational domain. This might be followed by less frequent communication while the flow pattern develops, followed by frequent communication while flow features are localized. It is likely that an optimal method will be problem specific. Further research in this area is needed.

It is not entirely clear from fig. 4 whether load balancing or communication is causing the greater loss in efficiency. Figure 5 shows the same data as fig. 4 but as a percentage of the maximum achievable. It ap- pears, for one iteration per chimera update, that the relative speed of solution is insensitive to variations in the load balance. When the amount of communication is decreased by half, as for two iterations per update, the load balance becomes more critical. This may be clearly seen a t configuration potentials of about 10,000 points/second and 19,000 points/second where the ac- tual speed of solution has nearly achieved the maxi- mum possible. I t is likely that the "first come, first served" approach to receiving chimera data in HEAD is allowing most communication to occur during the idle time incurred by the load imbalance, as was hoped. The small gap between the achieved and possible per- formance probably represents the outgoing communi- cation from HEAD and incoming communication from the longest-running processor.

The observant reader may have noticed that MEDUSA seems to have achieved impossible perfor- mance for the two iteration per chimera update case at the smallest potential tested. This is the unexpected performance result mentioned above. Figure 6 graphi- cally describes the distribution of grids on the proces- sors of the system, in this case one Crimson and two processors of a 4D/320VGX. The load balancing rou- tine treats each processor of a multi-processor machine as a separate machine. As the 4D/320VGX shares common memory space between its processors this as- sumption is not entirely valid. I t is impossible to dic- tate which processor of the multi-processor machine will solve any given grid.

In this instance, five grids were allocated to the processors of the 4D/320VGX1 all approximately the same size. For the single iteration per chimera update problem, the load balancing algorithm gives the cor- rect possible efficiency of 83.2% with one of the proces- sors of the 4D/320VGX determining the total running time (heavy dashed line in the figure). For the two it- eration per update case the load balancing algorithm gives identical information, but during the run, some- thing unexpected occurs. When the processor solving two grids completes two iterations on each, the other processor solving three grids still has two which have not been completed. As a shared memory machine, the operating system of the 4D/320VGX may simply split the remaining work between the two processors. The result is an unexpected improvement of one grid- iteration equivalent in performance.

If the processors of the 4D/320VGX are treated as a single processor with the combined speed of the two actual processors, an efficiency of 95.3% is pre- dicted, which is more in line with the speed mea- sured. This would suggest that further improvements are needed in the load balancing algorithm to correctly handle multi-processor shared-memory machines.

Conclusions

The similarities between MIMD parallel com- puters and workstations on a network have been ex- ploited to produce a working flow solver for complex three-dimensional configurations. The MEDUSA flow solver has been built from OVERFLOW with minimal changes in the flow solution algorithm. At the same time, a framework for large-scale multi-disciplinary computations has been created. Through this work it has become apparent that:

1. The common usage of personal workstations is wasteful of valuable computational resources.

2. Network-based parallel computer systems are ca- pable of contributing t o the computational re- sources of an organization while using otherwise wasted workstation CPU time

3. Manager-worker control decomposition is a use- ful parallel strategy for the solution of large-scale multi-disciplinary aerospace problems.

4. The parallelism provided by multiple-grid flow solution methods can be used successfully to de- compose a problem into parallel units.

5. Load balancing is necessary where there is a large variation in grid sizes and processor speeds, or

Page 7: [American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference - Orlando,FL,U.S.A. (06 July 1993 - 09 July 1993)] 11th Computational Fluid Dynamics

where there are more grids than processors. Load imbalance is a significant impediment to parallel efficiency.

Communication overhead can be a significant im- pediment to parallel efficiency. Faster networks will only improve efficiency to the point where load imbalance dominates.

A sustained computing speed equivalent to one- third of a single dedicated Cray-YMP processor has been achieved on a network-based system of nine workstations.

References

Smith, M. H., Chawla, K., and Van Dalsem, W. R., "Numerical Simulation of a Complete STOVL Aircraft in Ground Effect," AIAA 91- 3293, September 1991.

Rizk, Y. M. and Gee, K., "Numerical Prediction of the Unsteady Flowfield Around the F-18 Air- craft at Large Incidence," AIAA 91-0020, Jan- uary 1991.

Ryan, J . S., Weeratunga, S., "Parallel Computa- tion of Three-Dimensional Navier-Stokes Flow- fields for Supersonic Vehicles," AIAA 93-0064, January 1993.

Benek, J . A., Buning, P. G., and Steger, J . L., "A 3-D Chimera Grid Embedding Technique," AIAA 85-1523, July 1985.

Buning, P. G., Chiu, I. T., Obayashi, S., Rizk, Y. M., and Steger, J . L., "Numerical Simulation of the Integrated Space Shuttle Vehicle in Ascent," AIAA-4359CP, August 1988.

Rogers, S. E., Kutler, P., Kwak, D., and Kiris, C., "Numerical Simulation of Flow Through an Artificial Heart," NASA TM 102183, 1989.

Ragsdale, S. ed., Paral lel P rog ramming , Mc- Graw-Hill, 1991.

Buning, P. G., and Chan, W. M., "OVER- FLOW/F3D User's Manual, Version 1.6," NASA Ames Research Center, March 1991.

Steger, J . L., Ying, S. X., and Schiff, L. B., "A Partially Flux-Split Algorithm for the Numeri- cal Simulation of Compressible Inviscid and Vis- cous Flow," Workshop on CFD, Institute of Non- Linear Sciences at UCD, 1986.

Pulliam, T. H., and Chaussee, D. S., "A Diagonal Form of an Implicit Approximate-Factorization Algorithm," J o u r n a l of Compu ta t iona l Phy- sics, Volume 39, Number 2, February 1981.

Beguelin, A., Dongarra, J . , Geist, A., Manchek, R., Sunderam, V., "A User's Guide to PVM Parallel Virtual Machine," ORNL/TM-11826, March 1992.

Crutchfield, W. Y., "Load Balancing Irregular Algorithms," UCRL-JC-107679, Lawrence Liv- ermore National Laboratory, July 1991.

9 -GRID W I N G G R I D S Y S T E M - -

I I I G r i d No.

2 3

Po in t s Dimensions 1

4 5 6

N a m e 68,742 67 x 27 x 38

7 8

Wing 1A wing 1~ Wing 2A

68 x 27 x 38 I 69,768

I I I

67 x 25 x 38 - Wing 2B Wingtip

Farfield A

68 x 25 x 38 45 x 49 x 35 29 x 36 x 57 30 x 36 x 57 41 x 42 x 31

9 41 x 42 x 32 1 55,104

Table 1 Grid dimensions of 9-grid wing system. S I N G L E P R O C E S S O R S P E E D

63,650 64,600 77,175 59.508

Nearfield B

61,560 53.382

Machine Indigo 4D/35TG 4D/380S 4D/320VGX Indigo 4D/440VGX 4D/420VGX Crimson

Farfield B -

Nearfield A

Table 2 Summary of machine types and speeds in NAS Workstation Cluster.

Clock Rate 33MHz 36MHz 33MHz 33MHz 50MHz 40MHz 40MHz 50MHz

C P U T y p e MIPS R3000 MIPS R3000

8 x MIPS R3000 2 x MIPS R3000

MIPS R4000 4 x MIPS R3000 2 x MIPS R3000

MIPS R4000

Solut ion Speed 1,467 pts./sec. 2,052 pts./sec. 2,068 pts./sec. 2,207 pts./sec. 2,443 pts./sec. 2,626 pts./sec. 2,809 pts./sec. 4,648 pts./sec.

Page 8: [American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference - Orlando,FL,U.S.A. (06 July 1993 - 09 July 1993)] 11th Computational Fluid Dynamics

rPlJ 01 1 CPU 111

CPU 010 1 1 1 1 CPU110

Fig. 1 Inter-processor connection diagram of a pos- sible MIMD parallel processor.

Fig. 2 Network communication diagram of a typical engineering department.

Fig. 3 MEDUSA flow solution for the AV-8B Harrier wing showing instantaneous streamlines and contours of helicity. M, = 0.5,Re=100,000, cu = 15'.

Page 9: [American Institute of Aeronautics and Astronautics 11th Computational Fluid Dynamics Conference - Orlando,FL,U.S.A. (06 July 1993 - 09 July 1993)] 11th Computational Fluid Dynamics

36 s C 0 Performance 0 Q) ..+. Possible W -+- 9-gnd 1-/teration per update 27- .... -- 9-wd 2-aeratlon per update .......... - C . -

a" 3 C

18- ...... S 0 I: b 3 g - .........................

C, vl

0 9 18 27 I

36 CPU Potential (Thousand PointsiSecond)

Fig. 4 Parallel performance as a function of ag- gregate workstation performance for the diagonal- ized Beam-Warming algorithm as implemented in the MEDUSA flow solver.

60 ....... ..+. Possible .........................

I , 1 0 9 18 27 36

CPU Potential (Thousand Pohts/Second)

Expected End of Second Iteration

Actual Ending

End of First Itemtion

E" G

Grid on Processor 1 Grid on Processor 2 Grid on Processor 3

I 4 I

Unexpected Improvement 1

1 2 3 4Dl320VGX Crimson

Processor Fig. 6 Graphical representation of the unexpected improvement in efficiency due to work sharing on a shared memory processor.

Fig. 5 Parallel efficiency as a function of aggregate workstation performance for the diagonalized Beam- Warming algorithm as implemented in the MEDUSA flow solver.


Recommended