+ All Categories
Home > Documents > [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A....

[American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A....

Date post: 12-Dec-2016
Category:
Upload: ashish
View: 212 times
Download: 0 times
Share this document with a friend
10
AI AA-93-0058 Implementation of a Parallel Algorithm on a Distributed Network Manish Deshpande, Jinzhang Feng and Charles L. Merkle The Pennsylvania State University University Park, PA - Ashish Deshpande Yale University New Haven, CT 31 st Aerospace Sciences Meeting & Exhibit January 11 -1 4, 1993 / Reno, NV 1 d u For permlsslon to copy or republlsh, contact the Amerlcan Institute of Aeronautics and Astronautics 370 L'Enfent Promenade, S.W., Washlngton, D.C. 20024
Transcript
Page 1: [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A. (11 January 1993 - 14 January 1993)] 31st Aerospace Sciences Meeting - Implementation

AI AA-93-0058

Implementation of a Parallel Algorithm on a Distributed Network

Manish Deshpande, Jinzhang Feng and Charles L. Merkle The Pennsylvania State University University Park, PA -

Ashish Deshpande Yale University New Haven, CT

31 st Aerospace Sciences Meeting & Exhibit

January 11 -1 4, 1993 / Reno, NV 1 d u

For permlsslon to copy or republlsh, contact the Amerlcan Institute of Aeronautics and Astronautics 370 L'Enfent Promenade, S.W., Washlngton, D.C. 20024

Page 2: [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A. (11 January 1993 - 14 January 1993)] 31st Aerospace Sciences Meeting - Implementation

J

Implementation of a Parallel Algorithm on a Distributed Network

Manish Deshpande', J inzhang Fengtand Char les L. Merkle$ Propulsion Engineering Research Center Department of Mechanical Engineering

The Pennsylvania State University

Ashish Deshpande Department of Computer Science

Yale University

Abstract The objective of this research is to investigate the poten- tial of using a network of concurrently operating worksta- tions for solving large compute-intensive problems typical of Computational Fluid Dynamics. Such problems have a communication structure based primarily on nearest- neighbour communication and are therefore ideally suited to message passing. The implementation of a 3-D Navier- Stokes code on a network of IBM RISC/6000's is described. The performance of this code is compared with that on conventional high-performance machines such as the Cray- YMP and the Intel iPSC/860. The results suggest that a cluster of workstations presents a viable and economical resource for this purpose. Additionally a workstation net- work has the advantage of easy accessibility, fault tolerance and a simple environment for program development.

d

Introduction High performance computing is increasingly in demand in a wide range of fields such as, computational fluid dynam- ics, structural dynamics, weather forecasting and numer- ous other scientific and engineering applications. With the advances in parallel computer architectures and par- allel algorithms, the solution of large and more complex problems with many interacting components has become

*Student Member. AIAA IMember, AIAA !Member, AIAA

Copyright 01993 by the American Institute 2 of Aeronautics and Astronautics AU Rights Reserved

more feasible and economical [l]. In general, the computa- tions of each component necessitates unique requirements. For example, in complex CFD simulations, specific compu- tational requirements include such diverse capabilities as vector processing for fluid flow analysis, high speed-scalar computations for Lagrangian simulation of discrete p' ticles or for the computation of the temperature anc stress fields in adjacent surfaces as well as real time g r a d ics for user interaction. These requirements are often di- verse enough to make them difficult or impractical to exe- cute on a single machine.

Most computing environments, however, possess the hardware diversity required for such multi-component a p plications. A high speed local network consisting of high- performance scalar and/or vector processors along with at- tached graphics workstations is normally appropriate for these diverse requirements. Such autonomous, networked computers can be coerced to work simultaneously on a computational problem, termed as distributed computing. Within this loose definition, several software packages are available to support distributed computing [Z, 3, 41. These packages, in general, help users to achieve better computer performace a t a lower cost as well as to utilize available re- sources in an efficient manner.

One such application that seeks to harness this collection of capabilities and utilize it productively is P V M (Parallel Virtual Machine) [2, 5, 6].This software attempts to pr- vide a unified framework to develop parallel systems in an efficient manner. P V M allows a collection of heteroge- nous machines to be viewed as a general purpose concur- rent computational resource, thus allowing the particul-- requirements of an algorithm to be executed on the n appropriate hardware available, and/or, using the most &..J propriate software for the purpose. This paper addresses

Page 3: [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A. (11 January 1993 - 14 January 1993)] 31st Aerospace Sciences Meeting - Implementation

\L the use of PVM on a local network of high-function work- stations in solving computational fluid dynamics problems.

P5.M was selected as the communication software for the present work for several reasons. First it is free and easily available. Additionally it is simple, easy to install, understand and run, and it supports heterogeneity well.

formance on more powerful computers such as a CRi YMP and a Intel/iPSC860. The purpose of this c o m p a d son is to identify the potential of a collection of simpler and more easily accessible machines to calculate highly memory- and computation-intensive problems.

In general, PV.V is a capable tool for connecting a few worstations together for applications with relatively low Computational Problem communication needs relative to their computational re- quirements [7]. Other competing distributed computing software were not tested in the present research.

This paper presents the design and implementation of a general purpose Navier-Stokes solver on a local network of workstations using PViM as the interface for communica- tion. The specific problem presented here is the compu- tation of the three dimensional flow through curved rect- angular channels. This problem presents a classic test-bed for parallel implementation because it is relatively com- pute intensive, yet is representative of traditional fluid dy- namic problems. Like most CFD problems its communi- cation structure is based primarily on nearest neighbour communication, thus making it ideally suited for message passing. Additionally this problem has received consider- able attention recently in the analysis of the heat trans- fer through high-aspect ratio combustor cooling channels

'L' [8, 91. This application requires the simultaneous solution of the variable property Reynolds-averaged Navier-Stokes equation and the energy equation, to compute the heat transfer through the combustor walls. Such a problem re- quires a high degree of stretching near the walls to ade- quately resolve both the turbulent velocity profiles as well as the heat-transfer characteristics, and therefore is be- yond the capabilities of individual workstations. However a concurrently coupled system of workstations provides a single 'computer' capable of handling a problem of such computational and memory intensity.

The performance of the parallel algorithm was tested on a cluster of workstations connected together on a Local Area Network (LAN). Such a network can be an econom- ical, practical and powerful resource for parallel comput- ing. It is fairly commonly available to many users. It pro- vides considerable potential for fault tolerance. If a single workstation goes down, the remaining nodes can continue to function normally. This is often not true of tightly- coupled multiprocessors, where the failure of a single pro- cessor can result in the suspension of the entire machine. A network of workstations within a group can be made avail- able for dedicated use with minimal scheduling problems. Additionally, workstations provide the advantage of both stand-alone and concurrent use while offering a friendly and familiar environment to aid in program development.

The performance of this code was measured on a local network of IBM RISC/6000's and compared with its per-

L

As an example problem we consider the calculation of the flow through curved rectangular channels. The primary motivation of this implementation is to provide a means for studying the heat transfer characteristics of high-aspect ratio cooling channels in rocket engine combustors. A gen- eral parametric study of this effect has been conducted and reported in [8] for super-critical hydrogen. Such a compu- tation can then be coupled to the heat transfer from the combustor walls and eventually to the combustion process itself, thus presenting a complete model for the analysis of rocket engine combustors.

For simplicity, the parallel algorithm, investigated in this paper is restricted to the flow of a laminar, incompressible fluid, although extension to the more complicated problem discussed above is fairly straightforward.

The Navier-Stokes equations in conservative form in gen- eralized coordinates using artificial compressibility [lo] a.-

r- aQ + - (E a - E.) + -(F a - Fu) + -(C a - C,) = 0 iJ ( 1 ) ar ac as aC where

Q = ( ! ) , r = ( ! 1/a 0 0 0 ') 0 1 0 0 0 1

and

Here p is the pressure and u, u and w are the vel ties in the coordinate (z, y, z) directions. p is the a r t i f i & d

2

Page 4: [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A. (11 January 1993 - 14 January 1993)] 31st Aerospace Sciences Meeting - Implementation

Shadow coiumn.

nfonnallon Exchange

Nods (n+O

Figure 1: Representative Grid for Computation Figure 2: Schematic of Decomposed Domain

compressibility parameter and U , V and CV are the con- travariant velocities in the transformed ((, 7, () directions respectively. v is the kinematic viscosity for the fluid.

For the problem of interest, the assumption of an incom- pressible fluid is not valid. The analysis therefore requires

u' the solution of the variable property Navier-Stokes equa- tions along with the energy equat,ion and an equation of state relating the density to the temperature and the pres- sure.

The artificial compressibility parameter and the time derivative of pressure is added to the continuity equation for computational purposes to couple the pressure field to the velocity field, thus enabling the use of a time marching scheme to update the flow variables in the system. When the solution converges all the time derivatives approach zero and the steady state Navier-Stokes solution is recov- ered.

A four stage Runge-Kutta scheme is used to advance the solution in time. Central-differencing is used to calculate the spatial derivatives. A fourth order artificial viscosity is added [ll] to prevent odd-even splitting in the numerical solution.

The boundary conditions on the system are formulated using a Method of Characteristics (MOC) procedure [12]. The velocity profile is specified a t the inlet and a uniform pressure a t the exit plane. The remaining variables a t the inlet and outlet are calculated from the Riemann variables of the system determined from the MOC formulation. On the passage walls the traditional no-slip boundary condi- tions are applied augmented by enforcing the normal mo-

d mentum equation to obtain the wall pressure.

PVM Implementation

A representative grid for the problem is shown in Fig. 1. The domain was divided into strips in the streamwise di rection as suggested by the figure. developed for implementation over a network of w o r k s t a d tions, latency (communication overhead) was treated as the dominant factor. This translates to a minimum num- ber of communication operations (calls to PVM communi- cation routines) even at the cost of communicating more bytes of data. If the latency is not a dominant factor, the domain could possibly be decomposed more efficiently to reduce the total communicated data at the cost of more operations.

The decomposition of the domain for a representative grid is illustrated in Fig. 2. Each processor is given an equal portion of the domain with shadow regions on both sides to accomodate neighbouring node information as shown in Fig. 2. Each processor needs to communicate two blocks (2 t i l .r j l ) to the neighbouring processors. The communication structure of the problem is based on near- est neighbour communication only and is explicitly coded in the message passing implementation for this problem. The two planes of Communication are needed in the evalu- ation of the flux derivatives and artificial viscosity terms. The shadow blocks provide local storage on each processor for the necessary boundary information from neighbour- ing processors. PVM (Version 2.4 was used in this im- plementation) provides specific routines to accomplish the required communication.

Computational performance was measured in terms the actual elapsed time (wall-clock) between the start a n

Since the code w

's/ 3

Page 5: [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A. (11 January 1993 - 14 January 1993)] 31st Aerospace Sciences Meeting - Implementation

NO. OF, PROCESSORS

Figure 4: Speedup compared with Theoretical Speedup Serial Optical Communication Figure 3: Cross-Stream Velocity Profiles on Section

the completion of the process. This is the real time perfor-

network. Wall clock times were chosen as the performance monitor over CPU time because measuring the CPU time neglects a significant amount of overhead associated with a 3 5 -

.. mance that a user sees when executing the program on a 4 5 .

communicationacross the network. CPU time neglects the 3

L‘ time spent in managing traffic across the network as well as the PVM system’s management of the message traffic (through the PVM daemon) on each processor. CPU tim- ings without this communication time would imply that adding more and more workstations would contine to yield better performance. However on a network the increased communication should eventually result in degradation due 2 3 5 e

to the increased conflicts in memory access and commu- NO. OF PROCESSORS

nication paths, or because of natural inefficiencies in the concurrent algorithms. This drop is observed in our per- formance data as the number of processors is increased.

Figure 5: Speedup for various grid sizes. Serial Optical Communication, Lightly Loaded Network

All the performance measurements given here are for an average of at least five samples. Most measurements are on a lightly loaded network, which means that the ma- chines were in regular use, but not performing large CPU- intensive computations. Some results for a dedicated net- work are also included. Comparisons are also presented for Ethernet, Token-Ring and Optical communications media.

Results We present here the results of our experiments with the parallel solvers on a network of IBM RISC/6000’s and com- pare the performance with implemented versions of the identical code on the Cray-YMP and the Intel iPSC/860. The speedups are calculated based on the execution time of an efficient program for the same problem. The memory

L

requirements of the largest grid exceed the available mem- ory of a single machine and the sequential time is therefore extrapolated from the times for smaller grids.

The solutions for the flowfield have been presented and discussed in detail in [SI, and are only presented here to define the problem of interest. The cross-stream velocity profiles on a section of the flow in the streamwise direction are shown in Fig. 3.

To place our workstation experiments in perspective, we begin by comparing in Fig. 4, the actual speedup obtained on a network of workstations with several theoretical es- timates. The computational results were obtained on a network of six IBM 550’s running PVM over a serial c tical network for a grid size of 605,000. The theoreti estimates are defined as follows. The ideal speedup r e f e w ’

4

Page 6: [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A. (11 January 1993 - 14 January 1993)] 31st Aerospace Sciences Meeting - Implementation

Grid Size

148,500 294,000

d

to the maximum that can be acheived, which for n par- allel processors is simply n times a single processor. The Minsky Conjecture, which seems rather pessimistic is the lower bound. It scales as log, n. A more optimistic esti- mate is the n/ lnn scaling known as Amdahl's Law[l]. As the figure shows the performance of our code is near ideal for two processors and better than the n/ lnn estimate for more than two processors. This is because in practical CFD algorithms the computation time far outweighs the communication requirement and thus reduces the required overhead in comparison to the assumed overhead in the theoretical estimates.

The relative importance of communication requirements to computational requirements depends on the grid size used for the problem. Representative performance results on the network for different grid sizes ranging from 62,500 grid points to 605,000 grid points are shown in Fig. 5. To maintain the ratio of the communication block size to the total grid size constant, the ratio of the number of grids in each direction ( a l , j l , k l ) was kept nominally the same in all our grids. Even with the fraction of data trans- ferred constant, the results show that the parallel speedup factor increases as the number of grids increases. The rea- son for this is that the computational requirements scale as a higher power of the number of grid nodes than does the communication time. As long as the communication media is far from its rated capabilities, communication times scale approximately linearly with the number of grid points. This scaling, however, becomes highly nonlinear as the communication rate approaches saturation, and com- munication time begins to dominate the process as later results demonstrate.

The results in Fig. 5 show the speedup for six processors is approximately4.6 for the largest grid tested. This means a direct reduction in execution time by that factor. Addi-

'-

-'

RISC Cray Cray iPSC './ Wall CPU Wall 64 node 896 282 643 455

1611 590 1780 878

5 I

NO. OF PROCESSORS

Figure 6: Speedup for grid size = 148,500

Figure 7: Speedup for grid size = 605,000

d tionally the memory requirements for running this grid far exceed the available single-processor resources, providing another advantage of the distributed computing implemen- tation.

The above performance measurements were conducted on a network connected together by a serial optical link. In general, a Local Area Network of workstations can be connected using either the Serial link, a Token Ring or Ethernet acl the communication medium. The performance measurements on the identical workstations using the dif- ferent underlying communication media are presented in Figs. 6 and 7 for two different grid sizes. Serial opti- cal links are the fastest of the three, whereas Ethernet is the slowest. In addition the Ethernet is shared by other users whereas the Token Ring and Serial Link are dedi- cated. This relative communication speed is reflected in the performance results.

For all cases the Serial Link gives the largest speedup, whereas the Ethernet gives the smallest. For the 148,500 grid size, the speedup on the Serial Link is about 3.25 for four processors and just under four for six processors. By contraclt, the speed obtained on six processors commu- nicating over Ethernet actually decreases relative to that using four processors. This is poaibly due to the COT

munication protocol on Ethernet that allows stations transmit data as soon a4 they are ready, unless the trans-/ mission channel is busy. As the load on the network in-

5

Page 7: [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A. (11 January 1993 - 14 January 1993)] 31st Aerospace Sciences Meeting - Implementation

80

i 6o5.000

IO ,-, 2 5

NO OF PROCESSORS

Figure 8: Performance in MFlops (IBM RISC/6000 550’s)

creases, this results in significant degradation due to col- lisions of transmissions. This peak processing speed was also affected strongly by the processing by other users on the computers as is discussed later. The effect of going to a dedicated, as opposed to a shared, Ethernet connection was not tested.

The performance for the 605,000 grid point case is better for all numbers of processors, regardless of the communica- tion media used than it was for the 148,500 grid point case. In addition the maximum speed-up ratio that was noted for the smaller grid operating on Ethernet dissappears for the 605,000 grid problem as Fig. 7 shows. Here Ethernet continues to provide a smaller speedup than the other two media, but substantial computational speedup is observed in going from four to six processors. In general, one can expect to find an optimum number of processors for peak computation rate in all applications, regardless of the com- munication media. In our studies, we observed such a peak only for the Ethernet connection and never for the Token Ring or the Serial Link. The peak computation rate on Ethernet varied from four to eight workstations depending on the problem and the background load on the machine. The peak computation rate waa always larger than eight machines for the Token Ring and the Serial Link media.

To give an indication of the absolute speed of the net- worked workstations we present in Fig. 8 the performance for two different grids in MFlops. For computing the MFlop rate, the number of operations in the code were estimated. As mentioned before the runs are performed over a lightly loaded network. For the largest grid size the performance on six processors approaches 75 MFlops. The maximum rated speed for six concurrent 550’s is approxi- mately 150 MFlops.

A more significant factor than actual CPU-time is the wall-clock time required to obtain a given solution. The

2 \J 1 8 ‘ PVM2D

\

I 0 5 10 15 20 25 30 35 40

NO. OF PROCESSORS

Figure 9: Performance comparisons for 1-D and 2-D Solvers

performance of the six-processor network is compared in Table 1 with the performance on the Cray-YMP and a 64node Intel iPSC/SSO. The timings listed are for 400 it- erations for the specified grid size. As the table shows the six processor timings compare quite favorably with b the Cray and the Intel. quires approximately 3-5000 iterations to converge to the solution. The IBM RISC/SOOO and Intel timings can be readily extrapolated to a practical production run. The Cray timings, however, cannot because of in a normal o p eratiag environment Cray usage is subject to CPU limits and queue restrictions. The performance on the Intel is presented only for a 64node cube because the size of the code prohibits execution on a smaller cube.

The timings on the Intel are quite surprising, since one would expect a 64node cube (=-nodes) to perform much better than the measured timings. The reewon for this is not entirely apparent, since the code is efficiently vector- ized to run on the vector nodes and the only difference between the PVM and the hypercube version is the calls to the message passing routines.

To provide a better perspective, Fig 9 shows the com- parison of the performance of similar 1-D and 2-D codes on the local network, the Intel and the Cray. The local net- work performance was measured on a dedicated network of IBM RISC/6000 320’s (9.4 MFlops per workstation). The comparisons in the figure clearly demonstrate the potential of a local network in solving large computational problems normally assumed to require supercomputer resources.

Another interesting measurement is the percentag execution time spent in communication relative to the I execution time. Figures 10 and

In practice, an actual run's

i v 11 show the percentage

6

Page 8: [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A. (11 January 1993 - 14 January 1993)] 31st Aerospace Sciences Meeting - Implementation

J

Y 5

Figure 10: Communication time for grid size = 605,000 Figure 12: Communication time of different grids for Serial Link

Figure 11: Communication time for grid size = 148,500

communication time for the three different media for two grid sizes. As expected, the Ethernet link results in the maximum time spent in communication while the Serial Link provide the minimum. In fact for the six-processor network using Ethernet almost half the execution time is spent in message-passing between nodes as compared to less than 25 percent over the Serial link. This has a direct impact on the performance as discussed previously.

Figure 12 compares the communication times over the Serial link for two different grid sizes. On a four-processor network the communication time is less than 20 percent for the smaller grid and less than 10 percent for the larger grid. This reiterates the primary application of PVM, as a viable means of connecting a few workstations together for solving compute-intensive applications.

All the results described above were for a network of workstations under light or average load. These additional

&

processing tasks substantially affect the throughput of the distributed computing system. Parallel flow computations on a system of dedicated workstations (IBM 550's) using Ethernet communication are shown in Fig. 13. The results are for the smaller grid size of 148,500 and give substan- tially better results than those given previously in Fig. Not only is the speedup significantly larger for a given nul ber of machines, but the peak computation rate is pushe from four to eight mchines enabling speedups of almost 5.0 for the eight processor network as compared to peak speedups of less than three on the lightly loaded machines. For the eight-processor network the code is still spending less than 40 percent of the execution time in communi- cation. From the results discussed previously, additional improvement over and above presented in this figure can be achieved for larger problems or by using faster commu- nication media.

Finally, to test the true viability and applicability of the implemented code, actual practical runs were performed on a local network of heterogenous workstations (IBM - 320, 320H, 530, 560) using Ethernet as the underlying communication medium. These machines have individ- ual bench-mark rated capabilities ranging from 9.4 to 30.4 MFlops. Such a network represents the commonly avail- able resources within most research groups. The runs were performed for a grid size of 148,500. These runs are again the average of five to six samples, although the dedicated network implies very little variation from run to run.

The results of this experiment are presented in Fig. 14. As Fig. 14 shows the performance of the code peaks at around six processors and then degrades for a higher nun- ber of processors. This is also observed in the sharp crease in the communication time required as the numb> of processors increases beyond six. In this experiment the

aJ

7

Page 9: [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A. (11 January 1993 - 14 January 1993)] 31st Aerospace Sciences Meeting - Implementation

Figure 13: Performance on a dedicated Ethernet network (IBM RISC/6000 550's), Grid Size: 148,500

Figure 14: Performance on a dedicated non-homogenous network

problem was equally distributed among the various pro- cessors, despite their very different processing speeds. A more effective implementation would be to subdivide the problem unequally, so the faster machines have a propor- tionately larger share of the computation. Because of the equal subdivision of the problem, the problem size was

\i restricted by the memory availability on the smallest ma- chine (16 Meg). The performance of the higher number of processors can be expected to improve for larger problems. For larger grid sizes the peak would be expected to shift to the right. As stated before, increasing grid size adds to the viability of the concurrently operating network due to an increase in the available memory resources.

Concluding Remarks A general purpose 3-D, incompressible Navier-Stokes algo- rithm is implemented on a network of concurrently operat- ing workstations, using PVM and compared with its per- formance on a CRAY and Intel iPSC/860. The problem is relatively compute-intensive, and has a communication structure based primarily on nearest neighbour communi- cation, making it ideally suited to message passing. Such problems are frequently encountered in CFD, and their solution is increasingly in demand. The communication structure is explicitly coded in the implementation to fully exploit the regularity in the message-passing to produce a near-optimal solution.

Results have been presented for various grid sizes and using up to eight processors operating over three different communication media. The aim of this experiment is to demonstrate the viability of such a concurrent network to solve large compute-intensive problems and the extent to

which local workstations can be used to solve such prob- lems.

Using a parallel network of workstations has many ad- vantages. First as the results show, the execution time for a practical problem can be reduced by a large factor relative to the sequential code by implementing a para' le1 algorithm to solve the same problem. Additionally combined available memory of the concurrently operat i d

machines make the solution of large and complex problems feasible on average sized workstations. With the increase in grid size the computational requirement far outweigh the communication requirements making the algorithm more and more efficient. Although the use of a serial link as the communication medium is superior to the other media, the performance using Ethernet is also quite satisfactory.

Representative speedup ratios for six processors typi- cally have been around four times the speed of a single processor. More importantly, the wall-clock turnaround times are competitive with those on the Cray-YMP and the Intel-iPSC/EBO. In addition the PVM-based imple- mentation is definitely a more accessible and economical resource. Additionally, a concurrent network presents the possibility of implementing all the facets of a computa- tional problem from data management to graphical inter- action on the most appropriate hardware available for the specific purpose.

The primary limiting factor in applying PVM-based par- allel implementations of practical CFD problems on a dis- tributed network of workstations is the appearance of a peak operating speed limitation. As the communication time requirements increase, they eventually limit compu- tational throughput so than adding more machines a< ally increases computation time. Under worst conditi t b i j peak occurs at four machines but it is easily pushe Y

'

8

Page 10: [American Institute of Aeronautics and Astronautics 31st Aerospace Sciences Meeting - Reno,NV,U.S.A. (11 January 1993 - 14 January 1993)] 31st Aerospace Sciences Meeting - Implementation

' - f substantially higher. The four-processor peak occured for lightly-loaded machines using a shared Ethernet commu- nication and a small grid size (148,500 points). Increas- ing this grid size to 605,000 points shifted the peak above six proessors, and with dedicated machines, the peak was shifted to eight processors even for the small grid. Tra- ditional three-dimensional CFD problems with near one- million grid points should be very effectively computed on a network of eight to sixteen workstations.

[9] Frohiich, A, , Immich, H., Lebail, F., and Popp, hJ Three Dimensional Flow Analysis in a Rocket Engine Coolant Channel of High Depth/Width Ratio," AIAA Paper 91-2183, AIAA/SAE/ASME/ASEE, 27th Joint Propulsion Conference, Sacramento, CA June 1991.

[ lo] Kwak, D.1 Chang, J. , Shanks, s., and ChakravarthY, S." A Three-Dimensional Navier-Stokes Flow Solver Using Primitive Variables," A I A A .Journal. I:o/ . 24 , 1986, 390-396.

Acknowledgements [ll] Pulliam, T." Artificial Disspiation Models for the Eu- ler Eouations." A I A A .Journal. uol. 21. no. 12. De- . ,

This work was supported by the Propulsion Engineering Research Center, Pennsylvania State University and using the computational resources provided by the Numerical Aerodynamic Simulation (NAS) program at NASA Ames, the Cornell Theory Center and the Numerically Intensive Computing Group, Penn State.

cember 1986, 1931-1940.

[12] Merkle, C., and Tsai, Y.-L. P.' Application of Runge- Kutta Schemes to Incompressible Flows," AIAA Pa- per 86-0553, AIAA, 24th Aerospace Sciences Meeting January 1986.

Reterences [1] Hwang, K., and Briggs, F., Computer Architecture

and Parallel Processing McGraw-Hill, 1984.

[2] Sunderam, V." PVM: A Framework for Parallel Dis- tributed Computing," Concurrency: Practice and Ez- perience, vol. 2, no. 4, December 1990, 315-339.

[3] Carriero, N., and Gelernter, D." Linda in Context," Commvnicafions of t he ACM, uol. 32, no. 4, April 1989, 444-458.

[4] Butler, R., and Lusk, E." User's Guide to the p4 PI- gramming System," Tech. Rep. ANL-92/17, Argonne National Laboratory, Oct 1992.

[5] Geist, G., and Sunderam, V." Network Based Concur- rent Computing on the PVM System," Concurrency: Practice and Experience (in press).

[6] Beguelin, A., Dongarra, J., Geist, A., Manchek, B., and Sunderam, V." A User's Guide to PVM Parallel Virtual Machine," Tech. Rep. TM-1126, ORNL, July 1991.

[7] Grant, B., and Skjellum, A." The PVM Systems: An In-Depth Analysis and Documenting Study - Concise Edition," Tech. rep., LLNL, Aug 1992.

[8] Yagley, J., Feng, J., Merkle, C., and Lee, Y.-T." The Effect of Aspect Ratio on the Effectiveness of Combustor Cooling Passages," AIAA Paper 92- 3153, AIAA/SAE/ASME/ASEE, 28th Joint Propul- sion Conference, Nashville, TN July 1992.

'.d

J

9


Recommended