Compiler Integrated Multiprocessor Simulation

Compiler Integrated Multiprocessor Simulation1

R.S. Francis†, I.D. Mathieson† and A.N. Pears‡

†CSIRO, Division of Information Technology,High Performance Computation Project.

‡Department of Computer Science and Computer Engineering,La Trobe University.

Abstract

The Prism simulation system models the interaction of application, system and archi-tectural structures for shared memory multiprocessors and distributed memory multicom-puters. The simulation is achieved using two major components, a compiler and a libraryof processor and architecture modeling routines. The compiler processes application andsystem code written in a high level language and links the result to the modeling routinesto produce a purpose built simulator. The compiler converts the program into a largenumber of small fragments, called slices, and then compiles the sequence of slices twice.The first compilation targets an abstract model processor and annotates each slice witha list of code and data accesses that would be emitted by the modeled CPU when exe-cuting that slice. The second compilation emits a C program from the annotated sliceswhich interacts with a model machine to represent the program’s execution on the mod-eled architecture. The output from compiling an application program is combined withlibraries containing similarly compiled run-time and operating systems. The result is aspecial purpose simulator for the test application running under a particular kernel andexecuting on a specific architecture. Architectural features such as the number of pro-cessors or memory modules, cache algorithms and cache sizes, and network topology andspeed can be selected. The use of simulation allows any amount of performance data tobe gathered and analyzed on the fly, without disturbing the history of the execution. Italso allows for rapid prototyping of architectural features and implementation techniques.This paper discusses the fundamental operation of the simulator and its integration withthe associated compiler.

1 Introduction

As part of the general development of ever more powerful computers, parallel processingtechnologies are expected to play an important, if not dominant, role in providing higherthroughput performance. We can predict that shared memory multiprocessors will be thenorm in computer architecture in the immediate future. Bus technologies and cache co-herency strategies now exist to support this development[7, 13, 18]. Most manufacturersalready have multiple CPU products, and most new product designs allow for multipleCPU processing elements. For these machines, throughput improvement can be imme-diately derived using a symmetric processing mode in which each processor executes aseparate process. As well, some programs can be recompiled for parallel execution using

1This work is supported by the Australian Research Council, the Commonwealth Scientific and Indus-trial Research Organisation, La Trobe University and SIGMA DATA Corporation.

1

the shared memory model[12]. However, delivering the total processing power of suchmachines to a single application using a parallel processing mode remains a challenge tocurrent software engineering strategies.

As the number of processors increases, the ability to specify large degrees of parallelismwhile minimizing hardware and software overheads becomes more important. A key fea-ture of design in this context is the isolation and removal of performance bottlenecks.An examination of multiprocessors from this perspective leads directly to the design ofdistributed memory or network architecture machines. However, the removal of the singlepath by which all processors can communicate creates a host of additional difficulties anddesign decisions. It is evident that performance tradeoffs in such parallel systems are morecomplex than in uniprocessors and that hardware and software performance features aremore interdependent.

The development of multiprocessors and multicomputers calls for the ability to evaluatethe combination of hardware and software techniques employed in any given implemen-tation. Such evaluation, especially if it involves hardware features, is often carried outusing simulation, as the cost and time involved in machine construction make direct ex-perimentation impractical. Traditional computer simulation studies concentrate on eitherthe detailed logic of the hardware or the gross performance characteristics of the software.The detailed logic can be analyzed using circuit simulation and the gross characteristicscan be modeled using statistical techniques. The performance evaluation of parallel sys-tems requires less detail than would be obtained by circuit simulation but more precision,especially about component interactions, than can be represented using statistical models.

Early techniques in the multiprocessor arena involved the generation of static addressreference traces, typically from a uniprocessor, to which a multiprocessor analysis wasapplied[2]. Parallel system performance, however, is determined by interactions betweenall levels of the system including the specific hardware available, the kernel and operatingsystem structures, the application programming language and the algorithms coded in it.The identification of problems in a given design depends on measuring and then under-standing the interplay of all these factors. Execution driven simulation is a developingtechnique which tracks the parallel execution history of programs with sufficient accuracyand speed to allow simulation studies to be performed with confidence and within rea-sonable times. Several such simulators have been constructed but many of them[1, 6, 11]require the test application to be modified by hand to include calls to timing and con-tention analysis routines. By adopting a compiler integrated strategy[14], Prism, like theRPPT[4] and Cerberus[3], has the substantial advantage of directly executing unmodifiedprograms. The advantage of Prism over the approach used by Cerberus is in the overallefficiency of the simulation.

Once compiler integration is adopted, a number of benefits can be derived. First, in-struction simulation can be replaced by timing analysis, greatly increasing the speed ofsimulation. Second, the compiler can emit in-line code to maintain the simulation state,removing the need for interpretive execution. Finally, test applications and other softwarerun on the system can be written in a standard high level language. Apart from theobvious advantages in programmer efficiency, this has the added benefit of code transportbetween actual machines and simulations.

2

2 Parallel Systems Simulation

From the introductory discussion, the general requirements of a parallel system simulatorcan be developed.

• It must be capable of tracking all relevant hardware events down to individual pro-cessor address emissions and yet must support the analysis of gross program perfor-mance characteristics such as speedup and efficiency.

• It must be capable of executing hundreds of seconds of simulated machine time andpreferably operate sufficiently quickly to enable interactive use for application levelparallel program development, testing and refinement.

• It must allow the evaluation of the total time to solve significant problems and notbe restricted to the analysis of the efficiency of an architecture for an abstract orstatistical work load.

The Prism simulator meets these requirements through several of its features.

• Its efficiency comes from building two kinds of knowledge into the compiler. First,the compiler has sufficient knowledge of the model processor so that instructionsimulation can be replaced by timing analysis. Second, the compiler has sufficientknowledge of the simulator’s data structures to emit code which directly modifiesthe simulator’s state to record the effect of model processor execution.

• Its flexibility comes from a clear separation of concerns. Processors are modeled asaddress reference generators and the dynamically evolving address reference historiesgenerated during simulation are fed into an architecture simulator. Thus changes incache strategy or network topology can be achieved by linking different architecturemodeling routines during compilation.

• Its accuracy comes from interleaving the execution histories of processors so that ac-tivities in one processor can affect the execution and hence address reference historyof other processors.

• Its ease of use comes from allowing all aspects of programming on the simulatedarchitecture to be carried out in a well defined high level programming language.

Given the above features, Prism supports evaluation of the performance effects of varia-tions in any of the following simulation parameters.

• Hardware structures and their relative speed of operation.

• Application and system algorithms.

• Compilation and run-time support structures.

This paper describes the manner in which the Prism compiler compiles and links a parallelprogram in order to build a parallel system simulation of that program running on selectedparallel machines. It concludes with a discussion of the efficiency and general applicabilityof the technique.

3

3 Simulation Technique

The Prism simulator has been specifically designed to be capable of supporting the anal-ysis required for measuring multiprocessor performance effects. While the cause of variouseffects of interest is highly specific, the eventual performance analysis must be concernedwith gross application timings. Therefore the simulator is forced to track hardware andsoftware events and structures in detail but it must operate sufficiently quickly to permitextensive program runs. The way in which a simulation proceeds is best understood byconsidering the manner in which an application is compiled and then executed.

The basic unit of the Prism simulation is called a slice. The underlying machine sim-ulation provides a raw architecture upon which a kernel, threads management package,runtime library and test application are executed. To simulate parallel execution, all ofthis software is processed by the Prism compiler (pcomp) and converted into a set ofprimitive slices of activity. Each slice comprises timing and contention directives followedby simulator state change directives appropriate to the fragment of code modeled by theslice. Modeled processors in the architecture execute these slices according to the dic-tates of the control flow of the original programs. The result is an interleaving of parallelactivities to a precision determined by the grain of the slicing applied by the compiler.

The Compiler

Pcomp, the Prism compiler, is built from three major components: a parser, a timing an-alyzer and a code generator. The parser, constructed using yacc[19], lex[20] and errec[15],an in-house syntax error recovery tool, builds an in-memory parse tree representation ofthe source program. The timing analyzer operates in three passes. The first pass allocatesany temporaries required to prevent race conditions or maintain consistency of parallel ex-ecution. The second pass scans the modified parse tree, generating a linear intermediateform which can be broken down into a large number of fragments called slices. The thirdpass scans the sliced linear code, generating timing and contention information for execu-tion of the source code on the model target machine. This information, which representsthe sequence of processor events that would occur if the code of each slice were compiledto the model machine, is attached to each slice. Following analysis, the code generatoremits C in which each slice is coded as a function containing two parts. The code in thefirst part calls various routines in the machine modules to implement the effects of timingand contention events corresponding to the slice. The code in the second part of each sliceupdates the state of the simulator to reflect the effect of the slice’s original source code.Consequently, the compiler must model both the simulated processor and the eventualsimulation structure in order to emit directly executable code which performs the actionsappropriate to the two parts of each slice of the original program. These compilationstages are illustrated in Figure 1a.

The C code emitted from pcomp is compiled and linked with runtime libraries to providean executable image. Figure 1b depicts the integration of a compiled application program,runtime library, operating system and processor and memory simulation into an executableobject which is the simulator for the test application. A memory map of the resultingsimulator is shown in Figure 2. The only parts of the resulting image which are notgenerated by pcomp are the machine implementation modules (see later section). All of

5

S E W Linear Code PC Size Machine Code

1 1 1 Link f2 00 6 Link f

2 2 2 NF → retval = &TMP1 06 6 Load immediate &TMP1: : : 12 6 Store indexed NF,retval

: 3 3 NF → A = (3 + Y ) 18 6 Load immediate 3: : : 24 6 Add memory Y: : : 30 6 Store indexed NF,A

: 4 4 NF → B = 6 36 6 Load immediate 6: : : 42 6 Store indexed NF,B

: 5 5 Call 48 2 Call

3 6 6 Y = (2 ∗X) + TMP1 50 6 Load immediate 2: : : 56 6 Mul memory X: : : 62 6 Add indexed LF,TMP1: : 7 68 6 Store memory Y

Figure 3: Slicing of code for the statement Y := 2 ∗X + f(3 + Y, 6);.

the application, thread and parallel iteration management package, runtime mathematicaland I/O libraries, and the operating system kernel are compiled by pcomp, and havetherefore been converted into an immense number of small slices. Thus all aspects ofapplication and system software execute in parallel on the simulated architecture andkernel as well as application timing and data interactions can be observed.

Slicing

As an example of the compiler’s slicing of code, consider the source statement:

Y := 2 ∗X + f(3 + Y, 6);

where X and Y are integers and f is an integer function. This line is compiled by Prisminto a parse tree which in turn can be converted into linear intermediate code and thensliced. The slicing grain can be set at statement (S), expression (E) or write interleave(W) to give the slicing depicted in Figure 3. This grain determines the amount of codeor simulated activity which occurs in a single slice. The finer the grain the greater theinterleaving of concurrent activity and the higher the overhead in the simulation. Thenumbers on the left of Figure 3 depict the range of each slice at the various grains. Writeinterleave forces the store associated with each assignment into its own slice, thus per-mitting concurrent state changes to occur between the calculation and the storing of avalue.

2Link and Call instructions establish the new procedure’s execution environment and set NF to itsbase address. The function address is given to Link rather than Call as part of the support for parallelprocedure calls[9].

7

Each slice is converted into a function by the compiler’s code generator and written outin C from which it can be compiled using the host’s C compiler. It is then linked withthe other pre-compiled modules of the system as in Figure 2. The following code segmentshows the C emitted for slice 2 of the statement grain (S) from Figure 3.

slice_2() {CACHE_Fetch( base+6, 12 ); /* fetch load and store */CACHE_Write( &NF->ret_val ); /* write to store address */NF->ret_val= &LF->TMP1; /* update state */

CACHE_Fetch( base+18, 12 ); /* fetch load and add */CACHE_Read( &GF->Y ); /* read add value */CACHE_Fetch( base+30, 6 ); /* fetch store instruction */CACHE_Write( &NF->A ); /* write to store address */NF->A= ( 3 + GF->Y ); /* update state */

CACHE_Fetch( base+36, 12 ); /* fetch load and store */CACHE_Write( &NF->B ); /* write to store address */NF->B= 6; /* update state */

{ extern slice_3(); CPU_Call( slice_3 ); }}

Note that several of the linear intermediate code statements have been included in theone slice, in fact the slice covers the entire parameter passing sequence for the call of thefunction f . This slice concludes with the invocation of a simulated call instruction whichis directly handled by the architecture modeling routines. These routines manipulatethe simulation to reflect the timing and state changes appropriate to the required stackmanipulations. The return address (slice 3) will be stored on the simulated stack and theCPU’s PC set to the address which was specified in the preceding link instruction. Thenext time this CPU is selected for execution, the first slice of function f will be activated.

As a second example, the following C code is generated for slice 3 of the expression grain(E) from Figure 3.

slice_3() {CACHE_Fetch( base+18, 12 ); /* fetch load and add instructions */CACHE_Read( &GF->Y ); /* read the address of add value */CACHE_Fetch( base+30, 6 ); /* fetch store instruction */CACHE_Write( &NF->A ); /* write to store address */NF->A= ( 3 + GF->Y ); /* update state */

{ extern slice_4(); CPU_pc= slice_4; }/* execute slice_4 next */

}

This grain of slicing splits the function call up so that each element of the process is capableof interleaving with other activities in the modeled architecture. The finer the slicing themore precise the time location of separate activities and the higher the simulation overhead.

8

Execution

The main control of the simulation occurs in a basic driver routine. A key componentof this driver is the SLICE mechanism. The overall function of SLICE is to interleavethe simulated execution of multiple CPUs. Simulation proceeds by selecting one of themodeled CPUs and executing the sequence (or slice) of instructions to which it points.Each model CPU’s program counter is typed in C as a function pointer which is set to thefunction containing the next slice to be executed on that CPU. Execution remains withthat modeled CPU until the executed slice returns. SLICE also supports the delivery ofinterrupts to a CPU, such as timeout, and the processing of kernel mode entry and exit.Note that the gathering of raw performance statistics is implemented within SLICE andthe processor modeling library, but the manipulation of that raw data is performed bykernel or user code running on the simulator.

WHILE (simulation running) DO

• Preserve simulated registers in the Current CPU structure

• Select the youngest running CPU as the Current CPU

• Restore registers from the selected Current CPU structure

• Resume any pending partial operation in the selected CPU

• Process interrupts for the selected CPU (if in user mode)

• Call the slice function indicated by the selected CPU’s pc field

END

Figure 4: Operation sequence for the main SLICE execution loop.

The state of each CPU is stored in a data structure maintained by Prism. The structurecontains fields for each of the CPU’s visible and internal registers. During the executionof a slice, registers for the currently executing CPU are actually cached on the host stack

9

for efficient access.

As described earlier, the execution within each slice consists of the two actions:

• Execute host code to compute the timing and contention effects of this slice on themodeled machine.

• Execute host code to directly modify the simulator’s internal state to reflect the dataeffects of execution of this slice on the modeled machine.

The compiler ensures that the last action in each slice function is to set the pc field of thecurrent CPU to point to the function representing the next slice which that CPU shouldexecute. Thus most execution occurs in various slices, all of which are composed of directlycompiled host code. The specific activities of the SLICE routine, as listed in Figure 4,provide the mechanism for rotating execution around the simulated CPUs. The decisionto always step the youngest CPU ensures that the simulation keeps all processors neareach other to a degree determined by the slicing grain applied during the compilation.

4 The Machine Model

The Prism simulator treats processors as address reference generators. The entire com-pilation and slicing strategy is designed to calculate the address emission history of theprocessors as quickly as possible. These addresses are processed by the architecture mod-eling library in order to compute the timing effects of cache processing, network structuresand memory contention on the processor clock values (Figure 5). Hence the modeling of

Figure 5: Processing of address emissions during architecture modeling.

the architecture is split across two components:

• The processor architecture appears in timing analysis and is static.

10

• The machine architecture appears in the machine modules and is therefore responsiveto the dynamic state of the simulated machine.

The hardware emulation library comprises several interacting modules: a cpu module,a cache module, a memory module, a network module and a machine module. The cpumodule directly simulates instructions with complex timing interactions such as stackoperations and memory interlocked read-modify-write instructions. The cache moduleprovides various sized code and data caches with various associativities and write throughand copy back coherence strategies. The memory module simulates physical memorysubsystems and models contention delays during simultaneous access to the same memorysubsystem. The network module calculates message delays in network machines for storeand forward or circuit switching strategies on a range of topologies.

Machine Architecture

Several assumptions about the target machine underly code generation and the simulationof timing and contention effects.

• All CPUs run independently but with the same clock speed.

• Identical caching strategies are used by all CPUs.

• The bus and memory cycle times are identical but CPUs can complete several cyclesper bus cycle.

• Bus or network arbitration is impartial, giving the same random distribution ofdelays for each CPU on contention during the same access cycle.

Cpu Model

The model processor architecture approximates a byte-addressable, register-memory op-eration machine, with sufficient specific and temporary registers to evaluate expressions ofarbitrary complexity. Code and data fetches are pipelined across a 32-bit bus, but may bedelayed by hardware contention. Data widths are 1 and 4 bytes, 32-bit aligned. Instruc-tions are formed from a 2 byte opcode and one optional 2 or 4 byte operand, with controltransfer points padded to fall at a 32-bit boundary. Immediate, memory (direct), register,base-register relative and register-deferred addressing modes are available. Most instruc-tions are assumed to complete within the number of clock increments required to fetchthem, but some (such as floating point operations) may take longer. The machine providesboth privileged (kernel) and non-privileged (user) modes of operation. The setting of theCPU mode controls access to some of the registers and parts of memory.

Network Model

For multiprocessors, Prism models a single 32-bit bus connecting any number of memorymodules. For multicomputers, link bandwidth can be varied and topology can be selectedfrom mesh, torus, hypercube and ring. The network model can be configured for storeand forward or wormhole style circuit switching.

11

Memory Model

The simulator models contention for memory by permitting only one CPU to access a givenmemory module at any point in time. A separate compilation mode allows modeling ofCRCW PRAMs where any CPU can access any location at any time. The memory moduleprovides an aligned block of host memory to serve as simulated physical memory, as shownin Figure 2. The run-time model of memory has protected kernel memory allocated fromone end of the block, while user memory is allocated from the other end.

Parameters

A number of switches can be passed to a simulation to control parameters of the machinemodel. These include the number of processors and memory modules, the size of code anddata caches, the ratio of processor to bus cycle times, the number of nodes on a network,the link speed and topology, and various interrupt quanta used in scheduling decisions.

5 Performance results

Trade-Offs

As in all system simulations, some approximations must be made to simplify processingand to reduce running time. In Prism there are three major approximations made whichaffect the accuracy of the results.

• The timing analysis performed by the compiler can only approximate the modeledprocessor’s emissions for code fetches and data reads and writes. The more complexa processor is, and the more it asserts inherent parallelism on its instruction stream,the more difficult this calculation becomes.

• The slice grain applied can have a dramatic effect on accuracy. For example, atgrains above entire assignment statements various race conditions cannot appearduring interleaved execution. Of course, the finer the grain, the higher the costof carrying out the simulation. Below the expression level, apart from erroneousparallel programs, finer slicing produces the same results at higher costs.

• The machine simulation libraries which provide the runtime analysis of referencepatterns and which generate processor delays due to contention directly affect tim-ing accuracy. Considerable effort must be placed in such contention analysis as itembodies most of the performance effects arising from the non-processor aspects ofthe machine.

Consider the classic problem of the simultaneous execution of multiple instances of theassignment X := X + 1 to the shared variable X. Prism will place the entire executionof this statement into a single slice unless the slicing grain is set to write interleave. Thusexecution interleaving will only show erroneous results for higher precision and thereforehigher cost simulation. For larger grain slicing the statement is effectively rendered atomic.As a program which contains such concurrent assignments is erroneous, Prism can be used

12

CRCW Model(All processors have simultaneous access to all memory locations)

HOST CPUs Statement interleave Write interleave103 cycle/sec kb/sec 103 cycle/sec kb/sec

SPARC-1 1 1,358 3,215 376 96010 1,108 2,366 241 61320 636 1,299 179 40130 468 986 153 329

Mips M-120 1 2,264 5,339 678 1,79810 1,415 3,463 412 1,03220 900 1,858 266 60530 615 1,272 210 451

Figure 6: Simulation Speed for CRCW Model.

with high precision slicing during program development, testing and refinement to isolatesuch errors. Coarser grain slicing can be used for faster simulation on longer architectureevaluations if appropriate.

Hardware primitives, such as test-and-set, which form the basis of higher level synchro-nization structures require special treatment. Prism always places such instructions inindividual slices and thus guarantees their correct interleaving.

The processor modeled in Prism is a generic RISC style processor which does not corre-spond to any specific commercially available processor. Given that the ultimate interestis in gross system performance it is unlikely that minor variations in the processor modelwould produce discernible changes in total result. However, variations in the machinearchitecture, such as cache strategy or interconnection network, have a dramatic effecton the result. For this reason, the Prism system does not yet provide alternatives forprocessor architecture. Such variations in processor design could be provided by codingdifferent timing analysis routines in the compiler.

Performance

Efficiency results for Prism on a range of systems were obtained by adding code to countthe number of processor cycles and the number of code bytes fetched during the runs ofsome typical applications. These counts can be converted to cycles/sec and bytes/sec bydividing by elapsed times for the runs on the various systems. The UNIX system callsgettimeofday() and getrusage() were used to collect total elapsed time, and partition itinto user and system fractions. In the results reported here, checks were made to ensureno untoward system loads, such as excessive paging, occurred during a simulation.

The models used are:

• CRCW which represents the contention free model, (Figure 6), and

• NORM which is a multiprocessor with separate direct mapped code and data cachesper processor (Figure 7).

13

NORM: Multiprocessor Model(32Kbyte direct mapped write through cache, CPU clock tied to bus clock)HOST CPUs Statement interleave Write interleave

103 cycle/sec kb/sec 103 cycle/sec kb/secSPARC-1 1 153 347 108 244

10 157 346 101 22020 152 282 93 17130 145 247 87 148

Mips M-120 1 200 452 170 38010 203 441 149 32820 194 361 131 24330 187 317 118 199

Figure 7: Simulation Speed for Multiprocessor Architecture

The figures display the simulated execution speed in processor cycles per second and fetchrates in bytes per second. Given the contention delays introduced in hardware, processorcycles in comparison to host processor cycles are a poor estimate of relative efficiency.The code bytes fetched is a more interesting parameter which can be compared to themaximum possible fetch rate on each of the hosts. These comparisons produce efficiencyratios of 1:10 for the CRCW model, an impressive example of simulation efficiency, and1:200 for the NORM model.

6 An Example Experiment

The nature of the analysis which can be performed with Prism is best illustrated byan example. The example described here concerns an evaluation of distributed sharedmemory.

Distributed virtual memory (DVM)[17, 16] is proposed as a strategy for a transparent,flexible programming model for multicomputers. While DVM has several advantages, theimplications of its implementation on hardware and operating systems structures are notwell understood.

Using Prism it is possible to evaluate some of the operating system and hardware require-ments for DVM by performance measurement of prototype systems using various executionloads. The performance of unmodified shared memory algorithms when executed on sucha platform can also be evaluated.

The experiment described here generates results for a DVM system executing on fourcommon topologies. The load applied to the model is generated by the execution of asimple shared memory matrix multiplication algorithm. This task embodies a significantsimulation load, is simple to analyze and represents a challenge to DVM.

Development of a prototype DVM operating system for Prism involved the implemen-tation of additional facilities. These features manage the distribution and cooperationof computation elements when executing in a multicomputer network. Distribution of

14

PROCEDURE K ( ROW : INTEGER; VAR MAT, PROD : IntMatrix ) ;(* Perform parallel multiplication on 1..Size by 1..Size of MAT to* yield its square product in PROD. Degree copies of K are* spawned where each performs inner products for rows starting from* ROW in steps of Degree.*)VAR

i : INTEGER; (* steps through rows for this thread *)Row : INTEGER; (* steps through rows for this thread *)Col : INTEGER; (* current column for inner product *)ProdVal : INTEGER; (* Value of product element *)

BEGIN (* K *)MIGRATE( ROW-1 );

Row := ROW;WHILE Row <= Size DO

FOR Col := 1 TO Size DOProdVal := zero;for i := 1 to Size do

ProdVal := ProdVal + ( MAT[Row, i] * MAT[i, Col] )end;PROD[Row, Col] :=ProdVal

END;Row := Row + Degree

ENDEND K;

Figure 8: Simple multiply routine written in Modula-P as compiled by pcomp .

computation is provided at the thread level. Threads of execution[5, 8], which representa restartable program state, can be created on one node and then moved to other nodesin the machine. Support for thread migration must be provided by the operating system,where it may be invoked by user code or during scheduling in the kernel.

Cooperation between dispersed threads is accomplished by directly sharing variables andsynchronization is accomplished using a distributed barrier algorithm. The standardshared memory barrier algorithm uses a lockable shared data structure to record infor-mation about the state of sibling threads which wish to synchronize. Mutually exclusiveaccess to the data structure by threads on the local and remote nodes is guaranteed bythe kernel. Remote threads which wish to synchronize send a message to the node uponwhich the parent data structure is allocated. Queuing of these messages in the kernel untilexclusive access can be granted guarantees that operations on the synchronization dataare atomic. Once synchronization has been achieved local threads may be reactivated.Activation of threads on remote nodes must be accomplished using messages passing.

The source for the inner procedure of the algorithm is given in Figure 8. The example

15

Figure 9: Speedups For All Architectures

shows the general features of the language supported by pcomp. This code was executedon the following simulated machines: the CRCW model, a multiprocessor with 4-way setassociative copy-back caches, and multicomputers with hypercube, torus, ring and meshtopologies. Each multiplication of a 200x200 array on the network machines involved thesimulation of between 8x108 and 1.8x109 processor cycles and completed in approximatelyeight hours. The speedup results for all the runs are given in Figure 9

7 Summary

The Prism simulator provides a practical and efficient technique for investigating per-formance effects of hardware and software innovations for parallel systems ranging fromshared memory multiprocessors to network base multicomputers. The system providesthe following principle features.

• It uses a high level language for application and kernel coding. The system could beextended to provide multiple front ends for mixed language evaluations.

• The compiler integration provides efficient code generation and fast simulation.

• The slicing and interleaved execution support both high precision analysis and loweroverheads for lower precision analyses.

• The structure of the architecture simulation libraries allows variations in architec-tural strategies to be developed and tested quickly and easily.

16

A unique advantage of Prism is the simulation of the CRCW PRAM, a machine whichis impossible to construct. Evaluations run on this model yield a theoretical maximumspeedup for parallel programs intended for multiprocessor execution. The relative declinein performance of models of machines which can be constructed, in contrast to the CRCWPRAM performance, gives a benchmark comparison which cannot be obtained by anyother technique.

References

[1] S.G. ABRAHAM, “Parallel Simulation of Shared Memory Multiprocessors”, Proc. of theThird Int. Conf. on Supercomputing, Vol. III, International Supercomputing Institute, 1988,pp. 313–322.

[2] T. AXELROD, P. DUBOIS and P. ELTGROTH, “A Simulator for MIMD Performance Pre-diction: Application to the S-1 MkIIa Multiprocessor, Parallel Computing, Vol. 1, November1984, pp. 273–298.

[3] E.D. BROOKS III, T.S. AXELROD AND G.A.DARMOHRAY, “The Cerberus Multiproces-sor Simulator”, Proceedings of the Third SIAM Conference on Parallel Processing for Scien-tific Computing, December 1987, pp. 384–390.

[4] R.C. COVINGTON, S. MADALA, V. MEHTA, J.R. JUMP and J.B. SINCLAIR, “The RiceParallel Processing Testbed”, Proc. 1988 ACM SIGMETRICS, ACM Press, May 1988, pp.4–11.

[5] T.W. DOEPPNER, “Threads: A System for the Support of Concurrent Programming”,Department of Computer Science, Brown University, CS-87-11, June, 1987.

[6] M. DUBOIS, F.A. BRIGGS, I. PATIL and M. BALAKRISHNAN, “Trace-Driven Simulationsof Parallel and Distributed Algorithms in Multiprocessors”, Proc. 1986 Int. Conf. on ParallelProcessing, IEEE, August 1986, pp. 909–916.

[7] M. DUBOIS, C. SCHEURICH and F.A. BRIGGS, “Synchronization, Coherence, and EventOrdering in Multiprocessors”, IEEE Computer, February 1988, pp. 9–21.

[8] R.S. FRANCIS and I.D. MATHIESON, “A Threaded Programming Environment”, Aust.Comp. Sci. Communications, Vol. 9 No. 1 1987, pp. 173–183.

[9] R.S. FRANCIS and I.D. MATHIESON, Parallel Procedure Calls, Proc. 1987 IEEE Int. Conf.on Parallel Processing, Chicago, August 1987, pp. 663–666.

[10] E.F. GEHRINGER, J. ABULLARADE and M.H. GULYN, “A Survey of Commercial ParallelProcessors”, ACM Computer Architecture News, Vol. 16, No. 4, September 1988, pp. 75–107.

[11] D.J. KOPETZKY, “Horse: A simulation of the Horizon Supercomputer”, Proc. Supercom-puting ’88, Orlando, FL, IEEE and ACM SIGARCH, November 1988, pp. 53–54.

[12] D.J. KUCK, E.S. DAVIDSON, D.H. LAWRIE, A.H. SAMEH, Parallel Supercomputing Todayand the Cedar Approach in Special Topics in Supercomputing, Vol. 1, (ed.) J.J. Dongarra,North-Holland, 1987, pp. 1–23.

[13] T. LOVETT and S. THAKKAR “The Symmetry Multiprocessor System”, Proc. of the 1988Int. Conf. on Parallel Processing, Vol I-Architecture, August 1988, pp. 303–310.

17

[14] I.D. MATHIESON and R.S. FRANCIS, “A Dynamic-Trace-Driven Simulator for EvaluatingParallelism”, Proceedings of the 21st Hawaii Int. Conf. on System Sciences: Vol. 1 (Archi-tecture track), Kailua-Kona, HI: IEEE Computer Soc. Press, Jan. 1988, pp. 158–166.

[15] A.N. PEARS and R.S. FRANCIS, Enhanced Error Recovery for UNIX Parsers, AUUG-89Conference, AUUGN, Vol. 10, No. 4, August 1989, pp. 127–139.

[16] M. ROZIER, V. ABROSSIMOV, F. ARMAND, I. BOULE, M. GIEN, M. GUILLEMONT,F. HERRMANN, C. KAISER, S. LANGLOIS, P. LEONARD, W. NEUHAUSER, “CHORUSDistributed Operating Systems”, Computing Systems, Vol. 1, No. 4, Fall 1988, pp. 305–370.

[17] D.H.D. WARREN, and S. HARIDI, “DATA DIFFUSION MACHINE: A Scalable SharedVirtual Memory Multiprocessor”, Proc of International Conf on Fifth Generation ComputerSystems 1988 . ICOT 1988, pp. 943-952.

[18] H.J. WASSERMAN, M.L. SIMMONS and O.M. LUBECK, “The Performance of Minisuper-computers: Alliant FX/8, Convex C-1, and SCS-40”, Parallel Computing, Vol. 8, No. 1-3,October 1988, pp. 285–293

[19] “YACC- Yet Another Compiler Compiler”, Support Tools Guide UNIX System, WesternElectric, pp. 131–167, 1983.

[20] “LEX- Lexical Analyser Generator”, Support Tools Guide UNIX System, Western Electric,pp. 113–124, 1983.

18

Date post:	26-Nov-2023
Category:	Documents
Upload:	kth
View:	0 times
Download:	0 times

Compiler Integrated Multiprocessor Simulation

Documents