+ All Categories
Home > Documents > Algorithmic Design On The Cedar Multiprocessor

Algorithmic Design On The Cedar Multiprocessor

Date post: 20-Nov-2023
Category:
Upload: upatras
View: 0 times
Download: 0 times
Share this document with a friend
33
Transcript

ALGORITHMIC DESIGN ON THE CEDARMULTIPROCESSORMichael Berry� Hsin-Chu Chen � Efstratios Gallopoulos �Ulrike Meier � Allan Tuchman � Harry Wijsho� �Gung-Chung Yang �April 1, 1989AbstractThe CEDAR system under development in the Center for Super-computing Research and Development (CSRD) at the University ofIllinois at Urbana-Champaign is a clustered shared-memory multipro-cessor system which can be used to support parallel processing for awide range of numerical and non-numerical applications. In this pa-per, we survey selected algorithms and applications for CEDAR whichare under investigation by members of the CSRD Applications Group.Topics include the design of sparse basic linear algebra kernels, theConjugate Gradient method for linear systems of equations, directblock tridiagonal linear system solvers, a domain decomposition tech-nique for structural mechanics, boundary integral domain decompo-sition for partial di�erential equations, parallel algorithms for circuitsimulation, and parallelization of a computer graphics technique (raytracing). Performance results on the 2-cluster CEDAR system, AlliantFX/8, and Cray X-MP/48 are presented.1 IntroductionExploitation of the parallelism in the design of algorithms and their appli-cations may well be the determining factor for the success of a given super-computer. While great strides in algorithm performance have been achieved�Center for Supercomputing Research and Development, University of Illinois atUrbana-Champaign, Urbana, IL, 61801-2932. Research supported by the U.S. Depart-ment of Energy under Grant No. US DOE DE-FG02-85ER25001, and the NationalScience Foundation under Grants No. NSF MIP-8410110 NSF CCR-8717942, and NSFCCR-890003N. 1

on pipelined architectures (Cray X-MP and CDC Cyber 205), there aremany applications which cannot be vectorized e�ciently. This can be dueto insu�cient vector lengths or complex data dependencies.This phenomenon led to the design of architectures which provided notonly vectorization but also concurrency. With the associated higher com-putational rates for such machines, more complex memory systems wereneeded to satisfy fast data transfers. Today, machines such as the Cray 2,ETA-10, and CEDAR provide vectorization/concurrency at required datarates via hierarchical memory systems. Thus, the design of e�cient algo-rithms for machines supporting such complex memory systems becomes ofparamount importance.In this paper, we present a survey of the design of key algorithms andapplications for the CEDAR multiprocessor. Where applicable, performanceresults on 2 CEDAR clusters are discussed. In the next section, we outlinethe signi�cant hardware and software aspects of the CEDAR multiprocessor,and follow in succeeding sections with discussions on algorithmic designs insparse basic linear algebra kernels, the Conjugate Gradient method for lin-ear systems of equations, the solution of block tridiagonal linear systems,domain decomposition techniques for structural mechanics and for the solu-tion of partial di�erential equations, algorithms for circuit simulation, anda computer graphics technique (ray tracing).2 CEDAR Architecture and SoftwareThe CEDAR supercomputer [17], under development at the Center forSupercomputing Research and Development of the University of Illinoisat Urbana-Champaign, consists of multiple processor clusters connectedthrough an interconnection network to a globally shared memory. Theprototype machine is planned to have 4 clusters, each a slightly modi�edAlliant FX/8 multiprocessor. The major hardware and software features ofthis machine are described below.2.1 OrganizationEach CEDAR cluster may contain up to 8 computational elements (CE's) ormain processors (current system has 2 CE's per cluster). Each CE is a mi-croprogrammed general purpose computer consisting of a 5-stage unit, anda register set which includes eight 64-bit 32-element vector registers. TheseCE's are connected to a shared cache and a concurrency control bus. The2

cache is connected to the Alliant's main memory through the memory buswhich has a maximum bandwidth of 188 megabytes/second. The FX/8 canalso include up to 12 interactive processors (IP's) to execute the interactivecomponents of user jobs and to perform input/output and other operatingsystem activities. The IP's are connected to a cache (up to three IP's mayshare a cache) which in turn is connected to main memory through the mem-ory bus. Coherency between the main memory, the CE shared cache, andthe IP caches is maintained by the memory bus. A global interface betweeneach CE and a global network input port provides a pathway between a CEand the global network.2.2 Memory HierarchyThe memory hierarchy in the CEDAR system includes an I/O system, ashared global memory, a cluster memory, and a shared cache in each cluster.A virtual memory system with a page size of 4 kilobytes is also supported.Between the cluster memory and CE's, the 128 kilobyte shared write-backcache is 4-way interleaved so that all CE's in a cluster share four ports to thecache. We note that the bandwidth of the shared cache matches the issuingrate of the CE's unless severe con icts occur. Additionally, each cluster hasup to 32 megabytes of cluster memory in the prototype machine. On thecurrent 2-cluster CEDAR system, each cluster has 16 megabytes of clustermemory, the size of the global memory is 4 megabytes, and the shared datacache (64 kilobytes) is 2-way interleaved.The shared global memory has multiple memory modules so that arraydata can be stored across all memory modules in order to reduce memorycon icts and to allow data to be accessed in parallel. Long global memoryaccess delays can be o�set in two ways (see [9] and [19]):1. For programs with good data locality1, the cluster memory with asmaller and faster shared data cache can be used to minimize thenumber of global memory accesses from a CE.2. Array data can be prefetched into local bu�ers in the global interfacemodules before they are needed.1Such as numerical methods for domain decomposition.3

2.3 SoftwareWhile CEDAR hardware supports vector instructions and intracluster par-allelism, intercluster parallelism is supported by the software of the Xylemoperating system kernel [8] and the CEDAR Fortran run-time library. Xylemis based on Alliant's operating system, Concentrix, which is in turn basedon Berkeley's 4:2 implementation of UNIX.The constructs in CEDAR Fortran give the programmer access to theprimary architectural features. For example, variables and arrays can bedeclared with the attribute SYNC to specify that the variable or each elementof the array are synchronization structures with two components: data andkey. Variables or arrays may also be declared GLOBAL or CLUSTER. Thesedeclarations determine the location attribute of the page where the variablesand arrays are allocated (see [16]).The request to spawn a new cluster task can be made in CEDAR Fortranby calling the CTSKSTART routine. Synchronization of the spawned taskwith previously existing tasks can be achieved via the CTSKWAIT kernel.Parameters for the new cluster task may be passed by value or by reference.Given this brief overview of the CEDAR system, we now survey thedesign of some numerical and non-numerical algorithms which have been orwill be implemented on CEDAR prototype systems.3 Designing Sparse Basic Linear Algebra Sub-routinesWith the increasing complexity of writing e�cient codes fornovel super/parallel architectures, scienti�c software packages on highly ef-�ciently implemented basic computational (BLAS) primitives are becomingmore prevalent. Whereas the development of these BLAS primitives fordense computations evolved rapidly and has received a lot of attention inthe literature (see [6] and [10]), the same cannot be said for sparse compu-tations. This is mainly due to the fact that the computational complexityof sparse computations on parallel/super computers is not very well under-stood, and that there are no paradigms developed for implementing sparsecomputations on these architectures.Sparse computations are characterized by the intrinsic complexity of thedata handling. This is mainly caused by the fact that sparse matrices arestored in a condensed format in order to minimize the storage requirements.Some of the commonly used storage formats are Coordinate Scheme, Sparse4

Row(Column)-wise Format [15], and Jagged Diagonal Presentation [21]. Onan architecture with a hierarchical memory system, like CEDAR, the datamanipulation requirements become even more stringent. In order to copewith the complexity of the data handling, a design methodology for sparseBLAS primitives is introduced which is based on the data manipulationcapabilities of the architecture as well as the requirements imposed by thesparse computation. The design methodology essentially consists of thefollowing four steps:� De�ning the suitable Data Access Types� Handling Data Locality� Handling the Irregularity of the Computation� Handling Parallelism.We will demonstrate this for the Sparse Matrix times Dense Matrix (or Vec-tor) primitive (SpMxM(V)). The primitive SpMxM is commonly used, forexample, in iterative methods for determining the eigenvectors of a sparsematrix. At each iteration, the iteration matrix is multiplied with the ap-proximate eigenvectors. The primitive SpMxV, which is a special case of theformer one, is the crucial component with respect to performance in mostiterative solvers.In vector/concurrent architectures (e.g., CEDAR, Alliant FX series, Crayseries), there are essentially two di�erent types of data access: vector andscalar. In the SpMxM primitive, there are two realizations of these vectoraccesses. One realization is based upon the row/columns of matrixA and/orB, and the other realization is obtained by extending each row (column) ofA to a full row (column) by shifting all the non-zero entries of A to thetop (to the right). A represents the sparse matrix and B the dense matrixthroughout this section. The following table depicts which combinations ofthese accesses are appropriate for the implementation of SpMxM:A scalar row column ext. row ext. columnBscalar X Xrow Xcolumn X X5

The scalar-A/scalar-B version is certainly implementable but does not ex-ploit the vector capabilities of the architecture under investigation. Thisleaves us with four di�erent types of implementation for SpMxM.In the CEDAR architecture, data can reside in essentially four di�erentlocations: (vector)-registers, cluster cache, cluster memory, and global mem-ory. Because access times and storage capacities di�er for the possible datalocations in the CEDAR architecture, care has to be taken with respect todata locality. By forcing data to reside in the highest level of a system, e.g.,vector registers or cache, for as long as possible, one can realize maximumcomputational throughput. Such data locality in a computation is largelydetermined by the time delay between re-usage of data. The following strat-egy is used to optimize data handling in the CEDAR architecture.1. Vector Register utilization: reduction of the number of data-streamsto be accessed by each CE.2. Cache utilization: reduction of the length of the data-streams.3. Cluster Memory utilization: global decomposition of the data-structures.The reduction of the number of data-streams is obtained by keeping eachoperand as long as possible in a vector register during the computations.For each of the above four versions, we have two variants. For example, inthe scalar-A/row-B version, either each row of A or the result matrix canbe kept in a vector register for as long as possible. The number of datastreams can be even further decreased by applying a blocking technique. Byblocking we mean that the innermost loop of a nested loop is not iteratedfor a maximal number of times, but only in chunks of a certain length.The length of the data streams can be reduced by again applying ablocking technique. This blocking technique is applied in the same wayas above, but its functionality is quite di�erent. Whereas both blockingtechniques try to decompose a DO-loop into sections of a certain length, the�rst blocking technique is constrained to the vector processing capabilitiesof an architecture. In order to distinguish the two forms of blocking, werefer to the latter as vertical blocking.The utilization of cluster memory di�ers from cache and vector registersin the sense that its size is not as restrictive. Cluster memory can holdup to 2 megawords of data. On the other hand, cluster memory is notshared by all the processors. This means that a decomposition of the datastructures (algorithm) is necessary in order to avoid having multiple copiesof the same data structure across the cluster memories. Note that for the6

primitive SpMxM it is not possible to decompose the sparse matrix, thedense matrix, and the dense result matrix, in such a way that no data hasto be shared. Thus, this decomposition can be obtained in essentially threedi�erent manners. First, the sparse matrix is shared, and the dense and theresult matrix are vertically sliced; secondly, the dense matrix is shared, andthe sparse matrix and the results matrix are horizontally sliced; and thirdly,the result matrix is shared, the sparse matrix is vertically sliced, and thedense matrix is horizontally sliced. Because of the condensed storage formatsused for the sparse matrix, the slicing of the sparse matrix is more timeconsuming than the decomposition of the dense or result matrix. However,there is a trade-o� between the size of the sparse matrix and the size of thedense and result matrix. If the size of the sparse matrix becomes very largewith respect to the dense matrix, then sharing the sparse matrix becomesworthwhile. The shared data can either be copied across the clusters intocluster memory or kept shared in global memory. The latter choice has theadvantage that the requests for data can be shared by both the memory busof cluster memory and global memory network.The e�ect of indirect addressing and parallelization were left out of thisdiscussion. For a more detailed account of these issues and experimentaldata on the implementation of the primitives SpMxM(V) the reader is re-ferred to [30].4 Conjugate Gradient MethodConsider the linear system of n2 equationsAx = f ; (1)where A is a symmetric positive de�nite block tridiagonal matrix with tridi-agonal n�n-diagonal blocks and diagonal o�-diagonal blocks. We can solve(1) iteratively using the classical Conjugate Gradient (CG) algorithm (see[20]) which has the form: Given x0 is an arbitrary initial approximation tox, r b� f Ax0; p r; rTr (2)do until stopping criteria are ful�lled7

q Ap;� pTq;� � ;x x+ �p;r r� �q; new rTr;� new ;p r+ �p; newend.The basic iteration of the classical conjugate gradient algorithm (CG)can be vectorized very e�ciently for well structured problems as the oneconsidered here. The elementary operations required are matrix-vector mul-tiplications, dotproducts and linear combination of vectors.The algorithm was implemented on the 2-cluster CEDAR system bysplitting the matrix A and associated vectors as shown below (note thatthe pi must be chosen slightly larger as shown here to compute Aipi), andperforming the operations involving Ai, xi, bi, pi, qi and ri on cluster i,i=1,2. Not all of these operations are, however, independent from the resultsAA1A2 xx1x2 bb1b2 pp1p2 qq1q2 rr1r2obtained on the other cluster. For the computation of the dotproducts � and new , the partial results �i and i have to be accumulated. For the evaluationof the matrix-vector productA1p1 (A2p2), n elements of p2 (p1) which arecalculated in cluster 2 (1) are required. For the exchange of those elements,a work array w of length 2n is created in global memory (GM). The originalmatrix, A, right-hand-side, b, the solution, x, as well as the variables �,�, , new, 2, � , �2, and w are stored in global memory, whereas, Ai, xi,8

bi, pi, qi and ri are created in cluster memory. The contents of A and bare initially copied into Ai and bi, respectively, as well as the contents ofxi into x, before termination. The new iteration loop as implemented on 2clusters is given in Table 1.Task 1::q1 A1p1�1 p1Tq1� �1 + �2 .� �x1 x1 + �p1r1 r1 � �q1 1 r1Tr1 new 1 + 2 . = new� new p1 r1 + �p1copy part of p1 .to and from GM::Task 2::q2 A2p2�2 p2Tq2& x2 x2 + �p2r2 r2 � �q2 2 r2Tr2& p2 r2 + �p2& copy part of p2to and from GM::Table 1: Cluster Task Operations for the CG Method on the 2-ClusterCEDAR System. Arrows (.,&) indicate operations with interclusterdata-sharing.Both versions, the 1-cluster version as well as the 2-cluster version, wereimplemented on the 2-cluster CEDAR system with 2 processors in eachcluster. The experiments in Table 2 were performed for the Laplace equationwith Dirichlet boundary conditions on a n�n-grid. The timings given beloware wall clock times (in seconds) and the best out of at least 5 runs for eachsystem size. The results in Table 2 indicate that the performance of the2-cluster version of the Conjugate Gradient method approaches exceptionalspeedup for increasing system orders, n.9

n number of 1-cluster CG 2-cluster CG speedupiterations64 113 3.46 2.18 1.5996 168 12.03 7.14 1.68128 223 28.74 16.04 1.79160 278 56.86 29.83 1.91192 333 96.42 51.80 1.86224 387 156.75 81.25 1.93Table 2: Performance of the Conjugate Gradient Method on the 2-clusterCEDAR system.5 Solving Block Tridiagonal Linear SystemsThe solution of narrow-banded or block tridiagonal linear systems of equa-tions comprises one of the more dominant computations associated withimplementations of the �nite element method in scienti�c applications. Letus suppose, as is the case in several important situations, that these systemsare either diagonally dominant or symmetric positive de�nite. Let the blocktridiagonal system be given by Ax = f ; (3)where A is of order N = nm, n� 1, m � 16 and has the form10

A1 C1B1 A2 C2B2 A3 C3 Bn�2 An�1 Cn�1Bn�1 AnA = .We assume that Ai, Bi, Ci are each of order m, and in the case that Ais symmetric positive de�nite, we have Bi = CTi . For banded systems, theBi and Ci are upper and lower triangular matrices, respectively.5.1 Spike AlgorithmThe algorithm we have implemented for the solution of (3) on the 2-clusterCEDAR system, which capitalizes upon the e�ciency of the block LU (LDLT) factorization and associated block column sweep algorithms for di-agonally dominant (symmetric positive de�nite) block tridiagonal linear sys-tems, is discussed in [1]. The derivation is based upon dense block LU (LDLT) algorithms which primarily rely on matrix-matrix (BLAS3) primi-tives for their e�ciency. As discussed in [10], such block methods are prefer-able on hierarchical memory architectures, such as CEDAR, in that highperformance can be sustained via improved data locality. Our algorithm,SPIKE, comprises the highest level of the following hierarchy of e�cientBLAS3-based computational modules:11

Level Description3 SPIKE (2-Cluster CEDAR)2 Block LU ( LDLT) forblock tridiagonal systems (1 CEDAR Cluster)1 Block LU ( LDLT) for dense systems0 BLAS3 (matrix-matrix primitives).The following SPIKE algorithm (see [25]) on the 2-cluster CEDAR sys-tem partitions the original block tridiagonal system in (3) into 2 subsystemsin which each subsystem hasm+1 right-hand-sides (including original right-hand-side f), where m is the block size of matrix A. After the solutions tothese two independent systems are obtained in parallel, a reduced systemwhich restores the necessary coupling information is formulated and solved.Given the solution of the reduced system, one can easily recover the remain-ing disjoint partitions of the desired solution vector x using the 2 clustersof the current CEDAR system.5.2 CEDAR ImplementationInitially, we partition the N � N matrix A in (3) into 2 coupled blocktridiagonal matrices, ~A1 and ~A2, each of order mn=4, where N = nm andm is the block size in matrixA. For our illustrations, we assume that n = 2qand m = 2q+1 for an integer q such that q � 3. Given the partitioning inFigure 1, cluster i will load ~Ai from either global or cluster memory, computeits factorization, and store the factors in cluster memory. LetB1 = ~B1O ! ; C1 = O~C1 ! ; (mn=2�m)and f = f1f2 ! ; (fi � mn=2� 1)then the �rst phase of SPIKE on the 2-cluster CEDAR system is given by,12

Cluster Tasks1 Load f1, C1, and ~A1 from global memory.(Or alternatively generate f1, C1, and ~A1in cluster memory.)Factor ~A1 and store factorization in clustermemory.Solve ~A1[W1jg1] = [C1jf1].2 Load f2, B1, and ~A2 from global memory.(Or alternatively generate f2, B1, and ~A2in cluster memory.)Factor ~A2 and store factorization in clustermemory.Solve ~A2[W2jg2] = [B1jf2].The coe�cient matrix of the resulting global system of equations may thenbe represented by Figure 2, where the spikes, Y�1 and W�1, are mn=2 �mmatrices.If we assume that matrix A in (3) is diagonally dominant (or symmetricpositive de�nite), then it directly follows that each block tridiagonal matrix~Ai is also diagonally dominant (or symmetric positive de�nite). Hence, wecan apply a block LU (LDLT) algorithm (see [1]) on each CEDAR cluster inorder to solve the above systems with multiple right-hand-sides in parallel.~C1~B1~A1 ~A2A =Figure 1: Partitioning of matrix A for 2-cluster CEDAR system.13

Given W1, Y1, g1, and g2, then form and solve the corresponding2m� 2m reduced system. IfW1 = 0BBBBBBBBBB@ w1��wk��wn=2 1CCCCCCCCCCA ;Y1 = 0BBBBBBBBBB@ y1��wk��yn=2 1CCCCCCCCCCA ; gi = 0BBBBBBBBBB@ g1+(i�1)�n=2��gk+(i�1)�n=2��gi�n=2 1CCCCCCCCCCA ;where the wk's and yk's are m � m blocks and the gk's are m-columnvectors, then the reduced system for the case n = 8 is given by I w4y1 I ! x4x5 ! = g4g5 ! :If the original system of equations (3) is su�ciently diagonally dominant,we can solve the 2m� 2m reduced system via a dense block LU algorithm(no pivoting) on one CEDAR cluster. Given the solution of the reducedsystem, one then recovers the remaining partitions of the desired solutionvector, x. This retrieval phase of SPIKE is illustrated in Figure 2 for thecase n = 8. Given the solution vector partitions (m-column vectors copiedto global memory) x4, x5, obtained from the solution of the reduced system,we recover the remaining partitions (in parallel)u1 = 0B@ x1x2x3 1CA ;u2 = 0B@ x6x7x8 1CA ;where u1 = g�1 �W�1x5 ;u2 = g�2 �Y�1x4 :We note that for cluster i to recover ui, it must a fetch nonresident solutionvector partition (xi+3) from global memory into its vector registers. Asillustrated in Figure 2 for n = 8, cluster 2 must load x4 from global memoryin order to determine u2. We note that the work load between the 2 clustersfor all phases of the SPIKE algorithm is exactly equal.14

I ICluster12 Y�1W�1 u1u2x4x5 g�1g�2Figure 2: Solution of reduced system on the 2-cluster CEDAR system whenn = 8.The SPIKE algorithm on the current 2-cluster CEDAR system is im-plemented using 2 cluster tasks which need only be synchronized upon theformation and solution of the reduced system. The arrays which store thecoe�cient matrices of the partitioned block tridiagonal subsystems, ~Ai's,and the partitioned right-hand-side vectors, g�i 's, are not shared across clus-ters, and can thus have the CLUSTER CEDAR Fortran memory attribute.The solution vector partitions, xi's and ui's, however, are shared acrossthe clusters in the retrieval phase of SPIKE, and thus require the GLOBALmemory attribute. The Yi's and Wi's, which contain the m�m subblocksused to form the reduced system, are not shared across clusters and thuscan overwrite the same arrays storing the ~Ai's. For further descriptions ofCEDAR Fortran memory concepts see [16] and [19].5.3 Two-Cluster CEDAR ResultsTwo versions of the SPIKE algorithm have been implemented on the 2-cluster CEDAR system in which each cluster has 2 computational elements(CE's). The two versions of SPIKE only di�er in the data generation of thealgorithm: Version 1 assumes that the coe�cient matrix A in (3) initiallyresides in global memory so that ~A1 with ~C1 and ~A2 with ~B1 in (1) arecopied to separate cluster memories, and in Version 2 we generate ~A1 and~A2 in cluster memories without paying any overhead for an initial loadfrom global memory. Both 1- and 2-cluster experiments have been made forboth versions of the SPIKE algorithm on CEDAR. For the 2-cluster runs,the CEDAR CTSKSTART and CTSKWAIT synchronization primitives are usedonly once to initiate the 2 cluster tasks which execute the algorithm. Thesynchronization for forming and solving the reduced system is e�ciently15

handled by Fortran BUSY/WAIT kernels.The particular system Ax = f used in experiments below is given byaij � random number 2 [0; 10] ;aii � aii +Xj aij (ith row sum) ;fi � 10:0 + random number 2 [0; 1] :In Figure 3, we indicate the dedicated elapsed execution (wall-clock) timesfor solving 50 systems of orders ranging from N = 128 to N = 4096 forthe blocksize m = 16. Both 1- and 2-cluster CEDAR timings are indicated(1C,2C) and prefetching from global memory is enabled for all experiments.Versions 1 and 2 of SPIKE are denoted by V1 and V2, respectively. For0 1000 2000 3000 40000100200

System OrderWCT V2/2CV1/2CV1/1CV2/1C

Figure 3: SPIKE algorithm elapsed times for 1- and 2-cluster CEDAR ex-periments (WCT � Wall-Clock Time in Seconds).N = 4096, speedups of 1:9 and 1:96 have been obtained for versions 1 and 216

of the SPIKE algorithm, respectively, for 2 CEDAR cluster runs relative to1 cluster runs. We note also that overall the cost (in elapsed time) for theinitial load from global memory in version 1 is not signi�cant for N � 2048.The version 2 SPIKE algorithm in which the original coe�cient matrix of theblock tridiagonal linear system is generated completely within the CEDARcluster memories can be � 20% faster than the version 1 implementation.Although these initial results for our SPIKE algorithm are quite promising,future research and experiments is needed in order to determine the optimalnumber (� 2) of partitionable subsystems in (3) for as many as 4 CEDARclusters.6 Solving partial di�erential equationsWe have developed parallel techniques for the solution of separable ellipticequations. The discretization of such equations generates a block tridiagonalsystem of order N . The system, in this case, has a special structure, namelythat the diagonal blocks are tridiagonal and the o�-diagonal blocks diagonal.In order to take advantage of this special structure, we consider alternativemethods to those proposed in the previous section for the solution of generalblock tridiagonal linear systems. Frequently the matrix has a block Toeplitzform, in that the blocks in each diagonal are the same. Finally if A and Tdenote the diagonal block and o�-diagonal blocks respectively, a condition ofthe form AT = TA holds. Under certain circumstances related to the typeof boundary conditions and the region under consideration, fast methods ofserial complexity O(N logN) can be used. One such method is discussedbelow.6.1 Block Cyclic ReductionOne common method for solving block tridiagonal systems arising from thediscretization of separable elliptic equations is block cyclic reduction (BCR)([2]). At step r = 1; . . . ; k � 1 of this procedure, we have to solve systemsof the form p2r�1(A)X = Y where pl(A) is a polynomial in A. From thefundamental theorem of algebra we can rewrite this as2r�1Yi=1 (A� �(r�1)i I)[x1j � � � jx2k�r�1] = [y1j � � � jy2k�r�1]: (4)where A is of order m and n = 2k � 1. Furthermore, depending on theproblem, the roots �i may or may not be available analytically. In any case17

it is clear that as r increases,the e�ectiveness of a parallel or vector machineto handle (4) decreases rapidly. To avoid this problem [28,13] resort to thepartial fraction decomposition of the inverse of the polynomial operator.This results to[x1j � � � jx2k�r�1] = 2r�1Xi=1 �(r�1)i (A� �(r�1)i I)�1[y1j � � � jy2k�r�1]: (5)with the coe�cients �(r�1)i equal to 1p02r�1 (�(r�1)i ) . A discussion of the map-ping and performance of the algorithm for the Alliant FX/8 can be foundin [13]. A CEDAR version ([14]) is currently under implementation. Wealso note that a natural 2{cluster decomposition is o�ered in the solutionof the problem with periodic boundary conditions in the one direction. Inthat case, the resulting system is block tridiagonal and cyclic. As a result,it can be decoupled into 2 independent block tridiagonal systems of halfsize each ([27]). Moreover, the one of these systems corresponds to a dis-cretization of a problem with Dirichlet boundary conditions, whereas theother to Neumann boundary conditions. As a result our algorithm consistsof a decoupling phase, the solution of each subsystem on one cluster, and acombination phase. We add however that despite its simplicity, the methodhas a disadvantage, namely the load imbalance due to the di�ering oper-ation counts for the two subproblems. We �nally note that the methodsdeveloped here will serve also for the solution of time dependent problems.6.2 Boundary integral domain decompositionFor a multi-cluster architecture, with hierarchical levels in the computa-tional and memory units, domain decomposition methods o�er a naturalmeans for mapping the physical parallelism (in the domain) on the system.We have designed multi-cluster algorithms ([12,11]) for problems with knownfundamental solutions (e.g. Laplace's equation) which �rst use the methodof fundamental solutions to compute the solution at (few) selected pointsin the domain. These points form a set of separator lines which cut thedomain into substructures. By taking the computed values as new bound-ary values for each substructure, a set of independent subproblems, one foreach substructure can be formed and solved on each of the clusters. Clearlythe solution can be computed by applying the elliptic solvers most suitablefor each subdomain. For appropriate regions for example, the previouslydescribed parallel BCR method can be used. The advantage of the method18

is that it can be used both in order to compute the solution on an irreg-ular region { by decoupling it to subproblems which admit more e�cienttreatment { as well as in order to increase the e�ciency of a solver on asimple region, by partitioning to small enough problems which can exploitthe memory hierarchy. Another important issue is the computation of thesolution for the interfaces. This is done by solving with the QR algorithmin a least-squares sense a system with dense coe�cient matrix of dimensionm � p where p � m. The sequential complexity of the solver is O(mp2).Clearly p has to be small for the method to be competitive, but even then,it is wasteful to solve the problem on a single cluster. As this is the partof the algorithm which resolves the coupling between the subdomains, it isalso the hardest to map e�ciently on the multi-cluster. We are currentlyexperimenting with methods which achieve this section of the computationon 2 clusters.7 SAS Decomposition MethodThe symmetric-and-antisymmetric (SAS) decomposition is a special do-main decomposition method useful for a wide class of engineering appli-cation problems. Algebraically, it is a special matrix decomposition methodfor handling linear systems, eigenvalue problems, or generalized eigenvalueproblems when the coe�cient matrices involved, A 2 Cn�n say, satisfy therelation A = PAP where P is some re ection matrix (or symmetric signedpermutation matrix excluding the identity matrix) [5], [3], [4]. The matrixAis said to be re exive with respect to P or said to possess the SAS property.The method takes advantage of special properties possessed by the re- exive matrices to decompose a single problem into two or more smallerindependent subproblems via orthogonal transformations. The decomposi-tion, therefore, enables one to solve problems using large grain parallelismon shared-memory multiprocessors such as the Alliant FX/8, ETA-10, andCray X-MP/48. This approach is especially useful for multiprocessors pos-sessing three levels of parallelism such as the CEDAR system.7.1 SAS for Linear SystemsTo serve as an example of the SAS decomposition method, we considersolving the following linear systemAx = f (6)19

where A 2 Cn�n is assumed to be nonsingular and re exive with respectto some re ection matrix P taking the form O PT1P1 O ! ;P1 being some signed permutation matrix of order k. Here we implicitlyassume that n = 2k. Let Q be the matrix1p2 I �PT1P1 I ! :Instead of solving (6) directly, the SAS decomposition method solves theproblem by applying an orthogonal transformation to the linear system usingQ which leads to the following form~A~x = ~fwhere ~A = QTAQ, ~x = QTx, and ~f = QTf . Note that if we partition thematrix A into 2� 2 blocks, A11 A12A21 A22 ! ;then we have ~A = A11 +A12P1 OO A22 �A21PT1 ! : (7)From (7), it is clear that the linear system has been decomposed into twoindependent subsystems. This decoupling is a direct consequence of theassumption that the matrix A is re exive with respect to P.In many engineering applications, both submatricesA11 +A12P1 and A22 �A21PT1 still possess the SAS property, with re-spect to some other re ection matrix. The decomposition can be furthercarried out to yield four independent subsystems with each submatrix oforder approximately equal to one quarter of the order of A. This decom-position procedure can be applied recursively to the subsystems until nodisjoint submatrices possess this property.20

7.2 CEDAR ImplementationE�cient implementations of the SAS scheme on parallel computers dependnot only on the architecture of a given machine, but also on how the com-piler and operating system are designed to handle the program and the data.Table 3 shows the potential of parallelism inherent in solving linear systemsvia the SAS approach whenever applicable, where s is the number of sub-domains or submatrices decomposed by the SAS techniques. Note that thedecomposition of the right-hand side b into bi and the retrieval of the so-lution x from xi; i = 1; :::; s, involve only vector operations. On CEDAR,each cluster can be used not only to form an Ai matrix but also to solvethe corresponding system Aixi = bi. CEDAR CTSKSTART and CTSKWAITsynchronization primitives can be used to fork all cluster tasks.Non-SAS Algorithm SAS AlgorithmForm A Form A1, A2, ..., AsForm b Form bDecompose b into b1, b2, ..., bsSolve Ax = b Solve Aixi = bi (i = 1; :::; s)Retrieve x from x1, x2, ..., xsTable 3: Parallelism in the SAS approach.So far, we have not speci�ed how each linear subsystem is to be solved. Inprinciple, any algorithm which can be used to solve the undecomposed linearsystem can be used to solve the subsystems no matter whether the algorithmis of a sequential or parallel nature. The independence of the subsystemsresulting from the SAS domain decomposition implies high level parallelism.Hence, on machines such as the Alliant FX/8 or Cray X-MP/48 one cansolve one subsystem per processor using a vectorized solver. On CEDAR,however, one can solve one subsystem on each cluster using parallel solvers,thus taking full advantage of the three levels of parallelism.7.3 Alliant FX/8 ResultsThe SAS approach has been successfully employed on the Alliant FX/8 tosolving algebraic linear systems and generalized eigenvalue problems derivedfrom the �nite element approximation to elasticity problems. Figure 4 shows21

some timing results obtained from the application of this approach to a 3Dlong beam by decomposing the domain into eight subdomains. The speedupsof employing 2, 4, and 8 processors of the Alliant FX/8 in solving the linearsubsystems using the Cholesky decomposition method are approximately1:98, 3:75, and 6:75 respectively. See [3] and [4] for more details. Theimplementation of this method on 2 clusters of CEDAR is in progress andthe results are forthcoming.0 15 30 45 60 7502468

101214161820

System Order (Scaled)Time(Sec) : FX/1: FX/2: FX/4: FX/8

Figure 4: Execution times (CPU) for solving linear systems on the AlliantFX/8.8 Circuit SimulationOne of the objectives of the CSRD Applications Group is to demonstratethat a machine, like CEDAR, with hierarchical memory and cluster con-structs can perform well on a wide range of problems arising from scienti�c22

and engineering applications, particularly for large scale production codes.Based on this conviction, circuit simulation was included as a major researchtopic in the very early stage of the project, and a public domain circuit sim-ulator called SPICE from UC Berkeley (see [22] and [26]) was chosen as ourmain focus. Mathematically, the problem arising from circuit simulation isa system of sti� Di�erential-Algebraic Equations (DAE). Like most DAEsolvers, there are three numerical methods involved in the solution process:sti�y-stable numerical integration methods, Newton-Raphson methods fornonlinear algebraic equations, and the solution of sparse linear systems.For the past ten years, various e�orts have been directed at vectorizingSPICE on supercomputers like Cray-1, through restructuring compiler tech-nology or redesigning parallel algorithms. The result is quite disappointingsince the underlying operations in circuit simulation are basically scalar op-erations. Our own experience in this direction con�rms that there are few,if any, vectorizable loops in the code. Among a total of 123 subroutines inSPICE, there is no single dominating subroutine during an entire simulationrun. Because of the lack of array operations, circuit simulation becomes oneof the most di�cult problems in the area of parallel algorithm design andsoftware development.Our main e�orts in the last couple of years have involved the design par-allel algorithms for SPICE on a CEDAR single cluster. Although array op-erations do not exist in SPICE, it was recognized through a detailed analysisof the solution process that concurrency is feasible at a higher level of the so-lution algorithm hierarchy. This implies that multiple scalar operations arethe key to design parallel algorithms for SPICE, and that multitasking is thetool to achieve this goal. Therefore, the single most important architecturalfeature needed is the e�cient scheduling of multiple subroutine calls. On aCEDAR cluster, this can be achieved by hardware synchronization controlwith very low overhead. Through a detailed pro�ling of SPICE runs, threeimportant modules are identi�able: model evaluations and assembling of Ja-cobians (LOAD), the solution of sparse linear systems (SOLVE), and localtruncation error estimation for step size (TRUNC). Together, they consistof over 90% of the total simulation time.Parallel algorithm design for model evaluation and assembling Jacobiansis conceptually very simple at the algorithmic level. Like �nite-elementanalysis, each component of the circuit computes its own contribution forthe Jacobian locally. The module can be processed in parallel, and the onlysynchronization point needed is in the assembly phase. The di�culty lies inhow to achieve such a goal. In other words, it is the software development23

that dominates our e�orts. Currently, there are two design schemes we areconsidering: lock and lockless. Apparently, the lock scheme performs betterat this stage. Table 4 indicates that the speed up for the LOAD moduleranges from 6:5 to 7:6. The high e�ciency achieved here is due to a carefuldesign which reduces the number operations in the critical region, and betterutilizes the data cache in the CEDAR cluster.Parallel algorithm design for the TRUNC module is very similar to thatof the LOAD module, except in this case no synchronization is required.Again, each component of the energy-storage-type computes its own localtruncation error and estimates the suggested step size locally. The �nal stepsize for the next time point is the smallest among all suggested values. Asshown in Table 4, the speed up for the TRUNC module is about 7 on theaverage.Parallel algorithm design for sparse linear systems is the most intellectu-ally challenging problem among three modules. Linear systems arising fromcircuit simulation are unstructured, nonsymmetric, and inde�nite. Usuallythere are thousands of linear systems with the same sparsity pattern whichneed to be solved during the simulation. Therefore, the solver requires bothONE-OFF and 2-OFF methods. The important characteristics of thesesparse matrices is that they are extremely sparse, over 99% zeros, and veryirregular. Typically, the number of nonzeros per row or column in the fac-tored matrix grows linearly with respect to the order of the matrix, and theconstant of the linear function is around 10. Recently, great progress hasbeen made in solving large scale sparse symmetric positive de�nite linearsystems arising from structural analysis problems. Parallel algorithms suchas the supernodal general sparse or multifrontal methods, have achievedpeak performance speeds on Cray X-MP and Cray-2 machines. One impor-tant factor in their success lies in the fact that the problems being solvedusually have high average row counts (around 800) in the factored matrices.Therefore, high utilization of vector and hardware gather/scatter featuresunderlying the architecture is possible. For that same reason, it also ex-plains why they fail to achieve a similar result for the network problems likeelectric power system and circuit simulation.Our e�ort in this area has produced a stand-alone parallel direct sparsematrix package called DSpack. DSpack provides the same functionality asMA28 ([7]) except drop tolerance and iterative re�nement. It parallelizesevery step involved in ONE-OFF problem, and provides three (�ne-grained,medium-grained, and coarse-grained) parallel computing models for 2-OFFproblems. The current software package allows one to experiment with24

various pivoting stategies and many other options in the system. Bench-mark tests conducted on the Boeing-Harwell sparse matrix collection hasshown that the package is extremely robust. Currently, DSpack running ona CEDAR cluster outperforms MA28 by a factor 20 for network problemsand a factor of 5 for high �ll-in problems. The algorithms used in DSpackhas been customized and embedded into SPICE. The result, as shown inTable 4, is about 4:1 to 5:4 times faster than the original code.The overall speedup for the total simulation time ranges from 5:1 to5:9, as is shown in Table 5. To date, this is the highest speed up currentlyachieved on a CEDAR cluster for a real world production code. Comparisonswith a Cray X-MP/48 machine2 with a 8:5 nanosecond clock cycle time arealso listed in Table 5. On the average, our parallel SPICE code on a CEDARcluster runs about 46% slower than 1 CPU of a Cray X-MP/48 machine.The result is impressive when taking into account the fact that the ratio ofpeak performances of the two machines is 94=235 = 40%, and that the ratioof the best dense codes run on the two machines is 35=215 = 16%. Sinceall the parallel algorithms designed for the three modules can be scaled-upnaturally, we expect that our parallel SPICE code running on a two-clusterCEDAR will match that of 1 CPU of a Cray X-MP machine.9 Ray Tracing on CEDAR9.1 BackgroundRay tracing is a computer graphics technique for producing some of the mostphotorealistic images possible today ([29], [24]). The model to be rendered isusually a list of simple geometric primitives. The algorithm iterates througheach pixel (sample point) in the output image to compute the color andintensity of that pixel. For display, the output image is naturally organizedinto y horizontal scan lines of x pixels each. The image plane is placedbetween the eye and the model.For each pixel, a primary ray is generated from the eye through thatpixel. This ray is intersected with all objects in the scene and the closestintersection is preserved. The surface properties and transparency of theselected object determine which of several new rays will be spawned simu-lating the re ection and refraction of light (see Figure 5). In this way, a ray2Cray X-MP/48 system at the National Center for Supercomputing Applications atthe University of Illinois at Urbana-Champaign.25

LOADData FX/1 FX/8 SpeedupCKT1 52.34 7.86 6.7CKT2 76.07 11.73 6.5CKT3 175.59 22.99 7.6CKT4 181.76 25.01 7.3SOLVEData FX/1 FX/8 SpeedupCKT1 27.75 5.15 5.4CKT2 18.00 4.33 4.1CKT3 61.83 13.37 4.6CKT4 56.36 10.96 5.1TRUNCData FX/1 FX/8 SpeedupCKT1 7.43 1.05 7.1CKT2 17.84 2.42 7.4CKT3 10.32 1.44 7.2CKT4 10.78 1.68 6.4Table 4: Pro�le of SPICE Modules on 1 (FX/1) and 8 (FX/8) processors ofthe Alliant FX/8. Total Simulation TimeData FX/1 FX/8 Speedup Cray X-MP (FX=8)(CrayX�MP=48)CKT1 94.96 17.93 5.3 7.94 44%CKT2 119.65 23.45 5.1 10.95 47%CKT3 257.20 43.69 5.9 19.01 44%CKT4 259.22 44.74 5.8 22.37 50%Table 5: Total CPU Time for SPICE runs on 1 (FX/1) and 8 (FX/8) pro-cessors of the Alliant FX/8. 26

tree is built until a stopping criteria is encountered (for example, the maxi-mum tree depth is reached, a ray goes o� into space with no intersection, ora ray hits a light source) The color and brightness of each surface and lightencountered are propagated back up the tree to yield the �nal value for thatpixel. The bulk of the time spent in ray tracing is in determining whichobjects are intersected. In some cases this may require the use of iterativeequation solvers. Substantially less time is spent in the shading equationswhich determine the color of a pixel as a function of surface properties andlighting in the scene. Through SphereRay RefractedEyeImagePlane GlassSpherePrimaryRay OpaqueBoxFigure 5: The Ray Tracing Algorithm.9.2 ParallelizationWell over 100 papers have appeared with ray tracing as their primary sub-ject. Most of these have considered optimizations or extensions to the al-gorithm; few have dealt with parallelization. The only natural vectors thatarise are short vectors in 3-space or 3-D color space. Although the algo-rithm iterates quite regularly over independent pixels, each primary ray canspawn a tree of secondary rays quite di�erent from its neighbors in depthand shape. For these reasons the brute force algorithm does not yield itselfto vectorization. Some vectorization has been done however ([23], [18]).For these same reasons, however, the algorithm is embarrassingly easyto parallelize. In the extreme case an array of 512 by 512 processors, each27

with access to the model description, could compute the entire image in thetime it takes to compute only the most complicated pixel.On CEDAR, our parallel ray tracer assigns a portion of the image toeach cluster, where pixels are computed concurrently. An entire scan lineis assigned to each cluster at a time. The assumption is that adjacent scanlines will be of similar computational complexity, and that each cluster will�nish its scan line in about the same amount of time. This assumptionhas been upheld in actual tests. At the end of each scan line, all clustersare synchronized, then the computed values are output sequentially to theimage �le. The main loops of the ray tracer are shown below.parameter (NCLUSTER = 2, MAXSIZE = 1024)global scanline, iyposreal scanline(MAXSIZE, NCLUSTER)do iypos = 1, maxy, NCLUSTERC ... spread each scan line across clusterssdoall y = 1, min(NCLUSTER, maxy-iypos+1)integer ixposC ... compute scan line concurrentlyC ... within a clustercdoall ixpos = 1, maxxcall getpixcolor(ixpos, y+iypos-1,* scanline(ixpos, y))end cdoallend sdoallC ... synchronize and output the scan linesdo y = 1, min(NCLUSTER, maxy-iypos+1)call print(scanline(1, y))end doend doIf the memory is large enough to contain the entire image (perhaps 3MB), the synchronization step would be unnecessary and the scan linescould be assigned to processors at random. Figure 6 was computed usingthe parallel ray tracer and shows the strength of an electromagnetic �eld ina region. The embedded grid lines, mirrors, and shadows all provide visualcues to the three-dimensional visualization.28

Figure 6: Three-dimensional Image Computed by Parallel Ray Tracer.29

10 SummaryIn summary, we have presented a survey of several important algorithmsand applications for the CEDAR multiprocessor. Important design conceptsdiscussed include:� blocking techniques for data-stream reduction in sparse basic linearalgebra computations,� domain decomposition techniques in direct (BCR, SAS, and SPIKE)and iterative (Conjugate Gradient) methods for solving linear systemsof equations,� boundary integral domain decomposition for elliptic partial di�erentialequations,� parallelization of sparse linear system solvers for circuit simulationapplications, and� parallel ray tracing for three-dimensional visualization.Research in the areas along with the development of future CEDAR proto-types continues at the Center for Supercomputing Research and Develop-ment, University of Illinois at Urbana-Champaign.References[1] M. Berry and A. Sameh. Multiprocessor schemes for solving block tridi-agonal linear systems. International Journal of Supercomputer Appli-cations, 2(3):37{57, 1988.[2] B. Buzbee, G. Golub, and C. Nielson. On direct methods for solvingPoisson's equation. SIAM J. Numer. Anal., 7(4):627{656, December1970.[3] H.C. Chen. The SAS Domain Decomposition Method for StructuralAnalysis. PhD thesis, Department of Computer Science, University ofIllinois at Urbana-Champaign, 1988. Also available as CSRD TechnicalReport No. 754, Center for Supercomputing Research and Develop-ment. 30

[4] H.C. Chen and A. Sameh. A matrix decomposition method for or-thotropic elasticity problems. SIAM Journal on Matrix Analysis andApplications, 10(1):39{64, January 1989.[5] H.C. Chen and A. Sameh. Numerical linear algebra algorithms onthe CEDAR system. In A.K. Noor, editor, Parallel Computations andTheir Impact on Mechanics, pages 101{125, The American Society ofMechanical Engineers, 1987.[6] J. Dongarra, J. DuCroz, S. Hammarling, and R. Hanson. A Proposal foran Extended Set of Fortran Basic Linear Algebra Subprograms. Techni-cal Report 41, Argonne National Laboratory, Mathematics and Com-puter Science Division, December 1984.[7] I. Du�. A set of Fortran subroutines for sparse unsymmetric linearsystems. Technical Report R8730, Atomic Energy Research Establish-ment, Harwell, England, 1977.[8] P. Emrath. Xylem: an operating system for the cedar multiprocessor.IEEE Software, 2(4):30{37, July 1986.[9] P. Emrath, D. Padua, and P. Yew. CEDAR architecture and its soft-ware. In Hawaii International Conference on System Sciences, Hawaii,1989. Jan. 3-6, 1989.[10] K. Gallivan, W. Jalby, U. Meier, and A. Sameh. The impact of hier-archical memory systems on linear algebra algorithm design. Interna-tional Journal of Supercomputer Applications, 2(1):12{48, 1988.[11] E. Gallopoulos and D. Lee. Boundary integral domain decompositionon hierarchical memory multiprocessors. pages 488{499, ACM Intl.Conf. Supercomputing, July 1988.[12] E. Gallopoulos and D. Lee. Fast Laplace solver by boundary integralbased domain decomposition. Third SIAM Conference on Parallel Pro-cessing for Scienti�c Computing, December 1987.[13] E. Gallopoulos and Y. Saad. Parallel block cyclic reduction algorithmfor the fast solution of elliptic equations. Parallel Comput., 1987. Toappear.[14] E. Gallopoulos and A. Sameh. Solving elliptic equations on the Cedarmultiprocessor. In M. H. Wright, editor, Aspects of Computation on31

Asynchronous Parallel Processors, North-Holland Elsevier, 1988. Toappear.[15] F. G. Gustavson. Two fast algorithms for sparse matrices: multiplica-tion and permuted transposition. ACM Transactions on MathematicalSoftware, 4(3):250{269, 1978.[16] M. Guzzi. CEDAR Fortran Programmer's Handbook. Technical Re-port 601, Center for Supercomputing Research and Development, Uni-versity of Illinois, June 1987.[17] D. Kuck, E. Davidson, D. Lawrie, and A. Sameh. Parallel supercom-puting today and the CEDAR approach. Science, 231(4740):967{974,1986.[18] Nelson Max. Vectorized procedural models for natural terrain: wavesand islands in the sunset. Computer Graphics, 15(3):317{324, July1981. ACM Siggraph '81 Conference Proceedings.[19] R. McGrath and P. Emrath. Using Memory in the CEDAR System.Technical Report 655, Center for Supercomputing Research and Devel-opment, University of Illinois, June 1987.[20] U. Meier and A. Sameh. The behavior of conjugate gradient algorithmson a multivector processor with a hierarchical memory. Journal ofComputational and Applied Mathematics, 24:13{32, 1988.[21] R. Melhem. Parallel solution of linear systems with striped sparse ma-trices. Parallel Computing, 6(3):165{184, 1988.[22] L. Nagel. SPICE2: A Computer Program to Simulate Semiconduc-tor Circuits. Technical Report ERL-M520, University of California atBerkeley, Berkeley, CA, May 1975.[23] David J. Plunkett and Michael J. Bailey. The vectorization of a ray-tracing algorithm for improved execution speed. Computer Graphicsand Applications, 5(8):52{60, August 1985.[24] David F. Rogers. Procedural Elements for Computer Graphics.McGraw-Hill Book Company, New York, 1985.[25] A. Sameh. On two numerical algorithms for multiprocessors. NATOAdv. Res. Workshop on High-Speed Computation, 1983. Series F: Com-puter and Systems Sciences 7. 32

[26] A. Sangiovanni-Vincentelli. Circuit simulation. Computer Aids forVLSI Circuits, 19{113, 1981.[27] R. A. Sweet. A cyclic reduction algorithm for solving block tridiagonalsystems of arbitrary dimension. SIAM J. Numer. Anal., 14(4):707{720,September 1977.[28] R. A. Sweet. A parallel and vector cyclic reduction algorithm. SIAMJ. Sci. Statist. Comput., 9(4):761{765, July 1988.[29] Turner Whitted. An improved illumination model for shaded display.Communications of the ACM, 23(6):343{349, 1980.[30] H.A.G. Wijsho�. Implementing Sparse BLAS Primitives on Concur-rent/Vector Processors: a Case Study. Technical Report 843, Centerfor Supercomputing Research and Development, University of Illinois,January 1989.

33


Recommended