+ All Categories
Home > Documents > The compute cluster and other parallel programming models

The compute cluster and other parallel programming models

Date post: 22-Sep-2016
Category:
Upload: dd
View: 215 times
Download: 2 times
Share this document with a friend
3
- - Editor: Ronald D. Williams, University of Virgina, Thornton HalliElectrical Engineering, Charlottesville, VA 22903; phone, (804) 924-7960; fax, (804) 924-8818; Internet, rdwQvirginia,edu The compute cluster and other parallel programming models D. Douglas Wilmarth, Sky Computers Suppose you could afford a car that went fast enough and far enough and fulfilled all your other traveling needs. If everyone else had exactly the same requirement, then that car could be the one and only kind of transporta- tion; there would be no need for bus- es, subways, vans, trains, planes, and so on. Likewise, if everyone could buy a single computer that would always meet his or her processing needs, then parallel processing, symmetric multi- processing, SIMD, MIMD, SPMD, MPP, and so on would be unneces- sary. But we know this is not the case -with cars or computers. Application requirements. Many computer applications, such as ma- chine vision, radar, sonar, and signal processing, need performance well be- yond what any single processor can provide. This is becoming especially true in the imaging market. Until re- cently, a single CPU chip or board could address the simple applications, but the complex applications required multiple processors and a supercom- puter budget. Today, with the equivalent of super- computer power available on a single- slot 4 x 6-inch VME card, accessible, affordable imaging system designs are being developed with Gflops of per- formance and dozens of processors. However, obtaining true supercom- puter performance using multiproces- sor hardware architectures and paral- lel processing is still a challenge. When designing a system, develop- ers take into account how much pro- cessing power must be provided to achieve the program goals. The mea- sure of how much performance each additional processor provides is known as scalability. In a perfect world, each additional processor would contribute just as much as each previous processor, and we would achieve a linear speedup. That is, 10 processors would provide 10 times the performance of a unipro- cessor, and 10,000 processors would provide 10,000 times the performance. In reality, however, each additional processor contributes less than the previous one. In fact, system designers can more readily meet performance objectives by using a commercial off- the-shelf architecture that is designed to maximize performance by scaling to application size and complexity. Architectural features. The major hardware architectural features in op- timizing parallel processing perfor- mance are interconnect topology, memory locality, and synchronization facilities. Interconnect topology. How infor- mation moves between processors is a major design issue, since communica- tion speed and bandwidth will be gov- erning characteristics. The AIliant FX/ 2800, the Cray C90, and the Convex C3 series use hierarchical buses and/ or crossbars. Hypercube designs were developed for high-end machines such as the Thinking Machines CM-2 and the Intel iPSC-860. Intel’s follow-on Touchstone uses a two-dimensional mesh. These are just a few of the de- signs on the market today. Memory locality. Bandwidth be- tween the processor and the memory system governs the speed of the inner- most calculations and ultimately how fast the application runs. Where the memory is physically situated with re- spect to the processor governs this speed. Whether memory is globally addressable or strictly local deter- mines the visibility of the interconnect topology to the programmer and the amount of hidden contention with other processors. Synchronization facilities. How the processors signal status to each other is an important issue for avoiding deadlocks. Simple coordination at the network level is usually done in the operating-system software; the lower the programming level that needs syn- chronization, the more likely it is to require a hardware solution. SIMD (single-instruction, multiple-data) ar- chitectures, like those used by MasPar and Thinking Machines, have built-in synchronization, since there is only one instruction stream. MIMD (multi- ple-instruction, multiple-data) archi- tectures, from Cray and Convex, for example, have separate buses with synchronization registers. Parallel processing models. Various parallel processing models are avail- able. They range from job-level paral- lelism, through SIMD and MIMD, to the newest entry, the parallel compute cluster. Job-level parallelism and SMP. The oldest and still most prevalent forms of parallel processing are job-level parallelism and symmetric multipro- cessing (SMP). In this model, each job is executed on a single processor. No data or code is shared, and no inter- process communication is needed. data center. Significant system re- sources may be shared between the processors and even sections of the operating system code, and in most cases the operating system supports multiple jobs - multitasking - on each processor. The interconnect to- pology can be anything from local area networks to high-speed system buses/crossbars. The main memory is always local to the processor, and the synchronization facilities are always at the operating system level. This form of parallel processing has lasted so long because it is one of the more efficient approaches. Two condi- tions tend to be present: (1) the re- source needed for an individual job is a small portion of a single uniproces- sor, and (2) each job is substantially different in nature from others in the job stream. This is the “sweatshop” model of processing: Each processor takes the next task from a queue, completes it, and goes on to the next. This is the classic model for an MIS 70 COMPUTER
Transcript

- -

Editor: Ronald D. Williams, University of Virgina, Thornton HalliElectrical Engineering, Charlottesville, VA 22903; phone, (804) 924-7960; fax, (804) 924-8818; Internet, rdwQvirginia,edu

The compute cluster and other parallel programming models D. Douglas Wilmarth, Sky Computers

Suppose you could afford a car that went fast enough and far enough and fulfilled all your other traveling needs. If everyone else had exactly the same requirement, then that car could be the one and only kind of transporta- tion; there would be no need for bus- es, subways, vans, trains, planes, and so on. Likewise, if everyone could buy a single computer that would always meet his or her processing needs, then parallel processing, symmetric multi- processing, SIMD, MIMD, SPMD, MPP, and so on would be unneces- sary. But we know this is not the case -with cars or computers.

Application requirements. Many computer applications, such as ma- chine vision, radar, sonar, and signal processing, need performance well be- yond what any single processor can provide. This is becoming especially true in the imaging market. Until re- cently, a single CPU chip or board could address the simple applications, but the complex applications required multiple processors and a supercom- puter budget.

Today, with the equivalent of super- computer power available on a single- slot 4 x 6-inch VME card, accessible, affordable imaging system designs are being developed with Gflops of per- formance and dozens of processors. However, obtaining true supercom- puter performance using multiproces- sor hardware architectures and paral- lel processing is still a challenge.

When designing a system, develop- ers take into account how much pro- cessing power must be provided to achieve the program goals. The mea- sure of how much performance each additional processor provides is known as scalability.

In a perfect world, each additional processor would contribute just as much as each previous processor, and we would achieve a linear speedup. That is, 10 processors would provide 10 times the performance of a unipro- cessor, and 10,000 processors would

provide 10,000 times the performance. In reality, however, each additional processor contributes less than the previous one. In fact, system designers can more readily meet performance objectives by using a commercial off- the-shelf architecture that is designed to maximize performance by scaling to application size and complexity.

Architectural features. The major hardware architectural features in op- timizing parallel processing perfor- mance are interconnect topology, memory locality, and synchronization facilities.

Interconnect topology. How infor- mation moves between processors is a major design issue, since communica- tion speed and bandwidth will be gov- erning characteristics. The AIliant FX/ 2800, the Cray C90, and the Convex C3 series use hierarchical buses and/ or crossbars. Hypercube designs were developed for high-end machines such as the Thinking Machines CM-2 and the Intel iPSC-860. Intel’s follow-on Touchstone uses a two-dimensional mesh. These are just a few of the de- signs on the market today.

Memory locality. Bandwidth be- tween the processor and the memory system governs the speed of the inner- most calculations and ultimately how fast the application runs. Where the memory is physically situated with re- spect to the processor governs this speed. Whether memory is globally addressable or strictly local deter- mines the visibility of the interconnect topology to the programmer and the amount of hidden contention with other processors.

Synchronization facilities. How the processors signal status to each other is an important issue for avoiding deadlocks. Simple coordination at the network level is usually done in the operating-system software; the lower the programming level that needs syn-

chronization, the more likely it is to require a hardware solution. SIMD (single-instruction, multiple-data) ar- chitectures, like those used by MasPar and Thinking Machines, have built-in synchronization, since there is only one instruction stream. MIMD (multi- ple-instruction, multiple-data) archi- tectures, from Cray and Convex, for example, have separate buses with synchronization registers.

Parallel processing models. Various parallel processing models are avail- able. They range from job-level paral- lelism, through SIMD and MIMD, to the newest entry, the parallel compute cluster.

Job-level parallelism and S M P . The oldest and still most prevalent forms of parallel processing are job-level parallelism and symmetric multipro- cessing (SMP). In this model, each job is executed on a single processor. No data or code is shared, and no inter- process communication is needed.

data center. Significant system re- sources may be shared between the processors and even sections of the operating system code, and in most cases the operating system supports multiple jobs - multitasking - on each processor. The interconnect to- pology can be anything from local area networks to high-speed system buses/crossbars. The main memory is always local to the processor, and the synchronization facilities are always at the operating system level.

This form of parallel processing has lasted so long because it is one of the more efficient approaches. Two condi- tions tend to be present: (1) the re- source needed for an individual job is a small portion of a single uniproces- sor, and ( 2 ) each job is substantially different in nature from others in the job stream. This is the “sweatshop” model of processing: Each processor takes the next task from a queue, completes it, and goes on to the next.

This is the classic model for an MIS

70 COMPUTER

Processors need to be added as the volume of work goes up, but they can be of differing capability, size, and performance (heterogenous).

SMP architectures are characterized by the classical service bureau: Gener- al-purpose CPU architectures inte- grate seamlessly with new, additional systems. A wide user community has access to all the common resources, and load balancing is accomplished by the operating system. Within this con- text, SMP is a real and necessary mod- el in mainstream parallel computer systems. The job stream can be from one user or several hundred, and the CPUs can be mainframes, supercom- puters, superminis, and/or worksta- tions.

Data-level parallelism and SIMD/ MPP. Much government and universi- ty research has been targeted at an ex- treme form of parallelism: data-level parallelism. Thousands or millions of jobs need to be done; each job is iden- tical, differing only in the data set sup- plied. One processor is used for each data set, which reduces the runtime for many jobs to the time for one. The general-purpose computers that fit well in the SMP model have major cost disadvantages for applications us- ing data-level parallelism. Each CPU feature not used for that specific job adds a cost that will be replicated hun- dreds or thousands of times. It is these types of applications that encouraged the development of SIMD designs and massively parallel processors (MPPs).

SIMD designs are homogeneous collections of functional units sharing a common instruction unit. All units are available for use on a single job. In most cases, the synchronization fa- cilities are fixed; all units work in par- allel in lock-step fashion. A topology between the functional units is used for data communication, and the opti- mal interconnect topology is very ap- plication dependent. As an intermedi- ate step to custom hardware, these systems have shown some promise for ”embarrassingly” parallel applications - in particular, some imaging appli- cations where one functional unit is allocated to each output pixel.

MPP designs attempt to be more general purpose. The system is still homogeneous, but each node is a full CPU, not just a set of functional units. Each CPU has its own set of instruc- tions that must be coordinated with the cooperative processors. In theory, a job should be able to dynamically request an arbitrarily large set of pro- cessors, use them on time-critical sec- tions, and release them to other jobs

when they’re not in use. The difficulty here is that the synchronization facili- ties must be closely associated with the CPUs to be effective. However, it’s difficult to use a commodity pro- cessor and have a tightly coupled, wide-ranging, general-purpose syn- chronization facility layered on top. For example, it’s clear from trying to get 10,000 people to do something - anything - together that even a little variation causes problems. Imagine sports fans doing “the wave” in a sta- dium. Now imagine trying it with each person instructed separately.

Algorithm-level parallelism and MZMD. In imaging applications such as sonar, radar, and signal processing, the algorithms have some natural par- titions or “threads” that can be dele- gated and then collected. This charac- teristic has led to some success using a MIMD system, which is a collection of nearly independent processors, each running a separate thread of a single job. The processor handling each thread has data in a local memory and usually needs access to shared/global data. The ratio of local data accesses to global data accesses determines the speed and extent of the hardware in- terconnect topology and synchroniza- tion facilities. A high ratio (lots of lo- cal, no global memory) is workable for most topologies, even with small bandwidths, and has few synchroniza- tion requirements. A low ratio (little local, lots of global) stresses even the most robust designs.

of parallelism would be Noncomputer examples of this level

small group - a team playing an organized sport (one play book, many positions), medium-size group - the cast of a play, ballet, or opera (one script, many parts), and large group - a marching band in a parade (one score, 76 trom- bones).

These groups require little communi- cation during critical operations be- cause each individual (processor) fol- lows its own set of instructions. However, they all need significant preparation to operate efficiently.

The MIMD system has the same characteristics. Significant time is spent distributing the data in separate memories for parallel code execution. The threads assigned to each proces- sor need to run for extended periods with little or no communication with other threads, or the overhead will outweigh the improvement. A t some

point, all the threads must be brought back together to complete that task. The cost of the various overheads de- fines what applications will run well.

The data center environment with a large number of independent jobs maps well on MIMD systems, just as it would on any collection of networked systems. Applications such as large image-processing jobs get good per- formance but would do just as well - and be more cost-effective - on a SIMD system. What MIMD appears to address best is simulation applica- tions, such as weather modeling and aerodynamics. The data sets have large amounts of local computation that can be processed independently, but a collection step requires intelli- gent processors for efficiency. This partitioning of the problem is nontriv- ial and must be built into the applica- tion’s basic structure. Also, tuning this type of solution depends heavily on the specific number of processors, coding style, and development tools (for example, compilers, profilers. de- buggers).

Loop-level parallelism and compute clusters. Loop-level parallelism is the lowest level, since the hardware re- quirements are the most stringent. Multipie processors process a single loop cooperatively within a section of code. The hardware interconnect to- pology must permit ready access to all memory locations concurrently from all processors, essentially making all memory local. The synchronization fa- cilities must permit coordination be- tween processors in a few cycles, since the time spent by each processor in its section of the loop may be only a few hundred cycles.

Sky’s Shamrock board, an example of the parallel compute-cluster model, has a full crossbar switch between processors and memory banks. This allows every processor access to all sections of memory. Multiple accesses to the same bank of memory are ser- viced in a round-robin fashion so that no processor can be locked out. Arbi- tration occurs entirely in hardware. Coordination between the processors is through barrier-synchronization registers, which are mapped directly into the executive running on the i860 processor and accessed without going through the crossbar.

These hardware features let the programmer treat Shamrock’s four processors as a uniprocessor and make cluster computing a viable con- cept.

Myth versus reality. One of the

August 1993 71

computer industry’s ongoing quests has been the search for an infinitely scalable platform, a system that for an arbitrary application provides a near- linear speedup from one to 10,000 processors. While this is a noble goal, very few applications andlor algo- rithms scale over a range of 100 times.

The system designer must know, at a very early stage, how many proces- sors are in the production system for the software architecture to function up to expectation. Whether the target is an SMP, SIMD, MPP. or MIMD de- sign, each major step in hardware ex- pansion is accompanied by a major step in performance tuning. In many cases, the tuning at the next level in- volves increasing the total computa- tion so that it may also be parallelized and thereby contribute less to the overall execution time.

architecture is not the panacea for in- finite scalability, it does make several important contributions. For example, Sky’s Shamrock cluster offers the fol- lowing:

Although a parallel cluster-compute

(1) The basic unit of cluster perfor- mance is some factor (say, 2.5 to 3.5) better than each uniprocessor in the cluster. Each cluster consists of four

is60 processors. Therefore, the user who needs a “processor” that is more powerful than any single CPU avail- able can use a cluster now rather than wait for tomorrow’s promise of infi- nite scalability.

(2) The first level of parallelization for large systems is transparent to the user. This does not relieve the high- end user from coordinating multiple processors, but it does decrease the amount of tuning resulting from syn- chronization. Assume, for example, that an application needs 64 i860 equivalents to meet processing goals. If the target product is not cluster based, 64 separate processors must be synchronized. Synchronizing 16 clus- ters is a significantly simpler task.

(3) The multicomputer-only model of SMP and MIMD can be supported as well. The programming model of one processorlone memory bank can be supported as a special case of the cluster’s capabilities. This provides an easy migration path to those users al- ready using multiple boards.

The compute cluster cooperatively processes a single thread, making au- tomatic parallelization achievable in real-world applications. For example, Grumman Corporation will be using

Sky’s Shamrock to upgrade the U S Army’s Firefinder counterbattery ra- dar system. By using commercial off- the-shelf technology, Grumman and Sky have reduced the size of the radar processing unit by a factor of four. In addition, because the products are modular, upgradable, and extensible, Grumman can transfer the technology developed for this program to other similar programs.

Cluster technology is just appearing and is now possible because of recent advances in packaging and processors. The parallel compute-cluster pro- gramming model is an easy-to-use, straightforward approach to integrat- ing multiple processors into demand- ing applications. It can be used to sig- nificantly decrease the level of complexity for large applications and can handle multicomputer applica- tions as a matter of course.

D. Douglas Wilmarth is Sky Computzrs’ product marketing manger. His address is Sky Computers. 27 Industrial Ave., Chelmsford, MA 01824: e-mail, [email protected].

Massively Parallel Computing Systems (MPCS): the Challenges of General-purpose and Special Purpose Computing

Ischia, M a y 2-6,1994 - Italy CALL FOR PAPERS

TheMPCS Conference has been conceivedtogatherexpertsfromresearch, academyandindustqwhoare interested at problemsrelatedto the definition and implementation of massively parallel computing systems. A major goal of the conference is to foster communication and cooperation between communities studying general-purpose massively parallel systems and those studying special-purpose massively parallel systems. Therefore especially sought are contributions that expose and study the dis/advantages of either categories and that describe applications that enhance or exploit their characteristics. Topics of interest include, but are not limited to: Architecture of MPCS: processors and communication networks; Design methods for MPCS (prototyping, validation, diagnosis and quality ensurance); General-purpose vs. special-purpose MPCS; Programming environment for MPCS with reference to program transformation, load balancing, high level scheduling and message routing. Special sessions will be held on the following subjects: Communications in MPCS, Architecture of massively parallel computers for computer vision/image processing; Application specific massively parallel architecture. Papers are solicited for both regular and special sessions. Manuscripts should be submitted to the Program Chairman in the form of either full papers (4 copies not exceeding 5OOO words) or short-notes (2 copies not exceeding 1500 words). All papers will be reviewed by an international program committee. Deadlines: Submission of full papers: Submission of short-papers: December 1,1993 Submission of late papers: For more information or for having the extended Call for Papers please contact the Program Chairman or the Scientific secretariat. General Chairman: Prof. Roberto Vaccaro. IRSIP-CNR, via P. Castellino 111 - 80128 Napoli, phone t39 81 5452419/5904229, fax t39 81 5454330, e-mail: D491008INAREAVM.BITNET Program Chairman: Prof. Giacomo R. Sechi, IFCTR-CNR, via Bassini 15 - 20133 Milano, phone: t39 2 2363747, fax: t39 2 2362946, e-mail: MICR08IMISIAM.MI.CNR.IT Scientific Segretariat: Erminia Attanasio & Daniela Nardo, IRSIP-CNR, Via P. Castellino 11 1 - 80128 Napoli, phone: t 39 81 5452419/5904260, fax: t39 81 5454330, e-mail: [email protected]

October 20, 1993

March 1, 1994 (in camera-ready format)


Recommended