Parallel Numerical SimulationIntroduction to Parallel Programming Ralf-Peter Mundani Ioan Lucian...

Classification of . . .

Levels of Parallelism

Quantitative . . .

Network Topologies

Design of Parallel . . .

Page 1 of 64

Introduction to Parallel ProgrammingRalf-Peter Mundani

Ioan Lucian Muntean

ATHENS Course on

Parallel Numerical Simulation

Munich, March 20-24, 2006

Introduction to Parallel Programming

March 20 & 21, 2006

Ralf-Peter Mundani & Ioan Lucian MunteanDepartment of Computer Science – Chair VTechnische Universität München, Germany

http://www5.in.tum.de/persons/mundani.html

http://www5.in.tum.de/persons/muntean.html



Quantitative . . .

Network Topologies


Page 2 of 64


Ioan Lucian Muntean

1. Classification of Parallel Computers

Parallel Computers

• parallel computers consist of a set of processing elements that can collaborate in acoordinated and (partially) simultaneous way in order to solve a joint task

• possible appearances of such processing elements (in a broader sense):

– specialized units (the steps of a vector pipeline or the vector pipeline of a vectorcomputer’s vector unit, for example)

– parallel features in modern monoprocessors (superscalar processor architec-ture, fine-grain parallelism via instruction pipelining, cooperation of CPU, buscontrol, DMA unit, and graphics card, VLIW processors (very long instructionword), multi-threading processor technology, for example)

– several uniform arithmetical units (the processing elements of an array com-puter, for example)

– processors or processing nodes of a multiprocessor computer (i.e. the actualparallel computers in a narrower sense)

– complete stand-alone computers, connected via a LAN (work station or PCclusters as virtual parallel computers)

– parallel computers or clusters connected via a remote network (WAN) (so-called metacomputers)

• target machines in the following: multi- and specialized processors as well as clusters(i.e. the so-called high-performance architectures or supercomputers)





Quantitative . . .

Network Topologies


Page 3 of 64


Ioan Lucian Muntean

Commercial Parallel Computers• manufacturers: starting from 1983, big players and small start-ups• names have been coming and going rapidly• see the table below for producers of commercial parallel computers based on micro-

processor CPUs between 1984 and 1993 and their status 2003 (taken from [Qui03])• in addition to that: several manufacturers of vector computers, non-standard archi-

tectures (e.g. Thinking Machines and their Connection Machines starting from 1986and ending in the nineties)

Company Country Year Status in 2001Sequent U.S. 1984 Acquired by IBMIntel U.S. 1984 Out of the businessMeiko U.K. 1985 BankruptnCUBE U.S. 1985 Out of the businessParsytec Germany 1985 Out of the businessAlliant U.S. 1985 BankruptEncore U.S. 1986 Out of the businessFloating Point Systems U.S. 1986 Acquired by SunMyrias Canada 1987 Out of the businessAmetek U.S. 1987 Out of the businessSilicon Graphics U.S. 1988 ActiveC-DAC India 1991 ActiveKendall Square Research U.S. 1992 BankruptIBM U.S. 1993 ActiveNEC U.S. 1993 ActiveSun Micrososystems U.S. 1993 ActiveCray Research U.S. 1993 Active (as Cray Inc.)

"Out of the business" means the company is no longer selling general-purpose parallel computer systems.





Quantitative . . .

Network Topologies


Page 4 of 64


Ioan Lucian Muntean

The Arrival of Clusters

• In the late eighties, PCs became a commodity market with rapidly increasing perfor-mance, mass production, and decreasing prices.

• growing attractiveness of components from there for parallel computers

• 1994: Beowulf – the first parallel computer built completely out of commodity hard-ware and freely available hardware

– NASA Goddard Space Flight Center

– 16 Intel DX4 processors

– multiple 10Mbit Ethernet links

– Linux with GNU compilers

– MPI library

• 1996: Beowulf clusters performing more than 1 GFLOPS for less than $50.000

• 1997: a 140-node cluster performing more than 10 GFLOPS

• 2004: Mozart – SgS department cluster in Stuttgart

– 64 nodes, 128 processors (Intel P4)

– InfiniBand networking technology

– Linux, MPI (as 10 years before ...)

– about 600 GFLOPS sustained performance, 783 GFLOPS peak performance

– overall cost: about e390.000





Quantitative . . .

Network Topologies


Page 5 of 64


Ioan Lucian Muntean

Supercomputers

• supercomputing or high-performance scientific computing as the most important applica-tion of the big number crunchers

• national initiatives due to huge budget requirements

– ASCI – Adavanced Strategic Computing Initiative in the US

* in the sequel of the nuclear testing moratorium in 1992/1993

* decision: develop, build, and install a series of 5 supercomputers of up to$100 million each in the US

* start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world’sfirst TFLOPS computer)

* then: ASCI Blue Pacific (1998, Lawrence Livermore National Lab), ASCIBlue Mountain, ASCI White, ...

– meanwhile new High-End Computing memorandum in the US 2004

– federal Bundeshöchstleistungsrechner initiative in Germany

* decision in the midth nineties

* 3 federal supercomputing centres in Germany (München, Stuttgart, andJülich), one new installation each year, the newest one to be among thetop 10 of the world

• overview and state of the art: Top500 list (every six months), see http://www.top500.org



http://www.top500.org



Quantitative . . .

Network Topologies


Page 6 of 64


Ioan Lucian Muntean

Development 1993 - 2005





Quantitative . . .

Network Topologies


Page 7 of 64


Ioan Lucian Muntean

Development and Projection





Quantitative . . .

Network Topologies


Page 8 of 64


Ioan Lucian Muntean

The Earth Simulator – World’s No. 1 from 2002 - 2004

• installed in 2002 in Yokohama, Japan.

• based on the NEC SX architecture

• 640 nodes, each node with 8 vector processors (8 GFlop/s peak per processor), 2ns cycle time, 16GB shared memory

• a total of 5120 total processors with a theoretical peak performance of 40 TFlop/sand 10 TB memory

• It has a single stage crossbar (1800 miles of cable), 83,000 copper cables, 16 GB/scross section bandwidth.

• 700 TB disk space

• 1.6 PB mass store

• Area of computer = 4 tennis courts, 3 floors

• Linpack Benchmark = 35.6 TFlop/s





Quantitative . . .

Network Topologies


Page 9 of 64


Ioan Lucian Muntean

BlueGene/L – World’s No. 1 from 2004 -

• installed in 2004 in Lawrence Livermore National Laboratory, United States.• 2 processors, 5.6 GFlop/s, 512 MB memory per node• 65,536 (131,072) Nodes (Processors) achieved through extreme scalability, 367 TFlop/s

theoretical peak, 32 TB memory, 400 TB aggregate global disk• Dual PowerPC 440 microprocessor technology• 1,024 x 1-Gb/s Ethernet external networking• delivered bandwidth is 0.684 TB/s, cost effective for molecular dynamics and turbu-

lence• about 2 MW power needed for computer and cooling, about 4.7 billion joules of heat

is generated per hour• more than 5000 cables are present in the machine and the aggregate cable length

counts to more than 12 miles• Area of computer = 2,500 sq.ft (232 sq.m approx.)• Linpack Benchmark = 70.72 TFlop/s in Nov. 2004, 136.8 TFlop/s in Jun. 2005 and

280.6 TFlop/s in Nov. 2005





Quantitative . . .

Network Topologies


Page 10 of 64


Ioan Lucian Muntean

Standard Classification According to Flynn

• principle of the classification: computers as operators on two kinds of informationstreams:

– instruction streams: sequences of commands to be executed

– data streams: sequences of data subject to instruction streams

• this results in a two-dimensional subdivision of the variety of computer architectures:

– number of instructions a computer executes at a certain point of time

– number of data elements a computer processes at a certain point of time

• hence, Flynn distinguishes four classes of architectures:

– SISD: Single-Instruction-Single-Data

– SIMD: Single-Instruction-Multiple-Data

– MISD: Multiple-Instruction-Single-Data

– MIMD: Multiple-Instruction-Multiple-Data

• drawback: very different computers may belong to the same class





Quantitative . . .

Network Topologies


Page 11 of 64


Ioan Lucian Muntean

Flynn’s Classes

• SISD:

– the classical monoprocessor following von Neumann’s principle

• SIMD:

– array computers: consist of a large number (65536 and more) of uniform pro-cessing elements arranged in a regular way, which – under central control – allapply the same instruction to some part of the data each, simultaneously

– vector computers: consist of at least one vector pipeline (functional unit de-signed as a pipeline for processing vectors of floating point numbers)

• MISD:

– a pipeline of multiple independently executing functional units operating on asingle stream of data, forwarding results from one functional unit to the next

– not very popular class (mainly for special applications such as Digital SignalProcessing)

– example: systolic array – a network of primitive processing elements that "pump"data (for example, a hardware priority queue with constant-complexity opera-tions can be built out of primitive three-number sorting elements)

• MIMD:

– multiprocessor systems, i.e. the classical parallel computers

– networks of computers





Quantitative . . .

Network Topologies


Page 12 of 64


Ioan Lucian Muntean

Processor Coupling

• cooperation of processors or computers as well as their shared use of various re-sources require communication and synchronization

• depending on the type of processor coupling, we distinguish

– memory-coupled multiprocessor systems

– message-coupled multiprocessor systems

• memory-coupling (strong coupling):

– shared address space (physically and logically) for all processors, shared mem-ory

– communication and synchronization via shared variables

– example: SMP (symmetric multiprocessors), where the access to global mem-ory is identical for all processors

– connection to memory realized via a central bus or via more complex structures(crossbar switch ...)

• message-coupling (weak or loose coupling):

– physically distributed (local) memories and local address spaces, distributedmemory

– communication via the exchange of messages through the network

– synchronization implicitly via communication instructions





Quantitative . . .

Network Topologies


Page 13 of 64


Ioan Lucian Muntean

A Hybrid Type: DSM/VSM

• central issues:

– scalability: How simple is it to add new nodes (processors) to the system?

– programming model: How complicated is programming?

– portability: How simple is portation/migration, i.e. the transfer from one pro-cessor to another one, if executability and functionality shall be preserved?

– load distribution: How difficult is it to obtain a uniform distribution of the workload among the processors?

• message-coupled systems are advantageous concerning scalability, memory-coupledsystems are typically better with respect to the other aspects

• idea: combine advantages of both

• DSM (distributed shared memory) or VSM (virtual shared memory):

– physically distributed (local) memory

– nevertheless one global shared address space





Quantitative . . .

Network Topologies


Page 14 of 64


Ioan Lucian Muntean

An Alternative Classification due to Processor Coupling

• type of processor coupling allows for an alternative to Flynn’s classification

• UMA – Uniform Memory Access:

– access to shared memory is identical for all processors

– same access times for all processors to all data

– of course, a local cache is possible for each processor

– classical representative: SMP

• NUMA – Non-Uniform Memory Access:

– memory modules are physically distributed among the processors

– nevertheless a shared global address space

– access times depend on the location of the data (local or remote access)

– typical representative: DSM/VSM

• NORMA – No Remote Memory Access:

– systems with distributed memory (physically and logically)

– no direct access to another processor’s local world





Quantitative . . .

Network Topologies


Page 15 of 64


Ioan Lucian Muntean

Overview: Memory-Coupling, Message-Coupling, and DSM

schematic structure of a UMA configuration





Quantitative . . .

Network Topologies


Page 16 of 64


Ioan Lucian Muntean


schematic structure of a NUMA configuration





Quantitative . . .

Network Topologies


Page 17 of 64


Ioan Lucian Muntean


schematic structure of a NORMA configuration





Quantitative . . .

Network Topologies


Page 18 of 64


Ioan Lucian Muntean

2. Levels of ParallelismGranularity

• The decision which type of parallel architecture is best-suited for a given parallelprogram strongly depends on the character and, especially, on the granularity ofparallelism.

• some remarks on granularity:

– qualitative meaning: the level on which work is done in parallel

– we distinguish coarse-grain and fine-grain parallelism

– quantitative meaning: ratio of computational effort and communication or syn-chronization effort (roughly speaking the number of instructions between twonecessary steps of communication)

• starting point of the following considerations: a parallel program (explicit parallelism)

• typically, five different levels are identified:

– program level

– process level

– block level

– instruction level

– sub-instruction level





Quantitative . . .

Network Topologies


Page 19 of 64


Ioan Lucian Muntean

Program and Process Level

• program level:

– parallel processing of different programs

– independent units without any shared data

– no or only a small amount of communication

– organized by the operating system

• process level:

– notion of process here used as a heavy-weight process (see Section 2.1)

– a program is subdivided into different processes to be executed in parallel

– each process: large number of sequential instructions, private address space

– synchronization is necessary (all processes live and run within a program)

– communication in most cases necessary (data exchange ...)

– example: UNIX processes

– support by operating system via routines for process management, processsynchronization, and process communication





Quantitative . . .

Network Topologies


Page 20 of 64


Ioan Lucian Muntean

Block, Instruction, and Sub-Instruction Level

• block level:

– here, the units running in parallel are blocks of instructions or light-weight pro-cesses (threads, see Section 2.1)

– smaller number of instructions, which share the address space with other blocks

– examples: threads according to POSIX standard in multi-threading operatingsystems, loops in numerical programs

– communication via shared variables and synchronization mechanisms

• instruction level:

– parallel execution of machine instructions

– optimizing compilers can increase this potential by modifying the order of thecommands (better exploitation of superscalar architecture and pipelining mech-anisms)

• sub-instruction level:

– instructions are subdivided still further in units that can be executed in parallelor via overlapping

– examples: pipelining in superscalar processors, vector operations





Quantitative . . .

Network Topologies


Page 21 of 64


Ioan Lucian Muntean

Techniques of Parallel Work

• the different levels of parallelism have methods of parallel work in the hardware astheir counterparts

• objective: best exploitation of the inherent parallel potential

• levels of parallel work:

– computer coupling: useful for program level only, sometimes also for processlevel

– processor coupling:

* message-coupling for program and process level

* memory-coupling for program, process, and block level

– parallel work within the processor architecture: instruction pipelining, super-scalar organization, VLIW and so on for instruction level only, eventually forsub-instruction level

– SIMD techniques: concerning the sub-instruction level in vector and array com-puters





Quantitative . . .

Network Topologies


Page 22 of 64


Ioan Lucian Muntean

3. Quantitative Performance EvaluationPerformance Evaluation

• standard quantities for monoprocessors:

– MIPS: millions of instructions per second

– MFLOPS: millions of floating point operations per second

• not sufficient for parallel computers:

– in which context the performance measured was achieved (interconnectionstructure, which granularity of parallelism)?

– how efficient is parallelization itself (obtaining a runtime reduction of a factor 5with 10 processors is definitely no cunning trick!)?

• another issue:

– what is due to the parallel computer?

– what is due to the parallel algorithm or program?

• hence, we have a closer look at these things in the following





Quantitative . . .

Network Topologies


Page 23 of 64


Ioan Lucian Muntean

Notions of Time in the Execution of Instructions

• not only simple instruction time, but more detailed considerations instead:

– execution time T of a parallel program:time between start of the execution on the first participating processor and endof all computations on the last participating processor

– computation time Tcomp of a parallel program:part of the execution time used for computations

– communication time Tcomm of a parallel program:part of the execution time used for send and receive operations

– idle time Tidle of a parallel program:part of execution time used for waiting (for sending or receiving)

T = Tcomp + Tcomm + Tidle





Quantitative . . .

Network Topologies


Page 24 of 64


Ioan Lucian Muntean

Notions of Time in the Transmission of Data

• further subdivision of communication:

– communication time Tmsg of a message:time needed to send a message from one processor to another one

– setup time Ts:time for preparing and initializing the communication step

– transfer time Tw per data word transmitted:depends on the bandwidth of the transmission channel

Tmsg = Ts + Tw ·N (N data words)

• of course, this relation holds only in case of a dedicated (conflict-free) connection





Quantitative . . .

Network Topologies


Page 25 of 64


Ioan Lucian Muntean

Average Parallelism

• total work during a parallel computation:

W (p) := l ·pX

i=1

i · ti ,

where

– l: performance of one single processor

– p: number of processors

– ti: time when exactly i processors are busy

• average parallelism:

A(p) :=

Ppi=1 i · tiPp

i=1 ti=

1

l· W (p)Pp

i=1 ti

• for A(p), there exist several theoretical estimates (typically quite pessimistic), whichwere often used as arguments against massively parallel systems

• example: estimate of Minsky (1971):

– problem class: in the first step, p processors can be used, in the second steponly p/2, and so on (today considered as of no big relevance)

– example: parallel addition of 2p numbers on p processors

– result: A(p) = log2(p)





Quantitative . . .

Network Topologies


Page 26 of 64


Ioan Lucian Muntean

Comparison Multiprocessor – Monoprocessor

• (program-dependent) operation counts and times:

– P (1): number of unit operations (to be defined in an appropriate way) on amonoprocessor system

– P (p): number of unit operations on a p-processor system

– T (1): execution time on a monoprocessor (gauging normally such that T (1) =P (1), i.e. one time unit per unit operation)

– T (p): execution time on a p-processor

• speed-up S(p):

S(p) :=T (1)

T (p); bounds: 1 ≤ S(p) ≤ p

• efficiency E(p):

E(p) :=S(p)

p=

T (1)

p · T (p); bounds:

1

p≤ E(p) ≤ 1

• speed-up and efficiency come in two variants:

– algorithm-independent (absolute): compare the best known sequential algo-rithm for the respective problem with the given parallel one

– algorithm-dependent (relative): compare the parallel algorithm with its sequen-tial counterpart (or itself used sequentially)

– which point of view is the more objective one?





Quantitative . . .

Network Topologies


Page 27 of 64


Ioan Lucian Muntean

Scalability, Overhead, Parallel Index

• scalability:

– objective: adding more processors to the system shall reduce the executiontime significantly (efficiency close to 1) without necessary program modifica-tions

– scalability requires a sufficient problem size: although one porter may carryone suitcase in a minute, sixty won’t do it in a second – however, 60 porters willcarry 60 suitcases in one minute!

– therefore often scaled problem analysis: with increasing number of processors,increase problem size, too (then, efficiency 1 means constant execution times)

• overhead R(p) for the parallelization:

– definition:R(p) :=

P (p)

P (1), bound: 1 ≤ R(p)

– describes the additional number of operations for organization, synchroniza-tion, and communication

• parallel index I(p):

– definitionI(p) =

P (p)

T (p)

– measure for the average degree of parallelism, counts the number of paralleloperations per time unit

– relative acceleration (taking into account (parallel) overhead)





Quantitative . . .

Network Topologies


Page 28 of 64


Ioan Lucian Muntean

Amdahl’s Law

• the probably most important and most famous estimate for the speed-up

• underlying model:

– each program consists of parallelizable parts and of parts that can be executedonly in a sequential way; let the sequential part be s, 0 ≤ s ≤ 1

– then, the following holds for execution time and speed-up:

T (p) = T (1) · 1− s

p+ T (1) · s; S(p) =

T (1)

T (p)=

11−s

p+ s

– thus, we get Amdahl’s Law:

S(p) ≤ 1

s

• meaning:

– the sequential part can have a dramatic impact on the speed-up

– example: even for just one per cent (s = 0.01), more than a speed-up of 100is impossible – even on massively parallel computers with p much bigger than100!

– therefore central effort of all (parallel) algorithmics: keep s small!

– this is possible: about 75% of all LINPACK routines fulfil s < 0.1





Quantitative . . .

Network Topologies


Page 29 of 64


Ioan Lucian Muntean

Model of Gustafson

• alternative model for speed-up prediction or estimation

• underlying idea:

– normalize the execution time on the parallel machine to 1

– there: non-parallelizable part σ

– hence execution time on the monoprocessor:

T (1) = σ + p · (1− σ)

– this results in a speed-up of

S(p) = σ + p · (1− σ) = p + σ · (1− p)

• difference to Amdahl:

– sequential part – with respect to execution time on one processor – is not con-stant, but gets smaller with increasing p:

s(p) =σ

σ + p · (1− σ)∈ ]0, 1[

– often more realistic, because more processors are used especially for largerproblem size, and here parallelizable parts typically increase (more computa-tions, less declarations ...)

– in Gustafson’s model, speed-up is not bounded for increasing p





Quantitative . . .

Network Topologies


Page 30 of 64


Ioan Lucian Muntean

CCR

• important quantity measuring the success of a parallelization: communication-computation-ratio (CCR)

– gives the relation of pure communication time and pure computing time

– a small CCR is favourable (a lot of computations, only a small amount of com-munication)

– typically: CCR decreases with increasing problem size

• example (see exercises):

– consider a full N ×N -matrix

– consider the following iterative method: in each step, each matrix element isreplaced by the average of its eight neighbour values (analogous definition atthe boundary)

– for the update of each row, we need the two neighbouring rows

– p processors, decompose the matrix in p blocks of N/p rows

– computing time: 8N ·N/p

– communication time: 2(p− 1) ·N– hence, CCR is (p2 − p)/(4N)

– interpretation?





Quantitative . . .

Network Topologies


Page 31 of 64


Ioan Lucian Muntean

Characterizing the Performance of Vector Computers

• crucial quantity: vector length

• due to the strict pipelining, vector computers love long vectors

• the model of Hockney and Jesshope determines the performance r(N) as a functionof the vector length N and of two parameters r∞ and n1/2:

r(N) :=r∞

n1/2/N + 1MFLOPS ,

where

– r∞: performance in MFLOPS for optimum vector length

– n1/2: vector length, where at least 50% of the peak performance, i.e. 0.5 ·r∞MFLOPS, can be obtained





Quantitative . . .

Network Topologies


Page 32 of 64


Ioan Lucian Muntean

4. Network Topologies

Valuation Criteria

• interconnection network: medium of communication of the processors or nodes, re-spectively

• topology: way of arrangement or connections of the processor nodes (modulo sym-metries)

• different criteria to judge the quality of a network topology:

– complexity, costs: overall hardware costs to realize the net, that is basicallynumber of connecting lines, of network cards, and of switches and so on

– connectivity of a node: number of connections that exist between this node andother nodes (important cost factor)

– diameter: maximum distance (number of direct connections) between two nodes

– regularity of the network: extent of deviations in the local net quantities (con-nectivity, for example)

– length of lines: physical length of the connections (a room, a building; shortdistances are advantageous)

– blocking: does an existing communication between two nodes block the com-munication of other pairs of nodes (optimum: blocking-free networks)?





Quantitative . . .

Network Topologies


Page 33 of 64


Ioan Lucian Muntean

Valuation Criteria (cont’d)

• further criteria to judge the quality of a network topology:

– extensibility/flexibility: in which steps the network can be extended (arbitrarily,only in factors of 2 ...)?

– scalability: sensitivity of the crucial network properties to an increase of thenumber of nodes

– fault tolerance (redundance): robustness of the network with respect to break-downs of components

– throughput (bandwidth): maximum transmission rate of the network in MBit/s

– complexity of routing: how costly is the determination of the route of communi-cation from the sender to the receiver?





Quantitative . . .

Network Topologies


Page 34 of 64


Ioan Lucian Muntean

Characteristic 1: Way of Connection

• static networks:

– fixed (hard-wired) connections between the different nodes

– control of the connection set-up and other control functions are done by thenodes themselves or by some special connection hardware

• dynamic networks:

– no direct dedicated, hard-wired connections between pairs of nodes

– internal switch network or switch to which all nodes are connected via input andoutput slots

– control functions are completely concentrated in the switch

– various routes can be switched





Quantitative . . .

Network Topologies


Page 35 of 64


Ioan Lucian Muntean

Characteristic 2: Way of Data Transfer

• packet switching:

– data packets of fixed length or messages of variable length are sent from thesender to the receiver node

– the crucial point is to determine the route (routing)

– requires some decomposition and packing at the sending node and some un-packing and assembling at the receiving node

– increased administration overhead (each packet must be provided with trans-mission information to be interpreted at the intermediate nodes)

– nevertheless standard for multiprocessors

• circuit switching:

– a direct private connection between sender and receiver is established

– physical connection stays active for the whole duration of the transmission

– higher transmission rates, generally





Quantitative . . .

Network Topologies


Page 36 of 64


Ioan Lucian Muntean

Characteristic 3: Addressing Mode

• destination-based:

– header of each packet contains a globally unique address of the receiver

– this address is used by each intermediate node for the determination of thefurther route

– most frequent choice in static networks

• source-based:

– packet contains the necessary information for the whole way

– intermediate nodes are only responsible for a correct forwarding

– most frequent choice in dynamic networks





Quantitative . . .

Network Topologies


Page 37 of 64


Ioan Lucian Muntean

Characteristic 4: Way of Routing

• deterministic routing:

– always the same path between two nodes

– advantage: simple path determination (or none at all, respectively)

– drawbacks: increased risk of blockings, poor fault tolerance

• adaptive routing:

– possibility to select alternative paths

– more flexible

– slightly increased hardware costs





Quantitative . . .

Network Topologies


Page 38 of 64


Ioan Lucian Muntean

Characteristic 5: Organization of Flow Control

• transmission between not neighbouring nodes requires some buffer mechanism inthe intermediate nodes

• flow control organizes the buffer management

• store-and-forward mode:

– intermediate node receives message, completely stores it locally, and then for-wards it

– at no point of time, there is any direct connection from the sender to the receiver(unless neighbouring)

– used in the early multiprocessors and in WANs

• virtual-cut-through mode:

– message is passed as a sequence of transmission units

– header contains recipient and determines the further path of all units

– units that arrived completely are forwarded immediately

– in case of a blocking, the message is completely stored at the respective inter-mediate node

• wormhole routing mode:

– without blocking: identical with virtual-cut-through mode

– with blocking of header: all units stay where they currently are (no memoryproblems in the front node)





Quantitative . . .

Network Topologies


Page 39 of 64


Ioan Lucian Muntean

Static Networks

• occur in both memory- and message-coupled systems

• nodes: processors, memory modules, or processors with local memory

• today used in multiprocessor systems, especially

• classification concerning their dimension

• Think about connectivity, diameter, and the other criteria mentioned above for all ofthe following static network topologies!

• one-dimensional topologies:

– chain: large diameter, therefore not interesting for higher numbers of nodes





Quantitative . . .

Network Topologies


Page 40 of 64


Ioan Lucian Muntean

Static Networks (cont’d)• two-dimensional topologies:

– ring: connectivity 2 (cf. chain), but smaller diameter (n/2)

– chordal ring: ring with additional direct connections to every second (third, ...)node

– star: good broadcast abilities, but bottleneck in the central node

– binary tree: logarithmic diameter, but bottleneck in the root

– fat tree: higher bandwidth close to the root to prevent the bottleneck

– grid: regular, easily extensible, arbitrarily scalable, short length of lines

– torus: halving of the grid’s diameter





Quantitative . . .

Network Topologies


Page 41 of 64


Ioan Lucian Muntean

Static Networks (cont’d)

• three-dimensional topologies:

– cube: connectivity is three, diameter is three

• four-dimensional topologies:

– hypercube: connectivity 4, in the general n-dimensional case n; constructionprinciple: connect in two n-dimensional hypercubes the respective nodes toconstruct the n + 1-dimensional hypercube

– dual (hyper)cube: processors on the edges; results in a better connectivity 2





Quantitative . . .

Network Topologies


Page 42 of 64


Ioan Lucian Muntean

Dynamic Networks

• bus topologies: as simple bus the most frequently used structure for symmetric multi-processors; poor scalability; multiple buses are fault tolerant, but rare

• Modern and more powerful dynamic topologies are based upon switches, i.e. con-nection elements with n inputs and n outputs; all CPUs or memory elements etc. areconnected to it, and the switch provides the one-to-one connections internally.

• internal structure of general switches: built as a cascadic network of several elemen-tary switches (each with a small number of input and outputs)





Quantitative . . .

Network Topologies


Page 43 of 64


Ioan Lucian Muntean

Dynamic Networks (cont’d)

• crossbar switch:

– hardware component which can be switched such that, in a set of input andoutput slots, all possible disjoint pairs can communicate simultaneously withoutany blocking

– for processor-processor coupling (message-coupled systems) or for processor-memory-coupling (memory-coupled systems)

– depending on the state of the switch elements the different pairs can communi-cate

– excellent performance, but very high hardware costs (especially for a largernumber of nodes to be connected)

– example: the Earth Simulator consists of 640 nodes, all of which are directlyconnected via a 640-port crossbar switch





Quantitative . . .

Network Topologies


Page 44 of 64


Ioan Lucian Muntean

Dynamic Networks (cont’d)

• networks of simple switches:

– to avoid the high hardware costs of a crossbar switch, the interconnection net-work can be realized as a network of simple switch elements

– basic element: 2-switch with two input and two output slots (they can eitherforward the input signals or exchange them)

– further simple elements: 4-switch, broadcaster, ...

– interconnection networks based upon 2-switches are called permutation net-works; we distinguish

* single-stage permutation networks: consist of one column of 2-switches

* multi-stage permutation networks: consist of several of those columns





Quantitative . . .

Network Topologies


Page 45 of 64


Ioan Lucian Muntean

Single-Stage Permutation Networks

• notation: the input vector (en, ..., e1) is mapped onto the output vector (an, ..., a1)

• variants:

– perfect shuffle: cyclic shift of the address bits

M(an, ..., a1) = (an−1, ..., a1, an)

– butterfly: exchange the first (highest) and the last (lowest) address bit

K(an, ..., a1) = (a1, an−1, ..., a2, an)

– exchange: negation of the last (lowest) address bit

T (an, ..., a1) = (an, ..., a2, a1)





Quantitative . . .

Network Topologies


Page 46 of 64


Ioan Lucian Muntean

Multi-Stage Permutation Networks

• for n = 2p inputs and outputs, there are typically p stages of 2-switches, which meansa diameter of p

• variants:

– omega network: serial connection of perfect shuffles

– switching-banyan network: serial connection of butterflies

– shuffle-exchange network: combination of different 2-switches (special arrange-ment)





Quantitative . . .

Network Topologies


Page 47 of 64


Ioan Lucian Muntean

5. Design of Parallel Programs

Design Patterns – Ways of Parallelizing

• main problem: how can we parallelize, and what has to be parallelized?

• larger differences between the sequential program and its parallel version are possi-ble

• we meet several design patterns again and again:

– function parallelism: processing of different components of the program suchas blocks or procedures is done in parallelexample: assembly-line production in automotive industry

– data parallelism: parts of programs or whole programs are applied to differentpartitions of the overall data in parallelexample: solve a partial differential equation numerically with a domain decom-position method

– competition parallelism: all processors solve the same problem, but with differ-ent strategies or algorithmsexample: determine the fastest of k given sorting algorithms





Quantitative . . .

Network Topologies


Page 48 of 64


Ioan Lucian Muntean

Function Parallelism

• principle: division of labour with respect to the tasks to be done

• characteristics:

– for each processor, a separate program has to be written

– limited degree of parallelism (each problem allows for a certain number of par-allel subtasks only)

– hence, limited scalability

• appearances:

– problem-adjusted solution

– macropipelining:

* overlapping processing of the different subtasks

* the data to be processed are passed from processor to processor

* software-organized pipeline similar to the hardware pipelining within theprocessor

* requirement: all subtasks should entail roughly the same computationaleffort

* synchronous transmission of data (simultaneous phase of communicationof all processors) or asynchronous with buffering





Quantitative . . .

Network Topologies


Page 49 of 64


Ioan Lucian Muntean

Data Parallelism

• each processor deals with a part of the data only, but covers all functional subtasks

• characteristics:

– assumption: underlying uniform data structure (array, e.g.)

– each processor covers the whole algorithm or program, but with a reduced input

– extreme example: array computer

– advantages: one program for all processors, often good parallelization proper-ties

• structure of data-parallel programs:

– static: compiler defines parallelization and order (simple organization, but noflexibility if runtimes of the jobs differ or change)

– dynamic: control of the parallel processing is done dynamically during runtime,according to program organization or runtime system (more complex organiza-tion, but more flexible with respect to local changes of the load)





Quantitative . . .

Network Topologies


Page 50 of 64


Ioan Lucian Muntean

Dynamic Structure of the Program

• allows for a permanent dynamic load balancing

• different options of organization:

– order placing (master-slave):

* master nodes places orders to slave nodes: this requires two differenttypes of programs or program parts, at least (for the master and for theslaves)

* often, the master process is a bottleneck (especially for large numbers ofslaves)

– order fetching:

* there is a set of tasks not processed so far (bag of tasks) where idle pro-cessors look for new tasks to deal with

* bag must be accessible to all tasks, hence primarily used in the memory-coupled case





Quantitative . . .

Network Topologies


Page 51 of 64


Ioan Lucian Muntean

Important Problem of Parallel Programming

• data access does not happen with some constant speed:

– this observation already holds for monoprocessors: registers, cache, mainmemory and so on

– except for true shared memory, we now have an additional level in memoryhierarchy:local memory can be accessed in shorter time than remote memory (holds fordirect remote access and for message-coupling)

– ratio can be up to 1:100 or 1:100000

• therefore: organize program such that data access can be realized locally as often aspossible

• this is a fundamental principle of parallel programming

• a lot of consequences for parallel algorithms, which often have to be modified com-pared with their sequential counterparts:

– loop organization: inner/outer loop (example matrix-vector product)

– order of operations





Quantitative . . .

Network Topologies


Page 52 of 64


Ioan Lucian Muntean

Examples of Parallelization Paradigms

• data or functional parallelization?

• shared or distributed memory?

• study three examples:

– find prime numbers

– compute flows

– search in trees





Quantitative . . .

Network Topologies


Page 53 of 64


Ioan Lucian Muntean

Example 1: Prime Numbers

• given: first 40 prime numbers p0, . . . , p39

• wanted: all prime numbers in the interval [p39 + 2, p239]

• algorithm: for all candidates n ∈ [p39 + 2, p239]:

∃i, 0 ≤ i ≤ 39 : n mod pi = 0 ⇒ n not a prime number

• variant A: code partitioning (functional parallelization)each processor checks all candidates for divisibility w.r.t. some primes

• variant B: data partitioning (data parallelization)each processor checks some candidates for divisibility w.r.t. all primes





Quantitative . . .

Network Topologies


Page 54 of 64


Ioan Lucian Muntean

Variant A with Distributed Memory

• basic algorithm:

– all candidates stored in a vector

– processor i checks all of them and sends possible primes to processor i + 1

– drawback: no parallelism!

• first improvement:

– processor i sends each checked candidate (if possibly prime) to processor i+1,immediately

– now parallelism (pipelining), but bad communication-computation ratio

• further improvement:

– processor i sends candidates in blocks of, say, 100 to processor i + 1

– now good communication-computation ratio, but problems with pipeline arepossible





Quantitative . . .

Network Topologies


Page 55 of 64


Ioan Lucian Muntean

Variant B with Shared Memory


– set of n candidates subdivided into four subsets with n/4 candidates

– when local computation terminates, primes are announced

– good: parallel method, no access conflicts to candidates

– bad: access conflicts to prime numbers possible

Variant B with Distributed Memory

• basic algorithm

– candidates are equally distributed among processors

– primes are sent to processor 0 at the end

– good: parallel method, good communication-computation ratio





Quantitative . . .

Network Topologies


Page 56 of 64


Ioan Lucian Muntean

Example 2: Computational Fluid Mechanics

• simulation of wind tunnel experiments

– either via continuum mechanics and continuous quantities (velocity, pressure)

– or via motion of particles (used in the following)

• variant A: each processor takes care of a subset of particles

– difficult to find neighbours

– perfect load balance

• variant B: each processor takes care of a subdomain

– easy to find neighbours

– number of particles per processor may vary





Quantitative . . .

Network Topologies


Page 57 of 64


Ioan Lucian Muntean

Variant B with Shared Memory


– each processor administrates a list of particle data for particles belonging to hissubdomain

– during iteration: compute motion and detect (and make list of) leaving particles

– after iteration: add new particles to list ("arrived" from neighbouring subdo-mains)

– good parallelization properties, but access conflicts and poor load balance arepossible

• improvement:

– create much more subdomains than participating processors

– good load balance is possible





Quantitative . . .

Network Topologies


Page 58 of 64


Ioan Lucian Muntean

Variant B with Distributed Memory


– same scheme

– after iteration:

* send lists of leaving particles to respective neighbouring processors

* receive lists of arriving particles from respective neighbouring processors

• improvements

– more complicated than in shared memory case

– possible solution: reorganize subdivision during computation





Quantitative . . .

Network Topologies


Page 59 of 64


Ioan Lucian Muntean

Example 3: Search Trees

• in each position (node) several ways of continuation

• branches can be distributed among participating processors

• problems

– level of solution is unknown

– load of subtrees may vary and is, hence, not forseeable

– how to do load balancing?

– how to detect termination?





Quantitative . . .

Network Topologies


Page 60 of 64


Ioan Lucian Muntean

Search with Shared Memory

• algorithm:

– one processor computes tree up to level j and puts description of continuationpossibilities in a queue

– free processors take elements from the queue and compute this subtree

– good parallelization properties

• load balancing

– typically no problem

– after arriving on level j + k, append the possible continuations into the queue,again (and so on)

• detection of termination:

– if solution is found, set a flag

– the others: check flags from time to time





Quantitative . . .

Network Topologies


Page 61 of 64


Ioan Lucian Muntean

Search with Distributed Memory

• algorithm:

– one processor computes tree up to level j and puts description of continuationpossibilities in a local queue

– the other processors contact this one in order to get a subtree to process

– good parallelization properties

• load balancing:

– typically no problem

– solution as before

• detection of termination:

– if solution is found, broadcast a message

– the others: check for "found!" message or get informed via interrupt





Quantitative . . .

Network Topologies


Page 62 of 64


Ioan Lucian Muntean

Phases of the Development of Parallel Programs

• extended phase model of software engineering

• early phases of design (essentially before the first run of the program):

– specification: formal problem description with the help of graphical or textualtools

– verification: formal proof of correctness of the program (hardly any powerfultools available yet)

– coding: implementation in some programming language

– subdivision: partitioning into concurrent, communicating processes

– mapping: assignment of resources to programs or data

– performance evaluation (superior): simulation and profiling

• late phases of design (at least one program run on the target architecture has takenplace):

– performance analysis: observation of the load of the involved resources

– visualization: track the dynamic program execution

– debugging: tests

– dynamic load distribution





Quantitative . . .

Network Topologies


Page 63 of 64


Ioan Lucian Muntean

Tools for the Development of Parallel Programs

• the use of specific tools during the different steps is crucial

• such tools are available for the late phases, primarily

• objective: get from a manual to a (semi-) automated program development

• especially for the subdivision, we need estimates of various quantities of the program(resources needed, memory requirements, communication profile)

• examples of development tools:

– simulators

– profilers: estimate the runtime of (parts of) programs already during compilation

– monitors: observe the program’s execution (hardware monitors, software mon-itors, ...; important: small influence on measurements);note that this notion of a monitor has nothing to do with the one we studied inthe context of synchronization!

– parallel debuggers: facilitate the detection of errors in parallel programs

– program flow visualizers: tools for graphical interpretation





Quantitative . . .

Network Topologies


Page 64 of 64


Ioan Lucian Muntean

Alternative Phase Model

• strongly heuristic proceeding

• four phases in the development of parallel programs:

– partitioning: subdivision of program and data into smaller partsprimary objective: allow for as much parallelism as possible, without taking intoaccount practical restrictions of the hardware and so on

– communication: define communication requirements and appropriate commu-nication hardware

– bundling: evaluation of the first two phases’ results with respect to performanceand costs

– mapping: assign tasks to processors (often difficult balancing between optimumprocessor loads and minimum communication costs); can be realized staticallyby the compiler or dynamically during runtime via load balancing

• focus of the first two phases: detection of parallelism, good scalability

• focus of the last two phases: locality, parallel efficiency



Date post:	24-May-2020
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

Parallel Numerical SimulationIntroduction to Parallel Programming Ralf-Peter Mundani Ioan Lucian...

Documents