Chapter 2

Chapter 2Parallel Architectures

OutlineSome chapter referencesBrief review of complexity Terminology for comparisons Interconnection networksProcessor arraysMultiprocessorsMulticomputersFlynns Taxonomy moved to Chpt 1

Some Chapter ReferencesSelim Akl, The Design and Analysis of Parallel Algorithms, Prentice Hall, 1989 (earlier textbook).G. C. Fox, What Have We Learnt from Using Real Parallel Machines to Solve Real Problems? Technical Report C3P-522, Cal Tech, December 1989. (Included in part in more recent books co-authored by Fox.)A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second Edition, 2003 (first edition 1994), Addison Wesley.Harry Jordan, Gita Alaghband, Fundamentals of Parallel Processing: Algorithms, Architectures, Languages, Prentice Hall, 2003, Ch 1, 3-5.

References - continuedGregory Pfsiter, In Search of Clusters: The ongoing Battle in Lowly Parallelism, 2nd Edition, Ch 2. (Discusses details of some serious problems that MIMDs incur).Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill,2004 (Current Textbook), Chapter 2.Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994, Ch. 1,2Sayed H. Roosta, Parallel Processing & Parallel Algorithms: Theory and Computation, Springer Verlag, 2000, Chpt 1.Wilkinson & Allen, Parallel Programming: Techniques and Applications, Prentice Hall, 2nd Edition, 2005, Ch 1-2.

Brief Review Complexity Concepts Needed for ComparisonsWhenever we define a counting function, we usually characterize the growth rate of that function in terms of complexity classes.Definition: We say a function f(n) is in O(g(n)), if (and only if) there are positive constants c and n0 such that0 f(n) cg(n) for n n0

O(n) is read as big-oh of n.This notation can be used to separate counts into complexity classes that characterize the size of the count.We can use it for any kind of counting functions such as timings, bisection widths, etc.

Big-Oh and Asymptotic Growth RateThe big-Oh notation gives an upper bound on the (asymptotic) growth rate of a functionThe statement f(n) is O(g(n)) means that the growth rate of f(n) is no more than the growth rate of g(n)We can use the big-Oh notation to rank functions according to their growth rate

Assume:f(n) is O(g(n))g(n) is O(f(n))g(n) grows fasterYesNof(n) grows fasterNoYesSame growthYesYes

Relatives of Big-Ohbig-Omegaf(n) is (g(n)) if there is a constant c > 0 and an integer constant n0 1 such that f(n) cg(n) for n n0Intuitively, this says up to a constant factor, f(n) asymptotically is greater than or equal to g(n)

big-Thetaf(n) is (g(n)) if there are constants c > 0 and c > 0 and an integer constant n0 1 such that 0 cg(n) f(n) cg(n) for n n0Intuitively, this says up to a constant factor, f(n) and g(n) are asymptotically the same.Note: These concepts are covered in algorithm courses

Relatives of Big-Ohlittle-ohf(n) is o(g(n)) if, for any constant c > 0, there is an integer constant n0 0 such that 0 f(n) < cg(n) for n n0Intuitively, this says f(n) is, up to a constant, asymptotically strictly less than g(n), so f(n) (g(n)). little-omegaf(n) is (g(n)) if, for any constant c > 0, there is an integer constant n0 0 such that f(n) > cg(n) 0 for n n0Intuitively, this says f(n) is, up to a constant, asymptotically strictly greater than g(n), so f(n) (g(n)). These are not used as much as the earlier definitions, but they round out the picture.

Summary for Intuition for Asymptotic Notationbig-Ohf(n) is O(g(n)) if f(n) is asymptotically less than or equal to g(n)big-Omegaf(n) is (g(n)) if f(n) is asymptotically greater than or equal to g(n)big-Thetaf(n) is (g(n)) if f(n) is asymptotically equal to g(n)little-ohf(n) is o(g(n)) if f(n) is asymptotically strictly less than g(n)little-omegaf(n) is (g(n)) if is asymptotically strictly greater than g(n)

A CALCULUS DEFINITION OF O, (often easier to use)Definition: Let f and g be functions defined on the positive integers with nonnegative values.We say g is in O(f) if and only iflim g(n)/f(n) = cn -> for some nonnegative real number c--- i.e. the limit exists and is not infinite.Definition: We say f is in (g) if and only iff is in O(g) and g is in O(f)Note: Often use L'Hopital's Rule to calculate the limits you need.

Why Asymptotic Behavior is Important1) Allows us to compare counts on large sets.2) Helps us understand the maximum size of input that can be handled in a given time, provided we know the environment in which we are running.3) Stresses the fact that even dramatic speedups in hardware do not overcome the handicap of an asymtotically slow algorithm.

Recall: ORDER WINS OUT(Example from Baases Algorithms Text)The TRS-80Main language support: BASIC - typically a slow running interpreted languageFor more details on TRS-80 see:http://mate.kjsl.com/trs80/http://ds.dial.pipex.com/town/park/abm64/CrayWWWStuff/Cfaqp1.html#TOC3The CRAY-YMPLanguage used in example: FORTRAN- a fast running languageFor more details on CRAY-YMP see:

CRAY YMP TRS-80 with FORTRAN with BASIC complexity is 3n3 complexity is 19,500,000nn is:1010010002500100001000000 3 microsec200 millisec 3 millisec 2 sec 3 sec 20 sec50 sec50 sec49 min 3.2 min95 years5.4 hoursmicrosecond (abbr sec) One-millionth of a second.millisecond (abbr msec) One-thousandth of a second.

Interconnection NetworksUses of interconnection networksConnect processors to shared memoryConnect processors to each otherInterconnection media typesShared mediumSwitched mediumDifferent interconnection networks define different parallel machines.The interconnection networks properties influence the type of algorithm used for various machines as it affects how data is routed.

Shared versus Switched Media

Shared MediumAllows only message at a timeMessages are broadcastEach processor listens to every messageBefore sending a message, a processor listen until medium is unusedCollisions require resending of messagesEthernet is an example

Switched MediumSupports point-to-point messages between pairs of processorsEach processor is connected to one switchAdvantages over shared mediaAllows multiple messages to be sent simultaneouslyAllows scaling of the network to accommodate the increase in processors

Switch Network TopologiesView switched network as a graphVertices = processors or switchesEdges = communication pathsTwo kinds of topologiesDirectIndirect

Direct TopologyRatio of switch nodes to processor nodes is 1:1Every switch node is connected to1 processor nodeAt least 1 other switch nodeIndirect Topology Ratio of switch nodes to processor nodes is greater than 1:1 Some switches simply connect to other switches

Terminology for Evaluating Switch TopologiesWe need to evaluate 4 characteristics of a network in order to help us understand their effectiveness in implementing efficient parallel algorithms on a machine with a given network.These areThe diameter The bisection widthThe edges per nodeThe constant edge lengthWell define these and see how they affect algorithm choice.Then we will investigate several different topologies and see how these characteristics are evaluated.

Terminology for Evaluating Switch TopologiesDiameter Largest distance between two switch nodes.Low diameter is goodIt puts a lower bound on the complexity of parallel algorithms which requires communication between arbitrary pairs of nodes.

Terminology for Evaluating Switch Topologies

Bisection width The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves (within 1 node, if the number of processors is odd.)High bisection width is good.In algorithms requiring large amounts of data movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of an algorithm,Actually proving what the bisection width of a network is can be quite difficult.

Terminology for Evaluating Switch Topologies

Number of edges / nodeIt is best if the number of edges/node is a constant independent of network size as that allows more scalability of the system to a larger number of nodes.Degree is the maximum number of edges per node.Constant edge length? (yes/no)Again, for scalability, it is best if the nodes and edges can be laid out in 3D space so that the maximum edge length is a constant independent of network size.

Evaluating Switch TopologiesMany have been proposed and analyzed. We will consider several well known ones:2-D meshlinear networkbinary treehypertreebutterflyhypercubeshuffle-exchangeThose in yellow have been used in commercial parallel computers.

2-D MeshesNote: Circles represent switches and squares represent processors in all these slides.

2-D Mesh NetworkDirect topologySwitches arranged into a 2-D lattice or gridCommunication allowed only between neighboring switchesTorus: Variant that includes wraparound connections between switches on edge of mesh

Evaluating 2-D Meshes(Assumes mesh is a square)n = number of processors Diameter: (n1/2)Places a lower bound on algorithms that require processing with arbitrary nodes sharing data.Bisection width: (n1/2)Places a lower bound on algorithms that require distribution of data to all nodes.Max number of edges per switch: 4 (note: this is the degree)Constant edge length? YesDoes this scale well?Yes

Linear NetworkSwitches arranged into a 1-D meshCorresponds to a row or column of a 2-D meshRing : A variant that allows a wraparound connection between switches on the end.The linear and ring networks have many applicationsEssentially supports a pipeline in both directionsAlthough these networks are very simple, they support many optimal algorithms.

Evaluating Linear and Ring NetworksDiameterLinear : n-1 or (n)Ring: n/2 or (n)Bisection width: Linear: 1 or (1)Ring: 2 or (1)Degree for switches: 2Constant edge length? YesDoes this scale well?Yes

Binary Tree NetworkIndirect topologyn = 2d processor nodes, 2n-1 switches, where d= 0,1,... is the number of levelsi.e. 23 = 8 processors on bottom and 2(n) 1 = 2(8) 1 = 15 switches

Evaluating Binary Tree NetworkDiameter: 2 log nNote- this is smallBisection width: 1, the lowest possible numberDegree: 3Constant edge length? NoDoes this scale well?No

Hypertree Network (of degree 4 and depth 2) Front view: 4-ary tree of height 2(b) Side view: upside down binary tree of height d(c) Complete network

Hypertree NetworkIndirect topologyNote- the degree k and the depth d must be specified. This gives from the front a k-ary tree of height d.From the side, the same network looks like an upside down binary tree of height d.Joining the front and side views yields the complete network.

Evaluating 4-ary Hypertree with n =16 processorsDiameter: log nshares the low diameter of binary treeBisection width: n / 2Large value - much better than binary treeEdges / node: 6Constant edge length? No

Butterfly NetworkIndirect topologyn = 2d processor nodes connected by n(log n + 1) switching nodesA 23 = 8 processor butterfly network with 8*4=32 switching nodesAs complicated as this switching network appears to be, it is really quite simple as it admits a very nice routing algorithm!Note: The bottom row of switches is normally identical with the top row.The rows are called ranks.

Building the 23 Butterfly NetworkThere are 8 processors.Have 4 ranks (i.e. rows) with 8 switches per rank.Connections: Node(i,j), for i > 0, is connected to two nodes on rank i-1, namely node(i-1,j) and node(i-1,m), where m is the integer found by inverting the ith most significant bit in the binary d-bit representation of j.For example, suppose i = 2 and j = 3. Then node (2,3) is connected to node (1,3). To get the other connection, 3 = 0112. So, flip 2nd significant bit i.e. 0012 and connect node(2,3) to node(1,1) --- NOTE: There is an error on pg 32 on this example.

Why It Is Called a Butterfly NetworkWalk cycles such as node(i,j), node(i-1,j), node(i,m), node(i-1,m), node(i,j) where m is determined by the bit flipping as shown and you see a butterfly:

Butterfly Network RoutingSend message from processor 2 to processor 5.

Algorithm:0 means ship left; 1 means ship right. 1) 5 = 101. Pluck off leftmost bit 1 and send 01msg to right.2) Pluck off leftmost bit 0 and send 1msg to left.3) Pluck off leftmost bit 1 and send msg to right.

Evaluating the Butterfly NetworkDiameter:log nBisection width: n / 2Edges per node: 4 (even for d 3)Constant edge length? No as rank decreases,grows exponentially

Hypercube (or binary n-cube) n = 2d processors and n switch nodesButterfly with the columns of switch nodes collapsed into a single node.

Hypercube (or binary n-cube) n = 2d processors and n switch nodes Direct topology 2 x 2 x x 2 mesh Number of nodes is a power of 2 Node addresses 0, 1, , 2k-1 Node i is connected to k nodes whose addresses differ from i in exactly one bit position. Example: k = 0111 is connected to 1111, 0011, 0101, and 0110

Growing a HypercubeNote: For d = 4, it is a 4-dimensional cube.

Evaluating Hypercube Network Diameter: log nBisection width: n / 2Edges per node: log nConstant edge length? No. The length of the longest edge increases as n increases.

Routing on the Hypercube Network Example: Send a message from node 2 = 0010 to node 5 = 0101 The nodes differ in 3 bits so the shortest path will be of length 3. One path is0010 0110 0100 0101obtained by flipping one of the differing bits at each step. As with the butterfly network, bit flipping helps you route on this network.

A Perfect ShuffleA permutation that is produced as follows is called a perfect shuffle:Given a power of 2 cards, numbered 0, 1, 2, ..., 2d -1, write the card number with d bits. By left rotating the bits with a wrap, we calculate the position of the card after the perfect shuffle.Example: For d = 3, card 5 = 101. Left rotating and wrapping gives us 011. So, card 5 goes to position 3. Note that card 0 = 000 and card 7 = 111, stay in position.

Shuffle-exchange Network Illustrated01234567 Direct topology Number of nodes is a power of 2 Nodes have addresses 0, 1, , 2d-1 Two outgoing links from node i Shuffle link to node LeftCycle(i) Exchange link between node i and node i+1 when i is even

Shuffle-exchange Addressing 16 processorsNo arrows on line segment means it is bidirectional. Otherwise, you must follow the arrows.Devising a routing algorithm for this network is interesting and will be a homework problem.

Evaluating the Shuffle-exchange Diameter: 2log n - 1

Bisection width: n / log n

Edges per node: 3

Constant edge length? No

Two Problems with Shuffle-ExchangeShuffle-Exchange does not expand wellA large shuffle-exchange network does not compose well into smaller separate shuffle exchange networks.In a large shuffle-exchange network, a small percentage of nodes will be hot spotsThey will encounter much heavier trafficAbove results are in dissertation of one of Batchers students.

Comparing NetworksAll have logarithmic diameter except 2-D meshHypertree, butterfly, and hypercube have bisection width n / 2All have constant edges per node except hypercubeOnly 2-D mesh, linear, and ring topologies keep edge lengths constant as network size increasesShuffle-exchange is a good compromise- fixed number of edges per node, low diameter, good bisection width.However, negative results on preceding slide also need to be considered.

Alternate Names for SIMDsRecall that all active processors of a SIMD computer must simultaneously access the same memory location. The value in the i-th processor can be viewed as the i-th component of a vector. SIMD machines are sometimes called vector computers [Jordan,et.al.] or processor arrays [Quinn 94,04] based on their ability to execute vector and matrix operations efficiently.

SIMD ComputersSIMD computers that focus on vector operationsSupport some vector and possibly matrix operations in hardwareUsually limit or provide less support for non-vector type operations involving data in the vector components.General purpose SIMD computersSupport more traditional type operations (e.g., other than for vector/matrix data types). Usually also provide some vector and possibly matrix operations in hardware.

Pipelined ArchitecturesPipelined architectures are sometimes considered to be SIMD architecturesSee pg 37 of Textbook & pg 8-9 Jordan et. al.Vector components are entered successively into first processor in pipeline.The i-th processor of the pipeline receives the output from the (i-1)th processor.Normal operations in each processor are much larger (coarser) in pipelined computers than in true SIMDsPipelined somewhat SIMD in nature in that synchronization is not required.

Why Processor Arrays?Historically, high cost of control unitsScientific applications have data parallelism

Data/instruction StorageFront end computerAlso called the control unitHolds and runs programData manipulated sequentiallyProcessor arrayData manipulated in parallel

Processor Array PerformancePerformance: work done per time unitPerformance of processor arraySpeed of processing elementsUtilization of processing elements

Performance Example 11024 processorsEach adds a pair of integers in 1 sec (1 microsecond or one millionth of second or 10-6 second.)What is the performance when adding two 1024-element vectors (one per processor)?

Performance Example 2512 processorsEach adds two integers in 1 secWhat is the performance when adding two vectors of length 600?Since 600 > 512, 88 processor must add two pairs of integers.The other 424 processors add only a single pair of integers.

Example of a 2-D Processor Interconnection Network in a Processor ArrayEach VLSI chip has 16 processing elements.Each PE can simultaneously send a value to a neighbor.PE = processor element

SIMD Execution StyleThe traditional (SIMD, vector, processor array) execution style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]:The sequential processor that broadcasts the commands to the rest of the processors is called the front end or control unit.The front end is a general purpose CPU that stores the program and the data that is not manipulated in parallel.The front end normally executes the sequential portions of the program.Each processing element has a local memory that can not be directly accessed by the host or other processing elements.

SIMD Execution StyleCollectively, the individual memories of the processing elements (PEs) store the (vector) data that is processed in parallel. When the front end encounters an instruction whose operand is a vector, it issues a command to the PEs to perform the instruction in parallel. Although the PEs execute in parallel, some units can be allowed to skip any particular instruction.

Masking on Processor ArraysAll the processors work in lockstep except those that are masked out (by setting mask register).The conditional if-then-else is different for processor arrays than sequential version Every active processor tests to see if its data meets the negation of the boolean condition.If it does, it sets its mask bit so those processors will not participate in the operation initially.Next the unmasked processors, execute the THEN part.Afterwards, mask bits (for original set of active processors) are flipped and unmasked processors perform the the ELSE part.

if (COND) then A else B

SIMD MachinesAn early SIMD computer designed for vector and matrix processing was the Illiac IV computer built at the University of IllinoisSee Jordan et. al., pg 7The MPP, DAP, the Connection Machines CM-1 and CM-2, MasPar MP-1 and MP-2 are examples of SIMD computersSee Akl pg 8-12 and [Quinn, 94] The CRAY-1 and the Cyber-205 use pipelined arithmetic units to support vector operations and are sometimes called a pipelined SIMD See [Jordan, et al, p7], [Quinn 94, pg 61-2], and [Quinn 2004, pg37).

SIMD MachinesQuinn [1994, pg 63-67] discusses the CM-2 Connection Machine and a smaller & updated CM-200.Professor Batcher was the chief architect for the STARAN and the MPP (Massively Parallel Processor) and an advisor for the ASPROASPRO is a small second generation STARAN used by the Navy in the spy planes. Professor Batcher is best known architecturally for the MPP, which is at the Smithsonian Institute & currently displayed at a D.C. airport.

Todays SIMDsMany SIMDs are being embedded in SISD machines.Others are being build as part of hybrid architectures. Others are being build as special purpose machines, although some of them could classify as general purpose.Much of the recent work with SIMD architectures is proprietary.

A Company Building Inexpensive SIMDWorldScape is producing a COTS (commodity off the shelf) SIMD Not a traditional SIMD as the hardware doesnt synchronize every step. Hardware design supports efficient synchronizationTheir machine is programmed like a SIMD.The U.S. Navy has observed that their machines process radar a magnitude faster than others.There is quite a bit of information about their work at http://www.wscape.com

An Example of a Hybrid SIMDEmbedded Massively Parallel Accelerators

Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan(This and next three slides are due to Prabhakar R. Gudla (U of Maryland) at a CMSC 838T Presentation, 4/23/2003.)

Hybrid Architecturecombines SIMD and MIMD paradigm within a parallel architecture Hybrid Computer

Architecture of Systola 1024Instruction Systolic Array:32 32 mesh of processing elementswavefront instruction execution

SIMDs Embedded in SISDsIntel's Pentium 4 includes what they call MMX technology to gain a significant performance boost IBM and Motorola incorporated the technology into their G4 PowerPC chip in what they call their Velocity Engine. Both MMX technology and the Velocity Engine are the chip manufacturer's name for their proprietary SIMD processors and parallel extensions to their operating code. This same approach is used by NVidia and Evans & Sutherland to dramatically accelerate graphics rendering.

Special Purpose SIMDs in the Bioinformatics ArenaParcel Acquired by Celera Genomics in 2000Products include the sequence supercomputer GeneMatcher, which has a high throughput sequence analysis capabilitySupports over a million processorsGeneMatcher was used by Celera in their race with U.S. government to complete the description of the human genome sequencingTimeLogic, IncHas DeCypher, a reconfigurable SIMD

Advantages of SIMDsReference: [Roosta, pg 10] Less hardware than MIMDs as they have only one control unit. Control units are complex.Less memory needed than MIMD Only one copy of the instructions need to be storedAllows more data to be stored in memory.Less startup time in communicating between PEs.

Advantages of SIMDsSingle instruction stream and synchronization of PEs make SIMD applications easier to program, understand, & debug.Similar to sequential programmingControl flow operations and scalar operations can be executed on the control unit while PEs are executing other instructions.MIMD architectures require explicit synchronization primitives, which create a substantial amount of additional overhead.

Advantages of SIMDsDuring a communication operation between PEs, PEs send data to a neighboring PE in parallel and in lock stepNo need to create a header with routing information as routing is determined by program steps.the entire communication operation is executed synchronouslyA tight (worst case) upper bound for the time for this operation can be computed.Less complex hardware in SIMD since no message decoder is needed in PEs MIMDs need a message decoder in each PE.

SIMD Shortcomings(with some rebuttals)

Claims are from our textbook [i.e., Quinn 2004].Similar statements are found in [Grama, et. al]. Claim 1: Not all problems are data-parallelWhile true, most problems seem to have data parallel solutions. In [Fox, et.al.], the observation was made in their study of large parallel applications that most were data parallel by nature, but often had points where significant branching occurred.

SIMD Shortcomings(with some rebuttals)Claim 2: Speed drops for conditionally executed branchesProcessors in both MIMD & SIMD normally have to do a significant amount of condition testingMIMDs processors can execute multiple branches concurrently.For an if-then-else statement with execution times for the then and else parts being roughly equal, about of the SIMD processors are idle during its executionWith additional branching, the average number of inactive processors can become even higher.With SIMDs, only one of these branches can be executed at a time.This reason justifies the study of multiple SIMDs (or MSIMDs).


Claim 2 (cont): Speed drops for conditionally executed codeIn [Fox, et.al.], the observation was made that for the real applications surveyed, the MAXIMUM number of active branches at any point in time was about 8.The cost of the extremely simple processors used in a SIMD are extremely lowProgrammers used to worry about full utilization of memory but stopped this after memory cost became insignificant overall.


Claim 3: Dont adapt to multiple users well.This is true to some degree for all parallel computers. If usage of a parallel processor is dedicated to a important problem, it is probably best not to risk compromising its performance by sharingThis reason also justifies the study of multiple SIMDs (or MSIMD). SIMD architecture has not received the attention that MIMD has received and can greatly benefit from further research.


Claim 4: Do not scale down well to starter systems that are affordable.This point is arguable and its truth is likely to vary rapidly over timeWorldScape/ClearSpeed currently sells a very economical SIMD board that plugs into a PC.

SIMD Shortcomings(with some rebuttals)Claim 5: Requires customized VLSI for processors and expense of control units has droppedReliance on COTS (Commodity, off-the-shelf parts) has dropped the price of MIMDSExpense of PCs (with control units) has dropped significantlyHowever, reliance on COTS has fueled the success of low level parallelism provided by clusters and restricted new innovative parallel architecture research for well over a decade.

SIMD Shortcomings(with some rebuttals)Claim 5 (cont.)There is strong evidence that the period of continual dramatic increases in speed of PCs and clusters is ending.Continued rapid increases in parallel performance in the future will be necessary in order to solve important problems that are beyond our current capabilitiesAdditionally, with the appearance of the very economical COTS SIMDs, this claim no longer appears to be relevant.

MultiprocessorsMultiprocessor: multiple-CPU computer with a shared memorySame address on two different CPUs refers to the same memory locationAvoids three cited problems for SIMDsCan be built from commodity CPUsNaturally support multiple usersMaintain efficiency in conditional code

Centralized Multiprocessor

Centralized MultiprocessorStraightforward extension of uniprocessorAdd CPUs to busAll processors share same primary memoryMemory access time same for all CPUsUniform memory access (UMA) multiprocessorAlso called a symmetrical multiprocessor (SMP)

Private and Shared DataPrivate data: items used only by a single processorShared data: values used by multiple processorsIn a centralized multiprocessor (i.e. SMP), processors communicate via shared data values

Problems Associated with Shared DataThe cache coherence problemReplicating data across multiple caches reduces contention among processors for shared data values.But - how can we ensure different processors have the same value for same address?The cache coherence problem is when an obsolete value is still stored in a processors cache.

Write Invalidate ProtocolMost common solution to cache coherencyEach CPUs cache controller monitors (snoops) the bus & identifies which cache blocks are requested by other CPUs.A PE gains exclusive control of data item before performing write.Before write occurs, all other copies of data item cached by other PEs are invalidated.When any other CPU tries to read a memory location from an invalidated cache block, a cache miss occursIt has to retrieve updated data from memory

Cache-coherence ProblemMemory7X

Cache-coherence ProblemMemory7X7Read from memory is not a problem.

Cache-coherence ProblemMemory7X77

Cache-coherence ProblemMemory2X72Write to main memory is a problem.

Write Invalidate Protocol7X77A cache control monitorsnoops the bus to seewhich cache block isbeing requested by other processors.

Write Invalidate Protocol7X77Intent to write XBefore a write can occur, all copies of data at that address are declared invalid.

Write Invalidate Protocol7X7Intent to write X

Write Invalidate ProtocolX22When another processor tries to read from this location in cache, it receives a cache miss error and will have to refresh from main memory.

Synchronization Required for Shared DataMutual exclusion Definition: At most one process can be engaged in an activity at any time.Example: Only one processor can write to the same address in main memory at the same time.We say that process must mutually exclude all others while it performs this write.Barrier synchronization Definition: Guarantees that no process will proceed beyond a designated point (called the barrier) until every process reaches that point.

Distributed MultiprocessorDistributes primary memory among processorsIncrease aggregate memory bandwidth and lower average memory access timeAllows greater number of processorsAlso called non-uniform memory access (NUMA) multiprocessorLocal memory access time is fastNon-local memory access time can varyDistributed memories have one logical address space

Distributed Multiprocessors

Cache CoherenceSome NUMA multiprocessors do not support it in hardwareOnly instructions and private data are stored in cachePolicy creates a large memory access time varianceImplementation more difficultNo shared memory bus to snoopDirectory-based protocol needed

Directory-based ProtocolDistributed directory contains information about cacheable memory blocksOne directory entry for each cache blockEach entry hasSharing statusWhich processors have copies

Sharing StatusUncached -- (denoted by U)Block not in any processors cacheShared (denoted by S)Cached by one or more processorsRead onlyExclusive (denoted by E)Cached by exactly one processorProcessor has written blockCopy in memory is obsolete

Directory-based Protocol - step1

X has value 7 step 2Interconnection Network7XCachesMemoriesDirectoriesXU 0 0 0Bit Vector

CPU 0 Reads X step 3Interconnection Network7XCachesMemoriesDirectoriesXU 0 0 0

CPU 0 Reads X step 4Interconnection Network7XCachesMemoriesDirectoriesXS 1 0 0

CPU 0 Reads X step 5Interconnection NetworkCachesMemoriesDirectoriesXS 1 0 0

CPU 0 Writes 6 to X step 9Interconnection NetworkCachesMemoriesDirectoriesXS 1 0 1Write Miss

CPU 0 Writes 6 to X step 10Interconnection NetworkCachesMemoriesDirectoriesXS 1 0 1Invalidate

CPU 0 Writes 6 to X step 11Interconnection NetworkCachesMemoriesDirectoriesXE 1 0 06X

CPU 1 Reads X step 12Interconnection NetworkCachesMemoriesDirectoriesXE 1 0 0Read Miss

CPU 1 Reads X step 13Interconnection NetworkCachesMemoriesDirectoriesXE 1 0 0Switch to Shared

CPU 1 Reads X step 14Interconnection NetworkCachesMemoriesDirectoriesXE 1 0 0

CPU 2 Writes 5 to X step 16Interconnection NetworkCachesMemoriesDirectoriesXS 1 1 0Write Miss

CPU 2 Writes 5 to X - step 17Interconnection NetworkCachesMemoriesDirectoriesXS 1 1 0Invalidate

CPU 0 Writes 4 to X step 19Interconnection NetworkCachesMemoriesDirectoriesXE 0 0 1

CPU 0 Writes 4 to X step 20Interconnection NetworkCachesMemoriesDirectoriesXE 1 0 0Take Away

CPU 0 Writes 4 to X step 23 Interconnection NetworkCachesMemoriesDirectoriesXE 1 0 0Creates cache block storage for X

CPU 0 Writes Back X Block step 25Interconnection NetworkCachesMemoriesDirectoriesXE 1 0 0Data Write Back

CPU 0 flushes cache block X step 26Interconnection NetworkCachesMemoriesDirectoriesXU 0 0 0

Characteristics of MultiprocessorsInterprocessor communication is done in the memory interface by read and write instructionsMemory may be physically distributed and the reads and writes from different processors may take different time and congestion of the interconnection network may occur.Memory latency (i.e., time to complete a read or write) may be long and variable.Most messages through the bus or interconnection network are the size of single memory words.Randomization of requests may be used to reduce the probability of collisions.

MulticomputersDistributed memory multiple-CPU computerSame address on different processors refers to different physical memory locationsProcessors interact through message passing

Typically, Two Flavors of MulticomputersCommercial multicomputersCustom switch networkLow latency (the time it takes to get a response from something).High bandwidth (data path width) across processorsCommodity clustersMass produced computers, switches and other equipmentUse low cost componentsMessage latency is higherCommunications bandwidth is lower

Multicomputer CommunicationProcessors are connected by an interconnection network Each processor has a local memory and can only access its own local memoryData is passed between processors using messages, as dictated by the programData movement across the network is also asynchronousA common approach is to use MPI to handling message passing

Multicomputer Communications (cont)Multicomputers can be scaled to larger sizes much easier than multiprocessors.The amount of data transmissions between processors have a huge impact on the performanceThe distribution of the data among the processors is a very important factor in the performance efficiency.

Message-Passing AdvantagesNo problem with simultaneous access to data. Allows different PCs to operate on the same data independently.Allows PCs on a network to be easily upgraded when faster processors become available.

Disadvantages of Message-PassingProgrammers must make explicit message-passing calls in the codeThis is low-level programming and is error prone.Data is not shared but copied, which increases the total data size.Data IntegrityDifficulty in maintaining correctness of multiple copies of data item.

Some Interconnection Network Terminology (1/2)References: Wilkinson, et. al. & Grama, et. al. Also, earlier slides on architecture & networks.A link is the connection between two nodes.A switch that enables packets to be routed through the node to other nodes without disturbing the processor is assumed.The link between two nodes can be either bidirectional or use two directional links .Either one wire to carry one bit or parallel wires (one wire for each bit in word) can be used.The above choices do not have a major impact on the concepts presented in this course.

Network Terminology (2/2)The bandwidth is the number of bits that can be transmitted in unit time (i.e., bits per second).The network latency is the time required to transfer a message through the network.The communication latency is the total time required to send a message, including software overhead and interface delay.The message latency or startup time is the time required to send a zero-length message.Includes software & hardware overhead, such asChoosing a routepacking and unpacking the message

Circuit Switching Message PassingTechnique establishes a path and allows the entire message to transfer uninterrupted.Similar to telephone connection that is held until the end of the call.Links used are not available to other messages until the transfer is complete.Latency (message transfer time): If the length of control packet sent to establish path is small wrt (with respect to) the message length, the latency is essentiallythe constant L/B, where L is message length and B is bandwidth.

Store-and-forward Packet SwitchingMessage is divided into packets of informationEach packet includes source and destination addresses.Packets can not exceed a fixed, maximum size (e.g., 1000 byte).A packet is stored in a node in a buffer until it can move to the next node.

Packet Switching (cont)At each node, the designation information is looked at and used to select which node to forward the packet to.Routing algorithms (often probabilistic) are used to avoid hot spots and to minimize traffic jams.Significant latency is created by storing each packet in each node it reaches.Latency increases linearly with the length of the route.

Virtual Cut-Through Package SwitchingUsed to reduce the latency.Allows packet to pass through a node without being stored, if the outgoing link is available.If complete path is available, a message can immediately move from source to destination..

Wormhole RoutingAlternate to store-and-forward packet routingA message is divided into small units called flits (flow control units).Flits are 1-2 bytes in size.Can be transferred in parallel on links with multiple wires.Only head of flit is initially transferred when the next link becomes available.

Wormhole Routing (cont)As each flit moves forward, the next flit can move forward.The entire path must be reserved for a message as these packets pull each other along (like cars of a train).Request/acknowledge bit messages are required to coordinate these pull-along moves. See Wilkinson, et. al.The complete path must be reserved, as these flits are linked together.Latency: If the head of the flit is very small compared to the length of the message, then the latency is essentially the constant L/B, with L the message length and B the link bandwidth.

DeadlockRouting algorithms needed to find a path between the nodes.Adaptive routing algorithms choose different paths, depending on traffic conditions. Livelock is a deadlock-type situation where a packet continues to go around the network, without ever reaching its destination.Deadlock: No packet can be forwarded because they are blocked by other stored packets waiting to be forwarded.

Asymmetric MulticomputersHas a front-end that interacts with users and I/O devices.Processors in back end are used for computation. Similar to SIMDs (or processor arrays)Common with early multicomputersExamples of asymmetrical multicomputers given in textbook.

Asymmetrical MC AdvantagesBack-end processors dedicated to parallel computations Easier to understand, model, tune performanceOnly a simple back-end operating system needed Easy for a vendor to create

Asymmetrical MC DisadvantagesFront-end computer is a single point of failureSingle front-end computer limits scalability of systemPrimitive operating system in back-end processors makes debugging difficultEvery application requires development of both front-end and back-end program

Symmetric MulticomputersEvery computer executes the same operating system and has identical functionality. Users may log into any computer to edit or compile their programs.Any or all computers may be involved in the execution of their program.During execution of programs, every PE executes the same program. When only one PE should execute an operation, an if statement is used to select the PE.

Symmetric Multicomputers

Symmetrical MC AdvantagesAlleviate performance bottleneck caused by single front-end computerBetter support for debuggingEvery processor executes same program

Symmetrical MC DisadvantagesMore difficult to maintain illusion of single parallel computerNo simple way to balance program development workload among processorsMore difficult to achieve high performance when multiple processes on each processorDetails on next slide

Symmetric MC Disadvantages (cont)(cont.) More difficult to achieve high performance when multiple processes on each processorProcesses on same processor compete for same resourcesCPU CyclesCache spaceMemory bandwidthIncreased cache misses Cache is PE oriented instead of process oriented

ParPar Cluster, A Mixed ModelMixed modelIncorporates both asymetrical and symetrical designs.

A Commodity Cluster vs Network of WorkstationsA commodity cluster contains components of local area networksCommodity computersSwitchesA network of workstations is a dispersed collection of computersDistributed hetergeneous computers Located on primary users desksUnused cycles available for parallel use

Best Model for Commodity ClusterFull-Fledged operating system (e.g., Linux) desirableFeature of symmetric multicomputerDesirable to increase cache hitsFavors having only a single user process on each PEFavors most nodes being off-limits for program developmentNeed fast networkKeep program development users off networks.Access front-end by another path.Overall, a mixed model may be best for commodity clusters

Ideal Commodity Cluster FeaturesCo-located computersDedicated to running parallel jobsNo keyboards or displaysIdentical operating systemIdentical local disk imagesAdministered as an entity

Network of WorkstationsDispersed computersTypically located on users desksFirst priority: person at keyboardParallel jobs run in backgroundDifferent operating systemsDifferent local imagesCheckpointing and restarting importantTypically connected by ethernetToo slow for commodity network usage

SummaryCommercial parallel computers appeared in 1980sMultiple-CPU computers now dominateSmall-scale: Centralized multiprocessorsLarge-scale: Distributed memory architecturesMultiprocessorsMulticomputers

Date post:	06-Jan-2016
Category:	Documents
Upload:	bien
View:	43 times
Download:	0 times

Chapter 2

Documents