Alexandre David 1.2.05...

Post on 20-Sep-2020

1 views 0 download

transcript

Parallel Computers

Alexandre David1.2.05

adavid@cs.aau.dk

08-02-2010 MVP'10 - Aalborg University 2

How much do we need to know?

Important to know the architecture of parallel hardware.Not all details are important to programmers

keep portabilitykeep up with technological changes

The point: Get a meaningful model.

08-02-2010 MVP'10 - Aalborg University 3

Intel Core Duo

cache coherence protocol ModifiedExclusiveSharedInvalid

more shared L2low latency

08-02-2010 MVP'10 - Aalborg University 4

ExampleThe point: Relatively expensive.

08-02-2010 MVP'10 - Aalborg University 5

AMD Dual Core Opteron

cache coherence protocol

ModifiedOwnedExclusiveSharedInvalid

easier for SMP

08-02-2010 MVP'10 - Aalborg University 6

Core i7

08-02-2010 MVP'10 - Aalborg University 7

SMP

caches “snoop” on the bus bottleneck

08-02-2010 MVP'10 - Aalborg University 8

Larger SMP – Sun Fire18 boards connectedby a crossbarswitch.Snooping buses.Directory based cachecoherence protocol.Scalable/higherlatency.

Note: Expensivehardware.

08-02-2010 MVP'10 - Aalborg University 9

CrossbarN x N connections.Expensive, limited.

08-02-2010 MVP'10 - Aalborg University 10

Heterogeneous chipsGPUs

800 ALU on ATI’s latest 4800 series.--logic, ++computational units

FPGAsPCI boards availablereconfigurable

CellDual-threaded PPC – PPU, 64 bits8x SPU

08-02-2010 MVP'10 - Aalborg University 11

Cell architecture

18.2GB/s128 bits

No cache coherenceprotocol.

Different philosophy:the PPU is a coordinator,the SPUs do the job.

Difficult to program.

08-02-2010 MVP'10 - Aalborg University 12

Clusters“Cheap” PCs connected together.

GB ethernetInfiniband…Memory private to each machine,use message based communication.Scalable but high latency.Sold by racks.

08-02-2010 MVP'10 - Aalborg University 13

BlueGene

65536 x@ 700MHz

interesting part

08-02-2010 MVP'10 - Aalborg University 14

Interconnect

3-D torus for standarddata transfers.

Collective network forfast reductions.Very powerful.

08-02-2010 MVP'10 - Aalborg University 15

Broadcast/Reduction

Broadcast Reduce

08-02-2010 MVP'10 - Aalborg University 16

Cut-through routingSimplified packet routing:

Packets take the same path(1x routing information).In sequence packet delivery (no sequencing).Error detection at message level, cheap detection (for good networks).Fixed size unit for packets = flow control digits (flits).

08-02-2010 MVP'10 - Aalborg University 17

08-02-2010 MVP'10 - Aalborg University 18

LessonsVery different architectures.

SMPDistributed

But we want one meaningful model.Hints:

local accesses - cheapnon-local accesses - expensive

08-02-2010 MVP'10 - Aalborg University 19

RAM modelSequential execution unit with unbounded memory.

every operation takes 1 unit of time

Limitedok for algorithms – reason on complexityunrealistic

08-02-2010 MVP'10 - Aalborg University 20

Application of the RAM model

Expected: O(n), O(log n)(array must be sorted)

update of location missing

08-02-2010 MVP'10 - Aalborg University 21

PRAM modelSeveral execution units accessing one shared unbounded memory

global accesssynchronous access – one global clockcontention resolved by pre-defined rules

EREW, CREW, CRCW, ERCWleast powerful, least convenient: EREWmost powerful, most convenient: CRCWlesson: reason on CRCW but apply on EREW because it is possible to simulate one with the other (in polynomial time)

like RAM: good for algorithms, complexity…

08-02-2010 MVP'10 - Aalborg University 22

CTA (Candidate Type Architecture)

Account for communication costs.Applies to clusters & SMPs.Local/non-local accesses.Goal: Achieve in practice the predicted running time. PRAM is misleading in that respect.The catch: Not easy to estimate communication costs.

Model:interconnected processors with RAMtopology not specified but this impacts communication costs.

08-02-2010 MVP'10 - Aalborg University 23

CTA

SMPClusterCell…Memory latencyspecified infunction of thereal architecture.Non-local: λ.

08-02-2010 MVP'10 - Aalborg University 24

Typical λ

08-02-2010 MVP'10 - Aalborg University 25

LessonUse locality

temporal & spatialsometimes redundant computation is better than sending data around

Exact number of processors supplied at runtime.

scale/not tied to one setupNote: λ increases with P.

08-02-2010 MVP'10 - Aalborg University 26

Memory reference mechanismsShared memory

avoid race conditions, needs synchronization

One-sidednot commonprivate (local) & shared non-coherent memory

Message passing – 2-sidedMPIComplex communication protocols.

08-02-2010 MVP'10 - Aalborg University 27

Memory consistency modelsSequential consistency – expensive.

serialize the operations of all processorsoperations obey specified order

Relaxed consistency – weaker.variations

Keep in mind: There are hardware tricks to get sequential consistency (CAS/TAS).

Interconnects

08-02-2010 MVP'10 - Aalborg University 29

Bus Based Networks

No local cache

Local cache

Serialize accesses – cheap.

08-02-2010 MVP'10 - Aalborg University 30

Crossbar Networks

Parallel access – expensive.

08-02-2010 MVP'10 - Aalborg University 31

Omega networksMulti-stage network – compromise cost/performance.N nodes – log n stages.

08-02-2010 MVP'10 - Aalborg University 32

Linear Arrays and Meshes

08-02-2010 MVP'10 - Aalborg University 33

Hypercubes

2^d nodes,d=dimension,good routing,relatively expensive,low congestion

08-02-2010 MVP'10 - Aalborg University 34

Fat trees

More bandwidth where it is needed.

08-02-2010 MVP'10 - Aalborg University 35

Evaluating The NetworksAll the previous topologies have advantages and disadvantages.Important factors: cost and performance.Define criteria to characterize cost and performance.

08-02-2010 MVP'10 - Aalborg University 36

CriteriaDiameter: maximum distance pa ↔ pb.Connectivity: measure of multiplicity of paths.Bisection width: minimum number of links to cut in order to partition the network in 2 equal halves.Bisection bandwidth: minimum volume of communication allowed between 2 halves.Cost: number of communication links, i.e., wires.