+ All Categories
Home > Documents > Computer Architecture

Computer Architecture

Date post: 17-Mar-2016
Category:
Upload: jess
View: 43 times
Download: 0 times
Share this document with a friend
Description:
Computer Architecture. MIMD Parallel Processors. Iolanthe II racing in Waitemata Harbour. Classification of Parallel Processors. Flynn’s Taxonomy Classifies according to instruction and data stream S ingle I nstruction S ingle D ata Sequential processors - PowerPoint PPT Presentation
Popular Tags:
52
Computer Architecture MIMD Parallel Processors Iolanthe II racing in Waitemata Harbour
Transcript
Page 1: Computer Architecture

Computer Architecture

MIMD Parallel Processors

Iolanthe II racing in Waitemata Harbour

Page 2: Computer Architecture

Classification of Parallel Processors• Flynn’s Taxonomy

• Classifies according to instruction and data stream• Single Instruction Single Data

• Sequential processors• Single Instruction Multiple Data

• CM-2 – multiple small processors• Vector processors• Parts of commercial processors - MMX, Altivec

• Multiple Instruction Single Data• ?

• Multiple Instruction Multiple Data• General Parallel Processors

Page 3: Computer Architecture

MIMD Systems• Recipe

• Buy a few high performance commercial PEs• DEC Alpha• MIPS R10000• UltraSPARC• Pentium?

• Put them together with some memory and peripherals on a common bus Instant

parallel processor!

• How to program it?

Page 4: Computer Architecture

Programming Model• Problem not unique to MIMD

• Even sequential machines need one• von Neuman (stored program) model

• Parallel - Splitting the work load• Data

• Distribute data to PEs• Instructions

• Distribute tasks to PEs• Synchronization

• Having divided the data & tasks,how do we synchronize tasks?

Page 5: Computer Architecture

Programming Model• Shared Memory Model

• Flavour of the year• Generally thought

to be simplest to manage • All PEs see a common

(virtual) address space• PEs communicate

by writing into the common address space

Page 6: Computer Architecture

Data Distribution• Trivial

• All the data sits in the common address space

• Any PE can access it!• Uniform Memory Access

(UMA) systems• All PEs access all data

with same tacc

• Non-UMA (NUMA) systems• Memory is physically distributed• Some PEs are “closer” to some addresses• More later!

Page 7: Computer Architecture

Synchronisation• Read static shared data

• No problem!• Update problem

• PE0 writes x• PE1 reads x• How to ensure that

PE1 reads the lastvalue written by PE0?

• Semaphores• Lock resources

(memory areas or ...)while being updatedby one PE

Page 8: Computer Architecture

Synchronisation• Semaphore

• Data structure in memory• Count of waiters

• -1 = resource free• >= 0 resource in use

• Pointer to list of waiters• Two operations

• Wait• Proceed immediately if resource free

(waiter count = -1)

• Notify• Advise semaphore that you have finished with resource• Decrement waiter count• First waiter will be given control

Page 9: Computer Architecture

Semaphores - Implementation

• Scenario• Semaphore free (-1)• PE0: wait ..

• Resource free, so PE0 uses it (sets 0)

• PE1: wait ..• Reads count (0)• Starts to increment it ..

• PE0 notify ..• Gets bus and writes -1

• PE1: (finishing wait) • Adds 1 to 0, writes 1 to count, adds PE1 TCB to list

Stalemate!• Who issues notify to free the resource?

Page 10: Computer Architecture

Atomic Operations• Problem

• PE0 wrote a new value (-1) after PE1 had read the counter• PE1 increments the value it read (0) and writes it back

• Solution• PE1’s read and update must be atomic

• No other PE must gain access to counterwhile PE1 is updating

• Usually an architecture will provide • Test and set instruction

• Read a memory location, test it,if it’s 0, write a new value,else do nothing

• Atomic or indivisible .. No other PE can access the value until the operation is complete

Page 11: Computer Architecture

Atomic Operations• Test & Set

• Read a memory location, test it,if it’s 0, write a new value,else do nothing

• Can be used to guard a resource• When the location contains 0 -

access to the resource is allowed• Non-zero value means the resource is locked• Semaphore:

• Simple semaphore (no wait list)• Implement directly• Waiter “backs off” and tries again (rather than being queued)

• Complex semaphore (with wait list)• Guards the wait counter

Page 12: Computer Architecture

Atomic Operations• Processor must provide an atomic

operation for• Multi-tasking or multi-threading on a single PE

• Multiple processes• Interrupts occur at arbitrary points in time

• including timer interrupts signaling end of time-slice• Any process can be interrupted in the middle of a

read-modify-write sequence• Shared memory multi-processors

• One PE can lose control of the bus after the read of a read-modify-write

• Cache?• Later!

Page 13: Computer Architecture

Atomic Operations• Variations

• Provide equivalent capability• Sometimes appear in strange guises!

• Read-modify-write bus transactions• Memory location is

read, modified and written back as a single, indivisible operation

• Test and exchange• Check register’s value, if 0, exchange with memory

• Reservation Register (PowerPC)• lwarx - load word and reserve indexed• stwcx - store word conditional indexed• Reservation register stores address of reserved word

• Reservation and use can be separated by sequence of instructions

Page 14: Computer Architecture

Barriers• In shared memory

environment• PEs must know when

another PE hasproduced a result

• Simplest case:barrier for all PEs

• Must be inserted byprogrammer

• Potentially expensive• All PEs stall and

waste time in the barrier

Page 15: Computer Architecture

Cache?• What happens to cached

locations?

Page 16: Computer Architecture

Multiple Caches• Coherence

• PEA reads location xfrom memory

• Copy in cache A• PEB reads location x

from memory• Copy in cache B

• PEA adds 1

Page 17: Computer Architecture

Multiple Caches - Inconsistent states

• Coherence• PEA reads location x

from memory• Copy in cache A

• PEB reads location x from memory

• Copy in cache B• PEA adds 1

• A’s copy now 201• PEB reads location x

• reads 200 from cache B

Page 18: Computer Architecture

Multiple Caches - Inconsistent states

• Coherence• PEA reads location x

from memory• Copy in cache A

• PEB reads location x from memory

• Copy in cache B• PEA adds 1

• A’s copy now 201• PEB reads location x

• reads 200 from cache BCaches and memory are now inconsistent or

not coherent

Page 19: Computer Architecture

Cache - Maintaining Coherence• Invalidate on write

• PEA reads location xfrom memory

• Copy in cache A• PEB reads location x

from memory• Copy in cache B

• PEA adds 1• A’s copy now 201

• Issues invalidate x• Cache B marks x invalid• Invalidate is address transaction only

Page 20: Computer Architecture

Cache - Maintaining Coherence• Reading the new value

• PEB reads location x• Main memory

is wrong also• PEA snoops read

• Realises it hasvalid copy

• PEA issues retry

Page 21: Computer Architecture

Cache - Maintaining Coherence• Reading the new value

PEB reads location x Main memory

is wrong alsoPEA snoops read

Realises it hasvalid copy

PEA issues retryPEA writes x back

Memory now correct PEB reads location x again

• Reads latest version

Page 22: Computer Architecture

Coherent Cache - Snooping• SIU “snoops” bus for transactions

• Addresses compared with local cache• Matches

• Initiate retries• Local copy is modified• Local copy is written to bus

• Invalidate local copies• Another PE is writing

• Mark local copies shared

• second PE is readingsame value

Page 23: Computer Architecture

Coherent Cache - MESI protocol• Cache line has 4 states

• Invalid• Modified

• Only valid copy• Memory copy is invalid

• Exclusive• Only cached copy• Memory copy is valid

• Shared• Multiple cached copies• Memory copy is valid

Page 24: Computer Architecture

MESI State Diagram

• Note the number of bus transactions needed!

WH Write HitWM Write MissRH Read HitRMS Read Miss SharedRME Read Miss ExclusiveSHW Snoop Hit Write

Page 25: Computer Architecture

Coherent Cache - The Cost

• Cache coherency transactions• Additional transactions needed • Shared

• Write Hit• Other caches must be notified

• Modified• Other PE read

• Push-out needed

• Other PE write• Push-out needed - writing one word of n-word line

• Invalid - modified in other cache• Read or write

• Wait for push-out

Page 26: Computer Architecture

Clusters• A bus which is too long becomes slow!

eg PCI is limited to 10 TTL loads• Lots of processors?

• On the same bus • Bus speed must be limited Low communication rate Better to use a single PE!

• Clusters• ~8 processors on a bus

Page 27: Computer Architecture

Clusters

8 cache coherent

(CC) processors

on a bus

Interconnectnetwork

~100? clusters

Page 28: Computer Architecture

Clusters

Network InterfaceUnit

Detects requests for“remote” memory

Page 29: Computer Architecture

Clusters

Messagedespatched to

remote cluster’sNIU

Memory RequestMessage

Page 30: Computer Architecture

This memory ismuch closer

than this one!

From PEs inthis cluster

Clusters - Shared Memory• Non Uniform Memory Access

• Access time to memory depends on location!

Page 31: Computer Architecture

Clusters - Shared Memory• Non Uniform Memory Access

• Access time to memory depends on location!

Worse!NIU needs to maintain

cache coherenceacross the entire

machine

Page 32: Computer Architecture

Clusters - Maintaining Cache Coherence

• NIU (or equivalent) maintains directory • Directory Entries

• All lines from local memory cached elsewhere

• NIU software (firmware) • Checks memory requests against directory• Update directory• Send invalidate messages to other clusters• Fetch modified (dirty) lines from other clusters

• Remote memory access cost• 100s of cycles!

Address Status Clusters 4340 S 1, 3, 8 5260 E 9

Directory(Cluster 2)

Page 33: Computer Architecture

Clusters - “Off the shelf”

• Commercial clusters • Provide page migration

• Make copy of a remote page on the local PE• Programmer remains responsible for

coherence• Don’t provide hardware support for cache

coherence (across network)• Fully CC machines may never be available!

• Software Systems• ....

Page 34: Computer Architecture

Shared Memory Systems• Software Systems

eg Treadmarks• Provide shared memory on page basis

• Software • detects references to remote pages• moves copy to local memory

• Reduces shared memory overhead• Provides some of the shared memory model

convenience• Without swamping interconnection network with

messages• Message overhead is too high for a single word!

• Word basis is too expensive!!

Page 35: Computer Architecture

Shared Memory Systems - Granularity

• Granularity• Word basis is too expensive!!• Sharing data at low granularity

• Fine grain sharing• Access / sharing for individual words

• Overheads too high• Number of messages• Message overhead is high for one word

• Compare• Burst access to memory• Don’t fetch a single word -

• Overhead (bus protocol) is too high

• Amortize cost of access over multiple words

Page 36: Computer Architecture

Shared Memory Systems - Granularity

• Coarse Grain Systems• Transferring data from cluster to cluster

• Overhead• Messages• Updating directory

• Amortise the overhead over a whole pageLower relative overhead

• Applies to thread size also• Split program into small threads of control

Parallel Overhead • cost of setting up & starting each thread• cost of synchronising at the end of a set of threads

• Can be more efficient to run a single sequential thread!

Page 37: Computer Architecture

Coarse Grain Systems• So far ...

• Most experiments suggest that fine grain systems are impractical

• Larger, coarser grain • Blocks of data• Threads of computation

needed to reduce overall computation time by using multiple processors

• Too Fine grain parallel systems • can run slower than a single processor!

Page 38: Computer Architecture

Parallel Overhead• Ideal

• Time = 1/n• Add Overhead

• Time > optimal• No point to use

more than4 PEs!!

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

Number of PEs

Exec

utio

n Ti

me

Ideal

"+Parall O'head"

Page 39: Computer Architecture

Parallel Overhead• Ideal

• Time = 1/n• Add Overhead

• Time > optimal• No point to use

more than4 PEs!!

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

Number of PEs

Exec

utio

n Ti

me

Ideal

"+Parall O'head"

Page 40: Computer Architecture

Parallel Overhead• Shared memory systems Best results if you

• Share on large block basiseg page

• Split program into coarse grain(long running) threads

• Give away some parallelismto achieve any parallel speedup!

• Coarse grain• Data• Computation

There’s parallelism at the instruction level too!The instruction issue unit in a sequential processoris trying to exploit it!

Page 41: Computer Architecture

Clusters - Improving multiple PE performance

• Bandwidth to memory • Cache reduces dependency on the memory-

CPU interface• 95% cache hits 5% of memory accesses

crossing the interface but add

• a few PEs and • a few CC transactions

even if the interface was coping before,it won’t in a multiprocessor system!

A major bottleneck!

Page 42: Computer Architecture

Clusters - Improving multiple PE performance

• Bus protocols add to access time Request / Grant / Release phases needed

• “Point-to-point” is faster! • Cross-bar switch

interface to memory• No PE contends

with any other for the common bus

Cross-bar?Name taken from old telephone exchanges!

Page 43: Computer Architecture

Clusters - Memory Bandwidth• Modern Clusters

• Use “Point-to-point” X-bar interfaces to memory to get bandwidth!

• Cache coherence?• Now really hard!!• How does each cache

snoop all transactions?

Page 44: Computer Architecture

Programming Model • Distributed Memory

• Message passing• Alternative to shared memory• Each PE has

own address space• PEs communicate

with messages• Messages provide

synchronisation• PE can block or

wait for a message

Page 45: Computer Architecture

Programming Model - Distributed Memory • Distributed Memory Systems

• Hardware is simple!• Network can be as simple as ethernet• Networks of Workstations model

• Commodity (cheap!) PEs• Commodity Network

• Standard• Ethernet• ATM

• Proprietary• Myrinet• Achilles (UWA!)

Page 46: Computer Architecture

Programming Model - Distributed Memory • Distributed Memory Systems

• Software is considered harder• Programmer responsible for

• Distributing data to individual PEs• Explicit Thread control

• Starting, stopping & synchronising

• At least two commonly available systems• Parallel Virtual Machine (PVM)• Message Passing Interface (MPI)

• Built on two operations• Send data, destPE, block | don’t block• Receive data, srcPE, block | don’t block• Blocking ensures synchronisation

Page 47: Computer Architecture

Programming Model - Distributed Memory • Distributed Memory Systems

• Performance generally better (versus shared memory)

• Shared memory has hidden overheads• Grain size poorly chosen

• eg data doesn’t fit into pages• Unnecessary coherence

transactions• Updating a shared region (each page)

before end of computation• MP system waits and updates page when computation

is complete

Page 48: Computer Architecture

Programming Model - Distributed Memory • Distributed Memory Systems

• Performance generally better (versus shared memory)

• False sharing

• Severely degrades performance• May not be apparent on superficial analysis

PEa accessesthis data

PEb accessesthis data

This whole pageping-pongs

between PEa and PEb

Memory page

Page 49: Computer Architecture

Distributed Memory - Summary

• Simpler (almost trivial) hardware• Software

• More programmer effort• Explicit data distribution• Explicit synchronisation

• Performance generally better • Programmer knows more about the problem• Communicates only when necessary• Communication grain size can be optimum

Lower overheads

Page 50: Computer Architecture

Data Flow• Conventional programming models are

control driven• Instruction sequence is precisely specified• Sequence specifies control

• which instruction the CPU will execute next• Execution rule:

• Execute an instruction when its predecessor has completed

s1: r = a*b;s2: s = c*d;s3: y = r + s;

s2 executes when s1 is completes3 executes when s2 is complete

Page 51: Computer Architecture

Data Flow• Consider the calculation

• y = a*b + c*d• Represent it by

a graph• Nodes represent

computations• Data flows along

arcs• Execution rule:

• Execute an instruction when its data is available

• Data driven rule

a b

x

+

d c

x

y

Page 52: Computer Architecture

Data Flow• Dataflow firing rule

• An instruction fires (executes)when its data is available

• Exposes all possible parallelism• Either multiplication can

fire as soon as data arrives• Addition must wait

• Data dependence analysis!• Instruction issue units:

• Fire (issue) each instructionwhen its operands (registers) have been written a b

x

+

d c

x

y


Recommended