Architectureapm/REDAI/docs/redai02v.pdfDirectory-based Protocol •Distributed directory contains...

Parallel Computing

Architecture

2010@FEUP Architecture 2

• Uses of interconnection networks

• Connect processors to shared memory

• Connect processors to each other

• Interconnection media types

• Shared medium

• Switched medium

Interconnection Networks


Parallel Computers

• Vector Computers

• Multiple CPUs

• Instructions include direct vector operations

• Pipelined – data streams through vector arithmetic

units (CRAY)

• Processor array – processors execute the same

instruction

• Multiprocessors

• Multiple CPUs with shared memory

• Multicomputers

• Multiple CPUs with distributed memory


Processor Array

• Only well adapted to data parallel

problems


Multiprocessors

• Shared memory

• Can be built with comodity components

• Centralized

• Extension of a multiprocessor

• Add CPUs to a BUS

• Same memory access time

• UMA – Uniform memory access

• Also known as SMP (symmetric multiprocessor)

• Distributed

• Memory distributed among processors

• NUMA – Non-uniform memory access

• Allows greater numbers of processors


Centralized multiprocessors

• Problem: Cache coherence

• Write invalidate protocol


Most common solution to cache coherency

1. Each CPU’s cache controller monitors (snoops)

the bus & identifies which cache blocks are

requested by other CPUs.

2. A Processor gains exclusive control of data item

before performing “write”.

3. Before “write” occurs, all other copies of data

item cached by other Processors are

invalidated.

4. When any other CPU tries to read a memory

location from an invalidated cache block,

• a cache miss occurs

• it has to retrieve updated data from memory

Write Invalidate Protocol


Cache-coherence

Cache

CPU A

Cache

CPU B

Memory

7 X


CPU A

Cache

CPU B

Memory

X 7

7

Cache-coherence

Read from memory is

not a problem.


CPU A CPU B

Memory

X 7

7 7

Cache-coherence


CPU A CPU B

Memory

X 2

7 2

Cache-coherence

Write to memory is a

problem.


CPU A CPU B

Memory

X 7

7 7

Cache-coherence

A cache control

monitor snoops the bus

to see which cache

block is being

requested by other

processors.


CPU A CPU B

Memory

X 7

7 7

Cache-coherence

Intent to write X

Before a write can

occur, all copies of

data at that address

are declared invalid.


CPU A CPU B

Memory

X 7

7

Cache-coherence

Intent to write X


CPU A CPU B

Memory

X 2

Cache-coherence

2

When another processor

tries to read from this

location in cache, it

receives a cache miss

error and will have to

refresh from main

memory.

Distributed Multiprocessors

• Increase local memory bandwidth and

lower average memory access time

• The all memory has a single address

space


Cache Coherence

• Implementation more difficult

• No shared memory bus to “snoop”

• Directory-based protocol needed

• Some NUMA multiprocessors do not

support it in hardware

• Only instructions, private data in cache

• Large memory access time variance


Directory-based Protocol

• Distributed directory contains information about

cacheable memory blocks

• One directory entry for each cache block

• Each entry has

• Sharing status

• Which processors have copies

• Sharing status

• Uncached -- (denoted by “U”)

• Block not in any processor’s cache

• Shared – (denoted by “S”)

• Cached by one or more processors, read only

• Exclusive – (denoted by “E”)

• Cached by exactly one processor, write access




Interconnection Network

Directory

Local Memory

Cache

CPU 0

Directory

Local Memory

Cache

CPU 1

Directory

Local Memory

Cache

CPU 2




CPU 0 CPU 1 CPU 2

7 X

X U 0 0 0 Dir

Mem

Cache

Bit Vector

CPU0 reads X



CPU 0 CPU 1 CPU 2

7 X

X U 0 0 0 Dir

Mem

Cache

Bit Vector Read Miss

CPU0 reads X



CPU 0 CPU 1 CPU 2

7 X

X S 1 0 0 Dir

Mem

Cache

Bit Vector

7 X

CPU2 reads X



CPU 0 CPU 1 CPU 2

7 X

X S 1 0 0 Dir

Mem

Cache

Bit Vector

Read Miss

7 X

CPU2 reads X



CPU 0 CPU 1 CPU 2

7 X

X S 1 0 1 Dir

Mem

Cache

Bit Vector

7 X 7 X

CPU0 Writes 6 to X



CPU 0 CPU 1 CPU 2

7 X

X S 1 0 1 Dir

Mem

Cache

Bit Vector

7 X 7 X

Write Miss

CPU0 Writes 6 to X



CPU 0 CPU 1 CPU 2

7 X

X S 1 0 1 Dir

Mem

Cache

Bit Vector

7 X 7 X

Invalidate

CPU1 Reads X



CPU 0 CPU 1 CPU 2

7 X

X E 1 0 0 Dir

Mem

Cache

Bit Vector

6 X

Read Miss

CPU1 Reads X



CPU 0 CPU 1 CPU 2

6 X

X S 1 0 0 Dir

Mem

Cache

Bit Vector

6 X

Switch to Shared

CPU1 Reads X



CPU 0 CPU 1 CPU 2

6 X

X S 1 1 0 Dir

Mem

Cache

Bit Vector

6 X 6 X

CPU2 Writes 5 to X



CPU 0 CPU 1 CPU 2

6 X

X S 1 1 0 Dir

Mem

Cache

Bit Vector

6 X 6 X

Write Miss

CPU2 Writes 5 to X



CPU 0 CPU 1 CPU 2

6 X

X S 1 1 0 Dir

Mem

Cache

Bit Vector

6 X 6 X

Invalidate

CPU2 Writes 5 to X



CPU 0 CPU 1 CPU 2

6 X

X E 0 0 1 Dir

Mem

Cache

Bit Vector

5 X

CPU0 Writes 4 to X



CPU 0 CPU 1 CPU 2

6 X

X E 0 0 1 Dir

Mem

Cache

Bit Vector

5 X

Write Miss

CPU0 Writes 4 to X



CPU 0 CPU 1 CPU 2

5 X

X S 0 0 1 Dir

Mem

Cache

Bit Vector

5 X

Make shared

CPU0 Writes 4 to X



CPU 0 CPU 1 CPU 2

5 X

X U 0 0 0 Dir

Mem

Cache

Bit Vector Invalidate

CPU0 Writes 4 to X



CPU 0 CPU 1 CPU 2

5 X

X S 1 0 0 Dir

Mem

Cache

Bit Vector

5 X Creates cache

block storage

for X

CPU0 Writes 4 to X



CPU 0 CPU 1 CPU 2

4 X

X E 1 0 0 Dir

Mem

Cache

Bit Vector

5 X

Write X

CPU0 Writes Back X Block



CPU 0 CPU 1 CPU 2

4 X

X S 1 0 0 Dir

Mem

Cache

Bit Vector

4 X

Data Write Back

CPU0 flushes cache block X



CPU 0 CPU 1 CPU 2

X U 0 0 0 Dir

Mem

Cache

Bit Vector

4 X

Data Write Back

Multicomputer

• Distributed memory multiple-CPU

computer

• Same address on different processors

refers to different physical memory

locations

• Processors interact through message

passing

• Flavors

• Asymmetrical

• Symmetrical

• Mixed


Asymmetrical Multicomputer

• Back-end dedicated to parallel operations

• Single front-end computer can limit

scalability of system

• Every application requires development of

both front-end and back-end program


Symmetrical Multicomputer

• Every processor executes same program

• No simple way to balance program

development workload among processors

• More difficult to achieve high performance

with several processes on each processor


Mixed Cluster Multicomputer

• Co-located computers

• Dedicated to running parallel jobs

• Identical operating system

• Identical local disk images


Flynn’s Taxonomy

• Instruction stream

• Data stream

• Single vs. multiple

• Four combinations

• SISD

• SIMD

• MISD

• MIMD


Flynn’s Taxonomy

• SISD

• Single Instruction, Single Data

• Single-CPU systems

• Note: co-processors don’t count

• Can execute multiple functions

• Multiple I/O

• Example: PCs

• SIMD

• Single Instruction, Multiple Data

• Two architectures fit this category

• Pipelined vector processor

• Processor array


Flynn’s Taxonomy

• MISD

• Multiple Instruction, Single Data

• Example: systolic array

• MIMD

• Multiple Instruction, Multiple Data

• Multiple-CPU computers

• Multiprocessors

• Multicomputers


Systolic Array

• Multiple interconnected processing

elements

• Example: Sorting element


Input phase (1 clock)

3 inputs: a, b, c,

a

b

c min(a, b, c)

med(a, b, c)

max(a, b, c)

Output phase (1 clock)

3 outputs: min, med, max

A priority queue in a systolic array

• One insertion, 2 extractions


4

5

8

∞

∞

∞

∞

7

-∞ 5

8 4

7

-∞

∞

∞ ∞

∞

Inserting 7

4

5

7

8

∞

∞

∞ ∞

5

7 ∞

∞

4

8

∞ ∞

∞

Extraction

∞

∞

∞

7

8

∞

∞ ∞

7

∞ ∞

∞

5

∞

8 ∞

∞

∞

5

Extraction

Date post:	13-Mar-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Architectureapm/REDAI/docs/redai02v.pdfDirectory-based Protocol •Distributed directory contains...

Documents