Parallel Computing
Architecture
2010@FEUP Architecture 2
• Uses of interconnection networks
• Connect processors to shared memory
• Connect processors to each other
• Interconnection media types
• Shared medium
• Switched medium
Interconnection Networks
2010@FEUP Architecture 3
Parallel Computers
• Vector Computers
• Multiple CPUs
• Instructions include direct vector operations
• Pipelined – data streams through vector arithmetic
units (CRAY)
• Processor array – processors execute the same
instruction
• Multiprocessors
• Multiple CPUs with shared memory
• Multicomputers
• Multiple CPUs with distributed memory
2010@FEUP Architecture 4
Processor Array
• Only well adapted to data parallel
problems
2010@FEUP Architecture 5
Multiprocessors
• Shared memory
• Can be built with comodity components
• Centralized
• Extension of a multiprocessor
• Add CPUs to a BUS
• Same memory access time
• UMA – Uniform memory access
• Also known as SMP (symmetric multiprocessor)
• Distributed
• Memory distributed among processors
• NUMA – Non-uniform memory access
• Allows greater numbers of processors
2010@FEUP Architecture 6
Centralized multiprocessors
• Problem: Cache coherence
• Write invalidate protocol
2010@FEUP Architecture 7
Most common solution to cache coherency
1. Each CPU’s cache controller monitors (snoops)
the bus & identifies which cache blocks are
requested by other CPUs.
2. A Processor gains exclusive control of data item
before performing “write”.
3. Before “write” occurs, all other copies of data
item cached by other Processors are
invalidated.
4. When any other CPU tries to read a memory
location from an invalidated cache block,
• a cache miss occurs
• it has to retrieve updated data from memory
Write Invalidate Protocol
2010@FEUP Architecture 8
Cache-coherence
Cache
CPU A
Cache
CPU B
Memory
7 X
2010@FEUP Architecture 9
CPU A
Cache
CPU B
Memory
X 7
7
Cache-coherence
Read from memory is
not a problem.
2010@FEUP Architecture 10
CPU A CPU B
Memory
X 7
7 7
Cache-coherence
2010@FEUP Architecture 11
CPU A CPU B
Memory
X 2
7 2
Cache-coherence
Write to memory is a
problem.
2010@FEUP Architecture 12
CPU A CPU B
Memory
X 7
7 7
Cache-coherence
A cache control
monitor snoops the bus
to see which cache
block is being
requested by other
processors.
2010@FEUP Architecture 13
CPU A CPU B
Memory
X 7
7 7
Cache-coherence
Intent to write X
Before a write can
occur, all copies of
data at that address
are declared invalid.
2010@FEUP Architecture 14
CPU A CPU B
Memory
X 7
7
Cache-coherence
Intent to write X
2010@FEUP Architecture 15
CPU A CPU B
Memory
X 2
Cache-coherence
2
When another processor
tries to read from this
location in cache, it
receives a cache miss
error and will have to
refresh from main
memory.
Distributed Multiprocessors
• Increase local memory bandwidth and
lower average memory access time
• The all memory has a single address
space
2010@FEUP Architecture 16
Cache Coherence
• Implementation more difficult
• No shared memory bus to “snoop”
• Directory-based protocol needed
• Some NUMA multiprocessors do not
support it in hardware
• Only instructions, private data in cache
• Large memory access time variance
2010@FEUP Architecture 17
Directory-based Protocol
• Distributed directory contains information about
cacheable memory blocks
• One directory entry for each cache block
• Each entry has
• Sharing status
• Which processors have copies
• Sharing status
• Uncached -- (denoted by “U”)
• Block not in any processor’s cache
• Shared – (denoted by “S”)
• Cached by one or more processors, read only
• Exclusive – (denoted by “E”)
• Cached by exactly one processor, write access
2010@FEUP Architecture 18
Directory-based Protocol
2010@FEUP Architecture 19
Interconnection Network
Directory
Local Memory
Cache
CPU 0
Directory
Local Memory
Cache
CPU 1
Directory
Local Memory
Cache
CPU 2
Directory-based Protocol
2010@FEUP Architecture 20
Interconnection Network
CPU 0 CPU 1 CPU 2
7 X
X U 0 0 0 Dir
Mem
Cache
Bit Vector
CPU0 reads X
2010@FEUP Architecture 21
Interconnection Network
CPU 0 CPU 1 CPU 2
7 X
X U 0 0 0 Dir
Mem
Cache
Bit Vector Read Miss
CPU0 reads X
2010@FEUP Architecture 22
Interconnection Network
CPU 0 CPU 1 CPU 2
7 X
X S 1 0 0 Dir
Mem
Cache
Bit Vector
7 X
CPU2 reads X
2010@FEUP Architecture 23
Interconnection Network
CPU 0 CPU 1 CPU 2
7 X
X S 1 0 0 Dir
Mem
Cache
Bit Vector
Read Miss
7 X
CPU2 reads X
2010@FEUP Architecture 24
Interconnection Network
CPU 0 CPU 1 CPU 2
7 X
X S 1 0 1 Dir
Mem
Cache
Bit Vector
7 X 7 X
CPU0 Writes 6 to X
2010@FEUP Architecture 25
Interconnection Network
CPU 0 CPU 1 CPU 2
7 X
X S 1 0 1 Dir
Mem
Cache
Bit Vector
7 X 7 X
Write Miss
CPU0 Writes 6 to X
2010@FEUP Architecture 26
Interconnection Network
CPU 0 CPU 1 CPU 2
7 X
X S 1 0 1 Dir
Mem
Cache
Bit Vector
7 X 7 X
Invalidate
CPU1 Reads X
2010@FEUP Architecture 27
Interconnection Network
CPU 0 CPU 1 CPU 2
7 X
X E 1 0 0 Dir
Mem
Cache
Bit Vector
6 X
Read Miss
CPU1 Reads X
2010@FEUP Architecture 28
Interconnection Network
CPU 0 CPU 1 CPU 2
6 X
X S 1 0 0 Dir
Mem
Cache
Bit Vector
6 X
Switch to Shared
CPU1 Reads X
2010@FEUP Architecture 29
Interconnection Network
CPU 0 CPU 1 CPU 2
6 X
X S 1 1 0 Dir
Mem
Cache
Bit Vector
6 X 6 X
CPU2 Writes 5 to X
2010@FEUP Architecture 30
Interconnection Network
CPU 0 CPU 1 CPU 2
6 X
X S 1 1 0 Dir
Mem
Cache
Bit Vector
6 X 6 X
Write Miss
CPU2 Writes 5 to X
2010@FEUP Architecture 31
Interconnection Network
CPU 0 CPU 1 CPU 2
6 X
X S 1 1 0 Dir
Mem
Cache
Bit Vector
6 X 6 X
Invalidate
CPU2 Writes 5 to X
2010@FEUP Architecture 32
Interconnection Network
CPU 0 CPU 1 CPU 2
6 X
X E 0 0 1 Dir
Mem
Cache
Bit Vector
5 X
CPU0 Writes 4 to X
2010@FEUP Architecture 33
Interconnection Network
CPU 0 CPU 1 CPU 2
6 X
X E 0 0 1 Dir
Mem
Cache
Bit Vector
5 X
Write Miss
CPU0 Writes 4 to X
2010@FEUP Architecture 34
Interconnection Network
CPU 0 CPU 1 CPU 2
5 X
X S 0 0 1 Dir
Mem
Cache
Bit Vector
5 X
Make shared
CPU0 Writes 4 to X
2010@FEUP Architecture 35
Interconnection Network
CPU 0 CPU 1 CPU 2
5 X
X U 0 0 0 Dir
Mem
Cache
Bit Vector Invalidate
CPU0 Writes 4 to X
2010@FEUP Architecture 36
Interconnection Network
CPU 0 CPU 1 CPU 2
5 X
X S 1 0 0 Dir
Mem
Cache
Bit Vector
5 X Creates cache
block storage
for X
CPU0 Writes 4 to X
2010@FEUP Architecture 37
Interconnection Network
CPU 0 CPU 1 CPU 2
4 X
X E 1 0 0 Dir
Mem
Cache
Bit Vector
5 X
Write X
CPU0 Writes Back X Block
2010@FEUP Architecture 38
Interconnection Network
CPU 0 CPU 1 CPU 2
4 X
X S 1 0 0 Dir
Mem
Cache
Bit Vector
4 X
Data Write Back
CPU0 flushes cache block X
2010@FEUP Architecture 39
Interconnection Network
CPU 0 CPU 1 CPU 2
X U 0 0 0 Dir
Mem
Cache
Bit Vector
4 X
Data Write Back
Multicomputer
• Distributed memory multiple-CPU
computer
• Same address on different processors
refers to different physical memory
locations
• Processors interact through message
passing
• Flavors
• Asymmetrical
• Symmetrical
• Mixed
2010@FEUP Architecture 40
Asymmetrical Multicomputer
• Back-end dedicated to parallel operations
• Single front-end computer can limit
scalability of system
• Every application requires development of
both front-end and back-end program
2010@FEUP Architecture 41
Symmetrical Multicomputer
• Every processor executes same program
• No simple way to balance program
development workload among processors
• More difficult to achieve high performance
with several processes on each processor
2010@FEUP Architecture 42
Mixed Cluster Multicomputer
• Co-located computers
• Dedicated to running parallel jobs
• Identical operating system
• Identical local disk images
2010@FEUP Architecture 43
Flynn’s Taxonomy
• Instruction stream
• Data stream
• Single vs. multiple
• Four combinations
• SISD
• SIMD
• MISD
• MIMD
2010@FEUP Architecture 44
Flynn’s Taxonomy
• SISD
• Single Instruction, Single Data
• Single-CPU systems
• Note: co-processors don’t count
• Can execute multiple functions
• Multiple I/O
• Example: PCs
• SIMD
• Single Instruction, Multiple Data
• Two architectures fit this category
• Pipelined vector processor
• Processor array
2010@FEUP Architecture 45
Flynn’s Taxonomy
• MISD
• Multiple Instruction, Single Data
• Example: systolic array
• MIMD
• Multiple Instruction, Multiple Data
• Multiple-CPU computers
• Multiprocessors
• Multicomputers
2010@FEUP Architecture 46
Systolic Array
• Multiple interconnected processing
elements
• Example: Sorting element
2010@FEUP Architecture 47
Input phase (1 clock)
3 inputs: a, b, c,
a
b
c min(a, b, c)
med(a, b, c)
max(a, b, c)
Output phase (1 clock)
3 outputs: min, med, max
A priority queue in a systolic array
• One insertion, 2 extractions
2010@FEUP Architecture 48
4
5
8
∞
∞
∞
∞
7
-∞ 5
8 4
7
-∞
∞
∞ ∞
∞
Inserting 7
4
5
7
8
∞
∞
∞ ∞
5
7 ∞
∞
4
8
∞ ∞
∞
Extraction
∞
∞
∞
7
8
∞
∞ ∞
7
∞ ∞
∞
5
∞
8 ∞
∞
∞
5
Extraction