Post on 11-Jan-2016
description
transcript
ECE 1747: Parallel Programming
Basics of Parallel Architectures:
Shared-Memory Machines
Two Parallel Architectures
• Shared memory machines.
• Distributed memory machines.
Shared Memory: Logical View
proc1 proc2 proc3 procN
Shared memory space
Shared Memory Machines
• Small number of processors: shared memory with coherent caches (SMP).
• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).
SMPs
• 2- or 4-processors PCs are now commodity.
• Good price/performance ratio.
• Memory sometimes bottleneck (see later).
• Typical price (8-node): ~ $20-40k.
Physical Implementation
proc1 proc2 proc3 procN
Shared memory
cache1 cache2 cache3 cacheN
bus
Shared Memory Machines
• Small number of processors: shared memory with coherent caches (SMP).
• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).
CC-NUMA: Physical Implementation
proc1 proc2 proc3 procN
mem2 mem3 memNmem1
cache2cache1 cacheNcache3
inter-connect
Caches in Multiprocessors
• Suffer from the coherence problem:– same line appears in two or more caches– one processor writes word in line– other processors now can read stale data
• Leads to need for a coherence protocol– avoids coherence problems
• Many exist, will just look at simple one.
What is coherence?
• What does it mean to be shared?
• Intuitively, read last value written.
• Notion is not well-defined in a system without a global clock.
The Notion of “last written” in a Multi-processor System
w(x)
w(x)
r(x)
r(x)
P0
P1
P2
P3
The Notion of “last written” in a Single-machine System
w(x) w(x) r(x) r(x)
Coherence: a Clean Definition
• Is achieved by referring back to the single machine case.
• Called sequential consistency.
Sequential Consistency (SC)
• Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.
Returning to our Example
w(x)
w(x)
r(x)
r(x)
P0
P1
P2
P3
Another Way of Defining SC
• All memory references of a single process execute in program order.
• All writes are globally ordered.
SC: Example 1
w(x,1) w(y,1)
r(x) r(y)
Initial values of x,y are 0.
What are possible final values?
SC: Example 2
w(x,1) w(y,1)
r(y) r(x)
SC: Example 3
w(x,1)
w(y,1)
r(y) r(x)
SC: Example 4
w(x,1)
w(x,2)
r(x)
r(x)
Implementation
• Many ways of implementing SC.
• In fact, sometimes stronger conditions.
• Will look at a simple one: MSI protocol.
Physical Implementation
proc1 proc2 proc3 procN
Shared memory
cache1 cache2 cache3 cacheN
bus
Fundamental Assumption
• The bus is a reliable, ordered broadcast bus.– Every message sent by a processor is received
by all other processors in the same order.
• Also called a snooping bus– Processors (or caches) snoop on the bus.
States of a Cache Line
• Invalid
• Shared– read-only, one of many cached copies
• Modified– read-write, sole valid copy
Processor Transactions
• processor read(x)
• processor write(x)
Bus Transactions
• bus read(x) – asks for copy with no intent to modify
• bus read-exclusive(x)– asks for copy with intent to modify
State Diagram: Step 0
I S M
State Diagram: Step 1
I S M
PrRd/BuRd
State Diagram: Step 2
I S M
PrRd/BuRdPrRd/-
State Diagram: Step 3
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
State Diagram: Step 4
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
State Diagram: Step 5
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
PrWr/-
State Diagram: Step 6
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
PrWr/-
BuRd/Flush
State Diagram: Step 7
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
PrWr/-
BuRd/Flush
BuRd/-
State Diagram: Step 8
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
PrWr/-
BuRd/Flush
BuRd/-
BuRdX/-
State Diagram: Step 9
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
PrWr/-
BuRd/Flush
BuRd/-
BuRdX/-
BuRdX/Flush
In Reality
• Most machines use a slightly more complicated protocol (4 states instead of 3).
• See architecture books (MESI protocol).
Problem: False Sharing
• Occurs when two or more processors access different data in same cache line, and at least one of them writes.
• Leads to ping-pong effect.
False Sharing: Example (1 of 3)
for( i=0; i<n; i++ )
a[i] = b[i];
• Let’s assume we parallelize code: – p = 2– element of a takes 4 words– cache line has 32 words
False Sharing: Example (2 of 3)
a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]
cache line
Written by processor 0
Written by processor 1
False Sharing: Example (3 of 3)
P0
P1
a[0]
a[1]
a[2] a[4]
a[3] a[5]
...inv data
Summary
• Sequential consistency.
• Bus-based coherence protocols.
• False sharing.