ECE 1747: Parallel Programming

transcript

Basics of Parallel Architectures:

Shared-Memory Machines

Two Parallel Architectures

• Shared memory machines.

• Distributed memory machines.

Shared Memory: Logical View

proc1 proc2 proc3 procN

Shared memory space

Shared Memory Machines

• Small number of processors: shared memory with coherent caches (SMP).

• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

• 2- or 4-processors PCs are now commodity.

• Good price/performance ratio.

• Memory sometimes bottleneck (see later).

• Typical price (8-node): ~ $20-40k.

Physical Implementation

Shared memory

cache1 cache2 cache3 cacheN

Shared Memory Machines

• Small number of processors: shared memory with coherent caches (SMP).

• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

CC-NUMA: Physical Implementation

mem2 mem3 memNmem1

cache2cache1 cacheNcache3

inter-connect

Caches in Multiprocessors

• Suffer from the coherence problem:– same line appears in two or more caches– one processor writes word in line– other processors now can read stale data

• Leads to need for a coherence protocol– avoids coherence problems

• Many exist, will just look at simple one.

What is coherence?

• What does it mean to be shared?

• Intuitively, read last value written.

• Notion is not well-defined in a system without a global clock.

The Notion of “last written” in a Multi-processor System

The Notion of “last written” in a Single-machine System

w(x) w(x) r(x) r(x)

Coherence: a Clean Definition

• Is achieved by referring back to the single machine case.

• Called sequential consistency.

Sequential Consistency (SC)

• Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.

Returning to our Example

Another Way of Defining SC

• All memory references of a single process execute in program order.

• All writes are globally ordered.

SC: Example 1

w(x,1) w(y,1)

r(x) r(y)

Initial values of x,y are 0.

What are possible final values?

SC: Example 2

w(x,1) w(y,1)

r(y) r(x)

SC: Example 3

w(x,1)

w(y,1)

r(y) r(x)

SC: Example 4

w(x,1)

w(x,2)

Implementation

• Many ways of implementing SC.

• In fact, sometimes stronger conditions.

• Will look at a simple one: MSI protocol.

Physical Implementation

Shared memory

cache1 cache2 cache3 cacheN

Fundamental Assumption

• The bus is a reliable, ordered broadcast bus.– Every message sent by a processor is received

by all other processors in the same order.

• Also called a snooping bus– Processors (or caches) snoop on the bus.

States of a Cache Line

• Invalid

• Shared– read-only, one of many cached copies

• Modified– read-write, sole valid copy

Processor Transactions

• processor read(x)

• processor write(x)

Bus Transactions

• bus read(x) – asks for copy with no intent to modify

• bus read-exclusive(x)– asks for copy with intent to modify

State Diagram: Step 0

PrRd/BuRd

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/-

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/-

BuRd/Flush

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

BuRdX/-

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

BuRdX/-

BuRdX/Flush

In Reality

• Most machines use a slightly more complicated protocol (4 states instead of 3).

• See architecture books (MESI protocol).

Problem: False Sharing

• Occurs when two or more processors access different data in same cache line, and at least one of them writes.

• Leads to ping-pong effect.

False Sharing: Example (1 of 3)

for( i=0; i<n; i++ )

a[i] = b[i];

• Let’s assume we parallelize code: – p = 2– element of a takes 4 words– cache line has 32 words

a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]

cache line

Written by processor 0

Written by processor 1

a[2] a[4]

a[3] a[5]

...inv data

Summary

• Sequential consistency.

• Bus-based coherence protocols.

• False sharing.

ECE 1747: Parallel Programming

Documents