+ All Categories
Home > Documents > ECE 1747: Parallel Programming

ECE 1747: Parallel Programming

Date post: 19-Mar-2016
Category:
Upload: amable
View: 59 times
Download: 1 times
Share this document with a friend
Description:
ECE 1747: Parallel Programming. Basics of Parallel Architectures: Shared-Memory Machines. Two Parallel Architectures. Shared memory machines. Distributed memory machines. Shared Memory: Logical View. Shared memory space. proc1. proc2. proc3. procN. Shared Memory Machines. - PowerPoint PPT Presentation
71
ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines
Transcript
Page 1: ECE 1747: Parallel Programming

ECE 1747: Parallel Programming

Basics of Parallel Architectures:Shared-Memory Machines

Page 2: ECE 1747: Parallel Programming

Two Parallel Architectures

• Shared memory machines.• Distributed memory machines.

Page 3: ECE 1747: Parallel Programming

Shared Memory: Logical View

proc1 proc2 proc3 procN

Shared memory space

Page 4: ECE 1747: Parallel Programming

Shared Memory Machines

• Small number of processors: shared memory with coherent caches (SMP).

• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

Page 5: ECE 1747: Parallel Programming

SMPs

• 2- or 4-processors PCs are now commodity.• Good price/performance ratio.• Memory sometimes bottleneck (see later).• Typical price (8-node): ~ $20-40k.

Page 6: ECE 1747: Parallel Programming

Physical Implementation

proc1 proc2 proc3 procN

Shared memory

cache1 cache2 cache3 cacheN

bus

Page 7: ECE 1747: Parallel Programming

Shared Memory Machines

• Small number of processors: shared memory with coherent caches (SMP).

• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

Page 8: ECE 1747: Parallel Programming

CC-NUMA: Physical Implementation

proc1 proc2 proc3 procN

mem2 mem3 memNmem1

cache2cache1 cacheNcache3

inter-connect

Page 9: ECE 1747: Parallel Programming

Caches in Multiprocessors

• Suffer from the coherence problem:– same line appears in two or more caches– one processor writes word in line– other processors now can read stale data

• Leads to need for a coherence protocol– avoids coherence problems

• Many exist, will just look at simple one.

Page 10: ECE 1747: Parallel Programming

What is coherence?

• What does it mean to be shared?• Intuitively, read last value written.• Notion is not well-defined in a system

without a global clock.

Page 11: ECE 1747: Parallel Programming

The Notion of “last written” in a Multi-processor System

w(x)

w(x)

r(x)

r(x)

P0

P1

P2

P3

Page 12: ECE 1747: Parallel Programming

The Notion of “last written” in a Single-machine System

w(x) w(x) r(x) r(x)

Page 13: ECE 1747: Parallel Programming

Coherence: a Clean Definition

• Is achieved by referring back to the single machine case.

• Called sequential consistency.

Page 14: ECE 1747: Parallel Programming

Sequential Consistency (SC)

• Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.

Page 15: ECE 1747: Parallel Programming

Returning to our Example

w(x)

w(x)

r(x)

r(x)

P0

P1

P2

P3

Page 16: ECE 1747: Parallel Programming

Another Way of Defining SC

• All memory references of a single process execute in program order.

• All writes are globally ordered.

Page 17: ECE 1747: Parallel Programming

SC: Example 1

w(x,1) w(y,1)

r(x) r(y)

Initial values of x,y are 0.

What are possible final values?

Page 18: ECE 1747: Parallel Programming

SC: Example 2

w(x,1) w(y,1)

r(y) r(x)

Page 19: ECE 1747: Parallel Programming

SC: Example 3

w(x,1)

w(y,1)

r(y) r(x)

Page 20: ECE 1747: Parallel Programming

SC: Example 4

w(x,1)

w(x,2)

r(x)

r(x)

Page 21: ECE 1747: Parallel Programming

Implementation

• Many ways of implementing SC.• In fact, sometimes stronger conditions.• Will look at a simple one: MSI protocol.

Page 22: ECE 1747: Parallel Programming

Physical Implementation

proc1 proc2 proc3 procN

Shared memory

cache1 cache2 cache3 cacheN

bus

Page 23: ECE 1747: Parallel Programming

Fundamental Assumption

• The bus is a reliable, ordered broadcast bus.– Every message sent by a processor is received

by all other processors in the same order.• Also called a snooping bus

– Processors (or caches) snoop on the bus.

Page 24: ECE 1747: Parallel Programming

States of a Cache Line

• Invalid• Shared

– read-only, one of many cached copies• Modified

– read-write, sole valid copy

Page 25: ECE 1747: Parallel Programming

Processor Transactions

• processor read(x)• processor write(x)

Page 26: ECE 1747: Parallel Programming

Bus Transactions

• bus read(x) – asks for copy with no intent to modify

• bus read-exclusive(x)– asks for copy with intent to modify

Page 27: ECE 1747: Parallel Programming

State Diagram: Step 0

I S M

Page 28: ECE 1747: Parallel Programming

State Diagram: Step 1

I S M

PrRd/BuRd

Page 29: ECE 1747: Parallel Programming

State Diagram: Step 2

I S M

PrRd/BuRdPrRd/-

Page 30: ECE 1747: Parallel Programming

State Diagram: Step 3

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

Page 31: ECE 1747: Parallel Programming

State Diagram: Step 4

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

Page 32: ECE 1747: Parallel Programming

State Diagram: Step 5

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

PrWr/-

Page 33: ECE 1747: Parallel Programming

State Diagram: Step 6

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

PrWr/-

BuRd/Flush

Page 34: ECE 1747: Parallel Programming

State Diagram: Step 7

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

Page 35: ECE 1747: Parallel Programming

State Diagram: Step 8

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

BuRdX/-

Page 36: ECE 1747: Parallel Programming

State Diagram: Step 9

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

BuRdX/-

BuRdX/Flush

Page 37: ECE 1747: Parallel Programming

In Reality

• Most machines use a slightly more complicated protocol (4 states instead of 3).

• See architecture books (MESI protocol).

Page 38: ECE 1747: Parallel Programming

Problem: False Sharing

• Occurs when two or more processors access different data in same cache line, and at least one of them writes.

• Leads to ping-pong effect.

Page 39: ECE 1747: Parallel Programming

False Sharing: Example (1 of 3)

#pragma omp parallel for schedule(cyclic)for( i=0; i<n; i++ )

a[i] = b[i];

• Let’s assume: – p = 2– element of a takes 4 words– cache line has 32 words

Page 40: ECE 1747: Parallel Programming

False Sharing: Example (2 of 3)

a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]

cache line

Written by processor 0Written by processor 1

Page 41: ECE 1747: Parallel Programming

False Sharing: Example (3 of 3)

P0

P1

a[0]

a[1]

a[2] a[4]

a[3] a[5]

...inv data

Page 42: ECE 1747: Parallel Programming

Summary

• Sequential consistency.• Bus-based coherence protocols.• False sharing.

Page 43: ECE 1747: Parallel Programming

Algorithms for Scalable Synchronization on Shared-

Memory Multiprocessors

J.M. Mellor-Crummey, M.L. Scott(MCS Locks)

Page 44: ECE 1747: Parallel Programming

Introduction

• Busy-waiting techniques – heavily used in synchronization on shared memory MPs

• Two general categories: locks and barriers– Locks ensure mutual exclusion– Barriers provide phase separation in an

application

Page 45: ECE 1747: Parallel Programming

Problem

• Busy-waiting synchronization constructs tend to:– Have significant impact on network traffic due

to cache invalidations– Contention leads to poor scalability

• Main cause: spinning on remote variables

Page 46: ECE 1747: Parallel Programming

The Proposed Solution

• Minimize access to remote variables• Instead, spin on local variables• Claim:

– It can be done all in software (no need for fancy and costly hardware support)

– Spinning on local variables will minimize contention, allow for good scalability, and good performance

Page 47: ECE 1747: Parallel Programming

Spin Lock 1: Test-and-Set Lock

• Repeatedly test-and-set a boolean flag indicating whether the lock is held

• Problem: contention for the flag (read-modify-write instructions are expensive)– Causes lots of network traffic, especially on

cache-coherent architectures (because of cache invalidations)

• Variation: test-and-test-and-set – less traffic

Page 48: ECE 1747: Parallel Programming

Test-and-test with Backoff Lock• Pause between successive test-and-set (“backoff”)• T&S with backoff idea:

while test&set (L) fails {pause (delay);delay = delay * 2;

}

Page 49: ECE 1747: Parallel Programming

Spin Lock 2: The Ticket Lock

• 2 counters (nr_requests, and nr_releases)• Lock acquire: fetch-and-increment on the

nr_requests counter, waits until its “ticket” is equal to the value of the nr_releases counter

• Lock release: increment of the nr_releases counter

Page 50: ECE 1747: Parallel Programming

Spin Lock 2: The Ticket Lock

• Advantage over T&S: polls with read operations only

• Still generates lots of traffic and contention• Can further improve by using backoff

Page 51: ECE 1747: Parallel Programming

Array-Based Queueing Locks

• Each CPU spins on a different location, in a distinct cache line

• Each CPU clears the lock for its successor (sets it from must-wait to has-lock)

• Lock-acquire while (slots[my_place] == must-wait);

• Lock-release slots[(my_place + 1) % P] = has-lock;

Page 52: ECE 1747: Parallel Programming

List-Based Queueing Locks (MCS Locks)

• Spins on local flag variables only• Requires a small constant amount of space

per lock

Page 53: ECE 1747: Parallel Programming

List-Based Queueing Locks (MCS Locks)

• CPUs are all in a linked list: upon release by current CPU, lock is acquired by its successor

• Spinning is on local flag• Lock points at tail of queue (null if not held)• Compare-and-swap allows for detection if it

is the only processor in queue and atomic removal of self from the queue

Page 54: ECE 1747: Parallel Programming

List-Based Queueing Locks (MCS Locks)

• Spin in acquire_lock waits for lock to become free

• Spin in release_lock compensates for the time window between fetch-and-store and assignment to predecessor->next in acquire_lock

• If no compare_and_swap – cumbersome

Page 55: ECE 1747: Parallel Programming

The MCS Tree-Based Barrier

• Uses a pair of P (nr. of CPUs) trees: arrival tree, and wakeup tree

• Arrival tree: each node has 4 children • Wakeup tree: binary tree

– Fastest way to wake up all P processors

Page 56: ECE 1747: Parallel Programming

Hardware Description• BBN Butterfly 1 – DSM multiprocessor

– Supports up to 256 CPUs, 80 used in experiments– Atomic primitives allow fetch_and_add,

fetch_and_store (swap), test_and_set• Sequent Symmetry Model B – cache-coherent,

shared-bus multiprocessor– Supports up to 30 CPUs, 18 used in experiments– Snooping cache-coherence protocol

• Neither supports compare-and-swap

Page 57: ECE 1747: Parallel Programming

Measurement Technique

• Results averaged over 10k (Butterfly) or 100k (Symmetry) acquisitions

• For 1 CPU, time represents latency between acquire and release of lock

• Otherwise, time represents time elapsed between successive acquisitions

Page 58: ECE 1747: Parallel Programming

58

Spin Locks on Butterfly

Page 59: ECE 1747: Parallel Programming

59

Spin Locks on Butterfly

Page 60: ECE 1747: Parallel Programming

Spin Locks on Butterfly

• Anderson’s fares poorer because the Butterfly lacks coherent caches, and CPUs may spin on statically unpredictable locations – which may not be local

• T&S with exponential backoff, Ticket lock with proportional backoff, MCS all scale very well, with slopes of 0.0025, 0.0021 and 0.00025 μs respectively

Page 61: ECE 1747: Parallel Programming

61

Spin Locks on Symmetry

Page 62: ECE 1747: Parallel Programming

62

Spin Locks on Symmetry

Page 63: ECE 1747: Parallel Programming

63

Latency and Impact of Spin Locks

Page 64: ECE 1747: Parallel Programming

Latency and Impact of Spin Locks

• Latency results are poor on Butterfly because:– Atomic operations are inordinately expensive in

comparison to non-atomic ones– 16-bit atomic primitives on Butterfly cannot

manipulate 24-bit pointers

Page 65: ECE 1747: Parallel Programming

65

Barriers on Butterfly

Page 66: ECE 1747: Parallel Programming

66

Barriers on Butterfly

Page 67: ECE 1747: Parallel Programming

67

Barriers on Symmetry

Page 68: ECE 1747: Parallel Programming

Barriers on Symmetry

• Different results from Butterfly because:– More CPUs can spin on same location (own

copy in local cache)– Distributing writes across different memory

modules yields no benefit because the bus serializes all communication

Page 69: ECE 1747: Parallel Programming

Conclusions

• Criteria for evaluating spin locks:– Scalability and induced network load– Single-processor latency– Space requirements– Fairness– Implementability with available atomic

operations

Page 70: ECE 1747: Parallel Programming

Conclusions

• MCS lock algorithm scales best, together with array-based queueing on cache-coherent machines

• T&S and Ticket Locks with proper backoffs also scale well, but incur more network load

• Anderson and G&T: prohibitive space requirements for large numbers of CPUs

Page 71: ECE 1747: Parallel Programming

Conclusions

• MCS, array-based, and Ticket Locks guarantee fairness (FIFO)

• MCS benefits significantly from existence of compare-and-swap

• MCS is best when contention expected: excellent scaling, FIFO ordering, least interconnect contention, low space reqs.


Recommended