Download - EECE 5501 5.5: Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.

EECE 550 1

5.5: Synchronization Issue: How can synchronization operations

be implemented in bus-based cache-coherent multiprocessors

Components of a synchronization event– Acquire method– Waiting algorithm

• Busy waiting (processor cannot do other work)• Blocking (higher overhead, state must be

saved)

– Release method

EECE 550 2

Implementing Mutual Exclusion (Lock-Unlock) Hardware solution

– Use a set of “LOCK” bus lines• Expensive and nonscalable

P1 P2 P3 Pp

LOCK1LOCK2LOCK3LOCK4

EECE 550 3

Software solution– Requires hardware support for an atomic te

st-and-set operation– Example

lock: ld reg, locationcmp reg, #0bnz lockst location, #1ret

unlock: st location, #0 ret

Does this work?

EECE 550 4

Simple software test-and-set lock– lock: t&s reg, location

bnz reg, lockret

– unlock: st location, #0ret

Other possible atomic instructions– Swap reg, location– Fetch&op (operation) location

• fetch&inc location• fetch&add reg, location

– Compare&swap reg1, reg2, location/* if (reg1 = M[location])

then M[Location] reg2 */

EECE 550 5

Performance of t&s Locks

Figure 5.29 Based on following code

– Lock(L);critical-section( c ); /* c time in crit. sec. */unlock(L);

Exponential backoff (like CSMA)– If (lock is unsuccessful) then

wait ( k * fi ) time units before another attempt

constants chosen based on experiments

EECE 550 6

Test-and-test-and-set Lock

Could be basis for better solution Operation

– Lock: test reg, locationbnz reg, lockt&s reg, locationbnz reg, lockret

EECE 550 7

Performance Goals for Locks Low latency Low traffic Scalability Low storage cost Fairness

– Starvation should be avoided

swap lock?

t&s lock?

test-and-t&s lock?

Evaluation of locks:

EECE 550 8

(LL, SC) Primitives

LL (load-locked)– Loads synchronization variable into a regist

er SC (store-conditional)

– Tries to store the register value into the synchronization variable memory iff no other processor has written to that location (or cache block) since the LL

EECE 550 9

lock: LL reg1, locationbnz reg, lock /* if locked, try again *

/SC location, reg2beqz lock /* if sc failed, start again *

/ret

unlock: st location, #0ret

EECE 550 10

Comments on LL-SC

Only certain “undo-able” instructions are permitted between LL and SC

Many different types of “fetch&op” instructions can be implemented

SC does not generate invalidations upon a failure

Only one processor can perform LL or SC at any given time instant

EECE 550 11

Ticket Lock

LOCK: LL reg1, ticketadd reg2, reg1, #1SC ticket, reg2beqz lock

LOCK1: load reg3, LEDcmp reg1, reg3bnz LOCK1ret

Unlock: load reg1, LEDinc reg1store LED, reg1

ticket LED

EECE 550 12

Array-based LOCK

LOCK: LL reg1, ticketadd reg2, reg1, #1 (mod p)SC ticket, reg2beqz lockstore ptr, reg2

LOCK1: load reg3, LED[reg1]cmp reg3, #1bnz LOCK1store LED[reg1], #0ret

Unlock: load reg1, ptrstore LED[reg1], #1ret

ticket

LED

…

…

EECE 550 13

Comments on LL-SC

LL-SC does not generate bus traffic if LL fails

LL-SC does not generate invalidations if SC fails

LL-SC does generate read-miss bus traffic even if SC fails

O(p) traffic per lock acquisition LL-SC is not a fair lock

EECE 550 14

Comments on Ticket Lock

Operates like the ticket system at a bank Every process wanting to acquire the lock

takes a ticket number and then busy-waits on a global “now-serving” number

To release the lock, a process increments the “now-serving” number

Ticket is fair, generates low bus traffic, and uses a constant amount of small storage

Main problem: When “now-serving” changes, all processors’ cached copies are invalidated, and they all incur a read miss

EECE 550 15

Comments on Array-Based Lock Uses fetch&increment to obtain a unique

location on which to busy-wait (not a value) Lock data structure contains an array of p

locations (each in a separate cache block) Acquire

– Use fetch&increment to obtain next available location in lock array (with wraparound)

Release– Write “unlocked” to the next array location

It is fair, uses O(p) space, and is more scalable than ticket lock since only 1 processor read-misses

EECE 550 16

Comparison

Comparative performance: Fig. 5.30– LL-SC with exponential backoff is best

NOTE– “… if a process holding a lock stops or slows down

while it is in its critical section, all other processes may have to wait.” [pp. 350-351]

• Try to avoid locks• Try to use LL-SC type operations instead of actual locks

EECE 550 17

5.5.5. Barriers

Hardware barrier– Use special bus line and wired-OR

Software barrier– Use locks, shared counters, and flags– E.g., refer to p. 354 of text

EECE 550 18

Centralized barrierBARRIER (bar_name, p) {

LOCK(bar_name.lock);if (bar_name.counter == 0) bar_name.flag = 0;mycount = bar_name.counter++;UNLOCK(bar_name.lock);if (mycount == p) { bar_name.counter = 0; bar_name.flag = 1; }else while (bar_name.flag == 0) { }}

Problem with this code?

EECE 550 19

Centralized barrier has potential problem with flag “re-initialization”

Centralized barrier with sense reversalBARRIER (bar_name, p) {

local_sense = !(local_sense);LOCK(bar_name.lock);mycount = bar_name.counter++;if (mycount == p) { UNLOCK(bar_name.lock); bar_name.counter = 0; bar_name.flag = local_sense; }else {

UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) { }} }

EECE 550 20

Improving Barrier Performance Use software combining tree

– With a bus, this has no significant benefit Use a special bus primitive to reduce

the number of bus transactions for read misses in a centralized barrier– A processor monitors the bus and aborts its

read miss if it sees the response to a read miss to the same location (by another processor)

EECE 550 21

5.6. Implications for Software Use details of H/W design to design

better, more efficient S/W– Keep machine fixed and examine how to

improve parallel programs Programmer’s “Bag of Tricks”

– Assign tasks to reduce spatial interleaving of access patterns

– Structure data to reduce spatial interleaving of access patterns

• E.g., 4D arrays instead of 2D arrays for equation solver kernel

EECE 550 22

– Beware of conflict misses• Figure 5.34

– Sizing dimensions of allocated arrays to powers of 2 is bad

– This is a problem with direct-mapped caches

– Use per-processor heaps• Heap reservoir of memory space for a process

– Copy data to increase spatial locality– Pad arrays

• Refer to Figure 5.36• Try to avoid false sharing within a cache block

– Determine how to organize arrays of records• Which data will be used together?• Refer to Figure 5.36

– Align arrays to cache block boundaries• Array should begin at cache block boundary