EECE 550 1
5.5: Synchronization Issue: How can synchronization operations
be implemented in bus-based cache-coherent multiprocessors
Components of a synchronization event– Acquire method– Waiting algorithm
• Busy waiting (processor cannot do other work)• Blocking (higher overhead, state must be
saved)
– Release method
EECE 550 2
Implementing Mutual Exclusion (Lock-Unlock) Hardware solution
– Use a set of “LOCK” bus lines• Expensive and nonscalable
P1 P2 P3 Pp
LOCK1LOCK2LOCK3LOCK4
EECE 550 3
Software solution– Requires hardware support for an atomic te
st-and-set operation– Example
lock: ld reg, locationcmp reg, #0bnz lockst location, #1ret
unlock: st location, #0 ret
Does this work?
EECE 550 4
Simple software test-and-set lock– lock: t&s reg, location
bnz reg, lockret
– unlock: st location, #0ret
Other possible atomic instructions– Swap reg, location– Fetch&op (operation) location
• fetch&inc location• fetch&add reg, location
– Compare&swap reg1, reg2, location/* if (reg1 = M[location])
then M[Location] reg2 */
EECE 550 5
Performance of t&s Locks
Figure 5.29 Based on following code
– Lock(L);critical-section( c ); /* c time in crit. sec. */unlock(L);
Exponential backoff (like CSMA)– If (lock is unsuccessful) then
wait ( k * fi ) time units before another attempt
constants chosen based on experiments
EECE 550 6
Test-and-test-and-set Lock
Could be basis for better solution Operation
– Lock: test reg, locationbnz reg, lockt&s reg, locationbnz reg, lockret
EECE 550 7
Performance Goals for Locks Low latency Low traffic Scalability Low storage cost Fairness
– Starvation should be avoided
swap lock?
t&s lock?
test-and-t&s lock?
Evaluation of locks:
EECE 550 8
(LL, SC) Primitives
LL (load-locked)– Loads synchronization variable into a regist
er SC (store-conditional)
– Tries to store the register value into the synchronization variable memory iff no other processor has written to that location (or cache block) since the LL
EECE 550 9
lock: LL reg1, locationbnz reg, lock /* if locked, try again *
/SC location, reg2beqz lock /* if sc failed, start again *
/ret
unlock: st location, #0ret
EECE 550 10
Comments on LL-SC
Only certain “undo-able” instructions are permitted between LL and SC
Many different types of “fetch&op” instructions can be implemented
SC does not generate invalidations upon a failure
Only one processor can perform LL or SC at any given time instant
EECE 550 11
Ticket Lock
LOCK: LL reg1, ticketadd reg2, reg1, #1SC ticket, reg2beqz lock
LOCK1: load reg3, LEDcmp reg1, reg3bnz LOCK1ret
Unlock: load reg1, LEDinc reg1store LED, reg1
ticket LED
EECE 550 12
Array-based LOCK
LOCK: LL reg1, ticketadd reg2, reg1, #1 (mod p)SC ticket, reg2beqz lockstore ptr, reg2
LOCK1: load reg3, LED[reg1]cmp reg3, #1bnz LOCK1store LED[reg1], #0ret
Unlock: load reg1, ptrstore LED[reg1], #1ret
ticket
LED
…
…
EECE 550 13
Comments on LL-SC
LL-SC does not generate bus traffic if LL fails
LL-SC does not generate invalidations if SC fails
LL-SC does generate read-miss bus traffic even if SC fails
O(p) traffic per lock acquisition LL-SC is not a fair lock
EECE 550 14
Comments on Ticket Lock
Operates like the ticket system at a bank Every process wanting to acquire the lock
takes a ticket number and then busy-waits on a global “now-serving” number
To release the lock, a process increments the “now-serving” number
Ticket is fair, generates low bus traffic, and uses a constant amount of small storage
Main problem: When “now-serving” changes, all processors’ cached copies are invalidated, and they all incur a read miss
EECE 550 15
Comments on Array-Based Lock Uses fetch&increment to obtain a unique
location on which to busy-wait (not a value) Lock data structure contains an array of p
locations (each in a separate cache block) Acquire
– Use fetch&increment to obtain next available location in lock array (with wraparound)
Release– Write “unlocked” to the next array location
It is fair, uses O(p) space, and is more scalable than ticket lock since only 1 processor read-misses
EECE 550 16
Comparison
Comparative performance: Fig. 5.30– LL-SC with exponential backoff is best
NOTE– “… if a process holding a lock stops or slows down
while it is in its critical section, all other processes may have to wait.” [pp. 350-351]
• Try to avoid locks• Try to use LL-SC type operations instead of actual locks
EECE 550 17
5.5.5. Barriers
Hardware barrier– Use special bus line and wired-OR
Software barrier– Use locks, shared counters, and flags– E.g., refer to p. 354 of text
EECE 550 18
Centralized barrierBARRIER (bar_name, p) {
LOCK(bar_name.lock);if (bar_name.counter == 0) bar_name.flag = 0;mycount = bar_name.counter++;UNLOCK(bar_name.lock);if (mycount == p) { bar_name.counter = 0; bar_name.flag = 1; }else while (bar_name.flag == 0) { }}
Problem with this code?
EECE 550 19
Centralized barrier has potential problem with flag “re-initialization”
Centralized barrier with sense reversalBARRIER (bar_name, p) {
local_sense = !(local_sense);LOCK(bar_name.lock);mycount = bar_name.counter++;if (mycount == p) { UNLOCK(bar_name.lock); bar_name.counter = 0; bar_name.flag = local_sense; }else {
UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) { }} }
EECE 550 20
Improving Barrier Performance Use software combining tree
– With a bus, this has no significant benefit Use a special bus primitive to reduce
the number of bus transactions for read misses in a centralized barrier– A processor monitors the bus and aborts its
read miss if it sees the response to a read miss to the same location (by another processor)
EECE 550 21
5.6. Implications for Software Use details of H/W design to design
better, more efficient S/W– Keep machine fixed and examine how to
improve parallel programs Programmer’s “Bag of Tricks”
– Assign tasks to reduce spatial interleaving of access patterns
– Structure data to reduce spatial interleaving of access patterns
• E.g., 4D arrays instead of 2D arrays for equation solver kernel
EECE 550 22
– Beware of conflict misses• Figure 5.34
– Sizing dimensions of allocated arrays to powers of 2 is bad
– This is a problem with direct-mapped caches
– Use per-processor heaps• Heap reservoir of memory space for a process
– Copy data to increase spatial locality– Pad arrays
• Refer to Figure 5.36• Try to avoid false sharing within a cache block
– Determine how to organize arrays of records• Which data will be used together?• Refer to Figure 5.36
– Align arrays to cache block boundaries• Array should begin at cache block boundary