Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | kevin-porter |
View: | 221 times |
Download: | 0 times |
The Performance of Spin Lock Alternatives for
Shared-Memory Multiprocessors
THOMAS E. ANDERSON
Presented by Daesung Park
Introduction In shared-memory multiprocessors, each
processor can directly access memory For consistency of the data structure, we need
a method to serialize the operations done on it
Shared-memory multiprocessors provide some form of hardware support for mutual exclusion - atomic instructions
Why lock is needed? If the operations on critical sections are simple
enough Encapsulate these operations into single atomic
instruction Mutual exclusion is directly guaranteed Each processor attempts to access the shared
data waits its turn without returning control back to software
If the operations are not simple A LOCK is needed If the lock is busy, waiting is done in software Two choices, block or spin
The topics of this paper Are there efficient algorithms for software
spinning for busy lock? 5 software solutions are presented
Are more complex kinds of hardware support needed for performance? Hardware solutions for ‘Multistage Interconnection
Network Multiprocessors’ and ‘Single Bus Multiprocessors’ are presented
Multiprocessor Architectures How processors are connected to memory
Multistage interconnection network or Bus Where or not each processor has a coherent
private cache Yes or No
What is the coherence protocol Invalidation-based or Distributed-write
For the performance Minimize the communication bandwidth Minimize the delay between a lock is released
and reacquired Minimize latency by using simple algorithm
When there is no lock contention
The problem of spinning Spin on Test-and-Set
The performance of spinning on test-and-set degrades as the number of spinning processors increases
The lock holder must contend with spinning processors to access the lock location and other locations for normal operation
The problem of spinning – Spin on TAS
P1 P2 P3 P4
MEMORY
BUS, Write-Through, Invalidation-based, Spin on Read
lock := CLEAR;while (TestAndSet(lock) = BUSY)lock := CLEAR;
The problem of spinning Spin on Read (Test-and-Test-and-Set)
Use cache to reduce the cost of spinning When lock is released, each cache is updated or
invalidated The waiting processor sees the change and
performs a test-and-set When critical section is small, this is as poor as
spin on test-and-set This is most pronounced for systems with
invalidation-based cache coherence, but also occurs with distributed-write
The problem of spinning – Spin on read
P1 P2 P3 P4
MEMORY
BUS, Write-Through, Invalidation-based
11 0 10 00 1
01while (lock = BUSY or TestAndSet(lock) = BUSY)
Reasons for the poor performance of spin on read There is a separation between detecting that
the lock is released and attempting to acquire it with a test-and-set instruction More than one test-and-set can occur
Cache is invalidated by test-and-set even if the value is not changed
Invalidation-based cache coherence requires O(P) bus or network cycle to broadcast invalidation
Problem of spinningMeasurement Result1
Problem of spinningMeasurement Result2
Software solutionsDelay Alternatives
Insert delay into the spinning loop Where to insert delay
After the lock has been released After every separate access to the lock
The length of delay Static or dynamic
Lock latency is not affected because processors try to get lock before delay
Delay Alternatives Delay after Spinning processor Notices Lock has
been Released Reduce the number of test-and-sets when spin on
read Each processor can be statically assigned a
separate slot, or amount of time to delay The spinning processor with smallest delay gets the
lock Others may resume spinning without test-and-set When there are few spinning processors, using
fewer slots is better When there are many spinning processors, using
fewer slots results in many attempts to test-and-set
Delay Alternatives Vary spinning behavior based on the number
of waiting processors The number of collision = The number of
processors Initially assume that there are no other
waiting processors Try to test-and-set->fail->collision Double the delay up to some limit
Delay Alternatives Delay Between Each Memory Reference Can be used on architectures without cache or
with invalidation-based cache Reduce bandwidth consumption of spinning
processors Mead delay can be set statically or
dynamically More frequently polling improves performance
when there are few spinning processors
Software SolutionsQueuing in Shared Memory
Each processor insert itself into a queue then spins on a separate memory location flag
When a processor finishes with critical section, it sets the flag next processor in the queue
Only one cache read miss occurs Maintaining queue is expensive – much worse
for small critical sections
Queuing
Init flags[0] := HAS_LOCK;
flags[1..P-1] := MUST_SAIT;
queueLast := 0;
Lock myPlace := ReadAndIncrement(queueLast);
while(flags[myPlace mod P] = MUST_WAIT)
;
CRITICAL SECTION;
Unlock flags[myPlace mod P] := MUST_WAIT;
flags[(myPlace + 1) mod P] := HAS_LOCK;
QueuingImplementations among architectures Distributed-write cache coherence
All processors share counter To release lock, a processor writes its sequence
number into shared counter Each cache is updated, directly notifying the next
processor to get lock Invalidation-based cache coherence
Each processor should wait on a flag in a separate cache block
One of caches is invalidated and one read miss occurs Multistage-network without coherence
Each processor should wait on a flag in a separate cache block
Have to poll to learn when it is their turn
QueuingImplementations Bus without coherence
Processors must poll to find out if it is their turn This can swamp bus A delay can be inserted between each poll according to
the position of processors in the queue and the execution time of critical sections
Without atomic read-and-increment instruction Lock is needed One of delay alternatives above may be helpful for
contention Problem : Increment lock latency
Increment of counter Make its location 0, set another location If there is no contention, this latency is loss of
performance
Measurement Results of Software Alternatives1
Measurement Result of Software Alternatives2
Measurement Result of Software Alternatives3
Hardware Solutions Multistage Interconnection Network Multiprocessors Combining networks
For spin on test-and-set Only one of test-and-set requests are forwarded to
memory and all other requests are returned with the value set
Lock latency may increase Hardware queuing at the memory module
Eliminates polling across the network without coherence
Issues ‘enter’ and ‘exit’ instructions to the memory module
Lock latency is likely to be better than software queuing Caches to hold queue links
Stores the name of the next processors in the queue directly in each processor’s cache
Hardware Solutions Single Bus Multiprocessors Read broadcast
Eliminates duplicate read miss requests If a read occurs in the bus that is invalid in a
processor’s cache, the cache takes the data and make itself valid
Thus invalid caches of processors can be validated by another processor’s read
Special handling test-and-set requests in the cache Processor can spin on test-and-set, acquiring the
lock quickly when it is free without consuming bandwidth when it is busy
If test-and-set seems to fail, it is not committed
Conclusion Simple method of spin-waiting degrade
performance as the number of spinning processors increases
Software queuing and backoff have good performance even for large numbers of spinning processors
Backoff has better performance when there is no contention, queuing performs best when there are contention
Special hardware support can improve performance, too