+ All Categories
Home > Documents > The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON...

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON...

Date post: 03-Jan-2016
Category:
Upload: kevin-porter
View: 221 times
Download: 0 times
Share this document with a friend
Popular Tags:
27
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park
Transcript
Page 1: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

The Performance of Spin Lock Alternatives for

Shared-Memory Multiprocessors

THOMAS E. ANDERSON

Presented by Daesung Park

Page 2: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Introduction In shared-memory multiprocessors, each

processor can directly access memory For consistency of the data structure, we need

a method to serialize the operations done on it

Shared-memory multiprocessors provide some form of hardware support for mutual exclusion - atomic instructions

Page 3: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Why lock is needed? If the operations on critical sections are simple

enough Encapsulate these operations into single atomic

instruction Mutual exclusion is directly guaranteed Each processor attempts to access the shared

data waits its turn without returning control back to software

If the operations are not simple A LOCK is needed If the lock is busy, waiting is done in software Two choices, block or spin

Page 4: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

The topics of this paper Are there efficient algorithms for software

spinning for busy lock? 5 software solutions are presented

Are more complex kinds of hardware support needed for performance? Hardware solutions for ‘Multistage Interconnection

Network Multiprocessors’ and ‘Single Bus Multiprocessors’ are presented

Page 5: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Multiprocessor Architectures How processors are connected to memory

Multistage interconnection network or Bus Where or not each processor has a coherent

private cache Yes or No

What is the coherence protocol Invalidation-based or Distributed-write

Page 6: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

For the performance Minimize the communication bandwidth Minimize the delay between a lock is released

and reacquired Minimize latency by using simple algorithm

When there is no lock contention

Page 7: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

The problem of spinning Spin on Test-and-Set

The performance of spinning on test-and-set degrades as the number of spinning processors increases

The lock holder must contend with spinning processors to access the lock location and other locations for normal operation

Page 8: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

The problem of spinning – Spin on TAS

P1 P2 P3 P4

MEMORY

BUS, Write-Through, Invalidation-based, Spin on Read

lock := CLEAR;while (TestAndSet(lock) = BUSY)lock := CLEAR;

Page 9: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

The problem of spinning Spin on Read (Test-and-Test-and-Set)

Use cache to reduce the cost of spinning When lock is released, each cache is updated or

invalidated The waiting processor sees the change and

performs a test-and-set When critical section is small, this is as poor as

spin on test-and-set This is most pronounced for systems with

invalidation-based cache coherence, but also occurs with distributed-write

Page 10: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

The problem of spinning – Spin on read

P1 P2 P3 P4

MEMORY

BUS, Write-Through, Invalidation-based

11 0 10 00 1

01while (lock = BUSY or TestAndSet(lock) = BUSY)

Page 11: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Reasons for the poor performance of spin on read There is a separation between detecting that

the lock is released and attempting to acquire it with a test-and-set instruction More than one test-and-set can occur

Cache is invalidated by test-and-set even if the value is not changed

Invalidation-based cache coherence requires O(P) bus or network cycle to broadcast invalidation

Page 12: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Problem of spinningMeasurement Result1

Page 13: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Problem of spinningMeasurement Result2

Page 14: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Software solutionsDelay Alternatives

Insert delay into the spinning loop Where to insert delay

After the lock has been released After every separate access to the lock

The length of delay Static or dynamic

Lock latency is not affected because processors try to get lock before delay

Page 15: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Delay Alternatives Delay after Spinning processor Notices Lock has

been Released Reduce the number of test-and-sets when spin on

read Each processor can be statically assigned a

separate slot, or amount of time to delay The spinning processor with smallest delay gets the

lock Others may resume spinning without test-and-set When there are few spinning processors, using

fewer slots is better When there are many spinning processors, using

fewer slots results in many attempts to test-and-set

Page 16: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Delay Alternatives Vary spinning behavior based on the number

of waiting processors The number of collision = The number of

processors Initially assume that there are no other

waiting processors Try to test-and-set->fail->collision Double the delay up to some limit

Page 17: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Delay Alternatives Delay Between Each Memory Reference Can be used on architectures without cache or

with invalidation-based cache Reduce bandwidth consumption of spinning

processors Mead delay can be set statically or

dynamically More frequently polling improves performance

when there are few spinning processors

Page 18: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Software SolutionsQueuing in Shared Memory

Each processor insert itself into a queue then spins on a separate memory location flag

When a processor finishes with critical section, it sets the flag next processor in the queue

Only one cache read miss occurs Maintaining queue is expensive – much worse

for small critical sections

Page 19: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Queuing

Init flags[0] := HAS_LOCK;

flags[1..P-1] := MUST_SAIT;

queueLast := 0;

Lock myPlace := ReadAndIncrement(queueLast);

while(flags[myPlace mod P] = MUST_WAIT)

;

CRITICAL SECTION;

Unlock flags[myPlace mod P] := MUST_WAIT;

flags[(myPlace + 1) mod P] := HAS_LOCK;

Page 20: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

QueuingImplementations among architectures Distributed-write cache coherence

All processors share counter To release lock, a processor writes its sequence

number into shared counter Each cache is updated, directly notifying the next

processor to get lock Invalidation-based cache coherence

Each processor should wait on a flag in a separate cache block

One of caches is invalidated and one read miss occurs Multistage-network without coherence

Each processor should wait on a flag in a separate cache block

Have to poll to learn when it is their turn

Page 21: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

QueuingImplementations Bus without coherence

Processors must poll to find out if it is their turn This can swamp bus A delay can be inserted between each poll according to

the position of processors in the queue and the execution time of critical sections

Without atomic read-and-increment instruction Lock is needed One of delay alternatives above may be helpful for

contention Problem : Increment lock latency

Increment of counter Make its location 0, set another location If there is no contention, this latency is loss of

performance

Page 22: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Measurement Results of Software Alternatives1

Page 23: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Measurement Result of Software Alternatives2

Page 24: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Measurement Result of Software Alternatives3

Page 25: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Hardware Solutions Multistage Interconnection Network Multiprocessors Combining networks

For spin on test-and-set Only one of test-and-set requests are forwarded to

memory and all other requests are returned with the value set

Lock latency may increase Hardware queuing at the memory module

Eliminates polling across the network without coherence

Issues ‘enter’ and ‘exit’ instructions to the memory module

Lock latency is likely to be better than software queuing Caches to hold queue links

Stores the name of the next processors in the queue directly in each processor’s cache

Page 26: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Hardware Solutions Single Bus Multiprocessors Read broadcast

Eliminates duplicate read miss requests If a read occurs in the bus that is invalid in a

processor’s cache, the cache takes the data and make itself valid

Thus invalid caches of processors can be validated by another processor’s read

Special handling test-and-set requests in the cache Processor can spin on test-and-set, acquiring the

lock quickly when it is free without consuming bandwidth when it is busy

If test-and-set seems to fail, it is not committed

Page 27: The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Conclusion Simple method of spin-waiting degrade

performance as the number of spinning processors increases

Software queuing and backoff have good performance even for large numbers of spinning processors

Backoff has better performance when there is no contention, queuing performs best when there are contention

Special hardware support can improve performance, too


Recommended