Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | baldric-rogers |
View: | 230 times |
Download: | 0 times |
Spin Locks
Companion slides forThe Art of Multiprocessor
Programmingby Maurice Herlihy & Nir Shavit
Art of Multiprocessor Programming 2
What Should you do if you can’t get a lock?
• Keep trying– “spin” or “busy-wait”– Good if delays are short
• Give up the processor– Good if delays are long– Always good on uniprocessor
(1)
Art of Multiprocessor Programming 3
What Should you do if you can’t get a lock?
• Keep trying– “spin” or “busy-wait”– Good if delays are short
• Give up the processor– Good if delays are long– Always good on uniprocessor
our focus
Art of Multiprocessor Programming 4
Basic Spin-Lock
CS
Resets lock upon exit
spin lock
critical section
...
Art of Multiprocessor Programming 5
Basic Spin-Lock
CS
Resets lock upon exit
spin lock
critical section
...
…lock introduces sequential bottleneck
Seq Bottleneck no parallelism
Art of Multiprocessor Programming 6
Review: Test-and-Set
• Boolean value• Test-and-set (TAS)
– Swap true with current value– Return value tells if prior value was
true or false
• Can reset just by writing false• TAS aka “getAndSet”
Art of Multiprocessor Programming 7
Review: Test-and-Set
public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {
boolean prior = value; value = newValue; return prior; }}
(5)
Art of Multiprocessor Programming 8
Review: Test-and-Set
public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {
boolean prior = value; value = newValue; return prior; }}
Packagejava.util.concurrent.atomic
Art of Multiprocessor Programming 9
Review: Test-and-Set
public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {
boolean prior = value; value = newValue; return prior; }}
Swap old and new values
Art of Multiprocessor Programming 10
Review: Test-and-Set
AtomicBoolean lock = new AtomicBoolean(false)…boolean prior = lock.getAndSet(true)
Art of Multiprocessor Programming 11
Review: Test-and-Set
AtomicBoolean lock = new AtomicBoolean(false)…boolean prior = lock.getAndSet(true)
(5)
Swapping in true is called “test-and-set” or TAS
Art of Multiprocessor Programming 12
Test-and-Set Locks
• Locking– Lock is free: value is false– Lock is taken: value is true
• Acquire lock by calling TAS– If result is false, you win– If result is true, you lose
• Release lock by writing false
Art of Multiprocessor Programming 13
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}
Art of Multiprocessor Programming 14
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}
Lock state is AtomicBoolean
Art of Multiprocessor Programming 15
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}
Keep trying until lock acquired
Art of Multiprocessor Programming 16
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}
Release lock by resetting state to false
Art of Multiprocessor Programming 17
Performance
• Experiment– n threads– Increment shared counter 1 million
times
• How long should it take?• How long does it take?
Art of Multiprocessor Programming 18
Graph
ideal
tim e
threads
no speedup because of sequential bottleneck
Art of Multiprocessor Programming 19
Mystery #1ti
m e
threads
TAS lock
Ideal
(1)
What is going on?
Art of Multiprocessor Programming 20
Test-and-Test-and-Set Locks
• Lurking stage– Wait until lock “looks” free– Spin while read returns true (lock
taken)• Pouncing state
– As soon as lock “looks” available– Read returns false (lock free)– Call TAS to acquire lock– If TAS loses, back to lurking
Art of Multiprocessor Programming 21
Test-and-test-and-set Lock
class TTASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; }}
Art of Multiprocessor Programming 22
Test-and-test-and-set Lock
class TTASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; }} Wait until lock looks free
Art of Multiprocessor Programming 23
Test-and-test-and-set Lock
class TTASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; }}
Then try to acquire it
Art of Multiprocessor Programming 24
Mystery #2
TAS lock
TTAS lock
Ideal
tim e
threads
Art of Multiprocessor Programming 25
Opinion
• Our memory abstraction is broken• TAS & TTAS methods
– Are provably the same (in our model)
– Except they aren’t (in field tests)
• Need a more detailed model …
Art of Multiprocessor Programming 26
Bus-Based Architectures
Bus
cache
memory
cachecache
Art of Multiprocessor Programming 27
Bus-Based Architectures
Bus
cache
memory
cachecache
Random access memory (10s of cycles)
Art of Multiprocessor Programming 28
Bus-Based Architectures
cache
memory
cachecache
Shared Bus•Broadcast medium•One broadcaster at a time•Processors and memory all “snoop”
Bus
Art of Multiprocessor Programming 29
Bus-Based Architectures
Bus
cache
memory
cachecache
Per-Processor Caches•Small•Fast: 1 or 2 cycles•Address & state information
Art of Multiprocessor Programming 30
Jargon Watch
• Cache hit– “I found what I wanted in my cache”– Good Thing™
Art of Multiprocessor Programming 31
Jargon Watch
• Cache hit– “I found what I wanted in my cache”– Good Thing™
• Cache miss– “I had to shlep all the way to memory
for that data”– Bad Thing™
Art of Multiprocessor Programming 32
Cave Canem
• This model is still a simplification– But not in any essential way– Illustrates basic principles
• Will discuss complexities later
Art of Multiprocessor Programming 33
Bus
Processor Issues Load Request
cache
memory
cachecache
data
Art of Multiprocessor Programming 34
Bus
Processor Issues Load Request
Bus
cache
memory
cachecache
data
Gimmedata
Art of Multiprocessor Programming 35
cache
Bus
Memory Responds
Bus
memory
cachecache
data
Got your data right here data
Art of Multiprocessor Programming 36
Bus
Processor Issues Load Request
memory
cachecachedata
data
Gimmedata
Art of Multiprocessor Programming 37
Bus
Processor Issues Load Request
Bus
memory
cachecachedata
data
Gimmedata
Art of Multiprocessor Programming 38
Bus
Processor Issues Load Request
Bus
memory
cachecachedata
data
I got data
Art of Multiprocessor Programming 39
Bus
Other Processor Responds
memory
cachecache
data
I got data
datadata
Bus
Art of Multiprocessor Programming 40
Bus
Other Processor Responds
memory
cachecache
data
datadata
Bus
Art of Multiprocessor Programming 41
Modify Cached Data
Bus
data
memory
cachedata
data
(1)
Art of Multiprocessor Programming 42
Modify Cached Data
Bus
data
memory
cachedata
data
data
(1)
Art of Multiprocessor Programming 43
memory
Bus
data
Modify Cached Data
cachedata
data
Art of Multiprocessor Programming 44
memory
Bus
data
Modify Cached Data
cache
What’s up with the other copies?
data
data
Art of Multiprocessor Programming 45
Cache Coherence
• We have lots of copies of data– Original copy in memory – Cached copies at processors
• Some processor modifies its own copy– What do we do with the others?– How to avoid confusion?
Art of Multiprocessor Programming 46
Write-Back Caches
• Accumulate changes in cache• Write back when needed
– Need the cache for something else– Another processor wants it
• On first modification– Invalidate other entries– Requires non-trivial protocol …
Art of Multiprocessor Programming 47
Write-Back Caches
• Cache entry has three states– Invalid: contains raw seething bits– Valid: I can read but I can’t write– Dirty: Data has been modified
• Intercept other load requests• Write back to memory before using cache
Art of Multiprocessor Programming 48
Bus
Invalidate
memory
cachedatadata
data
Art of Multiprocessor Programming 49
Bus
Invalidate
Bus
memory
cachedatadata
data
Mine, all mine!
Art of Multiprocessor Programming 50
cache
Bus
Invalidate
memory
cachedata
data
Other caches lose read permission
Art of Multiprocessor Programming 51
cache
Bus
Invalidate
memory
cachedata
data
Other caches lose read permission
This cache acquires write permission
Art of Multiprocessor Programming 52
cache
Bus
Invalidate
memory
cachedata
data
Memory provides data only if not present in any cache, so no need
to change it now (expensive)
(2)
Art of Multiprocessor Programming 53
cache
Bus
Another Processor Asks for Data
memory
cachedata
data
(2)
Bus
Art of Multiprocessor Programming 54
cache data
Bus
Owner Responds
memory
cachedata
data
(2)
Bus
Here it is!
Art of Multiprocessor Programming 55
Bus
End of the Day …
memory
cachedata
data
(1)
Reading OK, no writing
data data
Art of Multiprocessor Programming 56
Simple TASLock
• TAS invalidates cache lines• Spinners
– Miss in cache– Go to bus
• Thread wants to release lock– delayed behind spinners
Art of Multiprocessor Programming 57
Test-and-test-and-set
• Wait until lock “looks” free– Spin on local cache– No bus use while lock busy
• Problem: when lock is released– Invalidation storm …
Art of Multiprocessor Programming 58
Local Spinning while Lock is Busy
Bus
memory
busybusybusy
busy
Art of Multiprocessor Programming 59
Bus
On Release
memory
freeinvalidinvalid
free
Art of Multiprocessor Programming 60
On Release
Bus
memory
freeinvalidinvalid
free
miss miss
Everyone misses, rereads
(1)
Art of Multiprocessor Programming 61
On Release
Bus
memory
freeinvalidinvalid
free
TAS(…) TAS(…)
Everyone tries TAS
(1)
Art of Multiprocessor Programming 62
Problems
• Everyone misses– Reads satisfied sequentially
• Everyone does TAS– Invalidates others’ caches
Art of Multiprocessor Programming 63
Mystery Explained
TAS lock
TTAS lock
Ideal
tim e
threads
Better than TAS but still
not as good as ideal
Art of Multiprocessor Programming 64
Solution: Introduce Delay
spin locktimedr1dr2d
• If the lock looks free• But I fail to get it
• There must be lots of contention• Better to back off than to collide again
Art of Multiprocessor Programming 65
Exponential Backoff Lock
public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}
Art of Multiprocessor Programming 66
Exponential Backoff Lock
public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay
Art of Multiprocessor Programming 67
Exponential Backoff Lock
public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free
Art of Multiprocessor Programming 68
Exponential Backoff Lock
public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return
Art of Multiprocessor Programming 69
Exponential Backoff Lock
public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}
Back off for random duration
Art of Multiprocessor Programming 70
Exponential Backoff Lock
public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}
Double max delay, within reason
Art of Multiprocessor Programming 71
Spin-Waiting Overhead
TTAS Lock
Backoff lock
tim e
threads
BUILT-IN FUNCTIONS FOR ATOMIC MEMORY ACCESS
72
Built-in functions for atomic memory access
type __sync_fetch_and_add (type *ptr, type value, ...)type __sync_fetch_and_sub (type *ptr, type value, ...)type __sync_fetch_and_or (type *ptr, type value, ...)type __sync_fetch_and_and (type *ptr, type value, ...)type __sync_fetch_and_xor (type *ptr, type value, ...)type __sync_fetch_and_nand (type *ptr, type value, ...)
{ tmp = *ptr; *ptr op= value; return tmp; } { tmp = *ptr; *ptr = ~tmp & value; return tmp; }
73
Built-in functions for atomic memory access
type __sync_add_and_fetch (type *ptr, type value, ...)type __sync_sub_and_fetch (type *ptr, type value, ...)type __sync_or_and_fetch (type *ptr, type value, ...)type __sync_and_and_fetch (type *ptr, type value, ...)type __sync_xor_and_fetch (type *ptr, type value, ...)type __sync_nand_and_fetch (type *ptr, type value, ...)
{ *ptr op= value; return *ptr; } { *ptr = ~*ptr & value; return *ptr; } // nand
74
Built-in functions for atomic memory access
bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval, ...)type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)
type __sync_lock_test_and_set (type *ptr, type value, ...)
75
OPENMP
76
OpenMP• (Shared Memory, Thread Based Parallelism) OpenMP is a
shared-memory application programming interface (API) whose features, are based on prior efforts to facilitate shared-memory parallel programming.
• (Compiler Directive Based) OpenMP provides directives, library functions and environment variables to create and control the execution of parallel programs.
• (Explicit Parallelism) OpenMP’s directives let the user tell the compiler which instructions to execute in parallel and how to distribute them among the threads that will run the code.
77
OpenMP exampleint main(){
for (i = 0; i < 10; i++) {a [i] = i * 0.5;b [i] = i * 2.0;
}sum = 0;for (i = 1; i <= 10; i+
+ ) {sum += a[i]*b[i];
}printf ("sum = %f",
sum);}
78
OpenMP exampleint main(){
for (i = 0; i < 10; i++) {a [i] = i * 0.5;b [i] = i * 2.0;
}sum = 0;for (i = 1; i <= 10; i+
+ ) {sum += a[i]*b[i];
}printf ("sum = %f",
sum);}
int main(){
for (i = 0; i < 10; i++) {a [i] = i * 0.5;b [i] = i * 2.0
}sum = 0; #pragma omp parallel private(t) shared(sum, a,
b){ t = 0;
#pragma omp for for (i = 1; i <= 10; i++ ) {
t += a[i]*b[i];}#pragma omp critical (update_sum)sum += t;
}printf ("sum = %f \n", sum);
}79
PARALLEL Region Construct
• A parallel region is a block of code that will be executed by multiple threads. This is the fundamental OpenMP parallel construct.
• Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code.
• There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point.
• How Many Threads?– Setting of the NUM_THREADS clause
– Use of the omp_set_num_threads() library function
– Setting of the OMP_NUM_THREADS environment variable
– Implementation default - usually the number of CPUs on a node
• A parallel region must be a structured block that does not span multiple routines or code files
• It is illegal to branch into or out of a parallel region
80
PARALLEL Region Construct
#pragma omp parallel [clause ...] newline
if (scalar_expression)
private (list)
shared (list)
default (shared | none)
firstprivate (list)
num_threads (integer-expression)
structured_block
81
PARALLEL Region Construct
#include <omp.h> void main () {
int nthreads, tid; /* Fork a team of threads with each thread having a private tid variable */ #pragma omp parallel private(tid) {
/* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) {
nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads);
} } /* All threads join master thread and terminate */
}
82
Work-Sharing Constructsfor Directive
• The for directive specifies that the iterations of the loop immediately following it must be executed in parallel by the team. This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor.
83
Work-Sharing Constructsfor Directive
#pragma omp for [clause ...] newline
schedule (type [,chunk])
ordered
private (list)
firstprivate (list)
lastprivate (list)
shared (list)
nowait
for_loop
84
Work-Sharing Constructsfor Directive
#include <omp.h> #define CHUNKSIZE 100 #define N 1000 void main () {
int i, chunk; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) {
#pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++) c[i] = a[i] + b[i];
} /* end of parallel section */
}
85
Work-Sharing Constructs SECTIONS Directive
• The SECTIONS directive is a non-iterative work-sharing construct. It specifies that the enclosed section(s) of code are to be divided among the threads in the team.
86
Work-Sharing Constructs
SECTIONS Directive#pragma omp sections [clause ...] newline
private (list)
firstprivate (list)
lastprivate (list)
nowait
{
#pragma omp section newline structured_block
#pragma omp section newline structured_block
}
87
Work-Sharing Constructs
SECTIONS Directive#include <omp.h> #define N 1000 void main () {
int i; float a[N], b[N], c[N], d[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = i * 1.5; b[i] = i + 22.35; #pragma omp parallel shared(a,b,c,d) private(i) {
#pragma omp sections nowait { #pragma omp section
for (i=0; i < N; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=0; i < N; i++) d[i] = a[i] * b[i];
} /* end of sections */
} /* end of parallel section */
}
88
Work-Sharing Constructs
SINGLE Directive• The SINGLE directive specifies that the enclosed code is to be
executed by only one thread in the team.
• May be useful when dealing with sections of code that are not thread safe (such as I/O)
#pragma omp single [clause ...] newline
private (list)
firstprivate (list)
nowait
structured_block
89
Data Scope Attribute Clauses - PRIVATE
• The PRIVATE clause declares variables in its list to be private to each thread.
• A new object of the same type is declared once for each thread in the team
• All references to the original object are replaced with references to the new object
• Variables declared PRIVATE should be assumed to be uninitialized for each thread
90
Data Scope Attribute Clauses - SHARED
• The SHARED clause declares variables in its list to be shared among all threads in the team.
• A shared variable exists in only one memory location and all threads can read or write to that address
• It is the programmer's responsibility to ensure that multiple threads properly access SHARED variables (such as via CRITICAL sections)
91
Data Scope Attribute Clauses - FIRSTPRIVATE• The FIRSTPRIVATE clause combines the
behavior of the PRIVATE clause with automatic initialization of the variables in its list.
• Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct.
92
Data Scope Attribute Clauses - LASTPRIVATE
• The LASTPRIVATE clause combines the behavior of the PRIVATE clause with a copy from the last loop iteration or section to the original variable object.
• The value copied back into the original variable object is obtained from the last (sequentially) iteration or section of the enclosing construct. For example, the team member which executes the final iteration for a DO section, or the team member which does the last SECTION of a SECTIONS context performs the copy with its own values
93
Synchronization Constructs
CRITICAL Directive• The CRITICAL directive specifies a region
of code that must be executed by only one thread at a time.
• If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region.
94
Synchronization Constructs
CRITICAL Directive#include <omp.h> void main() {
int x; x = 0; #pragma omp parallel shared(x) {
#pragma omp critical x = x + 1;
} /* end of parallel section */
}
95
Synchronization Constructs
• MASTER Directive
– The MASTER directive specifies a region that is to be executed only by the master thread of the team. All other threads on the team skip this section of code
• BARRIER Directive
– The BARRIER directive synchronizes all threads in the team.
– When a BARRIER directive is reached, a thread will wait at that point until all other threads have reached that barrier. All threads then resume executing in parallel the code that follows the barrier.
96
Synchronization Constructs
• ATOMIC Directive– The ATOMIC directive specifies that a
specific memory location must be updated atomically, rather than letting multiple threads attempt to write to it. In essence, this directive provides a mini-CRITICAL section.
– The directive applies only to a single, immediately following statement
97
Find an errorint main(){
for (i = 0; i < n; i++) {a [i] = i * 0.5;b [i] = i * 2.0
}sum = 0; t = 0;#pragma omp parallel shared(sum, a, b, n){
#pragma omp for private(t)for (i = 1; i <= n; i++ ) {
t = t + a[i]*b[i];}#pragma omp critical (update_sum)sum += t;
}printf ("sum = %f \n", sum);
}
98
Find an error
for (i=0; i<n-1; i++)a[i] = a[i] + b[i];
for (i=0; i<n-1; i++)a[i] = a[i+1] + b[i];
99
Find an error
#pragma omp parallel{
int Xlocal = omp_get_thread_num();Xshared = omp_get_thread_num(); printf("Xlocal = %d Xshared = %d\n",Xlocal,Xshared);
}int i, j;#pragma omp parallel forfor (i=0; i<n; i++)
for (j=0; j<m; j++) {a[i][j] = compute(i,j);
}
100
Find an error
void compute(int n){
int i;double h, x, sum;h = 1.0/(double) n;sum = 0.0;#pragma omp for reduction(+:sum) shared(h)for (i=1; i <= n; i++) {
x = h * ((double)i - 0.5);sum += (1.0 / (1.0 + x*x));
}pi = h * sum;
}
101
Find an error
void main (){.............
#pragma omp parallel for private(i,a,b)for (i=0; i<n; i++){
b++;a = b+i;
} /*-- End of parallel for --*/c = a + b;.............
}
102
Find an error
int icount;void lib_func(){
icount++;do_lib_work();
}main (){
#pragma omp parallel{
lib_func();}
}
103
Find an errorwork1(){
#pragma omp barrier}work2(){}main(){
#pragma omp parallel sections{
#pragma omp sectionwork1();
#pragma omp sectionwork2();
}} 104
Art of Multiprocessor Programming 105
This work is licensed under a
Creative Commons Attribution-ShareAlike 2.5 License.
• You are free:– to Share — to copy, distribute and transmit the work – to Remix — to adapt the work
• Under the following conditions:– Attribution. You must attribute the work to “The Art of
Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work).
– Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.
• For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to– http://creativecommons.org/licenses/by-sa/3.0/.
• Any of the above conditions can be waived if you get permission from the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.