Post on 31-Mar-2015
transcript
Read-Write Lock Allocation in Software Transactional
Memory
Amir Ghanbari Bavarsad and Ehsan Atoofian
Lakehead University
P1
$ $
Pn
Global Clock
Transactional Memory Software transactional memory (STM) exploits a
global clock to validate transactional data Pros: reduces validation overhead Cons: contention
Alternate: Read Write Lock Allocation (RWLA) Pros: no central clock Cons: overhead if a TX aborts
Speculative RWLA: changes validation policy dynamically → Speedup: up to 66%
2
Outline
Background
RWLA
Speculative RWLA
Conclusion
3
4
Counter in STM
T1
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Transactional data are validated using: Global clock
Shared variable Timestamp for transactions
Lock Memory is mapped to Lock Table Each entry of the table:
Version #
…
…
5
Validation in STM
Global Clock
Memory
Lock Table
Version #
6
Updating Global Clock & Lock Increment Global Clock Version # = global_clock Global Clock
Memory
Lock Table
Version #
…
…
counter
7
Validation in STM
rv (read version) is set to global_clock
T1
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Metadata for TX1
rv
Global Clock
8
Successful Read Validation
rv >= version# The most recent write to counter,
occurred before TM_BEGIN()
T1
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Metadata for TX1 Global Clock
rv
9
Failed Read Validation
rv < version# The most recent write to counter,
occurred after TM_BEGIN()
T1
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Metadata for TX1 Global Clock
rv
Overhead of Validation
This method, called GV4, results in many cache coherence misses if transactions commit frequently
10
P1
$ $
Pn
Global Clock
Outline
Background
RWLA
Speculative RWLA
Conclusion
11
Lock Memory is mapped to Lock Table Each entry of the table:
Lock bit Read bits
Read Write Lock Allocation (RWLA)
12
Lock Table
…
…
Memory
P0P1…Pn-1
lock bitRead bits
13
TM_READ
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
000000 …..
14
TM_READ
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Set read bit in the corresponding lock
entry
Yes
TM_READ()
Lock bit is free?
000000 …..1
lock bit
15
TM_READ
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Abort
No
100000 …..
Set read bit in the corresponding lock
entry
Yes
TM_READ()
Lock bit is free?
16
TM_WRITE
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Abort
TM_WRITE
All read bits are clear?
No
000100 …..
17
TM_WRITE
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Abort
TM_WRITE
Acquire lockfailed
All read bits are clear?
No
Yes
100000 …..
18
TM_WRITE
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
00000 …..
Abort
TM_WRITE
Acquire lockfailed
All read bits are clear?
No
Yes
10
Experimental Framework
Benchmarks: Stamp v0.9.7 Run up to competition Measured statistics over 10 runs
TL2 as an STM framework
Two Intel Xeon E5660, 6-way CMP
19
Performance of RWLA
20
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Bayes Kmeans Labyrinth Ssca2 Vacation Genome
2 4 8 16 AVG.
bet
ter
Speculative RWLA Conflict occurs frequently → select GV4 Conflict occurs rarely → select RWLA How to predict conflict?
21
Contention Predictor
Prediction: y≥0 →predict commit y<0 →predict abort
Update If outcome of current TX and TXi agree/disagree →increment/decrement
wi
22
1 X1 … Xn
y
w1w0 wn
n
niiwxwy
10 )(
xi: global transaction history, bipolar value
wi: weight vector
Performance of Speculative RWLA # of threads changes between 2 and 16 On average, performance changes from 21% in Bayes to
47% in Labyrinth
23
0
0.2
0.4
0.6
0.8
1
1.2
Bayes Kmeans Labyrinth Ssca2 Vacation Genome
2 4 8 16 AVG.
bet
ter
Conclusion
RWLA to overcome contentions over global clok
Applications react differently to GV4 and RWLA
Speculative RWLA changes validation policy dynamically
Speculative RWLA performance of STMs up to 66%
24
25
Thank You!
Questions?