Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | kellie-newman |
View: | 215 times |
Download: | 2 times |
1
Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics
Minjia Zhang,
Jipeng Huang, Man Cao, Michael D. Bond
Do We Need Efficient STM?
2
Problem Solved!
3
Blue Gene/Q
HTM is limited…
4
Problem Solved?
Best-effort HTM: no completion guarantee1
Performance penalty: short transactions2
Language-level support for atomic blocks: STM fallback
[1] I. Calciu et al. Invyswell: A Hybrid Transactional Memory for Haswell’s Restricted Transactional Memory. In PACT, 2014.[2] R. M. Yoo et al. Performance Evaluation of Intel Transactional Synchronization Extensions for High-Performance Computing. In SC, 2013.
5
atomic { from.balance -= amount; to.balance += amount;}
transaction
Problem Solved?
Existing STMs add high overhead 1,2,3
6
Software Transactional Memory Is Slow
[1] C. Cascaval et al. Software Transactional Memory: Why Is It Only a Research Toy? In CACM, 2008[2] A. Dragojevi´c, et al. Why STM Can Be More than a Research Toy. In CACM, 2011[3] R. M. Yoo et al. Kicking the Tires of Software Transactional Memory: Why the Going Gets Tough. In SPAA, 2008.
Existing STMs add high overhead 1,2,3
Related challenges: scalability, progress guarantees, strong semantics
7
Software Transactional Memory Is Slow
[1] C. Cascaval et al. Software Transactional Memory: Why Is It Only a Research Toy? In CACM, 2008[2] A. Dragojevi´c, et al. Why STM Can Be More than a Research Toy. In CACM, 2011[3] R. M. Yoo et al. Kicking the Tires of Software Transactional Memory: Why the Going Gets Tough. In SPAA, 2008.
8
Challenge
Expensive to detect conflicts
T1
atomic { …
… = o.f; … = p.g; …
o.f = …;
p.g = …; …}
o.f = …
T2
9
Challenge
Expensive to detect conflicts
p.g = …
T2 T1
atomic { …
… = o.f; … = p.g; …
o.f = …;
p.g = …; …}
10
Challenge
Expensive to detect conflicts
t.k = …
T2 T1
atomic { …
… = o.f; … = p.g; …
o.f = …;
p.g = …; …}
11
Challenge
Expensive to detect conflicts
instrumentation
?
T2 T1
atomic { …
… = o.f; … = p.g; …
o.f = …;
p.g = …; …}
12
LarkTM
13
Adds very low overhead
Achieves good scalability by using a hybrid approach
Provides strong progress guarantees
Provides strong atomicity
LarkTM Contributions
14
Key Insight
Avoid high instrumentation costs by minimizing instrumentation costs for non-conflicting accesses
15
LarkTM DesignPer-object biased reader-writer locks1,2
Eager concurrency control
Piggybacking conflict detection and conflict resolution on lock transfers
1. M. D. Bond et al. Octet: Capturing and Controlling Cross-Thread Dependences Efficiently. In OOSPLA, 2013. 2. B. Hindman and D. Grossman. Atomicity via Source-to-Source Translation. In MSPC, 2006.
16
LarkTM DesignPer-object biased reader-writer locks1,2
Eager concurrency control
Piggybacking conflict detection and conflict resolution on lock transfers
1. M. D. Bond et al. Octet: Capturing and Controlling Cross-Thread Dependences Efficiently. In OOSPLA, 2013. 2. B. Hindman and D. Grossman. Atomicity via Source-to-Source Translation. In MSPC, 2006.
• Minimal instrumentation and synchronization for both transactional and non-transactional non-conflicting accesses
• Does not release locks even if transactions commit
17
Biased Locks
f
lock state
object o
18
Biased Locks
∈ {WrExT, RdExT, RdSh}
f
lock state
object o
19
Tim
e
T1
Multi-thread Execution
f
lock state
T2
WrExT1
object o
transaction start
txn id: 42
o.f = 1
20
Tim
e
T1
Multi-thread Execution
f
lock state
T2
last txn
WrExT1
object o
transaction start
txn id: 42
o.f = 1
21
Tim
e
T1
Multi-thread Execution
f
lock state
T2
update
last txn 42
WrExT1
object o
transaction start
txn id: 42
o.f = 1
22
Tim
e
T1
Multi-thread Execution
f
lock state
T2
add
o.f
undo log
last txn 42
…
WrExT1
object o
transaction start
txn id: 42
o.f = 1
23
Tim
e
T1 T2
Multi-thread Execution
f
lock state
updatelast txn
142
…
WrExT1
object o
transaction start
txn id: 42
o.f = 1
24
Tim
e
T1 T2
o.f = 2
Multi-thread Execution
f
lock statelast txn
142
…
WrExT1
object o
transaction start
txn id: 42
o.f = 1 … …
25
Tim
e
T1 T2
o.f = 2
Multi-thread Execution
f
lock state
No synchronization on T1’s accesses to o
Problem!
last txn1
42
…
WrExT1
object o
transaction start
txn id: 42
26
Tim
e
T1 T2
o.f = 2
Multi-thread Execution
f
lock state
T2 starts coordination
o.f = 1 … …
last txn1
42
…
WrExT1
object o
transaction start
txn id: 42
27
Tim
e
T1 T2
o.f = 2
Coordination
f
lock stateupdate
o.f = 1 … …
last txn1
42
…
IntT2
object o
transaction start
txn id: 42
28
Tim
e
T1 T2
o.f = 2
Coordination
f
lock state
request
o.f = 1 … …
last txn1
42
…
IntT2
object o
transaction start
txn id: 42
29
Tim
e
T1 T2
o.f = 2
Coordination
f
lock state
request
… = o.f
o.f = 1 … …
safe point
safe point
last txn1
42
…
IntT2
object o
transaction start
txn id: 42
30
Tim
e
T1 T2
o.f = 2
Coordination
f
lock state
request
… = o.f
o.f = 1 … …
safe point
safe point
DetectingConflicts
last txn1
42
…
IntT2
object o
transaction start
txn id: 42
31
Tim
e
T1 T2
o.f = 2
A Transactional Conflict
f
lock state
request
… = o.f
safe point
safe point
o.f = 1 … …
DetectingConflicts
Contention Management
detectedconflicts Resolving
Conflicts
last txn1
42
…
IntT2
object o
transaction start
32
Tim
e
T1 T2
o.f = 2
Not A Transactional Conflict
f
lock state
safe point
no conflict
request
… … …
safe point
DetectingConflicts
last txn
txn id: 43
142
…
IntT2
object o
transaction start
txn id: 42
33
Tim
e
T1 T2
o.f = 2
Coordination
f
lock state
request
… = o.f
safe point
o.f = 1 … …
DetectingConflicts
last txn1
42
…
IntT2
object o
transaction start
34
Tim
e
T1 T2
o.f = 2
Coordination
f
lock state
response
waiting
request
txn id: 42
… = o.f
safe point
o.f = 1 … …
DetectingConflicts
last txn1
42
…
IntT2
object o
transaction start
txn id: 42
35
Tim
e
T1 T2
o.f = 2
Strong Progress Guarantees
f
lock state
request
safe point
o.f = 1 … …
… = o.f
may abort
DetectingConflicts
last txn
waiting
may abort
response
142
…
IntT2
object o
transaction start
txn id: 42
36
Tim
e
T1 T2
o.f = 2
Strong Progress Guarantees
f
lock state
request
safe point
o.f = 1 … …
… = o.f
may abort
DetectingConflicts
last txn
waiting
may abort
Starvation and livelock freedom
response
142
…
IntT2
object o
transaction start
transaction start
txn id: 42
37
Tim
e
T1 T2
Strong Atomicity Semantics
f
lock state
transactional access
o.f = 2
request
safe point
o.f = 1 … …
… = o.f
abort
DetectingConflicts
last txn
waiting
Transactional vs. Transactional Conflict
response
142
…
IntT2
object o
transaction start
retry
transaction start
txn id: 42
38
Tim
e
T1 T2
Strong Atomicity Semantics
f
lock state
transactional access
request
o.f = 2
safe point
o.f = 1 … …
… = o.f
DetectingConflicts
abort
last txn
waiting
Transactional vs. Transactional Conflict
response
142
…
IntT2
object o
transaction start
txn id: 42
39
Tim
e
T1 T2
Strong Atomicity Semantics
f
lock state
safe point
non-transactionalaccess
request
o.f = 2
safe point
o.f = 1 … …
… = o.f
DetectingConflicts
abort
last txn
waiting
Transactional vs. Non-transactional Conflict
response
142
…
IntT2
object o
transaction start
txn id: 42
40
Tim
e
T1 T2
Strong Atomicity Semantics
f
lock statenon-transactional
access
retry
request
o.f = 2
safe point
o.f = 1 … …
… = o.f
DetectingConflicts
abort
last txn
waiting
Transactional vs. Non-transactional Conflict
response
142
…
IntT2
object o
41
Tim
e
T1 T2
Strong Atomicity Semantics
non-transactionalaccess
request o.f = 2
response
T1
transaction end
safe point
… = o.f
o.f = …
Non-transactional accesses short transactions no setting up/tearing down cost
42
Tim
e
T1 T2
No Transactional Conflict
f
lock state
o.f = 2
request
transaction end
transaction start
txn id: 51
safe point
DetectingConflicts
last txn
waiting
response
142
…
IntT2
object o
transaction start
txn id: 51
43
Tim
e
T1 T2
No Transactional Conflict
f
lock state
acquirelock
o.f = 2
request
transaction end
safe point
DetectingConflicts
last txn
waiting
response
142
…
WrExT2
object o
transaction start
txn id: 51
44
Tim
e
T1 T2
No Transactional Conflict
f
lock state
o.f = 2
request
transaction end
update
add
o.f
undo log
safe point
DetectingConflicts
last txn
waiting
response
251
…
WrExT2
object o
transaction start
txn id: 51
45
Tim
e
T1 T2
No Transactional Conflict
f
lock state
o.f = 2
request
transaction end
o.f
undo log
Two versions of coordination protocol
o.f = 2
safe point
DetectingConflicts
last txn
waiting
response
251
…
WrExT2
object o
LarkTM-O
46
Adds very low overhead and scales well for low-contention cases
txn: 51
47
Tim
e
T1 T2
High-Contention Applications
… = o.f … …o.f = … …
…… = o.f …
…o.f = …
txn: 42
txn: 43
txn: 52… = o.f … …o.f = … …
…o.f = …
48
Tim
e
T1 T2
High-Contention Applications
request
response
…o.f = …
… = o.f … …o.f = … …
…… = o.f …
…o.f = …
… = o.f … …o.f = … …
request
response
safe point
safe point
txn: 51
txn: 42
txn: 43
txn: 52request
LarkTM-S
49
Handling High Contention
50
Tim
e
T1 T2
LarkTM-S: Hybrid with Traditional Locking
… = o.f … …o.f = … …
…… = o.f …
…o.f = …
… = o.f … …o.f = … …
txn: 51
txn: 42
txn: 43
txn: 52
…o.f = 1 o causes high contention
51
Tim
e
T1 T2
… = o.f … …o.f = … …
…… = o.f …
…o.f = …
… = o.f … …o.f = … …
txn: 51
txn: 42
txn: 43
txn: 52
…o.f = 1
LarkTM-S: Hybrid with Traditional Locking
52
Comparison Of Concurrency Control
1 B. Saha et al. McRT-STM: A High Performance Software Transactional Memory System for a Multi-Core Runtime. In PPoPP, 2006.2 T. Shpeisman et al. Enforcing Isolation and Ordering in STM. In PLDI, 2007.3 L. Dalessandro et al. NOrec: Streamlining STM by Abolishing Ownership Records. In PPoPP, 2010.
Write concurrency control Read concurrency control
LarkTM-O
Eager per-object biased reader–writer lock
Eager per-object biased reader–writer lock
LarkTM-S IntelSTM–LarkTM-O hybrid IntelSTM–LarkTM-O hybrid
IntelSTM1,2 Eager per-object lock Lazy version validation
NOrec3 Lazy global seqlock Lazy value validation
53
Instrumented accesses
LarkTM-O All accesses
LarkTM-S All accesses
IntelSTM All accesses
NOrec All transactional accesses
Comparison Of Instrumentation
except redundant accesses
54
Progress Guarantee
LarkTM-O Livelock and starvation free
LarkTM-S Livelock and starvation free
IntelSTM None
NOrec Livelock free
Comparison Of Progress Guarantees
55
Semantics
LarkTM-O Strong Atomicity
LarkTM-S Strong Atomicity
IntelSTM Strong Atomicity
NOrec Single Global Lock Atomicity (SLA)
Comparison Of Semantics
• LarkTM-O, LarkTM-S, IntelSTM (McRT), and NOrec• Developed in Jikes RVM 3.1.3• All STMs share features as much as possible (e.g., inlining
decisions, redundant barrier analysis, name-mangling)• Source code publicly available on
the Jikes RVM Research Archive
56
Implementation
57
Evaluation Methodology
• TM programs• STAMP benchmarks
• STM comparison• Norec• IntelSTM• LarkTM-O• LarkTM-S
• Platform• Eight 8-core processors (AMD Opteron 6272)• Four 8-core processors (Intel Xeon E5-4620)
58
Single-Thread Performance
kmea
ns_lo
w
kmea
ns_h
igh
ssca
2
intru
der
laby
rinth
3d
geno
me
vaca
tion_
low
vaca
tion_
high
geom
ean
0
50
100
150
200
250
300
Overh
ead
(%
)
59
Single-Thread Performance
kmea
ns_lo
w
kmea
ns_h
igh
ssca
2
intru
der
laby
rinth
3d
geno
me
vaca
tion_
low
vaca
tion_
high
geom
ean
0
50
100
150
200
250
300
NOrec
Overh
ead
(%
)
610
60
Single-Thread Performance
kmea
ns_lo
w
kmea
ns_h
igh
ssca
2
intru
der
laby
rinth
3d
geno
me
vaca
tion_
low
vaca
tion_
high
geom
ean
0
50
100
150
200
250
300
NOrec
IntelSTM
Overh
ead
(%
)
6102870
61
Single-Thread Performance
kmea
ns_lo
w
kmea
ns_h
igh
ssca
2
intru
der
laby
rinth
3d
geno
me
vaca
tion_
low
vaca
tion_
high
geom
ean
0
50
100
150
200
250
300
NOrec
IntelSTM
LarkTM-O
Overh
ead
(%
)
6102870
62
Single-Thread Performance
kmea
ns_lo
w
kmea
ns_h
igh
ssca
2
intru
der
laby
rinth
3d
geno
me
vaca
tion_
low
vaca
tion_
high
geom
ean
0
50
100
150
200
250
300
NOrec
IntelSTM
LarkTM-O
LarkTM-S
Overh
ead
(%
)
6102870
63
Single-Thread Performance
kmea
ns_lo
w
kmea
ns_h
igh
ssca
2
intru
der
laby
rinth
3d
geno
me
vaca
tion_
low
vaca
tion_
high
geom
ean
0
50
100
150
200
250
300
NOrec
IntelSTM
LarkTM-O
LarkTM-S
Overh
ead
(%
)
6102870
40%
73%
64
Speedup Geomean
1 2 4 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
NOrec
NOrec IntelSTM LarkTM-O LarkTM-S
Threads
Sp
eed
up
65
Speedup Geomean
1 2 4 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
NOrecIntelSTM
NOrec IntelSTM LarkTM-O LarkTM-S
Threads
Sp
eed
up
66
Speedup Geomean
1 2 4 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
NOrecIntelSTM
LarkTM-O
NOrec IntelSTM LarkTM-O LarkTM-S
Threads
Sp
eed
up
67
Speedup Geomean
1 2 4 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
NOrecIntelSTM
LarkTM-O
LarkTM-S
NOrec IntelSTM LarkTM-O LarkTM-S
Threads
Sp
eed
up
68
1 2 4 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
NOrecIntelSTM
LarkTM-O
LarkTM-S
Threads
Sp
eed
up
Toward Practical STM
Low instrumentation
overhead
69
1 2 4 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
NOrecIntelSTM
LarkTM-O
LarkTM-S
Threads
Sp
eed
up
Toward Practical STM
scales well
Low instrumentation
overhead
70
1 2 4 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
NOrecIntelSTM
LarkTM-O
LarkTM-S
Threads
Sp
eed
up
Toward Practical STM
scales well
Low instrumentation
overhead
Strong progress guarantees
71
1 2 4 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
NOrecIntelSTM
LarkTM-O
LarkTM-S
Threads
Sp
eed
up
Toward Practical STM
scales well
Low instrumentation
overhead
Strong progress guarantees
Strong semantics
72
1 2 4 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
NOrecIntelSTM
LarkTM-O
LarkTM-S
Threads
Sp
eed
up
Toward Practical STM
scales well
Low instrumentation
overhead
Strong progress guarantees
Strong semantics
Thank you