Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | arsenio-marquez |
View: | 22 times |
Download: | 0 times |
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording
(ASLPOS’06)
Min Xu Rastislav Bodik Mark D. Hill
Shimin Chen
LBA Reading Group Presentation
2
% gcc sim.c% a.outSegmentation fault%
% gdb a.outgdb> runProgram received SIGSEGV.In get() at hash.c:4545 a = bucket->d;
% gdb a.outgdb> runProgram exited normally.gdb>
% gcc para-sim.c% a.outSegmentation fault%
Why Do You Need a Recorder?
% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;
% gcc para-sim.c% a.outSegmentation faultRace recorded in “log”%
3Ideally …
% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;
% gcc para-sim.c% a.outSegmentation faultRace recorded in “log”%
Long recording:small logLow runtime
overheadLow cost
Applicability:Programs – data race
Systems – non-SC
4Flight Data Recorder (ISCA’03)
Full-system Record-Replay• Recording memory races:
• Assumes Sequential Consistency (SC)• Record order of instruction interleaving• Target cache-coherence multiprocessor server• Piggyback on coherence protocol: little extra H/W
• Recording system states: SafetyNet• Recording I/OsResults:
• Non-trivial recording interval: 1 second• Negligible runtime overhead: less than 2%• Can be “Always On”
5RTR
Better memory race log compression• 1 byte per Kilo instructions
Dealing with Total Store Ordering
In this talk, I will try to describe a full picture combining FDR and RTR.
6Outline
•Introduction•Recording System State•Recording Input/Output•Recording Memory Races•Dealing with TSO•Summary
7
Recording System State (based on SafetyNet)
•Purpose: re-construct the initial state (registers, TLB, main memory) at the beginning of the replay interval
•Policy: FDR’s 1second replay interval• Take a logical checkpoint every 1/3 second• Reserve memory space to store logs for 4
checkpoints•Logical checkpoint:
• Quiesce entire system to take a physical checkpoint• Registers and TLB states (4248 bytes/processor on
SPARC V9)• Log old value of a cache line upon first update
• Add an “already-updated” bit per cache line
8
FDR paper
9Outline
•Introduction•Recording System State•Recording Input/Output•Recording Memory Races•Dealing with TSO•Summary
10Recording I/O
I/O loads
Instruction count + interrupt number
DMA store values
11Outline
•Introduction•Recording System State•Recording Input/Output•Recording Memory Races•Dealing with TSO•Summary
12Log All Dependence
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Log J: 23 14 35 46
Log I: 23
Log Size: 5*16=80 bytes(10 integers)
Dependence Log
16 bytes
But too many dependence
13
Netzer’s Transitive Reduction (TR)approximated by FDR
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
TR reduced Log J: 23
35 46
Log I: 23
Log Size: 64 bytes(8 integers)
TR Reduced Log
How to further reduce log size?
14RTR
Actively creating artificial dependencies• Stricter• Vectorized
15The Intuition of the RTR Algorithm
After Reduction
From I to J
From J to I
Vectors
Vectors“Regulate” Replay
16
Stricter Dependences to Aid Vectorization
1
2
3
4
1
2
3
4
ld A
Thread I Thread J
Replay
st B
st C
add
st C
ld B
st Ald D
5 5sub st C
6 6ld B st D
Log J: 23 45
Log I: 23
Log Size: 48 bytes(6 integers)
New Reduced Log
stricter
Reduced
Fewer dependencies to log
17Compress Vectorized Dependencies
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Log J: x=3,5, ∆=1
Log I: x=3, ∆=1
Log Size: 40 bytes(5 integers)
Vectorized Log
VectorDeps.
TRRTR: fewer deps + fewer byte/dep
18
19H/W Considerations
(IC) Instruction count per core -- easy(VIC[p]) record previously seen senders’ largest time stamps for transitive reduction
(CTS[b]) time stamp per cache block:• i.e. record IC upon load/store commits• At commit time:
• Figure out memory address – how difficult?• Write CTS: decoupled timestamp memory
20H/W Considerations Cont’d
Piggyback on cache coherence messages• FDR: CTS[b]• RTR: CTS[b] & sender’s IC
Logic to perform algorithm at the receiver side• FDR: integer comparison, update VIC[sender],
generate log record• RTR: in addition, max/min, integer subtraction
Augment directory structure• Record last owner for evicted blocks
Cache must respond to inquiries about evicted blocks: reply with CTS[SET/LRU]
21Outline
•Introduction•Recording System State•Recording Input/Output•Recording Memory Races•Dealing with TSO•Summary
22Total Store Ordering
FIFO Write buffer• A store commits by placing its value into write
buffer• A store is ordered when it exits the write buffer
and updates the memory• Stores are ordered in commit order (FIFO)
Load can obtain values from write buffer or from memory system
23Problems with TSO
/* XXX */ is memory order
The two examples create cycles that will result in replay deadlocks
24Solution
Identify problematic load instructions• Monitor invalidation in [t1, t2]• t1: the load (or the previous store that feeds the
load) is ordered at memory• t2: all preceding instructions are ordered
Log load values and replay these load instructions by values
HW: similar to the misspeculation detection circuitry in SC systems (e.g. MIPS R10000)
Insufficient for supporting Processor Consistency and other more relaxed models
25Conclusion
RTR 1 byte/kilo-instruction•Based on Netzer’s transitive reduction•Create stricter dependencies•Vectorize dependencies to compress log•Avoid overly-strict hence no deadlock