1
Thread 1 Thread 2X++ T=YZ=2 T=X
What is a Data Race?
Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug
2
Lock(m)
Unlock(m) Lock(m)
Unlock(m)
How Can Data Races be Prevented?
Explicit synchronization between threads: Locks Critical Sections Barriers Mutexes Semaphores Monitors Events Etc.
Thread 1 Thread 2
X++
T=X
3
Is This Sufficient?
Yes! No!
Programmer dependent Correctness – programmer may forget to synch
Need tools to detect data races
Expensive Efficiency – to achieve correctness,
programmer may overdo. Need tools to remove excessive synch’s
4
#define N 100Type g_stack = new Type[N];int g_counter = 0;Lock g_lock;
void push( Type& obj ){lock(g_lock);...unlock(g_lock);}void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);}void popAll( ) {
lock(g_lock); delete[] g_stack;g_stack = new Type[N];g_counter = 0;unlock(g_lock);
}int find( Type& obj, int number ) {
lock(g_lock); for (int i = 0; i < number; i++)
if (obj == g_stack[i]) break; // Found!!!if (i == number) i = -1; // Not found… Return -1 to callerunlock(g_lock);return i;
}int find( Type& obj ) {
return find( obj, g_counter );}
Where is Waldo?
5
#define N 100Type g_stack = new Type[N];int g_counter = 0;Lock g_lock;
void push( Type& obj ){lock(g_lock);...unlock(g_lock);}void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);}void popAll( ) {
lock(g_lock); delete[] g_stack;g_stack = new Type[N];g_counter = 0;unlock(g_lock);
}int find( Type& obj, int number ) {
lock(g_lock); for (int i = 0; i < number; i++)
if (obj == g_stack[i]) break; // Found!!!if (i == number) i = -1; // Not found… Return -1 to callerunlock(g_lock);return i;
}int find( Type& obj ) {
return find( obj, g_counter );}
Can You Find the Race?Similar problem was foundin java.util.Vector
write
read
6
Detecting Data Races?
NP-hard [Netzer&Miller 1990] Input size = # instructions performed Even for 3 threads only Even with no loops/recursion
Execution orders/scheduling (#threads)thread_length
# inputs Detection-code’s side-effects Weak memory, instruction reorder,
atomicity
7
Motivation
Run-time framework goals Collect a complete trace of a program’s user-mode
execution Keep the tracing overhead for both space and time low Re-simulate the traced execution deterministically based on
the collected trace with full fidelity down to the instruction level
Full fidelity: user mode only, no tracing of kernel, only user-mode I/O callbacks
Advantages Complete program trace that can be analyzed from multiple
perspectives (replay analyzers: debuggers, locality, etc) Trace can be collected on one machine and re-played on
other machines (or perform live analysis by streaming)
Challenges: Trace Size and Performance
8
Original Record-Replay Approaches
InstantReplay ’87 Record order or memory accesses overhead may affect program behavior
RecPlay ’00 Record only synchronizations Not deterministic if have data races
Netzer ’93 Record optimal trace too expensive to keep track of all memory locations
Bacon & Goldstein ’91 Record memory bus transactions with hardware high logging bandwidth
9
Motivation
Increasing use and development for multi-core processors
MT program behavior is non-deterministic To effectively debug software, developers must
be able to replay executions that exhibit concurrency bugs
Shared memory updates happen in different order
10
Related Concepts
Runtime interpretation/translation of binary instructions Requires no static instrumentation, or special symbol
information Handle dynamically generated code, self modifying code Recording/Logging: ~100-200x
More recent logging Proposed hardware support (for MT domain) FDR (Flight Data Recorder) BugNet (cache bits set on first load) RTR (Regulated Transitive Reduction) DeLorean (ISCA 2008- chunks of instructions) Strata (time layer across all the logs for the running threads) iDNA (Diagnostic infrastructure using NirvanA- Microsoft)
11
Deterministic Replay
Re-execute the exact same sequence of instructions as recorded in a previous run
Single threaded programs Record Load Values needed for reproducing behavior of
a run (Load Log) Registers updated by system calls and signal handlers
(Reg Log) Output of special instructions: RDTSC, CPUID (Reg Log) System call (virtualization- cloning arguments, updates) Checkpointing (log summary ~10Million)
Multi-threaded programs Log interleaving among threads (shared memory
updates ordering – SMO Log)
12
PinSEL – System Effect Log (SEL)
Logging program load values needed for deterministic replay:– First access from a memory location– Values modified by the system (system effect) and read
by program– Machine and time sensitive instructions (cpuid,rdtsc)
Load A; (A = 111)
Logged
Not Logged
Syscall modifies location (B -> 0)
and (C -> 99)
Load C; (C = 99)
Load D; (D = 10)
Store A; (A 111)
Store B; (B 55)
Load B; (B = 0)
system call
Program execution
Load C; (C = 9)
Load D; (D = 10)
•Trace size is ~4-5 bytes per instruction
13
Optimization: Trace select reads
Observation: Hardware caches eliminate most off-chip reads Optimize logging:
Logger and replayer simulate identical cache memories Simple cache (the memory copy structure) to decide
which values to log. No tags or valid bits to check. If the values mismatch they are logged.
Average trace size is <1 bit per instruction
i = 1;for (j = 0; j < 10; j++){ i = i + j;}k = i; // value read is 46System_call();k = i; // value read is 0 (not predicted)
The only read not predicted and logged follows the system call
14
Example Overhead
PinSEL and PinPLAY Initial work (2006) with single threaded programs:
SPEC2000 ref runs: 130x slowdown for pinSEL and ~80x for PinPLAY (w/o in-lining)
Working with a subset of SPLASH2 benchmarks: 230x slowdown for PinSEL
Now: Geo-mean SPEC2006 Pin 1.4x Logger 83.6x Replayer 1.4x
15
Example: Microsoft iDNA Trace Writer Performance
Application
SimulatedInstructions(millions)
Trace FileSize
Trace FileBits / Instructio
n
NativeExecutionTime
ExecutionTime WhileTracing
ExecutionOverhead
Gzip 24,097 245 MB 0.09 11.7s 187s 15.98
Excel 1,781 99 MB 0.47 18.2s 105s 5.76
PowerPoint
7,392 528 MB 0.60 43.6s 247s 5.66
IE 116 5 MB 0.50 0.499s 6.94s 13.90
Vulcan 2,408 152 MB 0.53 2.74s 46.6s 17.01
Satsolver 9,431 1300 MB 1.16 9.78s 127s 12.98
•Memchecker and valgrind are in 30-40x range on CPU 2006
•iDNA ~11x, (does not log shared-memory dependences explicitly)
•Use a sequential number for every lock prefixed memory operation: offline
data race analysis
16
Logging Shared Memory Ordering(Cristiano’s PinSEL/PLAY Overview)
Emulation of Directory Based Cache Coherence
Identifies RAW, WAR, WAW dependences Indexed by hashing effective address Each entry represents an address range
Store A
Load B
Program execution
hash
Dir EntryDir Entry
Dir Entry
Dir Entry
Directory
17
Directory Entries
Every DirEntry maintains: Thread id of the last_writer A timestamp is the # of memory ref. the thread has
executed Vector of timestamps of last access for each thread
to that entry On Loads: update the timestamp for the thread in the
entry On Stores: update the timestamp and the last_writer
fields
Pro
gra
m e
xecu
tion
Thread T1 Thread T2
Last writer id:1: Store A
2: Load A
DirEntry: [A:D]
Last writer id:
DirEntry: [E:H]
Directory
T1: T2:
T1: T2:
1: Load F
2: Store A
3: Load F
3: Store F
T1
1
1
T2
22
3
T1
3
Vector
18
Detecting Dependences
RAW dependency between threads T and T’ is established if:
T executes a load that maps to the directory entry A T’ is the last_writer for the same entry
WAW dependency between T and T’ is established if: T executes a store that maps to the directory entry A T’ is the last_writer for the same entry
WAR dependency between T and T’ is established if: T executes a store that maps to the directory entry A T’ has accessed the same entry in the past and T is not
the last_writer
19
ExampleP
rog
ram
execu
tion
Thread T1 Thread T2
Last writer id:1: Store A
2: Load A
DirEntry: [A:D]
Last writer id:
DirEntry: [E:H]
T1: T2:
T1: T2:
1: Load F
2: Store A
3: Load F
3: Store F
T1
1
1
T2
22
3
T1
3
WAW
RAW
WAR
T1 2 T2 2
T1 3 T2 3
T2 2 T1 1
SMO logs:
Thread T1 cannot execute memory reference 2until T2 executes its memory reference 2
Thread T2 cannot execute memory reference 2 until T1 executes itsmemory reference 1
Last access to the DirEntry
Last_writerLast access to the DirEntry
20
Ordering Memory Accesses (Reducing log size)
Preserving order will reproduce execution a→b: “a happens-before b” Ordering is transitive: a→b, b→c means
a→c Two instructions must be ordered if:
they both access the same memory, and one of them is a write
21
Constraints: Enforcing Order
To guarantee a→d: a→d b→d a→c b→c
Suppose we need b→c b→c is necessary a→d is redundant
P1
a
b
c
d
P2
overconstrained
22
Reproduce exact same conflicts: no more, no less
Problem Formulation
ld A
Thread I Thread J
Recording
st B
st C
sub
ld B
add
st C
ld B
st A
st C
Thread I Thread J
Replay
Log
ld D
st D
ld A
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Conflicts(red)
Dependence(black)
23
Detect conflicts Write log
Log All Conflicts
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Log J: 23 14 35 46
Log I: 23
Log Size: 5*16=80 bytes(10 integers)
Dependence Log
16 bytes
Assign IC (logical Timestamps)
But too many conflicts
24
Netzer’s Transitive Reduction
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
TR reduced Log J: 23
35 46
Log I: 23
Log Size: 64 bytes(8 integers)
TR Reduced Log
25
RTR (Regulated Transitive Reduction): Stricter Dependences to Aid Vectorization
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Log J: 23 45
Log I: 23
Log Size: 48 bytes(6 integers)
New Reduced Log
stricter
Reduced
4% Overhead RTR+FDR (simulated on GEMs).2 MB/core/second logging (Apache)