What is a Data Race?

1

Thread 1 Thread 2X++ T=YZ=2 T=X

What is a Data Race?

Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug

2

Lock(m)

Unlock(m) Lock(m)

Unlock(m)

How Can Data Races be Prevented?

Explicit synchronization between threads: Locks Critical Sections Barriers Mutexes Semaphores Monitors Events Etc.

Thread 1 Thread 2

X++

T=X

3

Is This Sufficient?

Yes! No!

Programmer dependent Correctness – programmer may forget to synch

Need tools to detect data races

Expensive Efficiency – to achieve correctness,

programmer may overdo. Need tools to remove excessive synch’s

4

#define N 100Type g_stack = new Type[N];int g_counter = 0;Lock g_lock;

void push( Type& obj ){lock(g_lock);...unlock(g_lock);}void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);}void popAll( ) {

lock(g_lock); delete[] g_stack;g_stack = new Type[N];g_counter = 0;unlock(g_lock);

}int find( Type& obj, int number ) {

lock(g_lock); for (int i = 0; i < number; i++)

if (obj == g_stack[i]) break; // Found!!!if (i == number) i = -1; // Not found… Return -1 to callerunlock(g_lock);return i;

}int find( Type& obj ) {

return find( obj, g_counter );}

Where is Waldo?

5

#define N 100Type g_stack = new Type[N];int g_counter = 0;Lock g_lock;

void push( Type& obj ){lock(g_lock);...unlock(g_lock);}void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);}void popAll( ) {

lock(g_lock); delete[] g_stack;g_stack = new Type[N];g_counter = 0;unlock(g_lock);

}int find( Type& obj, int number ) {

lock(g_lock); for (int i = 0; i < number; i++)

if (obj == g_stack[i]) break; // Found!!!if (i == number) i = -1; // Not found… Return -1 to callerunlock(g_lock);return i;

}int find( Type& obj ) {

return find( obj, g_counter );}

Can You Find the Race?Similar problem was foundin java.util.Vector

write

read

6

Detecting Data Races?

NP-hard [Netzer&Miller 1990] Input size = # instructions performed Even for 3 threads only Even with no loops/recursion

Execution orders/scheduling (#threads)thread_length

# inputs Detection-code’s side-effects Weak memory, instruction reorder,

atomicity

7

Motivation

Run-time framework goals Collect a complete trace of a program’s user-mode

execution Keep the tracing overhead for both space and time low Re-simulate the traced execution deterministically based on

the collected trace with full fidelity down to the instruction level

Full fidelity: user mode only, no tracing of kernel, only user-mode I/O callbacks

Advantages Complete program trace that can be analyzed from multiple

perspectives (replay analyzers: debuggers, locality, etc) Trace can be collected on one machine and re-played on

other machines (or perform live analysis by streaming)

Challenges: Trace Size and Performance

8

Original Record-Replay Approaches

InstantReplay ’87 Record order or memory accesses overhead may affect program behavior

RecPlay ’00 Record only synchronizations Not deterministic if have data races

Netzer ’93 Record optimal trace too expensive to keep track of all memory locations

Bacon & Goldstein ’91 Record memory bus transactions with hardware high logging bandwidth

9

Motivation

Increasing use and development for multi-core processors

MT program behavior is non-deterministic To effectively debug software, developers must

be able to replay executions that exhibit concurrency bugs

Shared memory updates happen in different order

10

Related Concepts

Runtime interpretation/translation of binary instructions Requires no static instrumentation, or special symbol

information Handle dynamically generated code, self modifying code Recording/Logging: ~100-200x

More recent logging Proposed hardware support (for MT domain) FDR (Flight Data Recorder) BugNet (cache bits set on first load) RTR (Regulated Transitive Reduction) DeLorean (ISCA 2008- chunks of instructions) Strata (time layer across all the logs for the running threads) iDNA (Diagnostic infrastructure using NirvanA- Microsoft)

11

Deterministic Replay

Re-execute the exact same sequence of instructions as recorded in a previous run

Single threaded programs Record Load Values needed for reproducing behavior of

a run (Load Log) Registers updated by system calls and signal handlers

(Reg Log) Output of special instructions: RDTSC, CPUID (Reg Log) System call (virtualization- cloning arguments, updates) Checkpointing (log summary ~10Million)

Multi-threaded programs Log interleaving among threads (shared memory

updates ordering – SMO Log)

12

PinSEL – System Effect Log (SEL)

Logging program load values needed for deterministic replay:– First access from a memory location– Values modified by the system (system effect) and read

by program– Machine and time sensitive instructions (cpuid,rdtsc)

Load A; (A = 111)

Logged

Not Logged

Syscall modifies location (B -> 0)

and (C -> 99)

Load C; (C = 99)

Load D; (D = 10)

Store A; (A 111)

Store B; (B 55)

Load B; (B = 0)

system call

Program execution

Load C; (C = 9)

Load D; (D = 10)

•Trace size is ~4-5 bytes per instruction

13

Optimization: Trace select reads

Observation: Hardware caches eliminate most off-chip reads Optimize logging:

Logger and replayer simulate identical cache memories Simple cache (the memory copy structure) to decide

which values to log. No tags or valid bits to check. If the values mismatch they are logged.

Average trace size is <1 bit per instruction

i = 1;for (j = 0; j < 10; j++){ i = i + j;}k = i; // value read is 46System_call();k = i; // value read is 0 (not predicted)

The only read not predicted and logged follows the system call

14

Example Overhead

PinSEL and PinPLAY Initial work (2006) with single threaded programs:

SPEC2000 ref runs: 130x slowdown for pinSEL and ~80x for PinPLAY (w/o in-lining)

Working with a subset of SPLASH2 benchmarks: 230x slowdown for PinSEL

Now: Geo-mean SPEC2006 Pin 1.4x Logger 83.6x Replayer 1.4x

15

Example: Microsoft iDNA Trace Writer Performance

Application

SimulatedInstructions(millions)

Trace FileSize

Trace FileBits / Instructio

n

NativeExecutionTime

ExecutionTime WhileTracing

ExecutionOverhead

Gzip 24,097 245 MB 0.09 11.7s 187s 15.98

Excel 1,781 99 MB 0.47 18.2s 105s 5.76

PowerPoint

7,392 528 MB 0.60 43.6s 247s 5.66

IE 116 5 MB 0.50 0.499s 6.94s 13.90

Vulcan 2,408 152 MB 0.53 2.74s 46.6s 17.01

Satsolver 9,431 1300 MB 1.16 9.78s 127s 12.98

•Memchecker and valgrind are in 30-40x range on CPU 2006

•iDNA ~11x, (does not log shared-memory dependences explicitly)

•Use a sequential number for every lock prefixed memory operation: offline

data race analysis

16

Logging Shared Memory Ordering(Cristiano’s PinSEL/PLAY Overview)

Emulation of Directory Based Cache Coherence

Identifies RAW, WAR, WAW dependences Indexed by hashing effective address Each entry represents an address range

Store A

Load B

Program execution

hash

Dir EntryDir Entry

Dir Entry

Dir Entry

Directory

17

Directory Entries

Every DirEntry maintains: Thread id of the last_writer A timestamp is the # of memory ref. the thread has

executed Vector of timestamps of last access for each thread

to that entry On Loads: update the timestamp for the thread in the

entry On Stores: update the timestamp and the last_writer

fields

Pro

gra

m e

xecu

tion

Thread T1 Thread T2

Last writer id:1: Store A

2: Load A

DirEntry: [A:D]

Last writer id:

DirEntry: [E:H]

Directory

T1: T2:

T1: T2:

1: Load F

2: Store A

3: Load F

3: Store F

T1

1

1

T2

22

3

T1

3

Vector

18

Detecting Dependences

RAW dependency between threads T and T’ is established if:

T executes a load that maps to the directory entry A T’ is the last_writer for the same entry

WAW dependency between T and T’ is established if: T executes a store that maps to the directory entry A T’ is the last_writer for the same entry

WAR dependency between T and T’ is established if: T executes a store that maps to the directory entry A T’ has accessed the same entry in the past and T is not

the last_writer

19

ExampleP

rog

ram

execu

tion

Thread T1 Thread T2

Last writer id:1: Store A

2: Load A

DirEntry: [A:D]

Last writer id:

DirEntry: [E:H]

T1: T2:

T1: T2:

1: Load F

2: Store A

3: Load F

3: Store F

T1

1

1

T2

22

3

T1

3

WAW

RAW

WAR

T1 2 T2 2

T1 3 T2 3

T2 2 T1 1

SMO logs:

Thread T1 cannot execute memory reference 2until T2 executes its memory reference 2

Thread T2 cannot execute memory reference 2 until T1 executes itsmemory reference 1

Last access to the DirEntry

Last_writerLast access to the DirEntry

20

Ordering Memory Accesses (Reducing log size)

Preserving order will reproduce execution a→b: “a happens-before b” Ordering is transitive: a→b, b→c means

a→c Two instructions must be ordered if:

they both access the same memory, and one of them is a write

21

Constraints: Enforcing Order

To guarantee a→d: a→d b→d a→c b→c

Suppose we need b→c b→c is necessary a→d is redundant

P1

a

b

c

d

P2

overconstrained

22

Reproduce exact same conflicts: no more, no less

Problem Formulation

ld A

Thread I Thread J

Recording

st B

st C

sub

ld B

add

st C

ld B

st A

st C

Thread I Thread J

Replay

Log

ld D

st D

ld A

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Conflicts(red)

Dependence(black)

23

Detect conflicts Write log

Log All Conflicts

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: 23 14 35 46

Log I: 23

Log Size: 5*16=80 bytes(10 integers)

Dependence Log

16 bytes

Assign IC (logical Timestamps)

But too many conflicts

24

Netzer’s Transitive Reduction

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

TR reduced Log J: 23

35 46

Log I: 23

Log Size: 64 bytes(8 integers)

TR Reduced Log

25

RTR (Regulated Transitive Reduction): Stricter Dependences to Aid Vectorization

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: 23 45

Log I: 23

Log Size: 48 bytes(6 integers)

New Reduced Log

stricter

Reduced

4% Overhead RTR+FDR (simulated on GEMs).2 MB/core/second logging (Apache)

Date post:	06-Jan-2016
Category:	Documents
Upload:	ryann
View:	27 times
Download:	0 times

What is a Data Race?

Documents