Date post: | 18-Jan-2018 |
Category: |
Documents |
Upload: | philip-hancock |
View: | 223 times |
Download: | 0 times |
Execution Replay and Debugging
Contents
Introduction• Parallel program: set of co-operating processes• Co-operation using
– shared variables– message passing
• Developing parallel programs is considered difficult:– normal errors as in sequential programs– synchronisation errors (deadlock, races)– performance errors
We need good development tools
Debugging of parallel programs• Most used technique: cyclic debugging• Requires repeatable equivalent executions• Is a problem for parallel programs: lots of
non-determinism present• Solution: execution replay mechanism:
– record phase: trace information about the non-deterministic choices
– replay phase: force an equivalent re-execution using the trace allowing the use of intrusive debugging techniques
Non-determinism• Classes:
– external vs. internal non-determinism– desired vs. undesired non-determinism
• Important: the amount of non-determinism depends on the abstraction level. E.g. a semaphore P()-operation can be fully deterministic while consisting of e number of non-deterministic spinlocking operations.
Causes of Non-determinism– In sequential programs:
• program code (self modifying code?)• program input (disk, keyboard, network, ...)• certain system calls (gettimeofday())• interrupts, signals, ...
– In parallel programs:• accesses to shared variables: race conditions
(synchronisation races and data races)– In distributed programs:
• promiscuous receive operations• test operations for non-blocking messages operations
Main Issues in Execution Replay• recorded execution = original execution:
– trace as little as possible in order to limit the overhead
• in time • in space
• replayed execution = recorded execution:– faithful re-execution: trace enough
Execution Replay Methods• Two types: content- vs. ordering-based
– content-based: force each process to read the same value or to receive the same message as during the original execution
– ordering-based: force each process to access the variables or to receive the message in the same logical order as during the original execution
Logical Clocks for Ordering-based Methods
• A clock C() attaches a timestamps C(x) to an event x
• Used for tracing the logical order of events• Clock condition:
• Clocks are strongly consistent if
• New timestamp is the increment of the maximum of the old timestamps of the process and the object
)()( bCaCba
)()( bCaCba
Scalar Clocks• Aka Lamport Clocks• Simple and fast update algorithm:
• Scales very well with the number of processes
• Provides only limited information:
1,max'' oSCpSCoSCpSC
baabbSCaSC
abbababSCaSCbababSCaSC
//////
or or or
or
Vector Clocks• A vector clock for a program using N
processes consist of N scalar values
• Such a clock is strongly consistent: by comparing vector timestamps one can deduce concurrency information:
0,...,0,1,0,..,0,sup'' oVCpVCoVCpVC
abbVCaVC
babVCaVCbabVCaVC
//
An Example Program• A parallel program with two threads,
communicating using shared variables: A, B MA and MB. Local variables are x and y.
• M is used as a mutex using an atomic swap operation provided by the CPU:
valuememlocmemlocreturn
valuememlocswap][
][),(
An Example Program (II)• Lock operation on a mutex M is implemented
(in a library):
• Unlock operation on a mutex M is implemented as:
• All variables are initially 0
);1)1,(( Mswapwhile
;0M
An Example Program (III)• The example program:
Thread 1:L(MA);A=8;U(MA);L(MB);B=7;U(MB);
Thread 2:B=6;L(MB);x=B;U(MB);L(MA);y=A;U(MA);
A Possible Execution: Low Level View
A=8swap(MA,1) 0
MA=0
swap(MB,1) 0
B=7
MB=0
x=Bswap(MB,1) 0
MB=0
swap(MA,1) 0
y=AMA=0
B=6
swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1
A Possible Execution: High Level View
A=8L(MA)
U(MA)
L(MB)
B=7
U(MB)
x=BL(MB)
U(MB)
L(MA)
y=AU(MA)
B=6
tim e
Recap• A content-based replay method: the value
read by each load operation is stored• Trace generation of 1MB/s was measured on
a VAX 11/780• Undoable method: time needed to record the
large amount of trace information modifies the initial execution
• One advantage: possible to replay a subset of the processes in isolation.
Recap: Example
A=8swap(MA,1) 0
MA=0
swap(MB,1) 0
B=7
MB=0
x=Bswap(MB,1) 0
MB=0
swap(MA,1) 0
y=AMA=0
B=6
swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1
0
0111
70
80
Instant Replay• First ordering-based replay method• Developed for CREW-algorithms• Each shared object receives a version
number that is updated or logged at each CREW-operation:– read: the version number is logged– write:
• the version number is incremented• the number of preceding read operations is logged
Instant Replay: Example
A=8Lw(MA)
Uw(MA)
Lw(MB)
B=7
Uw(MB)
x=BLr(MB)
Ur(MB)
Lr(MA)
y=AUr(MA)
B=6
version: 1log 0 reads
version: 1log 0 reads
log version 1
log version 1
PROBLEM
Netzer• Widely cited method• Attaches a vector clock to each process. The
clocks attach a timestamp to each memory operations.
• Uses vector clocks to detect concurrent (racing) memory operations
• Automatically traces transitive reduction of the dependencies
Netzer: Basic Idea
B=6
Is this order guaranteed?
swap(MB,1) 0
B=7
B=6
Netzer: Transitive Reduction
B=7
MB=0
x=Bswap(MB,1) 0
Netzer: Example
A=8swap(MA,1) 0
MA=0
swap(MB,1) 0
B=7
MB=0
x=Bswap(MB,1) 0
MB=0
swap(MA,1) 0
y=AMA=0
B=6
swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1
Netzer: Example
A=8swap(MA,1) 0
MA=0
swap(MB,1) 0
B=7
MB=0
x=Bswap(MB,1) 0
MB=0
swap(MA,1) 0
y=AMA=0
B=6
swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1
(1,0)(2,0)
(4,0)
(5,1)
(6,4)
(3,0)(0,1)
(4,3)(4,4)(6,5)(6,6)(6,7)(6,8)(6,9)
(6,10)
(4,2)
Netzer: Example
A=8swap(MA,1) 0
MA=0
swap(MB,1) 0
B=7
MB=0
x=Bswap(MB,1) 0
MB=0
swap(MA,1) 0
y=AMA=0
B=6
swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1
(1,0)(2,0)
(4,0)
(5,1)
(6,4)
(3,0)(0,1)
(4,3)(4,4)(6,5)(6,6)(6,7)(6,8)(6,9)
(6,10)
(4,2)
Netzer: Problems• Size of vector clock grows with the number of
processes– the method doesn’t scale well– programs that create thread dynamically?
• A vector timestamp has to be attached to all shared memory locations: huge space overhead.
• The method basically detects all data and synchronisation races and replays them.
ROLT• Attaches a Lamport clock to each process.
The clocks attach a timestamp to each memory operations.
• Does not detect racing operation, but merely re-executes them in the same order.
• Also automatically traces transitive reduction of the dependencies
ROLT: Example
A=8swap(MA,1) 0
MA=0
swap(MB,1) 0
B=7
MB=0
x=Bswap(MB,1) 0
MB=0
swap(MA,1) 0
y=AMA=0
B=6
swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1
12
4
5
8
31
679
1011121314
5
ROLT: Example
A=8swap(MA,1) 0
MA=0
swap(MB,1) 0
B=7
MB=0
x=Bswap(MB,1) 0
MB=0
swap(MA,1) 0
y=AMA=0
B=6
swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1
12
4
5
8
31
679
1011121314
5
(5,8) (1,5),(7,9)Traced:
ROLT: Example
A=8swap(MA,1) 0
MA=0
swap(MB,1) 0
B=7
MB=0
x=Bswap(MB,1) 0
MB=0
swap(MA,1) 0
y=AMA=0
B=6
swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1
12
4
5
8
31
679
1011121314
5
ROLT: Example
A=8
swap(MA,1) 0
MA=0
swap(MB,1) 0
B=7
MB=0
x=Bswap(MB,1) 0
MB=0
swap(MA,1) 0
y=AMA=0
B=6
swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1
1
2
4
5
8
31
679
1011121314
5
ROLT using three phases• Problem: high overhead due to the tracing of
all memory operations• Solution: only record/replay the
synchronisation operations (subset of all race conditions)
• Problem: no correct replay possible if the execution contains a data race
• Solution: add a third phase for detecting the data races
ROLT using three phases• Phase 1: record the order of the
synchronisation races• Phase 2: replay the synchronisation races
while using intrusive data race detection techniques
• Phase 3: replay the synchronisation races and use cyclic debugging techniques to find the `normal’ errors
ROLT: Example
A=8L(MA)
U(MA)
L(MB)
B=7
U(MB)
x=BL(MB)
U(MB)
L(MA)
y=AMA=0
B=6
1
3
4
2
5
67
8 - (0,5)Traced:
ROLT• ROLT replays synchronisation races end
detects data races.• The method scales well and has a small
space and time overhead.• Produces small trace files.• A total order is imposed artificial
dependencies.
Conclusions