Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating...

transcript

Execution Replay and Debugging

Contents

Introduction• Parallel program: set of co-operating processes• Co-operation using

– shared variables– message passing

• Developing parallel programs is considered difficult:– normal errors as in sequential programs– synchronisation errors (deadlock, races)– performance errors

We need good development tools

Debugging of parallel programs• Most used technique: cyclic debugging• Requires repeatable equivalent executions• Is a problem for parallel programs: lots of

non-determinism present• Solution: execution replay mechanism:

– record phase: trace information about the non-deterministic choices

– replay phase: force an equivalent re-execution using the trace allowing the use of intrusive debugging techniques

Non-determinism• Classes:

– external vs. internal non-determinism– desired vs. undesired non-determinism

• Important: the amount of non-determinism depends on the abstraction level. E.g. a semaphore P()-operation can be fully deterministic while consisting of e number of non-deterministic spinlocking operations.

Causes of Non-determinism– In sequential programs:

• program code (self modifying code?)• program input (disk, keyboard, network, ...)• certain system calls (gettimeofday())• interrupts, signals, ...

– In parallel programs:• accesses to shared variables: race conditions

(synchronisation races and data races)– In distributed programs:

• promiscuous receive operations• test operations for non-blocking messages operations

Main Issues in Execution Replay• recorded execution = original execution:

– trace as little as possible in order to limit the overhead

• in time • in space

• replayed execution = recorded execution:– faithful re-execution: trace enough

Execution Replay Methods• Two types: content- vs. ordering-based

– content-based: force each process to read the same value or to receive the same message as during the original execution

– ordering-based: force each process to access the variables or to receive the message in the same logical order as during the original execution

Logical Clocks for Ordering-based Methods

• A clock C() attaches a timestamps C(x) to an event x

• Used for tracing the logical order of events• Clock condition:

• Clocks are strongly consistent if

• New timestamp is the increment of the maximum of the old timestamps of the process and the object

)()( bCaCba

Scalar Clocks• Aka Lamport Clocks• Simple and fast update algorithm:

• Scales very well with the number of processes

• Provides only limited information:

1,max'' oSCpSCoSCpSC

baabbSCaSC

abbababSCaSCbababSCaSC

//////

or or or

Vector Clocks• A vector clock for a program using N

processes consist of N scalar values

• Such a clock is strongly consistent: by comparing vector timestamps one can deduce concurrency information:

0,...,0,1,0,..,0,sup'' oVCpVCoVCpVC

abbVCaVC

babVCaVCbabVCaVC

An Example Program• A parallel program with two threads,

communicating using shared variables: A, B MA and MB. Local variables are x and y.

• M is used as a mutex using an atomic swap operation provided by the CPU:

valuememlocmemlocreturn

valuememlocswap][

An Example Program (II)• Lock operation on a mutex M is implemented

(in a library):

• Unlock operation on a mutex M is implemented as:

• All variables are initially 0

);1)1,(( Mswapwhile

An Example Program (III)• The example program:

Thread 1:L(MA);A=8;U(MA);L(MB);B=7;U(MB);

Thread 2:B=6;L(MB);x=B;U(MB);L(MA);y=A;U(MA);

A Possible Execution: Low Level View

A=8swap(MA,1) 0

swap(MB,1) 0

x=Bswap(MB,1) 0

swap(MA,1) 0

y=AMA=0

swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1

A Possible Execution: High Level View

A=8L(MA)

x=BL(MB)

y=AU(MA)

Recap• A content-based replay method: the value

read by each load operation is stored• Trace generation of 1MB/s was measured on

a VAX 11/780• Undoable method: time needed to record the

large amount of trace information modifies the initial execution

• One advantage: possible to replay a subset of the processes in isolation.

Recap: Example

A=8swap(MA,1) 0

swap(MB,1) 0

x=Bswap(MB,1) 0

swap(MA,1) 0

y=AMA=0

Instant Replay• First ordering-based replay method• Developed for CREW-algorithms• Each shared object receives a version

number that is updated or logged at each CREW-operation:– read: the version number is logged– write:

• the version number is incremented• the number of preceding read operations is logged

Instant Replay: Example

A=8Lw(MA)

Uw(MA)

Lw(MB)

Uw(MB)

x=BLr(MB)

Ur(MB)

Lr(MA)

y=AUr(MA)

version: 1log 0 reads

log version 1

PROBLEM

Netzer• Widely cited method• Attaches a vector clock to each process. The

clocks attach a timestamp to each memory operations.

• Uses vector clocks to detect concurrent (racing) memory operations

• Automatically traces transitive reduction of the dependencies

Netzer: Basic Idea

Is this order guaranteed?

swap(MB,1) 0

Netzer: Transitive Reduction

x=Bswap(MB,1) 0

Netzer: Example

A=8swap(MA,1) 0

swap(MB,1) 0

x=Bswap(MB,1) 0

swap(MA,1) 0

y=AMA=0

Netzer: Example

A=8swap(MA,1) 0

swap(MB,1) 0

x=Bswap(MB,1) 0

swap(MA,1) 0

y=AMA=0

(1,0)(2,0)

(3,0)(0,1)

(4,3)(4,4)(6,5)(6,6)(6,7)(6,8)(6,9)

(6,10)

Netzer: Example

A=8swap(MA,1) 0

swap(MB,1) 0

x=Bswap(MB,1) 0

swap(MA,1) 0

y=AMA=0

(1,0)(2,0)

(3,0)(0,1)

(4,3)(4,4)(6,5)(6,6)(6,7)(6,8)(6,9)

(6,10)

Netzer: Problems• Size of vector clock grows with the number of

processes– the method doesn’t scale well– programs that create thread dynamically?

• A vector timestamp has to be attached to all shared memory locations: huge space overhead.

• The method basically detects all data and synchronisation races and replays them.

ROLT• Attaches a Lamport clock to each process.

The clocks attach a timestamp to each memory operations.

• Does not detect racing operation, but merely re-executes them in the same order.

• Also automatically traces transitive reduction of the dependencies

ROLT: Example

A=8swap(MA,1) 0

swap(MB,1) 0

x=Bswap(MB,1) 0

swap(MA,1) 0

y=AMA=0

1011121314

ROLT: Example

A=8swap(MA,1) 0

swap(MB,1) 0

x=Bswap(MB,1) 0

swap(MA,1) 0

y=AMA=0

1011121314

(5,8) (1,5),(7,9)Traced:

ROLT: Example

A=8swap(MA,1) 0

swap(MB,1) 0

x=Bswap(MB,1) 0

swap(MA,1) 0

y=AMA=0

1011121314

ROLT: Example

swap(MA,1) 0

swap(MB,1) 0

x=Bswap(MB,1) 0

swap(MA,1) 0

y=AMA=0

1011121314

ROLT using three phases• Problem: high overhead due to the tracing of

all memory operations• Solution: only record/replay the

synchronisation operations (subset of all race conditions)

• Problem: no correct replay possible if the execution contains a data race

• Solution: add a third phase for detecting the data races

ROLT using three phases• Phase 1: record the order of the

synchronisation races• Phase 2: replay the synchronisation races

while using intrusive data race detection techniques

• Phase 3: replay the synchronisation races and use cyclic debugging techniques to find the `normal’ errors

ROLT: Example

A=8L(MA)

x=BL(MB)

y=AMA=0

8 - (0,5)Traced:

ROLT• ROLT replays synchronisation races end

detects data races.• The method scales well and has a small

space and time overhead.• Produces small trace files.• A total order is imposed artificial

dependencies.

Conclusions

Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating...

Documents