Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating...

Post on 18-Jan-2018

223 views 0 download

description

Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing Developing parallel programs is considered difficult: –normal errors as in sequential programs –synchronisation errors (deadlock, races) –performance errors  We need good development tools

transcript

Execution Replay and Debugging

Contents

Introduction• Parallel program: set of co-operating processes• Co-operation using

– shared variables– message passing

• Developing parallel programs is considered difficult:– normal errors as in sequential programs– synchronisation errors (deadlock, races)– performance errors

We need good development tools

Debugging of parallel programs• Most used technique: cyclic debugging• Requires repeatable equivalent executions• Is a problem for parallel programs: lots of

non-determinism present• Solution: execution replay mechanism:

– record phase: trace information about the non-deterministic choices

– replay phase: force an equivalent re-execution using the trace allowing the use of intrusive debugging techniques

Non-determinism• Classes:

– external vs. internal non-determinism– desired vs. undesired non-determinism

• Important: the amount of non-determinism depends on the abstraction level. E.g. a semaphore P()-operation can be fully deterministic while consisting of e number of non-deterministic spinlocking operations.

Causes of Non-determinism– In sequential programs:

• program code (self modifying code?)• program input (disk, keyboard, network, ...)• certain system calls (gettimeofday())• interrupts, signals, ...

– In parallel programs:• accesses to shared variables: race conditions

(synchronisation races and data races)– In distributed programs:

• promiscuous receive operations• test operations for non-blocking messages operations

Main Issues in Execution Replay• recorded execution = original execution:

– trace as little as possible in order to limit the overhead

• in time • in space

• replayed execution = recorded execution:– faithful re-execution: trace enough

Execution Replay Methods• Two types: content- vs. ordering-based

– content-based: force each process to read the same value or to receive the same message as during the original execution

– ordering-based: force each process to access the variables or to receive the message in the same logical order as during the original execution

Logical Clocks for Ordering-based Methods

• A clock C() attaches a timestamps C(x) to an event x

• Used for tracing the logical order of events• Clock condition:

• Clocks are strongly consistent if

• New timestamp is the increment of the maximum of the old timestamps of the process and the object

)()( bCaCba

)()( bCaCba

Scalar Clocks• Aka Lamport Clocks• Simple and fast update algorithm:

• Scales very well with the number of processes

• Provides only limited information:

1,max'' oSCpSCoSCpSC

baabbSCaSC

abbababSCaSCbababSCaSC

//////

or or or

or

Vector Clocks• A vector clock for a program using N

processes consist of N scalar values

• Such a clock is strongly consistent: by comparing vector timestamps one can deduce concurrency information:

0,...,0,1,0,..,0,sup'' oVCpVCoVCpVC

abbVCaVC

babVCaVCbabVCaVC

//

An Example Program• A parallel program with two threads,

communicating using shared variables: A, B MA and MB. Local variables are x and y.

• M is used as a mutex using an atomic swap operation provided by the CPU:

valuememlocmemlocreturn

valuememlocswap][

][),(

An Example Program (II)• Lock operation on a mutex M is implemented

(in a library):

• Unlock operation on a mutex M is implemented as:

• All variables are initially 0

);1)1,(( Mswapwhile

;0M

An Example Program (III)• The example program:

Thread 1:L(MA);A=8;U(MA);L(MB);B=7;U(MB);

Thread 2:B=6;L(MB);x=B;U(MB);L(MA);y=A;U(MA);

A Possible Execution: Low Level View

A=8swap(MA,1) 0

MA=0

swap(MB,1) 0

B=7

MB=0

x=Bswap(MB,1) 0

MB=0

swap(MA,1) 0

y=AMA=0

B=6

swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1

A Possible Execution: High Level View

A=8L(MA)

U(MA)

L(MB)

B=7

U(MB)

x=BL(MB)

U(MB)

L(MA)

y=AU(MA)

B=6

tim e

Recap• A content-based replay method: the value

read by each load operation is stored• Trace generation of 1MB/s was measured on

a VAX 11/780• Undoable method: time needed to record the

large amount of trace information modifies the initial execution

• One advantage: possible to replay a subset of the processes in isolation.

Recap: Example

A=8swap(MA,1) 0

MA=0

swap(MB,1) 0

B=7

MB=0

x=Bswap(MB,1) 0

MB=0

swap(MA,1) 0

y=AMA=0

B=6

swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1

0

0111

70

80

Instant Replay• First ordering-based replay method• Developed for CREW-algorithms• Each shared object receives a version

number that is updated or logged at each CREW-operation:– read: the version number is logged– write:

• the version number is incremented• the number of preceding read operations is logged

Instant Replay: Example

A=8Lw(MA)

Uw(MA)

Lw(MB)

B=7

Uw(MB)

x=BLr(MB)

Ur(MB)

Lr(MA)

y=AUr(MA)

B=6

version: 1log 0 reads

version: 1log 0 reads

log version 1

log version 1

PROBLEM

Netzer• Widely cited method• Attaches a vector clock to each process. The

clocks attach a timestamp to each memory operations.

• Uses vector clocks to detect concurrent (racing) memory operations

• Automatically traces transitive reduction of the dependencies

Netzer: Basic Idea

B=6

Is this order guaranteed?

swap(MB,1) 0

B=7

B=6

Netzer: Transitive Reduction

B=7

MB=0

x=Bswap(MB,1) 0

Netzer: Example

A=8swap(MA,1) 0

MA=0

swap(MB,1) 0

B=7

MB=0

x=Bswap(MB,1) 0

MB=0

swap(MA,1) 0

y=AMA=0

B=6

swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1

Netzer: Example

A=8swap(MA,1) 0

MA=0

swap(MB,1) 0

B=7

MB=0

x=Bswap(MB,1) 0

MB=0

swap(MA,1) 0

y=AMA=0

B=6

swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1

(1,0)(2,0)

(4,0)

(5,1)

(6,4)

(3,0)(0,1)

(4,3)(4,4)(6,5)(6,6)(6,7)(6,8)(6,9)

(6,10)

(4,2)

Netzer: Example

A=8swap(MA,1) 0

MA=0

swap(MB,1) 0

B=7

MB=0

x=Bswap(MB,1) 0

MB=0

swap(MA,1) 0

y=AMA=0

B=6

swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1

(1,0)(2,0)

(4,0)

(5,1)

(6,4)

(3,0)(0,1)

(4,3)(4,4)(6,5)(6,6)(6,7)(6,8)(6,9)

(6,10)

(4,2)

Netzer: Problems• Size of vector clock grows with the number of

processes– the method doesn’t scale well– programs that create thread dynamically?

• A vector timestamp has to be attached to all shared memory locations: huge space overhead.

• The method basically detects all data and synchronisation races and replays them.

ROLT• Attaches a Lamport clock to each process.

The clocks attach a timestamp to each memory operations.

• Does not detect racing operation, but merely re-executes them in the same order.

• Also automatically traces transitive reduction of the dependencies

ROLT: Example

A=8swap(MA,1) 0

MA=0

swap(MB,1) 0

B=7

MB=0

x=Bswap(MB,1) 0

MB=0

swap(MA,1) 0

y=AMA=0

B=6

swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1

12

4

5

8

31

679

1011121314

5

ROLT: Example

A=8swap(MA,1) 0

MA=0

swap(MB,1) 0

B=7

MB=0

x=Bswap(MB,1) 0

MB=0

swap(MA,1) 0

y=AMA=0

B=6

swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1

12

4

5

8

31

679

1011121314

5

(5,8) (1,5),(7,9)Traced:

ROLT: Example

A=8swap(MA,1) 0

MA=0

swap(MB,1) 0

B=7

MB=0

x=Bswap(MB,1) 0

MB=0

swap(MA,1) 0

y=AMA=0

B=6

swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1

12

4

5

8

31

679

1011121314

5

ROLT: Example

A=8

swap(MA,1) 0

MA=0

swap(MB,1) 0

B=7

MB=0

x=Bswap(MB,1) 0

MB=0

swap(MA,1) 0

y=AMA=0

B=6

swap(MB,1) 1swap(MB,1) 1swap(MB,1) 1

1

2

4

5

8

31

679

1011121314

5

ROLT using three phases• Problem: high overhead due to the tracing of

all memory operations• Solution: only record/replay the

synchronisation operations (subset of all race conditions)

• Problem: no correct replay possible if the execution contains a data race

• Solution: add a third phase for detecting the data races

ROLT using three phases• Phase 1: record the order of the

synchronisation races• Phase 2: replay the synchronisation races

while using intrusive data race detection techniques

• Phase 3: replay the synchronisation races and use cyclic debugging techniques to find the `normal’ errors

ROLT: Example

A=8L(MA)

U(MA)

L(MB)

B=7

U(MB)

x=BL(MB)

U(MB)

L(MA)

y=AMA=0

B=6

1

3

4

2

5

67

8 - (0,5)Traced:

ROLT• ROLT replays synchronisation races end

detects data races.• The method scales well and has a small

space and time overhead.• Produces small trace files.• A total order is imposed artificial

dependencies.

Conclusions