+ All Categories
Home > Documents > Consistent Cuts and Un-coordinated Check-pointing.

Consistent Cuts and Un-coordinated Check-pointing.

Date post: 14-Dec-2015
Category:
Upload: dashawn-tinkham
View: 233 times
Download: 6 times
Share this document with a friend
Popular Tags:
33
Consistent Cuts and Un-coordinated Check- pointing
Transcript
Page 1: Consistent Cuts and Un-coordinated Check-pointing.

Consistent Cuts and

Un-coordinated Check-pointing

Page 2: Consistent Cuts and Un-coordinated Check-pointing.

Cuts

• Subset C of events in computation– some definitions require at least one event from each process

• For each process P, events in C that executed on P form an initial prefix of all events that executed on P

• Cut: {e0,e1,e2,e4,e7} Not a cut: {e0,e2,e4,e7}• Frontier of cut: subset of cut containing last events on each process

– for our example, {e2,e4,e7}

x x x x

x x xx x x x

x x x

e0 e1 e2 e3

e4 e5 e6

e7 e8 e9 e10

e11 e12 e13

Page 3: Consistent Cuts and Un-coordinated Check-pointing.

Equivalent definition of cut

• Subset C of events in computation • If e’ C, and e e’, and e and e’ executed on same

process, then e C.• What happens if we remove condition that e and e’ were

executed on same process?

x x x x

x x xx x x x

x x x

e0 e1 e2 e3

e4 e5 e6

e7 e8 e9 e10

e11 e12 e13

Page 4: Consistent Cuts and Un-coordinated Check-pointing.

Consistent cut

• Subset C of events in computation• If e’ C, and e e’, then e C

– Consistent cut: {e0, e1, e2, e4, e5,e7}• note e5e2 but cut is still consistent by our definition

– Inconsistent cut: {e0,e1,e2,e4,e7}– Not a cut: {e0,e2,e4,e7}

x x x x

x x xx x x x

x x x

e0 e1 e2 e3

e4 e5 e6

e7 e8 e9 e10

e11 e12 e13

Page 5: Consistent Cuts and Un-coordinated Check-pointing.

Properties of consistent cuts(0)

• If cut is inconsistent, there must be a message such that receiving event is in C but sending event is not.

• Proof: there must an e and e’ such ee’, e’ in C but e not in C. Consider the chain ee0e1…e’. There must be events eiej in this chain such that events e,e0,…ei are not in C, but ej is in C. Clearly, ei and ej must be executed by different processes. Therefore, ei is send and ej is receive.

x x x x

x x xx x x x

x x x

e0

e4 e5 e6

e7 e8 e9 e10

e11 e12 e13x

e

e’

Page 6: Consistent Cuts and Un-coordinated Check-pointing.

Properties of consistent cuts(I)

• Let e P be a computational event on a frontier of a consistent cut C. If e P e’Q , then e’Q cannot be in C.

• Proof: Consider the causal chain e P e1… e’Q.

Event e1 must execute on process P because e P is a computational event. If e P is on frontier, e1 is not. By definition of consistent cut, e’Q cannot be in consistent cut.

x x x x

x x xx x x x

x x x

e0

e4 e5 e6

e7 e8 e9 e10

e11 e12 e13

Page 7: Consistent Cuts and Un-coordinated Check-pointing.

Properties (II)

• Let F = {e0,e1,….} be a set of computational events, one from each process. F is the frontier of a consistent cut iff the events in F are concurrent.

• Proof: from Property (I) and Property(0).

x x x x

x x xx x x x

x x x

e0

e4 e5 e6

e7 e8 e9 e10

e11 e12 e13

Page 8: Consistent Cuts and Un-coordinated Check-pointing.

Properties of consistent cuts (III):Lattice of consistent cuts

x x x x

x x xx x x x

x x x

e0 e1 e2 e3

e4 e5 e6

e7 e8 e9 e10

e11 e12 e13

21:)2,1(

21:)2,1(

21:21

CCCCMeet

CCCCJoin

CCCC

C1 C2

Page 9: Consistent Cuts and Un-coordinated Check-pointing.

Un-coordinated check-pointing

• Each process saves its local state at start, and then whenever it wants.

• Events: compute,send,receive,take check-point• Recovery line: frontier of any consistent cut, whose events are all

check-points• Is there an optimum recovery line? How do we find it?

p

q

r

*

Page 10: Consistent Cuts and Un-coordinated Check-pointing.

Check-point Dependency Graph

• Nodes– One for each local check-point– One for current state of each surviving process

• Edges: one for each message (e,e’) from some P to Q– Source is node for last check-point on P that happened before e– Destination is node n on Q for first check-point/current state such that e’ happened

before n

pqr

*

pq

r

Page 11: Consistent Cuts and Un-coordinated Check-pointing.

Properties of check-point dependency graph

• Node c2 is reachable from node c1 in graph iff check-point corresponding to c1 happens before

check-point corresponding to c2.

pqr

*

pq

r

Page 12: Consistent Cuts and Un-coordinated Check-pointing.

Finding optimum recovery line

• RL0 = { last nodes on each process }• While (there exist u,v in RLi | v is reachable from u)

– RLi+1 = RLi – {v} + {node before v in same process as v}

• Final RL when loop terminates is optimum recovery line• See later to make this into an algorithm.

pqr

*

p

q

r

RL0RL1RL2RL3

Page 13: Consistent Cuts and Un-coordinated Check-pointing.

Correctness

• Algorithm obviously computes a set of concurrent check-points, one from each process.

• From Property (II), it follows that these check-points are frontier of a consistent cut.

p

q

r

Page 14: Consistent Cuts and Un-coordinated Check-pointing.

Optimality

• Suppose O is better recovery line.• O cannot be RLO; otherwise, our algorithm succeeds. So RL0 is better than O.• Consider iteration when RLi is better than O but RLi+1is not. There exist u,v in

RLi such that v is reachable from u and RLi+1 is obtained from Rli by dropping v and taking check-point prior to v. Therefore, v must be in O. Let x in O be check-point on same process as u. We see that xuv, which contradicts Property(II).

p

q

r

Page 15: Consistent Cuts and Un-coordinated Check-pointing.

Finding recovery line efficiently

• Node colors– Yellow: on current recovery line– Red: beyond current recovery line– Green: behind current recovery line

• Bad edge:– Source is red/yellow– Destination is yellow/green

• Algorithm: propagate redness forward from destination bad edges

pq

r

Page 16: Consistent Cuts and Un-coordinated Check-pointing.

Algorithm

• Mark all nodes green• For each node l that is last node of process

– Mark node yellow– Add each edge (l,d) to worklist

• While worklist is nonempty do– Get edge (s,d) from worklist;– If color(d) is red continue; – L = node to left of d;– Mark L yellow; Add all bad edges (L,d) to worklist;– R = first red node to right of d;– For each node t in interval [d,R)

• Mark t red;• Add all bad edges of form (t,d) to worklist;

Page 17: Consistent Cuts and Un-coordinated Check-pointing.

Remarks

• Complexity of algorithm: O(|E|+|V|)– Each node is touched at most 3 times to mark it

green, yellow,red– Each edge is examined at most twice

• Once when its source goes green yellow

• Once when its source goes yellow red

• Another approach: use rollback dependency graph (see Alvisi et al)

Page 18: Consistent Cuts and Un-coordinated Check-pointing.

Practical details

• Each process numbers its checkpoints starting at 0.• When a message is sent from S to R, number of last check-point is piggybacked

on message.• Receiver of message saves message + piggyback in log.• When checkpoint is taken, message log is also saved on disk.• In-flight messages can be recovered from this log after recovery line has been

established.

pqr

*

Page 19: Consistent Cuts and Un-coordinated Check-pointing.

Garbage collection of saved states

• Garbage collection of old states is key problem.

• One solution: run the recovery line algorithm once in a while even if there is no failure, and GC all states behind the recovery line.

Page 20: Consistent Cuts and Un-coordinated Check-pointing.

Application-level Check-pointing

Page 21: Consistent Cuts and Un-coordinated Check-pointing.

Recall

• We have seen system-level check-pointing.• Trouble with system-level check-pointing:

– lot of data saved at each check-point • PC, registers, stack, heap, some O/S state,network state,…• thin pipe to disk problem

– lack of portability • processor/OS state is very implementation-specific• cannot restart check-point on different platform• cannot restart check-point on different number of processors

• One alternative: application-level check-pointing

Page 22: Consistent Cuts and Un-coordinated Check-pointing.

Application-level check-pointing• Key idea: permit user to specify

– what variables should be saved at a check-point– program point where check-point should be taken

• Example: protein-folding– save only positions and velocities of bases – check-point at end of time-step

• Advantages:– less data saved

• only live data needs to be saved• check-point at program points where live data is small and no in-flight messages

– data can be saved in implementation-independent manner

Page 23: Consistent Cuts and Un-coordinated Check-pointing.

Warning• This is more complex than it appears!• We must restore

– PC: need to save where check-point was taken– registers– stack

• In general, many active procedure invocations when check-point is taken.

• How do we restore stack so procedure returns etc. happen correctly?

• Heap: restored heap data will be in different locations than at check-point

Page 24: Consistent Cuts and Un-coordinated Check-pointing.

Right intuition

• In application-level check-pointing, we must use the saved variables to recompute the system state we would have saved in system-level check-pointing, modulo relocation of heap variables.

• Recovery script: – code that is executed to accomplish this

– distinct from user code, but obviously derived from it

– however, needs to woven into user code to simplify problems such as register restoration

Page 25: Consistent Cuts and Un-coordinated Check-pointing.

Example: DOME (Beguelin et al,CMU)

• Distributed Object Migration Environment (DOME)

• C++ library of data parallel objects automatically distributed over networks of heterogenous work-stations

• Application-level check-pointing and restart supported– User-level – Pre-processor based

Page 26: Consistent Cuts and Un-coordinated Check-pointing.

Simple case

• Most computation occurs in a loop in main

• Solution:– put one check-point at bottom of loop– live variables at bottom of loop are globals– write script to save and restore globals– weave script into main

Page 27: Consistent Cuts and Un-coordinated Check-pointing.

Dome example main (int argc, char *argv[])

{dome-init(argc,argv);

//* statements are introduced for failure recovery

//prefix d on variable type says “save me at checkpoint”

* dScalar<int> integer-variable;

* dScalar<float> float-variable;

* dVector<int> int-vector;

* if (! is_dome_restarting())

* execute_user_initialization_code(…);

while (!loop_done(…)) {

//loop_done uses only saved variables

do_computation(…);

* dome_check_point();

}

}

Page 28: Consistent Cuts and Un-coordinated Check-pointing.

Analysis

• Let us understand how this code restores processor state– PC: we drop into loop after restoring globals– registers: by making recovery script part of main, we

ensure that register contents at top of loop are same for normal execution and for restart

– stack: we re-execute main, so frame is restored– heap: restored from saved check-point but may be

relocated

• Think: this works even if we restart on different machine!

Page 29: Consistent Cuts and Un-coordinated Check-pointing.

Remarks

• Loop body is allowed to make function calls – real restriction is that there is one check-point and it

must be in main

• Command-line parameter is used to determine whether execution is normal or restart

• User must write some code to restore variables from check-point– perhaps library code can help

Page 30: Consistent Cuts and Un-coordinated Check-pointing.

More complex example

f() {

dScalar<int> i;

do_f_stuff;

g(i);

next_statement;

…;

}

g(dScalar<int> &I) {

do_g_stuff_1;

dome_checkpoint();

do_g_stuff_2;

}

Page 31: Consistent Cuts and Un-coordinated Check-pointing.

General scenario

• Check-point could happen deep inside a bunch of procedure calls.

• On restart, we need to restore stack so procedure returns etc. can happen normally.

• Solution: save information about which procedure invocations are live at check-point

Page 32: Consistent Cuts and Un-coordinated Check-pointing.

Example with Dome constructs

f() { g(dScalar<int> &I) {

dScalar<int> i; if (is_dome_restarting())

if (is_dome_restarting()) { goto restart_done;

next_call = dome_get_next_call(); do_g_stuff_1;

…..} dome_checkpoint();

do_f_stuff; restart_done:

dome_push(“g1”); do_g_stuff_2;

g1: }

g(i);

dome_pop();

next_statement;

…;

}

Page 33: Consistent Cuts and Un-coordinated Check-pointing.

Challenge

• Do this for MPI code.• Can compiler determine

– where to check-point?– what data to check-point?

• Need not save all data live at check-point– if some variables can be easily recomputed from saved

data and program constants, we can re-compute those values in the recovery script.

– we can modify program to make this easier.

• Measure of success: beat hand-written recovery code


Recommended