A New Diskless Check Pointing Approach

transcript

A New Diskless Checkpointing

Approach for Multiple Processor

Failures

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING

Ge-Ming Chiu, Member, IEEE Computer Society, and Jane-Ferng Chiu

Presented By,Linda Maria Pulickal

S7 CSE

Check Point Snapshot of current application state.

Used to restart the execution in case of failure.

Very important in large scale distributed computing.

INTRODUCTION

Checkpoints are stored in the primary storage memory of peer processors.

No need of secondary storage - saves time.

What is DiskLess CheckPointing?

No latency = No performance degradation.

When stable storage is unavailable. Eg: mobile computing systems.

Effective in a large scale (10,000-100,000 processors).

Advantage of DiskLess approach

Diskless checkpointing

neighbor-based

each processorsaves its checkpoints

in entirety in the memory of peer

processors.

Parity-baseduse a dedicated checkpoint pro-

cessor to store the parity of the

checkpoints taken by all the

application processors using XOR operations.

Reed-Solomon coding-basedencodes

checkpoints of multiple processors using Reed-Solomon

erasure coding techniques.

Extra dedicated processors for storing checkpoint data.

Difficulty finding extra processors. Eg: mobile computing systems

this addition increases failure probability.

Memory overhead.

Problem with existing techniques

System Model

Collection of n processors (or nodes), P0, P1, P2, ... ,Pn-1, interconnected by a (wired or wireless) network.

1. Diskless checkpointing

scheme to tolerate up to k simultaneous failures.

2. Reduce memory overhead.

Basic Operation of the Proposed

Scheme

Important terms:

1. Checkpoint Storage Nodes2. Checkpoint Coverage Nodes

Checkpoint COVERAGE - CCi

Checkpoint STORAGE - CSi

Each Pi send its checkpoint to at least k other processors (CSi).

-- at least one of CSi will remain alive for each

failed processor.

Pi also stores a copy of the state in a distinct section of its memory.

-- to help other failed processors decode their previous checkpoints.

Steps:

Each Pi calculates the parity from CCi using XOR.

Stores only the parity result in memory.

Advantage: Memory space of size equal to the

maximum checkpoint.

The conceptual framework of diskless checkpointing approach.

Recovery

P5 want to recover

P6 node is used

P6 State:

S6 = P1 + P2 + P5

P5 = S6 – P1 – P2

Safe Recovery CriterionFor any failed processor Pi, at least one

node in CSihas all of its checkpoint coverage nodes

intact.

DETERMINING THE CHECKPOINT STORAGE NODE SET

the cardinality of CSi must be at least k.

the cardinality of CCi is k to ensure good load balance.

Fundamentals of CSi

S0 = P3 + P4S1 = P2 + P3S2 = P0 + P1S3 = P2 + P4S4 = P0 + P1

Not Good Design.. How? CS0 ∩ CS1 = { P2, P4 } , more than 1 element

S0 = P3 + P4S1 = P0 + P4S2 = P0 + P1S3 = P1 + P2S4 = P2 + P3

Not Good Design.. How? P1 Є CS0 ; CS0 ∩ CS1 = { P2}

For all Pi and Pr,

(1) │CSi ∩ CSr │ ≤ 1 , i ≠ r

For each Pi,

(2) CSi ∩ CSr = ᶲ , for any Pr Є CSi.

Theorms

Design of CSi’s

Cyclic design concept.

Derived from CS0 as,

Only focus on CS0 design

PSR Sequence:d0, d1, d2, ... ,dr-1 is PSR if NO l, m, p, and q 0 ≤ l ≤ m < p ≤ q ≤ r – 1 satisfy,

Eg: 2, 1 , 5 , 3 not PSR 1, 3 , 5 , 2 is PSR

PSR ensures no 2 processors share more than 1 checkpoint storage node.

Design of CS0

Construct a PSR sequence of 3 (i.e k -1) +ve integers.

Select sequence with minimum sum, D. Eg: d0 = 1 ; d1 = 3 ; d2 = 2 & D = 6.

First element of CS0 : PD+1 = P7

ADD d0 , d1 , d2 as respective increments to P7.

CS0 ={ P7, P8, P11, P13}

Steps for k =4:

total no. of processors in the system ≥ 3D+2.

Ensure theorm 2.

Requirements

Performance Analysis

Thank You ….

A New Diskless Check Pointing Approach

Documents