+ All Categories
Home > Documents > Checkpointing-based Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

Checkpointing-based Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

Date post: 30-Jan-2016
Category:
Upload: lael
View: 37 times
Download: 0 times
Share this document with a friend
Description:
Checkpointing-based Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware. Raphael Y. de Camargo Andrei Goldchleger Fabio Kon Alfredo Goldman Department of Computer Science University of São Paulo, Brazil. Middleware 2004 – Toronto, Canada - PowerPoint PPT Presentation
Popular Tags:
22
Checkpointing-based Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware Raphael Y. de Camargo Andrei Goldchleger Fabio Kon Alfredo Goldman Department of Computer Science University of São Paulo, Brazil Middleware 2004 – Toronto, Canada 2nd International Workshop on Grid Computing
Transcript
Page 1: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

Checkpointing-based Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

Raphael Y. de CamargoAndrei Goldchleger Fabio KonAlfredo Goldman

Department of Computer ScienceUniversity of São Paulo, Brazil

Middleware 2004 – Toronto, Canada2nd International Workshop on Grid Computing

Page 2: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

2

Summary

Introduction InteGrade Grid middleware BSP Computing Model Checkpointing-based Rollback Recovery Checkpointing Infrastructure Preliminary Experiments Conclusions

Page 3: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

3

Introduction

Grid Computing: Grid computing allows the

leveraging and integration of computer resources distributed across LANs and WANs

Besides dedicated computing resources, it is also possible to use idle computing power from commodity workstations (opportunistic computing)

Challenges: Environment composed of

shared user workstations spread across many different LANs.

Machines may fail, become unaccessible, or may switch from idle to busy very rapidly

Some mechanism for fault-tolerance is a major requirement for such a system.

Page 4: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

4

InteGrade Grid Middleware

Objectives: Use idle computing power of

commodity workstations (opportunistic computing)

Allow organizations to increase their available computing power without buying extra hardware

Ensures the quality of service of machine owners sharing its computing resources

Implementation Status: Basic architecture already

implemented

Uses CORBA distributed object technology for communication

Provides support for execution of sequential, BSP and bag-of-tasks applications

Page 5: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

5

InterCluster InteGrade Architecture

GRM (Global Resource Manager):Manages the grid resources and schedules applications for execution

ASCT:Allows the submission and controlling of applications on the Grid

LRM (Local Resource Manager):Manages a node´s resources

Runtime LibrariesProvide support for running parallel applications

Page 6: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

6

BSP Parallel Computing Model

Computation is performed using a sequence of parallel supersteps

Each superstep is composed of computation and communication, with a synchronization barriers in the end

All data from communication is available to other processes only in the next superstep

Two communication Mechanisms: Direct Remote Memory Access (DRMA) Bulk Synchronous Message Passing (BSMP)

Page 7: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

7

Checkpointing-based Rollback Recovery

Checkpointing: Consists in periodically saving the application state into a checkpoint, so that its state can be recovered from it

Checkpointing-based Rollback-Recovery: Process of reinitializing an application from an intermediate execution point after a failure is detected

Two approachs for checkpointing:

System-level: - The memory space and processor registers from an application are saved into the checkpoint.

Application-level:- The application is responsible for providing the data to be saved and reconstructing its state from the checkpoint.

Page 8: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

8

Application-level checkpointing

Advantages Semantic information about data being saved: Possibility of generating portable checkpoints

Only the necessary data for recovering application state needs to be saved

The application is reponsible for: Providing which data needs to be saved Recovering its state from a previous checkpoint

Disadvantages Need to instrument source-code with checkpointing code Necessary to have access to application source-code Cannot generate forced checkpoints

Page 9: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

9

Checkpointing of Parallel Applications

In case of parallel applications we must consider the depencies among application processes generated by message exchanges;

Global checkpoint: is a collection contaning checkpoints from every application process. In the diagram, the global checkpoint s1 is inconsistent while global checkpoint s2 is consistent.

BSP applications: consistency can be guaranteed by generating the checkpoints after the synchronization phases.

Page 10: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

10

Checkpointing Infrastructure

Pre-Compiler Instruments a C/C++ application source-code with

checkpointing code Runtime libraries

Allows saving the application state into a checkpoint and recovering the data from a previous checkpoint

ExecutionMonitor Keep information about applications running on the

grid, allowing the restarting of these applications in case of failures.

Page 11: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

11

PreCompiler

Based on OpenC++. Permits that we use compile-time reflection to instrument an application source-code with checkpointing code

Needs to modify application code in order to save the following data : Execution Stack: contains runtime data from the active

functions in a particular moment during application execution Position Counter: the current position in the program The Heap: contains memory chuncks allocated by commands

such as malloc and new Global variables

Page 12: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

12

Saving and Recoveringthe Execution Stack State

Execution stack state: Not directly accessible from application code.

Saving the execution stack state:Save a list of the currently active functions and the values of their local variables.

Recovering the execution stack state: Call the functions in the saved list, declare the local variables and recover their values from the checkpoint. The remaining code is skipped.

Position Counter:Process state will only be saved in certain points in the source code, marked by a call to some function, such as checkpoint_candidate()

Execution Stack:

local variables

control information

function

parameters

local variables

control information

function

parameters

Page 13: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

13

Saving Local Vars and Pointers

Local Variables: Auxiliary stack keeps the address of local variables that are currently in scope. Local variable addresses are pushed into the stack just after their declaration, and removed when the variables leave scope During checkpoint generation, the values contained in these addresses are saved in the checkpoint.

Pointers: In the case of pointers, it is necessary first to dereference the pointer When saving pointer with multiple levels of indirection it is necessary to follow the pointer graph structure Special care is necessary with graphs containing cycles and when multiple pointers reference the same memory chunk

Page 14: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

14

Saving the Heap Memory

HeapManager Mantains a list of currently allocated

chunks of memory Includes the memory address, its size,

and a flag that indicates if that chunk has already been saved during checkpoint generation

Updated before memory allocation calls such as malloc and free for C and new and delete for C++.

Page 15: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

15

Classes, Structures and BSP Calls

Structures Saved in the same way as local vars. Must follow the pointers present in the structure.

Classes Use introspection to add methods for saving and restoring the class members.

BSP The bsp_begin and bsp_synch standard functions are replaced by functions from the checkpointing library During reinitialization, calls to functions that modify the state of the BSP library must be reexecuted.

(e.g., bsp_pushregister)

Page 16: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

16

Precompiler – Example of Instrumented Code

int function () { int lastFunctionCalled = -1; int localVar = 0; ckp_push_data(&lastFunctionCalled,sizeof(int)); ckp_push_data(&localVar, sizeof(int)); if ( ckpRecovering == 1 ) { ckp_get_data(&lastFunctionCalled, sizeof(int)); ckp_get_data(&localVar, sizeof(int)); if( lastFunctionCalled == 0 ) goto ckp0; } // Do computations (...) ckp0: lastFunctionCalled = 0; functionA ( ) ; // Do computations (...) ckp_npop_data(2); return localVar;}

Original Code Modified Code

Page 17: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

17

Checkpointing Runtime Library

Checkpointing Library: Provides the functionality for mantaining a stack of local variables, managing heap state and saving the data to a checkpoint; Provides a timer that applications can set to ensure a minimun time between checkpoints Checkpoints are currently architecture dependent and saved to file in the file system.

BSP Ckp Library: Provides specific functionality for checkpointing BSP applications:

bsp_begin_ckp( ): registers some addresses necessary for checkpointing coordination and initializes the timer.

bsp_synch_ckp( ): Test if the timer has expired and if true, signals the others processes to generate a new checkpoint.

Page 18: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

18

Application ExecutionMonitoring and Reinitialization

Execution Monitor: Contains a list of running applications in the nodes from its cluster Reschedule new executions with the GRM for failed processes

GRM Detects when a node or LRM fails and notifies the Execution Monitor Report nodes failures to the GRM

LRM Captures the exit status of

running applications and sends to the ExecutionMonitor

If process was explicitally killed by the signals SIGTERM or SIGKILL it is restarted

BSP Applications For BSP applications, all the

processes in the application are reinitialized

Page 19: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

19

Preliminary Experiments

Sequence similarity application: Compares two sequences of characters and finds the similarity among

them using given criteria. Used in bioinformatics to compare sequences of DNA. Was parallelized using the BSP computing model

600s 0 339.7s 339.9s 0%

60s 5 347.1s 339.9s 2.1%

10s 23 371.9s 339.9s 9.4%

Experiments were performed on a cluster of 10 1.4GHz machines connect by a 100Mbps Fast Ethernet network.

tmin nckp ttotal torig ovh

Page 20: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

20

Conclusions

We described an checkpointing-based rollback recovery mechanism for applications running in the InteGrade Grid middleware

This mechanism will allow a better resource utilization in the Grid, since it will be possible to migrate processes between nodes

Premiliminary indicates that checkpointing overhead can be low enough to be used on long-running BSP parallel applications

Page 21: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

21

Ongoing Work

Improve pre-compiler support for C++ Support for portable checkpoints

Allows better resource utilization In heterogeneous environments

Robust storage system for checkpoints Data saved in a distributed way Provide some degree of replication to provide fault-tolerance

Implement a efficient process migration mechanism on InteGrade Can be used for both fault-tolerance and dynamic adaptation

Page 22: Checkpointing-based  Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

22

Questions ?


Recommended