+ All Categories
Home > Documents > Portable Checkpointing for BSP Applications on Grid Environments

Portable Checkpointing for BSP Applications on Grid Environments

Date post: 19-Jan-2016
Category:
Upload: vea
View: 26 times
Download: 0 times
Share this document with a friend
Description:
Raphael Y. de Camargo Fabio Kon Alfredo Goldman Department of Computer Science IME / USP. Portable Checkpointing for BSP Applications on Grid Environments. INTRODUCTION. Computational Grids: ubiquitous access and coordinated usage of distributed resources - PowerPoint PPT Presentation
Popular Tags:
20
Rio de Janeiro, Octob er, 2005 SBAC 2005 1 Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman Department of Computer Science IME / USP
Transcript
Page 1: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 1

Portable Checkpointing for BSP Applications on Grid

EnvironmentsRaphael Y. de CamargoFabio KonAlfredo Goldman

Department of Computer ScienceIME / USP

Page 2: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 2

INTRODUCTION

● Computational Grids: ubiquitous access and coordinated usage of distributed resources

● Opportunistic Grids: usage of idle time of non-dedicated resources (desktop PCs)– Resources are heterogeneous (Mac,

Windows, Linux)– Failure rate is higher than dedicated

resources– Fails on a daily basis

Page 3: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 3

INTEGRADE

● Grid middleware: usage of idle computing power from personal computers

● Federation of clusters– Clusters composed of a

collection of resource providing nodes

● Sequential, parameter sweeping, and BSP applications

Page 4: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 4

MOTIVATION

● Fault-tolerance is essential, specially when running parallel applications– Failure of a single node require restarting

the application from the beginning – Checkpointing can be used as a fault-

tolerance mechanism● Mechanisms supporting heterogeneity

improve resource utilization– Portable checkpointing mechanism

allows reinitialization on machines of different architecture

Page 5: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 5

OUR APPROACH

● Source code instrumentation – Perform additional tasks– Logging, profiling, persistence

● BSP application on heterogeneous nodes● Portable checkpointing of applications● Pre-compiler based on OpenC++

– Open-source tool for compile time reflection

Page 6: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 6

BSP MODEL● Bridging model

– Link architecture to software● Execution performed in

supersteps– Computation and synchronization

phases● Two communication

mechanisms: – Direct Remote Memory Access

(DRMA)– Bulk Synchronous Message Passing

(BSMP)● Existing implementations:

– Oxford BSPLib, PUB, BSP-G

● Work only on homogeneous clusters

Page 7: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 7

HETEROGENEOUS NODES

● Extended BSPLib API– Some mehods receive extra parameter

describing data type information → used to convert data

– Pointer data types are defined by their declaration

– Arbitrary data casts are not allowed● Reasonable requirement for portability

● Pre-compiler automatically modifies application source code to use the extended API– Not need for manual modifications

Page 8: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 8

CHECKPOINTING APPROACHES

● System-level checkpoint– Data is copied in chekpoints directly from

application address space● Application-level checkpoint

– Instrument application source code to save its state

– Semantic information about data-types is available

● Allows generation of portable checkpoints● Drawbacks

– Need to modify application source-code– Checkpoints at certain points in the

application

Page 9: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 9

CHECKPOINTING LIBRARY

● Pre-compiler instruments application source code– No manual instrumentation of source code– Necessary access to source code

● Checkpointing Library– Timer with a minimum checkpoint interval– Saving performed by a separate thread– Checkpoint can be stored in filesystem (NFS)

or remote checkpoint repository (TCP/IP)

● Execution Manager– Coordinates checkpointing of BSP parallel

applications

Page 10: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 10

SAVING EXECUTION DATA

• Execution stack– Information from active function calls

• Local variables, function parameters, return address, and control information

– Dependent on architecture and OS

• Heap area– Memory chunks allocated by application

Execution Stack:

control information

local variables

function parameters

control information

local variables

function parameters

•Necessary to save– Execution stack + global variables– Data in heap area– Other information

Page 11: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 11

● Save only data necessary for reconstruction– List of function calls– Value of parameters and local variables

● Data added to an auxiliary stack during execution

● Recovery– Data read from checkpoint– Functions called– Local variables and parameter values

assigned– Data conversion is performed if necessary

EXECUTION STACK

Page 12: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 12

POINTERS AND HEAP MEMORY

● Memory addresses – Specific for an execution– Architecture dependent

● Checkpoint generation– Data from heap area is copied to checkpoint– Memory addresses → offsets in checkpoint

● Recovery– Memory areas are allocated– Data is copied to these memory areas

Page 13: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 13

EXPERIMENTS

● Parallel BSP applications– Similarity between large sequences of

characters– Matrix multiplication

● Testbed: – Machines in two labs:

● 11 AthlonXP 1700+, 512MB● 1 Power PC G4, 512MB● 2 Athlon 64 2800+, 512MB● 100Mbps Ethernet in 2 connected LANs

Page 14: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 14

CHECKPOINTING OVERHEAD

● Simulation parameters– Matrix multiplication application using 9

nodes– Matrix size: 450x450 and 1800x1800– Checkpoint sizes: 2.3MB and 37.1MB– Checkpointing intervals: 10, 30, and 60s

Page 15: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 15

CHECKPOINTING OVERHEAD

● Storage on local machine or remote repository is faster than with NFS

● When using a remote repository, the overhead was consistently below 10%, even with a 10s interval

Page 16: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 16

DYNAMIC GRID SIMULATION

● We simulated a dynamic environment where machine can fail unexpectedly

● Sequence similarity application using 10 nodes– Machine fails according to an exponential

distribution ● MTBF (1/λ) = 600s and 1800s

● Smaller checkpointing intervals → smaller execution times

ckpinterval 1/λ ttotal 1/λ ttotal

10s 600s 517.4s 1800s 490.3s

30s 600s 571.4s 1800s 519.3s

60s 600s 699.4s 1800s 534.5s

Page 17: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 17

HETEROGENEOUS NODES

• Matrix multiplication on 4 heterogeneous nodes– 3 AtlhonXP (x86) + 1 PowerPC G4 (ppc)

• Elements of type long double

● Time spent on data conversion is small compared to total execution time

Matrix size texec tx86 tppc

500x500 28.8s 0.042s 0.217s

1000x1000 156.8s 0.066s 0.348s

2000x2000 373.5s 0.078s 0.430s

Page 18: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 18

RESTART AN APPLICATION

● Time to recover from a checkpoint saved on different architectures

● Application that generates a graph of structures containing 20K nodes

● When recovering on an x86 machine– From x86: 0.179s– From x86-64: 0.186s → 3.9% slower than

x86– From PPC: 0.192s → 7.2% slower than x86

● Overhead when reading checkpoint data

Page 19: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 19

CONCLUSIONS

● Overhead of portability is small, and can lead to better resource utilization

● Possible to execute BSP applications on heterogeneous nodes

● Ongoing work– Distributed checkpoint repository

● Scalability and Fault-tolerance– Simulations in large scale and wide area

Grids– Support for multithreaded C++ applications

Page 20: Portable Checkpointing for BSP Applications on Grid Environments

Rio de Janeiro, October, 2005 SBAC 2005 20

QUESTIONS

For more information, please visit the poject page:

http://gsd.ime.usp.br/integrade


Recommended