Download - A Fault Tolerant Protocol for Massively Parallel Machines

A Fault Tolerant Protocol for Massively Parallel Machines

Sayantan Chakravorty

Laxmikant Kale

University of Illinois, Urbana-Champaign

Parallel Programming LaboratoryUniv. of Illinois, U-C

2

Outline

Motivation Background Design Protocols Results Summary Future Work


3

Motivation

As machines grow in size MTBF decreases Applications have to tolerate faults

Checkpoint/Rollback doesn’t scale All nodes are rolled back just because 1 crashed Even nodes independent of the crashed node are

restarted Restart cost is similar to Checkpoint period


4

Requirements

Fast and scalable Checkpoints Fast Restart

Only crashed processor to be restarted Minimize effect on fault free processors Restart cost less than checkpoint period

Low fault free runtime overhead Transparent to the user


5

Background

Checkpoint based methods Coordinated – Blocking [Tamir84], Non-blocking [Chandy85]

Co-check, Starfish, Clip – fault tolerant MPI Uncoordinated – suffers from rollback propagation Communication – [Briatico84], doesn’t scale well

Log-based Pessimistic – MPICH-V1 and V2, SBML [Johnson87] Optimistic – [Strom85] unbounded rollback, complicated

recovery Causal Logging – [Elnozahy93] Manetho, complicated

causality tracking and recovery


6

Design

Message Logging Sender side message logging

Asynchronous checkpoints Each processor has a buddy processor Stores its checkpoint in the buddy’s memory

Processor Virtualization Speed up restart


8

Processor Virtualization

User View System implementation

•Charm++•Parallel C++ with Data driven objects - Chares•Runtime maps objects to physical processors•Asynchronous method invocation

•Adaptive MPI•Implemented on Charm++•Multiple virtual processors on a physical processor


9

Benefits of Virtualization

Latency Tolerant Adaptive overlap of communication and

computation Supports migration of virtual processors


10

Message Logging Protocol

Correctness: Messages should be processed in the same order before and after the crash

Problem:

A

B

CA

B

C

Before Crash After Crash


11

Message Logging..

Solution: Fix an order the first time and always follow it Receiver gives each message a ticket number Process messages in order of ticket number

Each message contains Sender ID – who sent it Receiver ID – to whom was it sent Sequence Number (SN) – together with sender

and receiver IDs, identifies a message Ticket Number (TN) – decide order of processing


12

Message to Remote Chares

Chare Psender

Chare Qreceiver

<Sender, SN>

<SN,TN, Receiver> <SN, TN, Message>

•If <sender, SN> has been seen earlier TN is marked as received •Otherwise create new TN and store the <sender, SN,TN>


13

Message to Local Chare Multiple Chares on 1 processor

If processor crashes all trace of local message is lost After restart it should have the same TN Store <sender, receiver, SN, TN> on buddy

<sender, SN> <SN,TN, Receiver>

<sender, receiver, SN, TN>

Ack

<SN, TN, Message>

Processor R

Chare Q

Chare P

Buddy of Processor R


14

Checkpoint Protocol

A processor asynchronously decides to checkpoint

Packs up the state of all its chares and sends it to the buddy Message logs are part of a chare’s state

Message log on senders can be garbage collected

Deciding when to checkpoint is an interesting problem


15

Reliability

Only one scenario when our protocol fails Processor X (buddy of Y) crashes and restarts Checkpoint of Y is lost Y now crashes before saving its checkpoint

Result of not assuming reliable nodes for storing checkpoint

Still increases reliability by orders of magnitude

Probability can be minimized by having Y checkpoint after X crashes and restarts


16

Basic Restart Protocol

After a crash, a Charm++ process is restarted on a new processor

Gets checkpoint and local message log from buddy

Chares are restored and other processors are informed of it

Logged messages for chares on restarted processors are resentThe highest TN, from a crashed chare, seen is also sent

Messages are reprocessed by the restarted charesLocal messages check first in the restored local message log


17

Parallel Restart

Message Logging allows fault-free processors to continue with their execution

However, sooner or later some processors start waiting for crashed processor

Virtualization allows us to move work from the restarted processor to waiting processors

Chares are restarted in parallel Restart cost can be reduced


18

Present Status

Most of Charm++ has been ported Support for migration has not yet been

implemented in the fault tolerant protocol Simple AMPI programs work

Barriers to be done Parallel restart not yet implemented


19

Experimental Evaluation

NAS benchmarks could not be used Used a 5-point stencil computation with a 1-D

decomposition 8 quad 500 Mhz PIII cluster with 500 MB of

RAM per node, connected by ethernet


20

Overhead

Measurement of overhead for an application with low communication to computation ratio

Overhead measurement

0102030405060708090

100

0 5 10 15 20 25 30 35

Number of processors

Norm

aliz

ed p

erfo

rman

ce

Normal Charm++ FT- wi thout checkpoi nt FT- f ul l protocol


21

Measurement of overhead for an application with high communication to computation ratio

Overhead measurement

0

20

40

60

80

100

0 5 10 15 20 25 30 35

Number of processors

Norm

aliz

ed p

erfo

rman

ce

Normal Charm++ FT-without checkpoint FT-full protocol


22

Recovery Performance

Execution Time with increasing number of faults on 8 processors(Checkpoint period 30s)

Execut i on Ti me wi th Faul t s

0

200

400

600

800

0 1 2 3 4 5 6 7

Number of f aul t s

Execution Time(s)


23

Summary

Designed a fault tolerant protocol that Performs fast checkpoints Performs fast parallel restarts Doesn’t depend on any completely reliable node Supports multiple faults Minimizes the effect of a crash on fault free

processors Partial implementation of the protocol


24

Future Work

Include support for migration in the protocol Parallel restart Extend to AMPI Test with NAS benchmark Study the tradeoffs involved in deciding the

checkpoint period