A Fault Tolerant Protocol for Massively Parallel Machines
Sayantan Chakravorty
Laxmikant Kale
University of Illinois, Urbana-Champaign
Parallel Programming LaboratoryUniv. of Illinois, U-C
2
Outline
Motivation Background Design Protocols Results Summary Future Work
Parallel Programming LaboratoryUniv. of Illinois, U-C
3
Motivation
As machines grow in size MTBF decreases Applications have to tolerate faults
Checkpoint/Rollback doesn’t scale All nodes are rolled back just because 1 crashed Even nodes independent of the crashed node are
restarted Restart cost is similar to Checkpoint period
Parallel Programming LaboratoryUniv. of Illinois, U-C
4
Requirements
Fast and scalable Checkpoints Fast Restart
Only crashed processor to be restarted Minimize effect on fault free processors Restart cost less than checkpoint period
Low fault free runtime overhead Transparent to the user
Parallel Programming LaboratoryUniv. of Illinois, U-C
5
Background
Checkpoint based methods Coordinated – Blocking [Tamir84], Non-blocking [Chandy85]
Co-check, Starfish, Clip – fault tolerant MPI Uncoordinated – suffers from rollback propagation Communication – [Briatico84], doesn’t scale well
Log-based Pessimistic – MPICH-V1 and V2, SBML [Johnson87] Optimistic – [Strom85] unbounded rollback, complicated
recovery Causal Logging – [Elnozahy93] Manetho, complicated
causality tracking and recovery
Parallel Programming LaboratoryUniv. of Illinois, U-C
6
Design
Message Logging Sender side message logging
Asynchronous checkpoints Each processor has a buddy processor Stores its checkpoint in the buddy’s memory
Processor Virtualization Speed up restart
Parallel Programming LaboratoryUniv. of Illinois, U-C
8
Processor Virtualization
User View System implementation
•Charm++•Parallel C++ with Data driven objects - Chares•Runtime maps objects to physical processors•Asynchronous method invocation
•Adaptive MPI•Implemented on Charm++•Multiple virtual processors on a physical processor
Parallel Programming LaboratoryUniv. of Illinois, U-C
9
Benefits of Virtualization
Latency Tolerant Adaptive overlap of communication and
computation Supports migration of virtual processors
Parallel Programming LaboratoryUniv. of Illinois, U-C
10
Message Logging Protocol
Correctness: Messages should be processed in the same order before and after the crash
Problem:
A
B
CA
B
C
Before Crash After Crash
Parallel Programming LaboratoryUniv. of Illinois, U-C
11
Message Logging..
Solution: Fix an order the first time and always follow it Receiver gives each message a ticket number Process messages in order of ticket number
Each message contains Sender ID – who sent it Receiver ID – to whom was it sent Sequence Number (SN) – together with sender
and receiver IDs, identifies a message Ticket Number (TN) – decide order of processing
Parallel Programming LaboratoryUniv. of Illinois, U-C
12
Message to Remote Chares
Chare Psender
Chare Qreceiver
<Sender, SN>
<SN,TN, Receiver> <SN, TN, Message>
•If <sender, SN> has been seen earlier TN is marked as received •Otherwise create new TN and store the <sender, SN,TN>
Parallel Programming LaboratoryUniv. of Illinois, U-C
13
Message to Local Chare Multiple Chares on 1 processor
If processor crashes all trace of local message is lost After restart it should have the same TN Store <sender, receiver, SN, TN> on buddy
<sender, SN> <SN,TN, Receiver>
<sender, receiver, SN, TN>
Ack
<SN, TN, Message>
Processor R
Chare Q
Chare P
Buddy of Processor R
Parallel Programming LaboratoryUniv. of Illinois, U-C
14
Checkpoint Protocol
A processor asynchronously decides to checkpoint
Packs up the state of all its chares and sends it to the buddy Message logs are part of a chare’s state
Message log on senders can be garbage collected
Deciding when to checkpoint is an interesting problem
Parallel Programming LaboratoryUniv. of Illinois, U-C
15
Reliability
Only one scenario when our protocol fails Processor X (buddy of Y) crashes and restarts Checkpoint of Y is lost Y now crashes before saving its checkpoint
Result of not assuming reliable nodes for storing checkpoint
Still increases reliability by orders of magnitude
Probability can be minimized by having Y checkpoint after X crashes and restarts
Parallel Programming LaboratoryUniv. of Illinois, U-C
16
Basic Restart Protocol
After a crash, a Charm++ process is restarted on a new processor
Gets checkpoint and local message log from buddy
Chares are restored and other processors are informed of it
Logged messages for chares on restarted processors are resentThe highest TN, from a crashed chare, seen is also sent
Messages are reprocessed by the restarted charesLocal messages check first in the restored local message log
Parallel Programming LaboratoryUniv. of Illinois, U-C
17
Parallel Restart
Message Logging allows fault-free processors to continue with their execution
However, sooner or later some processors start waiting for crashed processor
Virtualization allows us to move work from the restarted processor to waiting processors
Chares are restarted in parallel Restart cost can be reduced
Parallel Programming LaboratoryUniv. of Illinois, U-C
18
Present Status
Most of Charm++ has been ported Support for migration has not yet been
implemented in the fault tolerant protocol Simple AMPI programs work
Barriers to be done Parallel restart not yet implemented
Parallel Programming LaboratoryUniv. of Illinois, U-C
19
Experimental Evaluation
NAS benchmarks could not be used Used a 5-point stencil computation with a 1-D
decomposition 8 quad 500 Mhz PIII cluster with 500 MB of
RAM per node, connected by ethernet
Parallel Programming LaboratoryUniv. of Illinois, U-C
20
Overhead
Measurement of overhead for an application with low communication to computation ratio
Overhead measurement
0102030405060708090
100
0 5 10 15 20 25 30 35
Number of processors
Norm
aliz
ed p
erfo
rman
ce
Normal Charm++ FT- wi thout checkpoi nt FT- f ul l protocol
Parallel Programming LaboratoryUniv. of Illinois, U-C
21
Measurement of overhead for an application with high communication to computation ratio
Overhead measurement
0
20
40
60
80
100
0 5 10 15 20 25 30 35
Number of processors
Norm
aliz
ed p
erfo
rman
ce
Normal Charm++ FT-without checkpoint FT-full protocol
Parallel Programming LaboratoryUniv. of Illinois, U-C
22
Recovery Performance
Execution Time with increasing number of faults on 8 processors(Checkpoint period 30s)
Execut i on Ti me wi th Faul t s
0
200
400
600
800
0 1 2 3 4 5 6 7
Number of f aul t s
Execution Time(s)
Parallel Programming LaboratoryUniv. of Illinois, U-C
23
Summary
Designed a fault tolerant protocol that Performs fast checkpoints Performs fast parallel restarts Doesn’t depend on any completely reliable node Supports multiple faults Minimizes the effect of a crash on fault free
processors Partial implementation of the protocol