Presented By:
Md. Mohsin Ali
April 24 2013
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving
Applications
Authors: Md. Mohsin Ali and Peter E. Strazdins
Research School of Computer Science
The Australian National University
Canberra, ACT 0200, Australia
High Performance Computing High Performance Computing
application areas Atmosphere, Earth, Environmentapplication areas Atmosphere, Earth, Environment
Bioscience, Biotechnology, GeneticsBioscience, Biotechnology, Genetics
Chemistry, Molecular SciencesChemistry, Molecular Sciences
Computer Science, MathematicsComputer Science, Mathematics
Advanced graphics and virtual Advanced graphics and virtual
reality, etc.reality, etc.
Science and Engineering Industrial and Commercial Science and Engineering Industrial and Commercial
2
Introduction
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
High Performance Computing High Performance Computing
component hundreds of thousands of processing component hundreds of thousands of processing
elements to concurrently execute elements to concurrently execute
millions of threadsmillions of threads
3
Introduction
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Size of HPC systems are becoming larger to meet Size of HPC systems are becoming larger to meet
current demandcurrent demand
4
Introduction
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Size of HPC systems are becoming larger to meet Size of HPC systems are becoming larger to meet
current demandcurrent demand
… … ButBut
System size ––– Probability (component failure)System size ––– Probability (component failure)
5
Introduction
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Size of HPC systems are becoming larger to meet Size of HPC systems are becoming larger to meet
current demandcurrent demand
… … ButBut
System size ––– Probability (component failure)System size ––– Probability (component failure)
System size ––– Hardness of achieving parallelismSystem size ––– Hardness of achieving parallelism
6
Introduction
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Frequency of FailureFrequency of Failure
*G. Gibson, B. Schroeder, J. Digney, 2007*G. Gibson, B. Schroeder, J. Digney, 2007
7
Introduction
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Frequency of FailureFrequency of Failure
Reasons of FailureReasons of Failure
*G. Gibson, B. Schroeder, J. Digney, 2007*G. Gibson, B. Schroeder, J. Digney, 2007
8
Introduction
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
checkpoint/restartcheckpoint/restart
9
Ways of Failure Recovery
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
checkpoint/restartcheckpoint/restart
disadvantagesdisadvantages
I/O bottleneckI/O bottleneck
up to 25% of overhead in currentup to 25% of overhead in current
petascale systemspetascale systems
10
Ways of Failure Recovery
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
replicationreplication
11
Ways of Failure Recovery
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
replicationreplication
disadvantagesdisadvantages
need to keep data consistency need to keep data consistency
hard to find out proper places for replicationhard to find out proper places for replication
scalability is degradedscalability is degraded
12
Ways of Failure Recovery
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
message loggingmessage logging
13
Ways of Failure Recovery
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
message loggingmessage logging
disadvantagesdisadvantages
same as replication but reduced message sizesame as replication but reduced message size
performance degradation caused by performance degradation caused by
synchronizationsynchronization
14
Ways of Failure Recovery
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
algorithm-basedalgorithm-based
15
Ways of Failure Recovery
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
algorithm-basedalgorithm-based
advantagesadvantages
error detection, correction, and repeated error detection, correction, and repeated
computation are within the algorithm computation are within the algorithm
executing within a processing element (PE)executing within a processing element (PE)
errors are propagated on less number of PEerrors are propagated on less number of PE
16
Ways of Failure Recovery
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff
physical theorems.physical theorems.””
Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)
17
Motivation
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff
physical theorems.physical theorems.””
Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)
Solution of PDEsSolution of PDEs
Time-evolving numerical methodsTime-evolving numerical methods
18
Motivation
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff
physical theorems.physical theorems.””
Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)
Solution of PDEsSolution of PDEs
Time-evolving numerical methodsTime-evolving numerical methods
Parallel version for complex PDEsParallel version for complex PDEs
19
Motivation
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff
physical theorems.physical theorems.””
Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)
Solution of PDEsSolution of PDEs
Time-evolving numerical methodsTime-evolving numerical methods
Parallel version for complex PDEsParallel version for complex PDEs
Even a single process failure postpone whole computation Even a single process failure postpone whole computation
20
Motivation
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff
physical theorems.physical theorems.””
Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)
Solution of PDEsSolution of PDEs
Time-evolving numerical methodsTime-evolving numerical methods
Parallel version for complex PDEsParallel version for complex PDEs
Even a single process failure postpone whole computation Even a single process failure postpone whole computation
More component on system causes more failure (More component on system causes more failure (and more complexity)and more complexity)
21
Motivation
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Design and implementation of time-evolving applicationDesign and implementation of time-evolving application
tolerate process failuretolerate process failure
achieve high scalabilityachieve high scalability
To learn the usability of fault-tolerant semantics of FT-MPITo learn the usability of fault-tolerant semantics of FT-MPI
22
Goal
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
23
Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
24
Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to detect processes failure?How to detect processes failure?
25
Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to detect processes failure?How to detect processes failure?
How to determine which processes are failed?How to determine which processes are failed?
26
Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to detect processes failure?How to detect processes failure?
How to determine which processes are failed?How to determine which processes are failed?
How to recover failed processes?How to recover failed processes?
27
Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to detect processes failure?How to detect processes failure?
How to determine which processes are failed?How to determine which processes are failed?
How to recover failed processes?How to recover failed processes?
How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?
28
Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to detect processes failure?How to detect processes failure?
How to determine which processes are failed?How to determine which processes are failed?
How to recover failed processes?How to recover failed processes?
How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?
How to continue time-step from the point of failure?How to continue time-step from the point of failure?
29
Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to detect processes failure?How to detect processes failure?
How to determine which processes are failed?How to determine which processes are failed?
How to recover failed processes?How to recover failed processes?
How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?
How to continue time-step from the point of failure?How to continue time-step from the point of failure?
How to retain scalability? How to retain scalability?
30
Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to detect processes failure?How to detect processes failure?
MPI_ERR_OTHER semantics of FT-MPIMPI_ERR_OTHER semantics of FT-MPI
31
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to detect processes failure?How to detect processes failure?
MPI_ERR_OTHER semantics of FT-MPIMPI_ERR_OTHER semantics of FT-MPI
How to determine which processes are failed?How to determine which processes are failed?
Attribute catching mechanism of MPIAttribute catching mechanism of MPI
32
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to detect processes failure?How to detect processes failure?
MPI_ERR_OTHER semantics of FT-MPIMPI_ERR_OTHER semantics of FT-MPI
How to determine which processes are failed?How to determine which processes are failed?
Attribute catching mechanism of MPIAttribute catching mechanism of MPI
How to recover failed processes?How to recover failed processes?
Creating new processes with the same rank as previousCreating new processes with the same rank as previous
> FT_MPI_CHECK_RECOVER and > FT_MPI_CHECK_RECOVER and
> MPI_Comm_dup semantics of FT-MPI > MPI_Comm_dup semantics of FT-MPI
33
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?
How to continue time-step from the point of failure?How to continue time-step from the point of failure?
Replaced byReplaced by
34
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
1D advection with periodic
boundary condition
How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?
How to continue time-step from the point of failure?How to continue time-step from the point of failure?
Replaced byReplaced by
““lost state infolost state info”” of recovered processes are recovered from of recovered processes are recovered from
master by FT-MPI process restart master by FT-MPI process restart
35
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Every two-way exchange is going through master and save
“state info”on it
worker
1D advection with periodic
boundary condition
How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?
How to continue time-step from the point of failure?How to continue time-step from the point of failure?
Replaced byReplaced by
““lost state infolost state info”” of recovered processes are recovered from of recovered processes are recovered from
master by FT-MPI process restartmaster by FT-MPI process restart
time-stepping is continued from one step backwardstime-stepping is continued from one step backwards36
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Every two-way exchange is going through master and save
“state info”on it
worker
1D advection with periodic
boundary condition
Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed
37
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed
38
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed
39
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed
40
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed
41
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed
42
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed
43
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed
44
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed
45
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed
46
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
How to retain scalability?How to retain scalability?
Scalability is very low in this master-worker modelScalability is very low in this master-worker model
47
How to Tackle Challenges
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
# cores 16 (total)# cores 16 (total)
# nodes 4 (total)# nodes 4 (total)
Memory 4 GB (each node)Memory 4 GB (each node)
standard GigE Switch standard GigE Switch
48
Overhead of FT-MPI over Open MPI
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Scalability achieved for 16 cores (4 nodes) = 15% (very low)Scalability achieved for 16 cores (4 nodes) = 15% (very low)
Recovery time Recovery time
1 worker process failed = 1 sec1 worker process failed = 1 sec
4 worker processes failed = 2 sec4 worker processes failed = 2 sec
8 worker processes failed = 3 sec8 worker processes failed = 3 sec
15 worker processes failed = 5 sec15 worker processes failed = 5 sec
49
Scalability and Recovery Time
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Checkpointing after each T time-steps on a specific node Checkpointing after each T time-steps on a specific node
50
Future Work
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Checkpointin after each T time-steps on separate nodes Checkpointin after each T time-steps on separate nodes
51
Future Work
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
System size ––– Probability (component failure)System size ––– Probability (component failure)
System size ––– Hardness of achieving parallelismSystem size ––– Hardness of achieving parallelism
Process failure detection by FT-MPIProcess failure detection by FT-MPI
Failed process restart by FT-MPI Failed process restart by FT-MPI
Algorithm-based fault tolerance technique for data recoveryAlgorithm-based fault tolerance technique for data recovery
Overhead of FT-MPI compared to Open MPI is lowOverhead of FT-MPI compared to Open MPI is low
Recovery time is lessRecovery time is less
Master-worker model is not so scalable, but can be used as a Master-worker model is not so scalable, but can be used as a
prototypeprototype
52
Conclusion
Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |
Thank You!Thank You!
53Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |