Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving...

Post on 23-Jun-2020

0 views 0 download

transcript

Presented By:

Md. Mohsin Ali

April 24 2013

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving

Applications

Authors: Md. Mohsin Ali and Peter E. Strazdins

Research School of Computer Science

The Australian National University

Canberra, ACT 0200, Australia

High Performance Computing High Performance Computing

application areas Atmosphere, Earth, Environmentapplication areas Atmosphere, Earth, Environment

Bioscience, Biotechnology, GeneticsBioscience, Biotechnology, Genetics

Chemistry, Molecular SciencesChemistry, Molecular Sciences

Computer Science, MathematicsComputer Science, Mathematics

Advanced graphics and virtual Advanced graphics and virtual

reality, etc.reality, etc.

Science and Engineering Industrial and Commercial Science and Engineering Industrial and Commercial

2

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

High Performance Computing High Performance Computing

component hundreds of thousands of processing component hundreds of thousands of processing

elements to concurrently execute elements to concurrently execute

millions of threadsmillions of threads

3

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Size of HPC systems are becoming larger to meet Size of HPC systems are becoming larger to meet

current demandcurrent demand

4

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Size of HPC systems are becoming larger to meet Size of HPC systems are becoming larger to meet

current demandcurrent demand

… … ButBut

System size ––– Probability (component failure)System size ––– Probability (component failure)

5

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Size of HPC systems are becoming larger to meet Size of HPC systems are becoming larger to meet

current demandcurrent demand

… … ButBut

System size ––– Probability (component failure)System size ––– Probability (component failure)

System size ––– Hardness of achieving parallelismSystem size ––– Hardness of achieving parallelism

6

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Frequency of FailureFrequency of Failure

*G. Gibson, B. Schroeder, J. Digney, 2007*G. Gibson, B. Schroeder, J. Digney, 2007

7

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Frequency of FailureFrequency of Failure

Reasons of FailureReasons of Failure

*G. Gibson, B. Schroeder, J. Digney, 2007*G. Gibson, B. Schroeder, J. Digney, 2007

8

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

checkpoint/restartcheckpoint/restart

9

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

checkpoint/restartcheckpoint/restart

disadvantagesdisadvantages

I/O bottleneckI/O bottleneck

up to 25% of overhead in currentup to 25% of overhead in current

petascale systemspetascale systems

10

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

replicationreplication

11

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

replicationreplication

disadvantagesdisadvantages

need to keep data consistency need to keep data consistency

hard to find out proper places for replicationhard to find out proper places for replication

scalability is degradedscalability is degraded

12

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

message loggingmessage logging

13

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

message loggingmessage logging

disadvantagesdisadvantages

same as replication but reduced message sizesame as replication but reduced message size

performance degradation caused by performance degradation caused by

synchronizationsynchronization

14

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

algorithm-basedalgorithm-based

15

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

algorithm-basedalgorithm-based

advantagesadvantages

error detection, correction, and repeated error detection, correction, and repeated

computation are within the algorithm computation are within the algorithm

executing within a processing element (PE)executing within a processing element (PE)

errors are propagated on less number of PEerrors are propagated on less number of PE

16

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff

physical theorems.physical theorems.””

Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)

17

Motivation

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff

physical theorems.physical theorems.””

Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)

Solution of PDEsSolution of PDEs

Time-evolving numerical methodsTime-evolving numerical methods

18

Motivation

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff

physical theorems.physical theorems.””

Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)

Solution of PDEsSolution of PDEs

Time-evolving numerical methodsTime-evolving numerical methods

Parallel version for complex PDEsParallel version for complex PDEs

19

Motivation

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff

physical theorems.physical theorems.””

Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)

Solution of PDEsSolution of PDEs

Time-evolving numerical methodsTime-evolving numerical methods

Parallel version for complex PDEsParallel version for complex PDEs

Even a single process failure postpone whole computation Even a single process failure postpone whole computation

20

Motivation

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff

physical theorems.physical theorems.””

Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)

Solution of PDEsSolution of PDEs

Time-evolving numerical methodsTime-evolving numerical methods

Parallel version for complex PDEsParallel version for complex PDEs

Even a single process failure postpone whole computation Even a single process failure postpone whole computation

More component on system causes more failure (More component on system causes more failure (and more complexity)and more complexity)

21

Motivation

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Design and implementation of time-evolving applicationDesign and implementation of time-evolving application

tolerate process failuretolerate process failure

achieve high scalabilityachieve high scalability

To learn the usability of fault-tolerant semantics of FT-MPITo learn the usability of fault-tolerant semantics of FT-MPI

22

Goal

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

23

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

24

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to detect processes failure?How to detect processes failure?

25

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to detect processes failure?How to detect processes failure?

How to determine which processes are failed?How to determine which processes are failed?

26

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to detect processes failure?How to detect processes failure?

How to determine which processes are failed?How to determine which processes are failed?

How to recover failed processes?How to recover failed processes?

27

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to detect processes failure?How to detect processes failure?

How to determine which processes are failed?How to determine which processes are failed?

How to recover failed processes?How to recover failed processes?

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

28

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to detect processes failure?How to detect processes failure?

How to determine which processes are failed?How to determine which processes are failed?

How to recover failed processes?How to recover failed processes?

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

How to continue time-step from the point of failure?How to continue time-step from the point of failure?

29

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to detect processes failure?How to detect processes failure?

How to determine which processes are failed?How to determine which processes are failed?

How to recover failed processes?How to recover failed processes?

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

How to continue time-step from the point of failure?How to continue time-step from the point of failure?

How to retain scalability? How to retain scalability?

30

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to detect processes failure?How to detect processes failure?

MPI_ERR_OTHER semantics of FT-MPIMPI_ERR_OTHER semantics of FT-MPI

31

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to detect processes failure?How to detect processes failure?

MPI_ERR_OTHER semantics of FT-MPIMPI_ERR_OTHER semantics of FT-MPI

How to determine which processes are failed?How to determine which processes are failed?

Attribute catching mechanism of MPIAttribute catching mechanism of MPI

32

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to detect processes failure?How to detect processes failure?

MPI_ERR_OTHER semantics of FT-MPIMPI_ERR_OTHER semantics of FT-MPI

How to determine which processes are failed?How to determine which processes are failed?

Attribute catching mechanism of MPIAttribute catching mechanism of MPI

How to recover failed processes?How to recover failed processes?

Creating new processes with the same rank as previousCreating new processes with the same rank as previous

> FT_MPI_CHECK_RECOVER and > FT_MPI_CHECK_RECOVER and

> MPI_Comm_dup semantics of FT-MPI > MPI_Comm_dup semantics of FT-MPI

33

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

How to continue time-step from the point of failure?How to continue time-step from the point of failure?

Replaced byReplaced by

34

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

1D advection with periodic

boundary condition

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

How to continue time-step from the point of failure?How to continue time-step from the point of failure?

Replaced byReplaced by

““lost state infolost state info”” of recovered processes are recovered from of recovered processes are recovered from

master by FT-MPI process restart master by FT-MPI process restart

35

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Every two-way exchange is going through master and save

“state info”on it

worker

1D advection with periodic

boundary condition

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

How to continue time-step from the point of failure?How to continue time-step from the point of failure?

Replaced byReplaced by

““lost state infolost state info”” of recovered processes are recovered from of recovered processes are recovered from

master by FT-MPI process restartmaster by FT-MPI process restart

time-stepping is continued from one step backwardstime-stepping is continued from one step backwards36

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Every two-way exchange is going through master and save

“state info”on it

worker

1D advection with periodic

boundary condition

Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed

37

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed

38

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed

39

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed

40

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed

41

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed

42

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed

43

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed

44

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed

45

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed

46

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

How to retain scalability?How to retain scalability?

Scalability is very low in this master-worker modelScalability is very low in this master-worker model

47

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

# cores 16 (total)# cores 16 (total)

# nodes 4 (total)# nodes 4 (total)

Memory 4 GB (each node)Memory 4 GB (each node)

standard GigE Switch standard GigE Switch

48

Overhead of FT-MPI over Open MPI

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Scalability achieved for 16 cores (4 nodes) = 15% (very low)Scalability achieved for 16 cores (4 nodes) = 15% (very low)

Recovery time Recovery time

1 worker process failed = 1 sec1 worker process failed = 1 sec

4 worker processes failed = 2 sec4 worker processes failed = 2 sec

8 worker processes failed = 3 sec8 worker processes failed = 3 sec

15 worker processes failed = 5 sec15 worker processes failed = 5 sec

49

Scalability and Recovery Time

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Checkpointing after each T time-steps on a specific node Checkpointing after each T time-steps on a specific node

50

Future Work

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Checkpointin after each T time-steps on separate nodes Checkpointin after each T time-steps on separate nodes

51

Future Work

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

System size ––– Probability (component failure)System size ––– Probability (component failure)

System size ––– Hardness of achieving parallelismSystem size ––– Hardness of achieving parallelism

Process failure detection by FT-MPIProcess failure detection by FT-MPI

Failed process restart by FT-MPI Failed process restart by FT-MPI

Algorithm-based fault tolerance technique for data recoveryAlgorithm-based fault tolerance technique for data recovery

Overhead of FT-MPI compared to Open MPI is lowOverhead of FT-MPI compared to Open MPI is low

Recovery time is lessRecovery time is less

Master-worker model is not so scalable, but can be used as a Master-worker model is not so scalable, but can be used as a

prototypeprototype

52

Conclusion

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Thank You!Thank You!

53Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |