+ All Categories
Home > Documents > Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving...

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving...

Date post: 23-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
53
Presented By: Md. Mohsin Ali April 24 2013 Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications Authors: Md. Mohsin Ali and Peter E. Strazdins Research School of Computer Science The Australian National University Canberra, ACT 0200, Australia
Transcript
Page 1: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Presented By:

Md. Mohsin Ali

April 24 2013

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving

Applications

Authors: Md. Mohsin Ali and Peter E. Strazdins

Research School of Computer Science

The Australian National University

Canberra, ACT 0200, Australia

Page 2: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

High Performance Computing High Performance Computing

application areas Atmosphere, Earth, Environmentapplication areas Atmosphere, Earth, Environment

Bioscience, Biotechnology, GeneticsBioscience, Biotechnology, Genetics

Chemistry, Molecular SciencesChemistry, Molecular Sciences

Computer Science, MathematicsComputer Science, Mathematics

Advanced graphics and virtual Advanced graphics and virtual

reality, etc.reality, etc.

Science and Engineering Industrial and Commercial Science and Engineering Industrial and Commercial

2

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 3: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

High Performance Computing High Performance Computing

component hundreds of thousands of processing component hundreds of thousands of processing

elements to concurrently execute elements to concurrently execute

millions of threadsmillions of threads

3

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 4: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Size of HPC systems are becoming larger to meet Size of HPC systems are becoming larger to meet

current demandcurrent demand

4

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 5: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Size of HPC systems are becoming larger to meet Size of HPC systems are becoming larger to meet

current demandcurrent demand

… … ButBut

System size ––– Probability (component failure)System size ––– Probability (component failure)

5

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 6: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Size of HPC systems are becoming larger to meet Size of HPC systems are becoming larger to meet

current demandcurrent demand

… … ButBut

System size ––– Probability (component failure)System size ––– Probability (component failure)

System size ––– Hardness of achieving parallelismSystem size ––– Hardness of achieving parallelism

6

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 7: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Frequency of FailureFrequency of Failure

*G. Gibson, B. Schroeder, J. Digney, 2007*G. Gibson, B. Schroeder, J. Digney, 2007

7

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 8: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Frequency of FailureFrequency of Failure

Reasons of FailureReasons of Failure

*G. Gibson, B. Schroeder, J. Digney, 2007*G. Gibson, B. Schroeder, J. Digney, 2007

8

Introduction

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 9: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

checkpoint/restartcheckpoint/restart

9

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 10: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

checkpoint/restartcheckpoint/restart

disadvantagesdisadvantages

I/O bottleneckI/O bottleneck

up to 25% of overhead in currentup to 25% of overhead in current

petascale systemspetascale systems

10

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 11: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

replicationreplication

11

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 12: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

replicationreplication

disadvantagesdisadvantages

need to keep data consistency need to keep data consistency

hard to find out proper places for replicationhard to find out proper places for replication

scalability is degradedscalability is degraded

12

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 13: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

message loggingmessage logging

13

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 14: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

message loggingmessage logging

disadvantagesdisadvantages

same as replication but reduced message sizesame as replication but reduced message size

performance degradation caused by performance degradation caused by

synchronizationsynchronization

14

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 15: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

algorithm-basedalgorithm-based

15

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 16: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

algorithm-basedalgorithm-based

advantagesadvantages

error detection, correction, and repeated error detection, correction, and repeated

computation are within the algorithm computation are within the algorithm

executing within a processing element (PE)executing within a processing element (PE)

errors are propagated on less number of PEerrors are propagated on less number of PE

16

Ways of Failure Recovery

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 17: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff

physical theorems.physical theorems.””

Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)

17

Motivation

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 18: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff

physical theorems.physical theorems.””

Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)

Solution of PDEsSolution of PDEs

Time-evolving numerical methodsTime-evolving numerical methods

18

Motivation

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 19: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff

physical theorems.physical theorems.””

Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)

Solution of PDEsSolution of PDEs

Time-evolving numerical methodsTime-evolving numerical methods

Parallel version for complex PDEsParallel version for complex PDEs

19

Motivation

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 20: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff

physical theorems.physical theorems.””

Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)

Solution of PDEsSolution of PDEs

Time-evolving numerical methodsTime-evolving numerical methods

Parallel version for complex PDEsParallel version for complex PDEs

Even a single process failure postpone whole computation Even a single process failure postpone whole computation

20

Motivation

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 21: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

““... partial di erential equations (PDEs) are the basis of all ff... partial di erential equations (PDEs) are the basis of all ff

physical theorems.physical theorems.””

Bernhard Riemann (1826-1866)Bernhard Riemann (1826-1866)

Solution of PDEsSolution of PDEs

Time-evolving numerical methodsTime-evolving numerical methods

Parallel version for complex PDEsParallel version for complex PDEs

Even a single process failure postpone whole computation Even a single process failure postpone whole computation

More component on system causes more failure (More component on system causes more failure (and more complexity)and more complexity)

21

Motivation

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 22: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Design and implementation of time-evolving applicationDesign and implementation of time-evolving application

tolerate process failuretolerate process failure

achieve high scalabilityachieve high scalability

To learn the usability of fault-tolerant semantics of FT-MPITo learn the usability of fault-tolerant semantics of FT-MPI

22

Goal

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 23: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

23

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 24: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

24

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 25: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to detect processes failure?How to detect processes failure?

25

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 26: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to detect processes failure?How to detect processes failure?

How to determine which processes are failed?How to determine which processes are failed?

26

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 27: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to detect processes failure?How to detect processes failure?

How to determine which processes are failed?How to determine which processes are failed?

How to recover failed processes?How to recover failed processes?

27

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 28: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to detect processes failure?How to detect processes failure?

How to determine which processes are failed?How to determine which processes are failed?

How to recover failed processes?How to recover failed processes?

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

28

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 29: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to detect processes failure?How to detect processes failure?

How to determine which processes are failed?How to determine which processes are failed?

How to recover failed processes?How to recover failed processes?

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

How to continue time-step from the point of failure?How to continue time-step from the point of failure?

29

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 30: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to detect processes failure?How to detect processes failure?

How to determine which processes are failed?How to determine which processes are failed?

How to recover failed processes?How to recover failed processes?

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

How to continue time-step from the point of failure?How to continue time-step from the point of failure?

How to retain scalability? How to retain scalability?

30

Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 31: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to detect processes failure?How to detect processes failure?

MPI_ERR_OTHER semantics of FT-MPIMPI_ERR_OTHER semantics of FT-MPI

31

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 32: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to detect processes failure?How to detect processes failure?

MPI_ERR_OTHER semantics of FT-MPIMPI_ERR_OTHER semantics of FT-MPI

How to determine which processes are failed?How to determine which processes are failed?

Attribute catching mechanism of MPIAttribute catching mechanism of MPI

32

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 33: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to detect processes failure?How to detect processes failure?

MPI_ERR_OTHER semantics of FT-MPIMPI_ERR_OTHER semantics of FT-MPI

How to determine which processes are failed?How to determine which processes are failed?

Attribute catching mechanism of MPIAttribute catching mechanism of MPI

How to recover failed processes?How to recover failed processes?

Creating new processes with the same rank as previousCreating new processes with the same rank as previous

> FT_MPI_CHECK_RECOVER and > FT_MPI_CHECK_RECOVER and

> MPI_Comm_dup semantics of FT-MPI > MPI_Comm_dup semantics of FT-MPI

33

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 34: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

How to continue time-step from the point of failure?How to continue time-step from the point of failure?

Replaced byReplaced by

34

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

1D advection with periodic

boundary condition

Page 35: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

How to continue time-step from the point of failure?How to continue time-step from the point of failure?

Replaced byReplaced by

““lost state infolost state info”” of recovered processes are recovered from of recovered processes are recovered from

master by FT-MPI process restart master by FT-MPI process restart

35

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Every two-way exchange is going through master and save

“state info”on it

worker

1D advection with periodic

boundary condition

Page 36: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to recover How to recover ““lost state infolost state info”” of recovered processes? of recovered processes?

How to continue time-step from the point of failure?How to continue time-step from the point of failure?

Replaced byReplaced by

““lost state infolost state info”” of recovered processes are recovered from of recovered processes are recovered from

master by FT-MPI process restartmaster by FT-MPI process restart

time-stepping is continued from one step backwardstime-stepping is continued from one step backwards36

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Every two-way exchange is going through master and save

“state info”on it

worker

1D advection with periodic

boundary condition

Page 37: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed

37

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 38: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed

38

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 39: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed

39

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 40: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed

40

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 41: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Sending from Master (Process 0) is FailedSending from Master (Process 0) is Failed

41

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 42: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed

42

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 43: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed

43

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 44: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed

44

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 45: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed

45

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 46: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Sending from Worker (Process > 0) is FailedSending from Worker (Process > 0) is Failed

46

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 47: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

How to retain scalability?How to retain scalability?

Scalability is very low in this master-worker modelScalability is very low in this master-worker model

47

How to Tackle Challenges

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 48: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

# cores 16 (total)# cores 16 (total)

# nodes 4 (total)# nodes 4 (total)

Memory 4 GB (each node)Memory 4 GB (each node)

standard GigE Switch standard GigE Switch

48

Overhead of FT-MPI over Open MPI

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 49: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Scalability achieved for 16 cores (4 nodes) = 15% (very low)Scalability achieved for 16 cores (4 nodes) = 15% (very low)

Recovery time Recovery time

1 worker process failed = 1 sec1 worker process failed = 1 sec

4 worker processes failed = 2 sec4 worker processes failed = 2 sec

8 worker processes failed = 3 sec8 worker processes failed = 3 sec

15 worker processes failed = 5 sec15 worker processes failed = 5 sec

49

Scalability and Recovery Time

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 50: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Checkpointing after each T time-steps on a specific node Checkpointing after each T time-steps on a specific node

50

Future Work

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 51: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Checkpointin after each T time-steps on separate nodes Checkpointin after each T time-steps on separate nodes

51

Future Work

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 52: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

System size ––– Probability (component failure)System size ––– Probability (component failure)

System size ––– Hardness of achieving parallelismSystem size ––– Hardness of achieving parallelism

Process failure detection by FT-MPIProcess failure detection by FT-MPI

Failed process restart by FT-MPI Failed process restart by FT-MPI

Algorithm-based fault tolerance technique for data recoveryAlgorithm-based fault tolerance technique for data recovery

Overhead of FT-MPI compared to Open MPI is lowOverhead of FT-MPI compared to Open MPI is low

Recovery time is lessRecovery time is less

Master-worker model is not so scalable, but can be used as a Master-worker model is not so scalable, but can be used as a

prototypeprototype

52

Conclusion

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |

Page 53: Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applicationsusers.cecs.anu.edu.au/~mohsin/downloads/slides-algorithm... · 2016-07-20 · Presented By: Md.

Thank You!Thank You!

53Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications | Md. Mohsin Ali |


Recommended