+ All Categories
Home > Documents > Software Fault Tolerance In A Clustered Architecture : Techniques & Reliability Modeling

Software Fault Tolerance In A Clustered Architecture : Techniques & Reliability Modeling

Date post: 02-Jan-2016
Category:
Upload: april-houston
View: 36 times
Download: 2 times
Share this document with a friend
Description:
Software Fault Tolerance In A Clustered Architecture : Techniques & Reliability Modeling. Hüsnü Şensoy. Agenda. Introduction RCC Principal Techniques & Architecture Assumptions Reliability Techniques Reliability Modeling & Analysis Conclusion. Introduction. - PowerPoint PPT Presentation
Popular Tags:
21
SOFTWARE FAULT TOLERANCE IN A CLUSTERED ARCHİTECTURE: TECHNİQUES & RELİABİLİTY MODELİNG Hüsnü Şensoy
Transcript
Page 1: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

SOFTWARE FAULT TOLERANCE IN A CLUSTERED ARCHİTECTURE:TECHNİQUES & RELİABİLİTY MODELİNGHüsnü Şensoy

Page 2: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

AGENDA

Introduction RCC Principal Techniques & Architecture

Assumptions Reliability Techniques Reliability Modeling & Analysis Conclusion

Page 3: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

INTRODUCTIONSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

Page 4: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

AVAİLABİLİTY & DATA CONSİSTENCY

Page 5: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

AVAİLABİLİTY IN CLUSTERED ENVİRONMENT

4+2 Configuratio

n

Page 6: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

RCC PRINCIPAL TECHNIQUES & ARCHITECTURE ASSUMPTIONSSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

Page 7: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

CLUSTERED ARCHITECTURE RELIABILITY

Commercial Hardware

OS

Database

Commercial Hardware

OS

Database

Commercial Hardware

OS

Database

Commercial Hardware

OS

Database

Application

•Error Detection•Switchover

Application Application Application

•Error detection•Consequent recovery actions•Data backup

Page 8: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

ZOOM IN TO A PROCESSİNG NODE

RCC Platform AssetsRCC Platform Assets•WatchDog Interface•State Server•Cluster Management•Process Monitoring•Resource Monitors: Disk, Network

RCC Aware ApplicationRCC Aware Application•Network Systems’ Applications

Off-the-Off-the-shelf shelf ApplicationApplicationss

Standard LibrariesStandard Libraries

RCC LibrariesRCC Libraries

Commercial UNIX Operating SystemCommercial UNIX Operating System

CommercialCommercialMirroring/Mirroring/

Journaling File Journaling File System SoftwareSystem Software

Commercial UNIX Sytem Hardware DriversCommercial UNIX Sytem Hardware Drivers

Disk MirrorDisk MirrorPseudo DriverPseudo Driver

Page 9: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

RELİABİLİTY TECHNİQUESSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

Page 10: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

RELİABİLİTY DİMENSİONS

Availability Data Consistency

MTTRMTBF

MTBFtyAvailabili

Page 11: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

RELIABILITY MODELS

Page 12: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

LEVELS OF RELİABİLİTY Level 0: Basic automatic fault detection by watchdog, no automatic fault recovery, no data

consistency A small set of fault classes – hardware & software – is detected by the watchdog. For a hardware fault, the system is manually reconfigured. For a software fault, the application process is restarted at the initial internal state which will require

initialization of the faulty processor since the application may leave its data in an inconsistent or incorrect state.

Level 1: Basic automatic fault detection by watchdog, automatic fault recovery, no data consistency

A small set of fault classes – hardware & software – is detected by the watchdog & recovery is automatic. When a fault is detected by the watchdog, the system is automatically recovered – reconfigured for hardware

faults and initialized for software faults.

Level 2: Level 1 plus enhanced automatic fault detection by watchdog plus periodic checkpointing, logging & recovery of internal state.

The watchdog & application are enhanced to automatically detect a larger set of fault The internal state of the application process is periodically checkpointed. After a hardware failure is detected, the system is reconfigured around the faulty unit. The application is restarted at the most recent checkpointed internal state

Level 3: Level 2 plus persistent data recovery. (this is the highest level achievable with RCC) The persistent data of the application is replicated on a backup disk connected to a backup node, and is kept

consistent with the data on the primary node throughout the normal operation of the application. In case of a fault, in backup node, the backup disk brings the application’s persistent data as close to the state

at which the application crashed as possible.

Level 4: Continuous operation without interruption This level of reliability is not achievable with the RCC.

Page 13: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

RELİABİLİTY MODELİNG & ANALYSİS Software Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

Page 14: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

BASİC MODEL FOR SOFTWARE FAULT TOLERANCE

WorkingWorking

Fault Fault DetectioDetectio

n & n & RecoveryRecovery

Volatile Volatile Data Data

RecoveryRecovery

PersistenPersistent Data t Data

RecoveryRecovery

FailedFailed

11c 22c

c

)1( c

1 23

11)1( c

22 )1( c

33)1( c

Page 15: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

LEVEL 0 RELİABİLİTY

WorkingWorking FailedFailed

41

001.0

%99,96

Page 16: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

LEVEL 1 RELİABİLİTY

WorkingWorking

Fault Fault DetectioDetectio

n & n & RecoveryRecovery

FailedFailed

c

)1( c

1

9.0

30

30

1

1

c

%99,98

Page 17: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

LEVEL 2 RELİABİLİTY

WorkingWorking

Fault Fault DetectioDetectio

n & n & RecoveryRecovery

Volatile Volatile Data Data

RecoveryRecovery

FailedFailed

11c

c

)1( c

1 2

11)1( c

%99,99

1800

1800

9.0

99.0

2

2

1

c

c

Page 18: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

LEVEL 3 RELİABİLİTY

WorkingWorking

Fault Fault DetectioDetectio

n & n & RecoveryRecovery

Volatile Volatile Data Data

RecoveryRecovery

PersistenPersistent Data t Data

RecoveryRecovery

FailedFailed

11c 22c

c

)1( c

1 23

11)1( c

22 )1( c 3600,100

1800,1800

9.0

99.0

999.0

33

22

2

1

c

c

c

~%100

Page 19: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

CONCLUSİONSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

Page 20: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

CONCLUSION

In this work, a RCC has been proposed. Different levels of reliability have been

defined. A reliability analysis is held via Markov

modelling.

Page 21: Software  Fault Tolerance In  A  Clustered Architecture : Techniques  &  Reliability Modeling

QUESTİONS & COMMENTSSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling

?


Recommended