Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | april-houston |
View: | 36 times |
Download: | 2 times |
SOFTWARE FAULT TOLERANCE IN A CLUSTERED ARCHİTECTURE:TECHNİQUES & RELİABİLİTY MODELİNGHüsnü Şensoy
AGENDA
Introduction RCC Principal Techniques & Architecture
Assumptions Reliability Techniques Reliability Modeling & Analysis Conclusion
INTRODUCTIONSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling
AVAİLABİLİTY & DATA CONSİSTENCY
AVAİLABİLİTY IN CLUSTERED ENVİRONMENT
4+2 Configuratio
n
RCC PRINCIPAL TECHNIQUES & ARCHITECTURE ASSUMPTIONSSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling
CLUSTERED ARCHITECTURE RELIABILITY
Commercial Hardware
OS
Database
Commercial Hardware
OS
Database
Commercial Hardware
OS
Database
Commercial Hardware
OS
Database
Application
•Error Detection•Switchover
Application Application Application
•Error detection•Consequent recovery actions•Data backup
ZOOM IN TO A PROCESSİNG NODE
RCC Platform AssetsRCC Platform Assets•WatchDog Interface•State Server•Cluster Management•Process Monitoring•Resource Monitors: Disk, Network
RCC Aware ApplicationRCC Aware Application•Network Systems’ Applications
Off-the-Off-the-shelf shelf ApplicationApplicationss
Standard LibrariesStandard Libraries
RCC LibrariesRCC Libraries
Commercial UNIX Operating SystemCommercial UNIX Operating System
CommercialCommercialMirroring/Mirroring/
Journaling File Journaling File System SoftwareSystem Software
Commercial UNIX Sytem Hardware DriversCommercial UNIX Sytem Hardware Drivers
Disk MirrorDisk MirrorPseudo DriverPseudo Driver
RELİABİLİTY TECHNİQUESSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling
RELİABİLİTY DİMENSİONS
Availability Data Consistency
MTTRMTBF
MTBFtyAvailabili
RELIABILITY MODELS
LEVELS OF RELİABİLİTY Level 0: Basic automatic fault detection by watchdog, no automatic fault recovery, no data
consistency A small set of fault classes – hardware & software – is detected by the watchdog. For a hardware fault, the system is manually reconfigured. For a software fault, the application process is restarted at the initial internal state which will require
initialization of the faulty processor since the application may leave its data in an inconsistent or incorrect state.
Level 1: Basic automatic fault detection by watchdog, automatic fault recovery, no data consistency
A small set of fault classes – hardware & software – is detected by the watchdog & recovery is automatic. When a fault is detected by the watchdog, the system is automatically recovered – reconfigured for hardware
faults and initialized for software faults.
Level 2: Level 1 plus enhanced automatic fault detection by watchdog plus periodic checkpointing, logging & recovery of internal state.
The watchdog & application are enhanced to automatically detect a larger set of fault The internal state of the application process is periodically checkpointed. After a hardware failure is detected, the system is reconfigured around the faulty unit. The application is restarted at the most recent checkpointed internal state
Level 3: Level 2 plus persistent data recovery. (this is the highest level achievable with RCC) The persistent data of the application is replicated on a backup disk connected to a backup node, and is kept
consistent with the data on the primary node throughout the normal operation of the application. In case of a fault, in backup node, the backup disk brings the application’s persistent data as close to the state
at which the application crashed as possible.
Level 4: Continuous operation without interruption This level of reliability is not achievable with the RCC.
RELİABİLİTY MODELİNG & ANALYSİS Software Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling
BASİC MODEL FOR SOFTWARE FAULT TOLERANCE
WorkingWorking
Fault Fault DetectioDetectio
n & n & RecoveryRecovery
Volatile Volatile Data Data
RecoveryRecovery
PersistenPersistent Data t Data
RecoveryRecovery
FailedFailed
11c 22c
c
)1( c
1 23
11)1( c
22 )1( c
33)1( c
LEVEL 0 RELİABİLİTY
WorkingWorking FailedFailed
41
001.0
%99,96
LEVEL 1 RELİABİLİTY
WorkingWorking
Fault Fault DetectioDetectio
n & n & RecoveryRecovery
FailedFailed
c
)1( c
1
9.0
30
30
1
1
c
%99,98
LEVEL 2 RELİABİLİTY
WorkingWorking
Fault Fault DetectioDetectio
n & n & RecoveryRecovery
Volatile Volatile Data Data
RecoveryRecovery
FailedFailed
11c
c
)1( c
1 2
11)1( c
%99,99
1800
1800
9.0
99.0
2
2
1
c
c
LEVEL 3 RELİABİLİTY
WorkingWorking
Fault Fault DetectioDetectio
n & n & RecoveryRecovery
Volatile Volatile Data Data
RecoveryRecovery
PersistenPersistent Data t Data
RecoveryRecovery
FailedFailed
11c 22c
c
)1( c
1 23
11)1( c
22 )1( c 3600,100
1800,1800
9.0
99.0
999.0
33
22
2
1
c
c
c
~%100
CONCLUSİONSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling
CONCLUSION
In this work, a RCC has been proposed. Different levels of reliability have been
defined. A reliability analysis is held via Markov
modelling.
QUESTİONS & COMMENTSSoftware Fault Tolerance In A Clustered Architecture:Techniques & Reliability Modeling
?