Date post: | 02-Apr-2018 |
Category: |
Documents |
Upload: | vvaqas-rasool |
View: | 220 times |
Download: | 0 times |
of 40
7/27/2019 Real Time Systems IX
1/40
Fault-Tolerance in Real-Time
SystemsSidra Rashid
Bahria University, Islamabad Campus
Lecture IX
7/27/2019 Real Time Systems IX
2/40
Fault Tolerance
What is Fault Tolerance? Ability of an operational system to tolerate the presence of faults
Why tolerate faults? It is proven that it is impossible to completely test a practical-sized
system. Therefore, it is important to implement techniques which allow a
system to detect and tolerate faults during normal operation.
4 phases of fault tolerance: Error detection detection of an erroneous state Damage assessment computes the severity of the fault Error processing substitute erroneous state for an error-free one Fault Treatment determine the cause of the error, then run fault
passivation to ensure it doesnt happen again
5/3/13
7/27/2019 Real Time Systems IX
3/40
Fault Classification
Nature of faults distinguishes the intention of the fault
Accid-ental
Persistence of faults determine the duration of the fault state
Origin of faults are categorized into 3 types: Phenomenon is the fault from physical or human phenomenon Extent does the internal or external environment cause the fault Phase is the fault caused within the design or operation of the system
Faults
Nature Origin
Int.Human-made
Intent-ional
Operation Perm-anent
ExtentPhenomenon
Physical Temp-orary
Ext. Design
Phase
Persistence
5/3/13
7/27/2019 Real Time Systems IX
4/40
Software Fault Tolerance Techniques
Key to fault-tolerance is redundancy Three domains:
Space Several hardware channels each executing same task
Information
Recover the system via data structures storing system contents Repetition
Restarts module in event of a faulty module Two major schemes have evolved
Recovery Block (RB) 1H/Nds/NT-System
There is only one hardware channel (1H), and the faults are tolerated by executing severaldiverse software modules (NdS) sequentially (NT)
N-Version Programming (NVP) NH/Nds/1T-System
The system has a number of (identical) hardware channels (NH) each executing one of thediverse software versions (NdS), hence no redundancy in time (1T)
5/3/13
7/27/2019 Real Time Systems IX
5/40
Software Fault Tolerance Recovery Block
Checkpoint
S
wi t c h
Primary
Alternate 1
Alternate 2
Alternate N-1
.
.
.
AcceptanceTest
Restore from Checkpoint More alternates?Deadline not exceeded?
Passed
Failed
FaultTrue
5/3/13
7/27/2019 Real Time Systems IX
6/40
Software Fault Tolerance Recovery Block
Considerations Software diversity
Idea: different teams, one specification, different products Hope that failure domains do not overlap
Difficulties in designing acceptance test Single test for all modules of recovery block Test is most crucial element in improving reliability
Design of Recovery Cache sufficiently simple to ensure no faults
Increased System Overhead Domino Effect
Recovery blocks can push concurrent tasks that communicate intouncontrolled rollback
5/3/13
7/27/2019 Real Time Systems IX
7/40
Software Fault Tolerance N-Version Programming
N-Version Programming ( NH/Nds/1T ) Several Hardware channels Software diverse versions of code Results are voted upon Initial Specification is crucial
S wi t c h
Version 1
Version 2
Version N
.
.
.
.
.
Voter
Output
No agreement Failure
S yn
c h
MajorityAgreement
5/3/13
7/27/2019 Real Time Systems IX
8/40
Considerations Software diversity! Difficult to create good specification Decision Mechanism
Some results will not always be identical (valid and invalid) define a range of valid solutions but decreases distance from acceptance test
approach System Overhead
temporal: Synchronization and decision algorithm space: multiple hardware channels and space for multiple software versions
Extensions Community Error Recovery ( forward recovery)
enough information from good versions to recover failed versions
5/3/13
Software Fault Tolerance N-Version Programming
7/27/2019 Real Time Systems IX
9/40
Software Fault Tolerance Consensus Recovery Block(CRB)
NH/Nds/1T Synthesis of N-version Programming and recovery block Basic Assumption:
no similar errors will occur (erroneous results resembling each other) if two or more versions agree, the result is considered correct
S wi t c h
Version 1
Version 2
Version N
.
.
.
.
.
Voter
Output
No agreement
Failure
Agreement
Input
AT
Versions untried?
Time limit not expired?
5/3/13
7/27/2019 Real Time Systems IX
10/40
Software Fault Tolerance Distributed Recovery Block
NH/NS/1T or Nhs/Nds/1T Reproducing RB Scheme on Multiple Network Nodes
Considerations Synchronization between nodes especially during rollback
Version A
Version B
AcceptanceTest
More alternates?Deadline not exceeded?
Accepted
False
Input
True
Version A
Version B
AcceptanceTest
More alternates?Deadline not exceeded?
Accepted
FailedTrue
Primary Node
Secondary Node
5/3/13
7/27/2019 Real Time Systems IX
11/40
Extended Distributed Recovery Block
Heartbeat scheme Active Node Shadow Node Supervisor Node
Each node contains Primary version Alternate version Acceptance test Device Drivers
RecoveryManager
Supervisor
To the system To the system
PrimaryVersion
AlternateVersion
AcceptanceTest
DeviceDrivers
HeartbeatsNodeExec.
Active
Heartbeat/ResetRequest Consent
NodeExec.
Shadow
PrimaryVersion
AlternateVersion
AcceptanceTest
DeviceDrivers
5/3/13
7/27/2019 Real Time Systems IX
12/40
5/3/13
7/27/2019 Real Time Systems IX
13/40
Roll-Forward Checkpointing Scheme Used for multiprocessor systems Pool of Active Processing Modules
Processor Volatile storage Stable storage
Checkpoint processor The checkpoint processor detects module failures by comparing the state of
each pair of processing modules that perform the same task.
The two processors execute their tasks, checkpoint their states, and send thecheckpoints to the checkpoint processor.
The checkpoint processor compares the states, and if the states match thenew checkpoint is considered correct and it replaces the old checkpoint.
5/3/13
7/27/2019 Real Time Systems IX
14/40
5/3/13
7/27/2019 Real Time Systems IX
15/40
5/3/13
7/27/2019 Real Time Systems IX
16/40
N Self-Checking Program
Made up of several Self Checking Components Made up of different variants
Variants are either associated with an acceptance test or pairedtogether and associated with a comparison algorithm
Components execute in parallel Fault tolerance is provided by parallel execution of components Each component is responsible for determining whether a delivered
result is acceptablethe system is divided into several self-checking components comprised of different variants (equivalent to alternates in RB and versions in NVP) of the software. These components execute in parallel. A self-checkingcomponent is made up in one of two ways: a) each variant is associatedwith an acceptance test which tests the results of the variant (Figure a),or b)
5/3/13
7/27/2019 Real Time Systems IX
17/40
5/3/13
7/27/2019 Real Time Systems IX
18/40
Data Diversity Retry Block
Executes test normally If the results are accepted by the test, execution is complete If the results are not accepted the test runs again once the input data
has been restated
N-copy Programming Upon entry to the block, data is restated to N-1 ways
This creates N different data sets The copies execute in parallel Output is selected with a voting scheme
5/3/13
7/27/2019 Real Time Systems IX
19/40
5/3/13
7/27/2019 Real Time Systems IX
20/40
5/3/13
7/27/2019 Real Time Systems IX
21/40
Summary
Fault tolerant design considerations Anticipated faults
In most cases, a simple acceptance test is all that is needed Unanticipated faults
Designers must decide what is the most practical solution Most of the techniques in this report are hardware based, and
many designers will not be able to use them This leaves designers with
Recovery Blocks (Software Design Diversity)
Retry Blocks (Data Diversity)
5/3/13
7/27/2019 Real Time Systems IX
22/40
5/3/13
Fault-Tolerance inReal-Time Databases
7/27/2019 Real Time Systems IX
23/40
Overview
The causes of the downtime Availability solutions
CASE 1: Clustra CASE 2: TelORB CASE 3: RODAIN
5/3/13
7/27/2019 Real Time Systems IX
24/40
The Causes of Downtime
Planned downtime Hardware expansion Database software upgrades Operating system upgrades
Unplanned downtime Hardware failure OS failure Database software bugs Power failure Disaster Human error
5/3/13
7/27/2019 Real Time Systems IX
25/40
Traditional Availability Solutions Replication:
The standby system needs to duplicate transactions as they occur on the primarysystem. Ideally, this replication is done in near-real time, so the standby system isvery close to current in the event of a primary system failure.
FailoverFailover is the moment of truth. When a failure occurs on the primary system, allconnections must be re established on the standby, and all active transactionsmust be rolled back and restarted. Because everything must be transferred,typical failover times are measured in minutes at best, during which time thedatabase is unavailable.
Primary restartOnce the standby system takes over, there is no longer a standby. This is especiallyvulnerable period, and so the primary must be restarted as quickly as possible. Insome schemes the primary becomes the new standby, and in other schemesprocessing must, at some point, be switched back to the primary.
5/3/13
7/27/2019 Real Time Systems IX
26/40
CASE 1: Clustra Developed for telephony applications such as
mobility management and intelligentnetworks.
Relational database with location andreplication transparency.
Real-Time data locked in main memory andAPI provides precompiled transactions.
NOT a Real-Time Database !
5/3/13
7/27/2019 Real Time Systems IX
27/40
Clustra hardware architecture
5/3/13
7/27/2019 Real Time Systems IX
28/40
Data distribution and replication
5/3/13
7/27/2019 Real Time Systems IX
29/40
How Clustra Handles Failures Real-Time failover: Hot-standby data is up to date, so failover
occurs in milliseconds. Automatic restart and takeback: Restart of the failed node and
takeback of operations is automatic, and again transparent tousers and operators.
Self-repair: If a node fails completely, data is copied from thecomplementary node to standby. This is also automatic andtransparent.
Limited failure effects
5/3/13
7/27/2019 Real Time Systems IX
30/40
How Clustra Handles Upgades
Hardware, operating system, and databasesoftware upgrades without ever going down.
Process called rolling upgrade I.e. required changes are performed node by node. Each node upgraded to catch up to the status of
complementary node. When this is completed, the operation is performed to
next node.
5/3/13
7/27/2019 Real Time Systems IX
31/40
CASE 2: TelORB
CharacteristicsVery high availability (HA), robustness implemented in SW(soft) Real Time
Scalability by using loosely coupled processors
OpennessHardware: Intel/Pentium
Language: C++, JavaInteroperability: CORBA/IIOP, TCP/IP, Java RMI
3:rd party SW: Java
5/3/13
7/27/2019 Real Time Systems IX
32/40
TelORB Availability
Real-time object-oriented DBMS supporting
Distributed Transactions
ACID properties expected from a DBMS
Data Replication (providing redundancy)
Network Redundancy
Software Configuration Control
Automatic restart of processes that originally executedon a faulty processor on the ones that are working
Self healingIn service upgrade of software with no disturbance to operation
Hot replacement of faulty processors
5/3/13
7/27/2019 Real Time Systems IX
33/40
Automatic Reconfiguration
reloading
5/3/13
7/27/2019 Real Time Systems IX
34/40
Software upgrade
Smooth software upgrade when old and newversion of same process can coexistPossibility for application to arrange for statetransfer between old and new static process(unless important states arent already storedin the database)
5/3/13
7/27/2019 Real Time Systems IX
35/40
Partioning: Types and Data
21 221817
A B
2019 2019 A
B
1817
21 22
5/3/13
7/27/2019 Real Time Systems IX
36/40
Advantages
Standard interfaces through Corba
Standard languages : C++, Java
Based on commercial hardware
(Soft) Real-time OSFault tolerance implemented in software
Fully scalable architecture
Includes powerful middleware: A database management system and
functions for software managementFully compatible simulated environment for development on Unix/Linux/NT workstations
5/3/13
7/27/2019 Real Time Systems IX
37/40
CASE 3: RODAIN
Real-Time Object-Oriented DatabaseArchitechture for Intelligent Networks
Real-Time Main-Memory Database System Runs on Real-Time OS: Linux
5/3/13
7/27/2019 Real Time Systems IX
38/40
Rodain Cluster
5/3/13
7/27/2019 Real Time Systems IX
39/40
Rodain Database Node
Distributed DatabaseSubsystem
User RequestInterpreter Subsystem
Watchdog Subsystem
Fault-Tolerance andRecovery Subsystem
Object-OrientedDatabaseManagementSubsystem
Database Primary Unit
User RequestInterpreter Subsystem
Watchdog Subsystem
Object-OrientedDatabaseManagementSubsystem
Database Mirror Unit
Distributed DatabaseSubsystem
Fault-Tolerance andRecovery Subsystem
shared disk
5/3/13
7/27/2019 Real Time Systems IX
40/40
Distributed DatabaseSubsystem
User RequestInterpreter Subsystem
Watchdog Subsystem
Fault-Tolerance andRecovery Subsystem
Object-OrientedDatabaseManagementSubsystem
Database Primary Unit
User RequestInterpreter Subsystem
Watchdog Subsystem
Object-OrientedDatabaseManagementSubsystem
Database Mirror Unit
Distributed DatabaseSubsystem
Fault-Tolerance andRecovery Subsystem
shared disk
RODAIN Database Node II