CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance...

CS 505: Thu D. NguyenRutgers University, Spring 2005 1

CS 505: Computer Structures

Fault Tolerance

Thu D. Nguyen

Spring 2005

Computer Science

Rutgers University


Fault Tolerance

• Computing components WILL fail– Hardware, software, and people

• General field of dependability, fault tolerance, reliability, etc. addresses the issue of how can we keep a computing system running in the presence of component failures

• Lots of jargon (like all areas of computer science) so need to start with terminology

– See short paper I posted on web today


Dependability, Reliability, Availability

• Dependability: the ability of a computing system to deliver service that can justifiably be trusted

– Service delivered by a system is its behavior as perceived by the service’s users

– Dependability is a general concept that encapsulate reliability, availability, etc.

• Availability: readiness for correct service– What percentage of time is the service available

• Reliability: continuity of correct service– How long until the next service failure

• Safety: absence of catastrophic consequences on the users and environment, even in presence of faults


Faults, Errors, and Failures

• Failure: an event that occurs when the delivered service deviates from correct service

– By definition, a failure is visible to the user

• A fault is a failure of a component of a computing system that may lead to service failure

– If the system can tolerate this fault, that is, continue to provide correct service despite the fault, then the fault does not lead to service failure

• An error is the activation of a fault– Faults may be dormant or latent– For example, a disk fault may not ever become an error

if the service never uses that disk again


Fault Tolerance

• How to continue delivering correct service in the presence of errors

• Error detection: figuring out that an error exists in the service

• Fault diagnosis: figure out the root cause of the detected error(s)

• Error handling and recovery: dynamic reconfiguration of the service to continue delivering correct service

• Fault prediction: predicting when faults are likely to occur

• Fault prevention: pro-active reconfiguration of the service to tolerate likely future faults


Mathematical Definitions

• Availability = MTTF / (MTTF + MTTR)• Reliability = MTTF


Tandem Case Study

• Modularity• Fail-fast (fail-stop) hardware

– Extensive self-monitoring– Fault model enforcement– What happens when the self-monitoring and fault model

enforcement hardware fails?

• Replicate hardware for redundancy– Tolerate single fault

• Fault-tolerance software• On-line maintenance• Simplified user interface


Tandem NonStop


Tandem Integrity


Census of Tandem Availability


Census of Tandem Availability


Case Study of 1 Tandem Customer


Sources of Failures(Going Beyond Tandem)

• Operator mistakes are a major source of service failures• Theory: insufficient infrastructural support major reason

for operator mistakes– System designers rarely consider human-system interactions

59%22%

8%

11%

OperatorHardwareSoftwareOverload

51%

15%

34%

0%

Public Switched Telephone Network Average of 3 Internet Sites

[Patterson et al. 2002]


Data from Vivo Project

• Conducting survey to understand database and network administration

– ~100 respondents– DBAs: all ≥ 2 years experience, 71% ≥ 5 years experience– Networking: 98% ≥ 2 years experience, 81% ≥ 5 years experience

• Source of failures

Network and Systems

44%

15%

15%

10%

10%

2%2% 2%

Database

16%

18%

16%26%

14%

2%8% Operator Lack of

Understanding/ExperienceComplex Operation

Hardware Failure

Buggy Software

Operator Inattentive/tired

Lack of Appropriate Tools

Unfriendly Interface

Not specified

Date post:	14-Dec-2015
Category:	Documents
Upload:	camron-bell
View:	215 times
Download:	1 times

CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance...

Documents