Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | camron-bell |
View: | 215 times |
Download: | 1 times |
CS 505: Thu D. NguyenRutgers University, Spring 2005 1
CS 505: Computer Structures
Fault Tolerance
Thu D. Nguyen
Spring 2005
Computer Science
Rutgers University
CS 505: Thu D. NguyenRutgers University, Spring 2005 2
Fault Tolerance
• Computing components WILL fail– Hardware, software, and people
• General field of dependability, fault tolerance, reliability, etc. addresses the issue of how can we keep a computing system running in the presence of component failures
• Lots of jargon (like all areas of computer science) so need to start with terminology
– See short paper I posted on web today
CS 505: Thu D. NguyenRutgers University, Spring 2005 3
Dependability, Reliability, Availability
• Dependability: the ability of a computing system to deliver service that can justifiably be trusted
– Service delivered by a system is its behavior as perceived by the service’s users
– Dependability is a general concept that encapsulate reliability, availability, etc.
• Availability: readiness for correct service– What percentage of time is the service available
• Reliability: continuity of correct service– How long until the next service failure
• Safety: absence of catastrophic consequences on the users and environment, even in presence of faults
CS 505: Thu D. NguyenRutgers University, Spring 2005 4
Faults, Errors, and Failures
• Failure: an event that occurs when the delivered service deviates from correct service
– By definition, a failure is visible to the user
• A fault is a failure of a component of a computing system that may lead to service failure
– If the system can tolerate this fault, that is, continue to provide correct service despite the fault, then the fault does not lead to service failure
• An error is the activation of a fault– Faults may be dormant or latent– For example, a disk fault may not ever become an error
if the service never uses that disk again
CS 505: Thu D. NguyenRutgers University, Spring 2005 5
Fault Tolerance
• How to continue delivering correct service in the presence of errors
• Error detection: figuring out that an error exists in the service
• Fault diagnosis: figure out the root cause of the detected error(s)
• Error handling and recovery: dynamic reconfiguration of the service to continue delivering correct service
• Fault prediction: predicting when faults are likely to occur
• Fault prevention: pro-active reconfiguration of the service to tolerate likely future faults
CS 505: Thu D. NguyenRutgers University, Spring 2005 6
Mathematical Definitions
• Availability = MTTF / (MTTF + MTTR)• Reliability = MTTF
CS 505: Thu D. NguyenRutgers University, Spring 2005 7
Tandem Case Study
• Modularity• Fail-fast (fail-stop) hardware
– Extensive self-monitoring– Fault model enforcement– What happens when the self-monitoring and fault model
enforcement hardware fails?
• Replicate hardware for redundancy– Tolerate single fault
• Fault-tolerance software• On-line maintenance• Simplified user interface
CS 505: Thu D. NguyenRutgers University, Spring 2005 8
Tandem NonStop
CS 505: Thu D. NguyenRutgers University, Spring 2005 9
Tandem Integrity
CS 505: Thu D. NguyenRutgers University, Spring 2005 10
Census of Tandem Availability
CS 505: Thu D. NguyenRutgers University, Spring 2005 11
Census of Tandem Availability
CS 505: Thu D. NguyenRutgers University, Spring 2005 12
Case Study of 1 Tandem Customer
CS 505: Thu D. NguyenRutgers University, Spring 2005 13
Sources of Failures(Going Beyond Tandem)
• Operator mistakes are a major source of service failures• Theory: insufficient infrastructural support major reason
for operator mistakes– System designers rarely consider human-system interactions
59%22%
8%
11%
OperatorHardwareSoftwareOverload
51%
15%
34%
0%
Public Switched Telephone Network Average of 3 Internet Sites
[Patterson et al. 2002]
CS 505: Thu D. NguyenRutgers University, Spring 2005 14
Data from Vivo Project
• Conducting survey to understand database and network administration
– ~100 respondents– DBAs: all ≥ 2 years experience, 71% ≥ 5 years experience– Networking: 98% ≥ 2 years experience, 81% ≥ 5 years experience
• Source of failures
Network and Systems
44%
15%
15%
10%
10%
2%2% 2%
Database
16%
18%
16%26%
14%
2%8% Operator Lack of
Understanding/ExperienceComplex Operation
Hardware Failure
Buggy Software
Operator Inattentive/tired
Lack of Appropriate Tools
Unfriendly Interface
Not specified