Fundamentals Of Dependable Computing
ICSE 2014
John C. Knight Department of Computer Science
University of Virginia
University of Virginia
Tutorial Goals p Answer the following questions:
n What is dependability and what affects it? n How do we achieve dependability? n How do we know we have achieved it?
p Learn a few things, have some fun, etc. p Motivation level, cannot teach several semester’s worth of
material in a day p Assume that fundamental material is unfamiliar p Emphasis on software
Dependability — A Rigorous, Principled, Engineering Subject
© John C. Knight 2014, All Rights Reserved 2
University of Virginia
Tutorial Roadmap
Basic Concepts
Terminology Of Dependability
Dependability Requirements
Faults And Fault Treatment
Software Technology
Review
Software Fault Avoidance
Software Fault Elimination
You are here
Fault Tolerance
Special Topics
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
© John C. Knight 2014, All Rights Reserved 3
University of Virginia
Tutorial Schedule p Session 1:
n Basic concepts n Terminology n Dependability requirements
p Session 2: n Faults n Fault treatment n Software technology review
p Session 3: n Formal specification n Correctness by construction with SPARK Ada
p Session 4: n Hardware and software fault tolerance n Byzantine agreement n Fail-stop machines
© John C. Knight 2014, All Rights Reserved 4
University of Virginia
Tutorial Protocol p Please call me John p Please ask questions anytime p Please relate your experience p Please ask me to slow down if necessary p Please indicate material that is:
n Too high or too low level n Needs clarification n Wrong
p Make sure that this is fun and useful to you p Copies omit some slides (copyright)
© John C. Knight 2014, All Rights Reserved 5
University of Virginia
Tutorial Assumed Background
p High-level language programming p Basic computer organization p Elementary probability theory p Predicate calculus p Set theory p Basic principles of operating systems
© John C. Knight 2014, All Rights Reserved 6
University of Virginia
Starting Point
Entrance to the
Meissen Pottery Factory
near Dresden in
Germany
© John C. Knight 2014, All Rights Reserved 8
University of Virginia
What Is Our Goal?
A chain is only as strong as its weakest link...
Ahhhhh
Many things could go wrong: • Attachment could break • Box could break • Chain could break • Etc.
© John C. Knight 2014, All Rights Reserved 9
University of Virginia
Dependability Outline 1. Determine what happens if chain/box breaks
n Determine the consequences of failure 2. Decide how strong the chain/box has to be
n Define dependability requirements 3. Determine how chain/box might fail
n Determine all possible faults that might occur 4. Buy high quality chain/box so as to avoid defects
n Practice fault avoidance during construction 5. Fix any broken links that remain
n Practice fault elimination during construction 6. Cope with broken links you didn’t find or that break
n Practice fault tolerance during operation
Make absolutely sure you find all the
weak links
© John C. Knight 2014, All Rights Reserved 10
University of Virginia
Dependability Outline
1. Determine the consequences of failure
2. Define dependability requirements
3. Determine all possible faults that might occur
4. Practice fault avoidance during construction
5. Practice fault elimination during construction
6. Practice fault tolerance during operation
© John C. Knight 2014, All Rights Reserved 11
University of Virginia
What Happened? p Titan IV:
n Power bus short circuit, guidance reboot
p Mars Global Explorer: n Unit systems confusion
p Mars Polar Lander: n Touchdown sensor triggered by landing leg deployment
© John C. Knight 2014, All Rights Reserved 13
University of Virginia
What Happened? p Mars Rovers:
n Pathfinder - Priority inversion
n Spirit and Opportunity – file system filled up
p Korean 801: n Safety case violated
p Ariane V: n Unhandled exception
© John C. Knight 2014, All Rights Reserved 14
University of Virginia
Software And Its Role
Environment
System
Software
Users
Operators
Software caused all of these consequences
© John C. Knight 2014, All Rights Reserved 15
University of Virginia
How To Think About Failures p Accidents are very complex:
n Do not jump to conclusions n There is never a “root” or single cause n There are usually many causal factors
p Lessons learned must be comprehensive: n Prevent future occurrences of all causal factors
p Always investigate failures
© John C. Knight 2014, All Rights Reserved 16
University of Virginia
Role of The Software Engineer p Why should a software engineer be concerned
about dependability?
p Four reasons: n Software has to perform the intended function n Software has to perform the intended function correctly n Software has to operate on target system whose design
might have been influenced by system’s dependability goals
n Software often needs to take action when a hardware component fails
© John C. Knight 2014, All Rights Reserved 17
University of Virginia
So? p Acquire sufficient information about systems side
of dependability that the software engineering can: n Understand why software is being asked to do what it is
being asked to do n Understand why software is being made to operate on
the particular platform specified by the system designers
p Acquire a conceptual framework within which to think about software and its dependability
© John C. Knight 2014, All Rights Reserved 18
University of Virginia
Current Software Practice p Unstructured, natural language documents p Some use of visual specification & formalism
from other disciplines (PDE’s) p UML, C, C++, Java p Tools for test setup and management p Combative adoption of standards
GREAT CARE By Many GREAT PEOPLE
© John C. Knight 2014, All Rights Reserved 19
University of Virginia
Dependable Systems p Software is the least well understood technology
involved p Software volume is increasing at an exponential
rate p Software is quickly becoming the dominant
cause of critical system failures p Many aspects of hardware dependability are
essentially solved problems
© John C. Knight 2014, All Rights Reserved 20
University of Virginia
Step 1 Determine Consequences of
Failure
What Happens If The Chain Breaks?
© John C. Knight 2014, All Rights Reserved 21
University of Virginia
Consequences of Failure p Injury or loss of life p Environmental damage p Damage to or loss of equipment p Financial loss:
n Theft n Useless or defective mass-produced equipment n Loss of production capacity or service n Loss of business reputation, customer base
© John C. Knight 2014, All Rights Reserved 22
University of Virginia
Consequences of Failure p Direct—device fails and causes injury p Indirect—defective support tool leads to device
that fails p Combinations of losses:
n Immediate damage to equipment n Subsequent loss of service n Subsequent loss of business reputation n Subsequent law suits
© John C. Knight 2014, All Rights Reserved 23
University of Virginia
Hidden Costs of Failure p Some costs of failure are obvious, others not p Ariane V example:
n Obvious: p Loss of launch vehicle p Loss of payload
n Not so obvious: p Environmental cleanup p Loss of service from payload p Vehicle redesign p Increased insurance rates for launches
© John C. Knight 2014, All Rights Reserved 24
University of Virginia
Consequences of Failure
p Loss of revenue n Software defects : $200 billion/year (SCC)
p Note: recent worms and viruses
n 1 hour of downtime costs (InternetWeek, 2000): Brokerage operations $6,450,000/hr Credit card auth. $2,600,000/hr Ebay (1 outage/22hrs) $225,000/hr Home Shopping Channel $113,000/hr Airline reservation ctr $89,000/hr ATM services fee $14,000/hr
n Note: Ebay (22 hrs, 1999) p $4 Billion Market Cap loss
© John C. Knight 2014, All Rights Reserved 25
University of Virginia
Step 2 Define Dependability
Requirements
How Strong Does The Chain Have To Be?
© John C. Knight 2014, All Rights Reserved 26
University of Virginia
Why Dependability Requirements? p Need an engineering target
n What technology do we have to employ? p Informal statements are inadequate:
n “System should not fail.” n “System must be very reliable.”
p Need a target so that, if we meet it, system will perform “acceptably”
p Need terminology to define requirements
© John C. Knight 2014, All Rights Reserved 27
University of Virginia
Why Study Terminology? p We need to be able to communicate in a precise
manner: n Researchers n Developers n Customers
p There are everyday notions of these terms p The public has an interest p But public terminology is imprecise
© John C. Knight 2014, All Rights Reserved 29
University of Virginia
Webster p Dependable:
capable of being depended on: RELIABLE
p Reliable: suitable or fit to be relied on: DEPENDABLE
n Rely: n 1) to be dependent
<the system for which we depend on water>
n 2) to have confidence based on experience <someone you can rely on>
© John C. Knight 2014, All Rights Reserved 30
University of Virginia
Historical Evolution of Concerns p 40’s: ENIAC
n 18K vacuum tubes à failed ~ every 7 mns n 18K multiplies/minute à 7 mns ~ one program execution
Need RELIABILITY p 60’s: Interactive systems
+ AVAILABILITY p 70’s: F-8 Crusader, MD-11, Patriot missile defense
+ SAFETY p 90’s-today: Internet, E-commerce, Grid/Web services
+ SECURITY
© John C. Knight 2014, All Rights Reserved 31
University of Virginia
The Definitive Reference “Basic Concepts and Taxonomy of Dependable
and Secure Computing” Avizienis, Laprie, Randell, and Landwehr
IEEE Transactions on Dependable and Secure Computing, Vol. 1, No. 1
p We will use this material frequently
p For now, we will look at dependability &failure
© John C. Knight 2014, All Rights Reserved 32
University of Virginia
Dependability
p Old: Dependability is the ability to deliver service that can
justifiably be trusted. p New: The dependability of a system is the ability to avoid service
failures that are more frequent and more severe than is acceptable.
p Why the change? To allow for the possibility of failure yet the system remaining acceptable
Note Subjectivity
© John C. Knight 2014, All Rights Reserved 33
University of Virginia
Social Aspects p Wholesale death:
n Aircraft, railway accidents n Improper detonation of weapons
p Retail death: n Car accidents, medical devices n Operational procedures
p Societal assessment determines acceptable levels of loss
Resolution of Subjectivity
© John C. Knight 2014, All Rights Reserved 34
University of Virginia
Failure p From the taxonomy: Correct service is delivered when the service implements the
system function. A service failure, often abbreviated here to failure, is an event that occurs when the delivered service deviates from correct service. A service fails either because it does not comply with the functional specification, or because this specification did not adequately describe the system function. A service failure is a transition from correct service to incorrect service, i.e., to not implementing the system function. The period of delivery of incorrect service is a service outage. The transition from incorrect service to correct service is a service restoration. The deviation from correct service may assume different forms that are called service failure modes and are ranked according to failure severities
© John C. Knight 2014, All Rights Reserved 35
University of Virginia
Verification and Validation
Requirements
Specification
Implementation
Customer
Requirements Analysis & Elicitation
System Definition
System Development
Verification
Validation
Failure can arise from anywhere.
The customer does not care.
© John C. Knight 2014, All Rights Reserved 36
University of Virginia
Failure Viewpoints
p Failure domain
p Detectability of failures
p Consistency of failures
p Consequences of failure
© John C. Knight 2014, All Rights Reserved 37
University of Virginia
Failure Domain p Content failure:
n Supplied service has incorrect information
p Timing failure: n Timing or duration of information incorrect n Halt failure:
p System halts and remains in fixed state
n Erratic failure: p Service delivery is erratic
© John C. Knight 2014, All Rights Reserved 38
University of Virginia
Detectability And Consistency p Detectability:
n Signaled, unsignaled, false alarm p False positives, false negatives
p Consistency: n Consistent—incorrect service seen identically by all
users n Inconsistent—incorrect service seen differently by
different users, Byzantine failures
© John C. Knight 2014, All Rights Reserved 39
University of Virginia
Consequences Of Failure p Different failures, different consequences p Examples (go with intuitive notion for now):
n Availability Long vs. short outage n Safety Human life endangered vs. equip n Confidentiality Type of information disclosed
p Range of severity of failure: n Minor failure n Catastrophic failure
© John C. Knight 2014, All Rights Reserved 40
University of Virginia
Attributes Of Dependability p Reliability
p Availability
p Safety
p Confidentiality
p Integrity
p Maintainability
Requirements: These are properties that might be required of a given system
Customers must identify the dependability requirements of their system and developers must design so as to achieve them
© John C. Knight 2014, All Rights Reserved 41
University of Virginia
Attributes Of Dependability p Note this very carefully…
p Fault tolerance is not a system requirement or a system property
p Fault tolerance is one of the mechanisms that can be used to provide dependability
p This is important
Fault tolerance is
not a system property
© John C. Knight 2014, All Rights Reserved 42
University of Virginia
Reliability R(t) = Probability that the system will
operate correctly in a specified operating environment up until time t
© John C. Knight 2014, All Rights Reserved 43
p System’s ability to provide continuous service p Note that t is important p If a system only needs to operate for ten hours
at a time, then that is the reliability target
University of Virginia
Failure Per Demand p Sometimes, the notion of time, t, is not what we
need p Some systems are only called to act on demand:
n E.g., Protection systems
p For such systems, what we need is:
Probability of failure per demand
© John C. Knight 2014, All Rights Reserved 44
University of Virginia
Availability
A(t) = Probability that the system will be operational at time t
© John C. Knight 2014, All Rights Reserved 45
p Literally, readiness for service p Admits the possibility of brief outages p Fundamentally different concept
University of Virginia
Reliability vs. Availability
p They are not the same..... p Example:
A system that fails, on average, once per hour but which restarts automatically in ten milliseconds would not be considered very reliable but is highly available
Availability = 0.9999972
© John C. Knight 2014, All Rights Reserved 46
University of Virginia
Nines of Availability
# 9’s % Downtime / year Systems
2 99% ~5000 minutes General web site
3 99.9% ~500 minutes Amazon.com 4 99.99% ~50 minutes Enterprise
server 5 99.999% ~5 minutes Telephone
System 6 99.9999% ~30 seconds Phone switches
Caveats: How measured? What does it mean to be operational?
© John C. Knight 2014, All Rights Reserved 47
University of Virginia
Design Tradeoffs
p How to make availability approach 100%?
p Need to define maximum repair time p Need to define how Availability is measured
Availability = MTTF -------------------- MTTF + MTTR
MTTF à infinity (high reliability) MTTR à zero (fast recovery)
MTTF = Mean Time To Failure
MTTR = Mean Time To Repair
© John C. Knight 2014, All Rights Reserved 48
University of Virginia
Subtleties
p Which has higher availability? n Two 4.5 hour outage / year n 1 minute outage / day
p For an Internet-base company such as EBay or Amazon, which would be more desirable? Why?
p For an autonomous rover? p Need to specify details of acceptable outages
99.9%
© John C. Knight 2014, All Rights Reserved 49
University of Virginia
Specifying Availability p Minimum probability of readiness for service p Period over which probability measured p Fixed or moving window p Maximum acceptable outage
n Time to repair
p Minimum sustained operating time
© John C. Knight 2014, All Rights Reserved 50
University of Virginia
Safety
Absence of catastrophic consequences on the users or the environment
40K deaths/yr; 800/week = 2 fully loaded Boeing 747/week
© John C. Knight 2014, All Rights Reserved 51
p Are commercial aircraft “safe”? p They crash occasionally p How many crashes are too many? p Are cars “safe”? They crash quite a lot
University of Virginia
Risk p Risk is defined to be expected loss per unit time:
Risk = Σ pr(accidenti) x cost(accidenti)
Risk < pre-defined_delta
© John C. Knight 2014, All Rights Reserved 52
p Safety is defined to be an acceptable level of risk:
University of Virginia
Reliability vs. Availability vs. Safety
p They are not the same..... p Example:
A system that is turned off is not very reliable, is not very available, but is probably very safe
p In practice, safety often involves specific intervention
p Never believe an airline that says: “Your safety is our biggest concern.”
© John C. Knight 2014, All Rights Reserved 53
University of Virginia
Confidentiality
Absence of unauthorized disclosure of information p Microsoft source code vs. Linux source code p Web browsing p Operating systems security model
n Files, memory p Medical records p Credit card transaction records p School grades
© John C. Knight 2014, All Rights Reserved 54
University of Virginia
Integrity
Absence of improper system state alterations p Operating systems:
n Files, memory, network packets p Linux kernel backdoor attempt p Database records p Your bank account p File transfer p Did I really get the right version of software XYZ? p …
© John C. Knight 2014, All Rights Reserved 55
University of Virginia
Security p Security is a combination of attributes:
n Integrity n Confidentiality n Availability
p Under different circumstances, these attributes are more or less important: n Denial of service is an availability issue n Exposure of information is a confidentiality
issue
© John C. Knight 2014, All Rights Reserved 56
University of Virginia
Maintainability
Ability to undergo repairs and modifications
© John C. Knight 2014, All Rights Reserved 57
University of Virginia
General Dependability Requirements
p Telecommunications: n Availability, maintainability
p Transportation: n Reliability, availability, safety
p Weapons: n Safety
p Nuclear systems: n Safety
© John C. Knight 2014, All Rights Reserved 58
University of Virginia
Pacemaker Dependability
Pacing now extended to left ventricle
© John C. Knight 2014, All Rights Reserved 59
University of Virginia
General Characteristics p Eight-bit processors, moving to 32-bit p Software:
n Approximately 30K lines, mostly “C” n Vastly more software in external programmer
p Patient data storage example: n 200 samples/sec, 3 channels, 15 minutes
p Long battery life necessary—device “sleeps” between heart beats
© John C. Knight 2014, All Rights Reserved 60
University of Virginia
Pacemaker Inputs & Outputs p Three leads—one from each of three chambers
(L, R ventricles, R atrium) p Each lead can sense and pace p Two shocking electrodes in right ventricle p Case to lead impedance for breathing rate and
lung volume p Accelerometer to measure activity p Body temperature
© John C. Knight 2014, All Rights Reserved 61
University of Virginia
Real-time Control
Wake up
Check Expected Event
Generate Signal
Set Timer
Event As Expected?
Timer Interrupt
Yes
No
Heart Beat
© John C. Knight 2014, All Rights Reserved 62
University of Virginia
Multi-Chamber Functionality p Pacing:
n Always allow natural pace if possible n Atrium-ventricle sequencing to allow for fill n Atrium natural pulse, pace ventricle n Biventricular pacing for congestive heart failure
p Defibrillation: n Ventricular fibrillation relatively easy to detect n Distinguish benign, moderate, serious events n Atrial fibrillation being treated also
© John C. Knight 2014, All Rights Reserved 63
University of Virginia
Dependability of Pacemaker p Is reliability the goal? p Typical battery life is five years p Does failure matter?
n Device is “off” between heartbeats n 70 Hz restart rate is acceptable n So t = 100 milliseconds is OK
p But persistent storage is needed
© John C. Knight 2014, All Rights Reserved 64
University of Virginia
Dependability of Pacemaker p Is availability the goal?
p How about an availability of 0.99999?
p This corresponds to an average of five minutes per year of downtime
p Death would result if this occurred all at once
© John C. Knight 2014, All Rights Reserved 65
University of Virginia
Dependability of Pacemaker p Is safety the goal? p It’s safe when it’s off—or is it? p Leaving the system off might result in death very
quickly..... p Safety:
n Under/over pacing n Unnecessary/missing defibrillation
© John C. Knight 2014, All Rights Reserved 66
University of Virginia
Attributes Of Dependability p Reliability
p Availability
p Safety
p Confidentiality
p Integrity
p Maintainability
Requirements: These are properties that might be required of a given system
Customers must identify the dependability requirements of their system and developers must design so as to achieve them
© John C. Knight 2014, All Rights Reserved 67
University of Virginia
How Much Is Enough? p Dilemma—could we achieve higher
dependability: If more resources were expended on developing a given system
then the rate of failure might be reduced. That being the case, should the developers expend all possible resources on developing the system?
p For example, formal verification: n Known to detect defects n Considered difficult and expensive to apply n Should its use be required of all systems with large
consequences of failure?
© John C. Knight 2014, All Rights Reserved 68