Software Reliability
CSCI 5801: Software Engineering
Software Reliability
Software Reliability
• What you know after testing:– Software passes all cases in test suite
• What customer wants to know:– Is the code well written in general?– How often will it fail?– What has to happen for it to fail?– What happens when it fails?
Larger Context of Reliability
• Fault detection (testing and validation)– Detect faults before the system is put into
operation
• Fault avoidance– Build systems with the objective of creating
fault-free software
• Fault tolerance– Build systems that continue to operate when
faults occur
Code Reviews
• Examining code without running it– Remove dependency on test cases
• Methodology: look for typical flaws
• Best done by others who have different POV– Code walkthroughs done by other programmers– Pair programming in XP– Static analysis tools–
• Goal: Detect flaws before they become faults (fault avoidance)
Code Walkthroughs
• Going through code by hand, statement by statement– 90 – 125 statements/hour on average
• Team with ~4 members, with specific roles:– Moderator: runs session, insure proceeds smoothly– Code author– Inspectors (at least 2)– Scribe: writes down results/suggestions
• Estimated to find 60% to 90% of code errors
Code Walkthroughs
• Preparation– Developer provides colleagues with code listing
and documentation– Participants study the documentation in advance
• Meeting– Developer leads reviewers through the code,
describing what each section does and encouraging questions
– Inspectors look for possible flaws and suggest improvements
Code Walkthroughs
• Example checklist:– Data faults: Initialization, constants, array bounds, character
strings– Control faults: Conditions, loop termination, compound
statements, case statements– Input/output faults: All inputs used; all outputs assigned a
value– Interface faults: Parameter numbers, types, and order;
structures and shared memory– Storage management faults: Modification of links, allocation
and de-allocation of memory– Exceptions: Possible errors, error handlers
Static Analysis Tools
• Scan source code for possible faults and anomalies– Lint for C programs– PMD for Java
• Examples:– Control flow: Loops with multiple exit or entry points– Data use: Undeclared or uninitialized variables, unused
variables, multiple assignments, array bounds– Interface faults: Parameter mismatches, non-use of
functions results, uncalled procedures– Storage management: Unassigned pointers, pointer
arithmetic
Good programming practice eliminates all warnings from source code
PMD Example
Static Analysis Tools
• Cross-reference table: Shows every use of a variable, procedure, object, etc.
• Information flow analysis: Identifies input variables on which an output depends.
• Path analysis: Identifies all possible paths through the program.
Software reliability• Definition:
Probability that the system will not fail during a certain period of time in a certain environment– Failures/CPU hour, etc.
• Questions:• How much more testing is needed to reach
required reliability?• What is expected reliability gain for further
testing?
12
Statistical Testing
• Testing software for reliability rather than fault detection– Measuring the number of errors/transaction
allows the reliability of the software to be predicted
• Key problem: Software will never be 100% reliable!– An acceptable level of reliability should be specified in
RSD, and the software tested and modified until that level of reliability is reached
Reliability Prediction
• Reliability growth model– Mathematical model of how system reliability is
predicted to change over time as faults found and removed
– Extrapolated from current data about failures• Can be used to determine whether system meets
reliability requirements– Mean time to failure– Average failures per transaction
• Can be used to predict when testing will be completed and what level of reliability is feasible
Operational Profile
• Problem: Statistical testing requires large number of test cases for statistical significance (thousands)
• Where do such test cases come from?– Often too many to create by hand– Random generation not sufficient
Operational Profile
• Operational profile: Set of test data whose frequency matches the actual frequency of these inputs from ‘normal’ usage of the system– Close match with actual usage is necessary or the measured
reliability will not be reflected in the actual usage of the system
• Can be generated from real data collected from an existing system or (more often) depends on assumptions made about the pattern of usage of a system.
Example Operational Profile
...
Number ofinputs
Input classes
Note that some types of inputs much more likely than others
LPM Estimates
• Logarithmic Poisson execution time model (LPM)– Major bugs found quickly– Those major bugs cause most failures– Effectiveness of fault correction decreases over
time– There exists a point at which further testing has
little gain
18
Reliability predictionReliability
Requiredreliability
Fitted reliabilitymodel curve
Estimatedtime of reliability
achievement
Time
= Measured reliability
Reliability Measurement Problems
• Operational profile uncertainty– The operational profile may not be an accurate
reflection of the real use of the system• High costs of test data generation
– Costs can be very high if the test data for the system cannot be generated automatically
• Statistical uncertainty– You need a statistically significant number of
failures to compute the reliability but highly reliable systems will rarely fail
Stress Testing
• Goal of stress testing: Determine what it will take to “break” system – “Break” = no longer meets requirements in some way– Functional: fails to perform required functions– Reliability: fails more often than specified– Performance: slower than required
• Approaches:– Increase load/decrease resources until system breaks– Perform “attacks” designed to produce undesirable result
Stress Testing
• Increase load on system in different ways– Number of students simultaneously adding courses– Size of files/databases that must be read– …
• Decrease resources available to system (may require fault injection software)– Increase number of other processes running on system– Increase lag time of networked resources
• Goal: point at which system fails should be much greater than scenarios listed in RSD
Stress Testing
• “Attack” testing common in security– Goal of normal testing:
– Goal of secure programming:
SystemInput for specific test case
Desired response for specific test case
Any inputDoes not produce undesirable resultSystem
Stress Testing
• Based on risk analysis from design stage:– Can roster database be deleted?– Can intruder read files (in violation of FERPA)?– Can a student add a course but not be added to the roster?
Fault Tolerance
• Goals:– System continues to operate when problems occur– System avoids critical failures (data loss, etc.)
• Problems can occur from many sources– Anticipated at design stage– Unanticipated (hardware faults, etc.)
• Cannot prevent all failures!
Fault Tolerance
• Usually based on idea of “backward recovery”– Record system state at specific events (checkpoints).
After failure, recreate state at last checkpoint.– Combine checkpoints with system log (audit trail of
transactions) that allows transactions from last checkpoint to be repeated automatically.
• Note that backward recovery software must also be thoroughly tested!