Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 218 times |
Download: | 2 times |
1
Holistic reliability management Keene & Associates
Software and System Reliability Concepts to Assess and Improve Products and Their Underlying Product Development Process
Hewlett Packard February 15, 2006
Dr. Samuel Keene, FIEEE
2
Holistic reliability management Keene & Associates
Dr Samuel Keene, FIEEE
Six Sigma Sr Master Black Belt Past President of IEEE RS HW, SW and System Reliability Reliability Engineer of the Year in 1996 Reliability Engineer, Consultant, and Educator Education: Physics (BS and MS), Operations Research,
and MBA
3
Holistic reliability management Keene & Associates
ho·lis·tic (h -l s t k)
a. Emphasizing the importance of the whole and the interdependence of its parts.
b. Concerned with wholes rather than analysis or separation into parts: holistic medicine; holistic ecology.
Note: Safety, security, reliability, and survivability are system attributes
4
Holistic reliability management Keene & Associates
The whole is more than the sum of its parts
Hardware reliability Hardware to software reliability comparison System concepts Software reliability concepts
Measuring Software reliability Building more reliable software and systems
Managing variation
Recent Ref: The Quest for Imperfection, Machine Design, 10.5.5 (C, Misapplied measurements and focus)
5
Holistic reliability management Keene & Associates
Notorious Failures (assignable cause)
•Patriot missile misfire (1991) operational profile change•Jupiter Fly by – Programmed to switch power supplies if communication not received within in 7 days (15 year mission)
•Mars Climate Orbitor (1998) mix of metric and Imperial units
•DSC Communications failure (1991) – 4 bits changed in 13 LOC but not regression tested
6
Holistic reliability management Keene & Associates
Allegedly, The first time the F-15 Crossed the equator
Normal Everyday Flight?
7
Holistic reliability management Keene & Associates
One more special cause driven reliability problem
Pfizer Pharmaceutical products were experiencing intermittent failures in a Paris operating room
On site investigation revealed that a doctor’s cell phone was infringing on Pfizer’s medical equipment
Solution: redesign chassis covers reducing the orifices (holes) in the equipment covers to block radiation
8
Holistic reliability management Keene & Associates
Bath tub curve
The slope provides insight into the failure mechanism
(t)
Time
Wearout
Random
Infant Mortality
=1.0 1.0 < 1.0
9
Holistic reliability management Keene & Associates
Hardware failures are special cause driven also
History:
Parts count
Mil Hnbk 217
•Part type
•Stress
•Some application factors
My IBM FA experience
PRISM model
10
Holistic reliability management Keene & Associates
Reliability Prediction Failure Analysis Experience
My IBM Failure Analysis experience Pareto (80-20) effect observed
Special Cause vs Common Cause
Actually, a 99-01 % breakdown of parts experiencing reliability problems
11
Holistic reliability management Keene & Associates
Prism Reliability Model
Based upon extensive military and commercial field return data modified by broad spectrum of expert application factors, (e.g., EMC related questions):
•Are the equipment orifices smaller than 1/10 emission wavelengths exposure?
•Will the product be EMC certified by EU for emissions and susceptibility?
•Do traces on alternate layers run orthogonal to each other?
•Are adjacent traces separated by twice their width?
•Plus 5 additional EMC application questions
Built in best practices and lessons learned
12
Holistic reliability management Keene & Associates
Failures vs Faults
Failure Fault Departure of system behavior in execution from user needs
Defect in system implementation that (can) causes the failure when executed
User-oriented
Developer-oriented
Failure Fault Departure of system behavior in execution from user needs
Defect in system implementation that (can) causes the failure when executed
User-oriented
Developer-oriented
13
Holistic reliability management Keene & Associates
Programmingoversight
(Error)
Fault(Failure
susceptibility)
FaultActivation(Failuretrigger)
Failure
The path to failure
E.g. F(x) = 1/(x+234); well behaved except at x = -234
Programming error can occur anywhere in the process from requirements development to test
15
Holistic reliability management Keene & Associates
Perfective changes – adding functionality, which might benew or overlooked
Adaptive – to have the code work in a changed environment
Corrective – fixing bugs
Preventive – preclude a problem
Software Maintenance
16
Holistic reliability management Keene & Associates
PerfectiveAdaptiveCorrectivePreventative
Category
Pie Chart of Percentage Activity vs Maintenance type
18
Holistic reliability management Keene & Associates
Operational profile Established definition: Operational profile is the set of input events that the software will receive during execution along with the probability that the events will occur
Modified definition: Operational profile (usage) is: (1) the set of input events that the software will receive during execution along with the probability that the events will occur, and (2) the set of context-sensitive input events generated by external hardware and software systems that the software can interact with during execution. This is the configuration (C) and machine (M). One could also add in the operator variation (O) on impacting the software reliability.
19
Holistic reliability management Keene & Associates
Operational Profile Example
OperationOccurance Probability
Enter card 0.332
Verify Pin 0.332
Withdraw checking 0.199
Withdraw savings 0.066
Deposit checking 0.040
Deposit savings 0.020
Query status 0.00664
Test terminal 0.00332
Input to stolen cards list 0.00058
Backup files 0.000023
Total 1.000000
Table 1. Operational Profile for ATM Machine
20
Holistic reliability management Keene & Associates
FI/FIO
Mcalls
0
2
4
6
8
10
12
14
16
18
0 0.1 0.2 0.3 0.4 0.5
Conventional test
Operational-profile-driven testreaches FIO faster
Reliability Estimation during Testing
22
Holistic reliability management Keene & Associates
CASRE model selection rules for picking the “best fit model”
1. Do a goodness-of-fit test (i.e., KS or Chi-Square) on the model results
2. Rank the models according to their prequential likelihood values (larger)
3. -ln(Prequential Likelihood), though, smaller is better
4. In case of a tie in prequential likelihood, break the tie using the values of model bias
5. In case of a tie in model bias, break the tie using the values of model bias trend
6. Optional - in case of a tie in model bias trend, break the tie using model noise
From Dr Allen Nikora, NASA JPL, CASRE Developer
23
Holistic reliability management Keene & Associates
Software Reliability Predictive ModelsModel Name Data InputsKeene KSLOCs; SEI Level; fault density; years to maturityMusa Basic Error count; time of error detectionMusa Logarithmic Error count; time of error detectionShooman Error count; time of error detectionJelinski-Moranda Error count; time of error detectionLipow Error count; time of error detection; intervalsGoel-Okumoto Error count; time of error detection; intervalsSchick-Wolverton Error count; time of error detectionDual Test Common error count; error count from both groupsWeibull Error count; time of error detectionTesting Success # of test runs successful; total # of runs
24
Holistic reliability management Keene & Associates
Raleigh Model Reliability Prediction Based on Profile of Development Process Defect
Discovery
Early-Stage Prediction
Early-Stage Prediction
Code-Phase Prediction
Code-Phase Prediction
Unit-Test Phase
Prediction
Unit-Test Phase
Prediction
System-Test Phase Prediction
System-Test Phase Prediction
Operation Phase
Prediction
Operation Phase
Prediction
RequirementsDesign Code Unit Test System Test Operation
Process/Product Characteristics
Faults/Failure Data Collection
Software Reliability Estimation/Performance EvaluationSoftware Reliability Estimation/Performance Evaluation
Estimation & Development
25
Holistic reliability management Keene & Associates
The Necessity of Training Farm Hands for FirstClass Farms in the Fatherly Handling of Farm LiveStock is Foremost in the Eyes of Farm Owners.Since the Forefathers of the Farm Owners Trainedthe Farm Hands for First Class Farms in theFatherly Handling of Farm Live Stock, the FarmOwners Feel they should carry on with the FamilyTradition of Training Farm Hands of First ClassFarmers in the Fatherly Handling of Farm LiveStock Because they Believe it is the Basis of GoodFundamental Farm Management.
Inspection ExerciseInspection Exercise
Task: You have 60 seconds to document the number of times the 6th letter of the alphabet appears in the following text:
26
Holistic reliability management Keene & Associates
Quantitatively measuring software quality is more like finding flaws in silk than measuring the size of pearls or the firmness of fruit
The Reality
27
Holistic reliability management Keene & Associates
Time Assertion
Software does not wear out over time! If it is logically incorrect today it will be logically incorrect tomorrow
Models need to consider the quality of the test cases and complexity of the software
e.g., 1 LOC vs. 1M LOC
28
Holistic reliability management Keene & Associates
Reliability Focus
“System Management” Failures (Brandon Murphy)
Requirements deficiencies Interface deficiencies
The best products result from the best development process, example, “The defect prevention process” used by IBM to be the first to achieve SEI Level 5 for their SW development process.
29
Holistic reliability management Keene & Associates
Customer Fulfillment: Kano Diagram
RequirementFulfilled
RequirementUnfulfilled
Satisfaction
Dissatisfaction
Expected(Unspoken)
Unexpected(Unspoken)
Specifi
ed
30
Holistic reliability management Keene & Associates
Conclusion: Design, Software, Requirements Capture, and the Development Process (especially the quality of communications) made a big difference in reliability!
31
Holistic reliability management Keene & Associates
Keene Process-Based (apriori) SW Reliability Model
Process Capability (SEI Level) Development Organization Maintaining Organization
Code Extent (SLOC) Exponential growth to a plateau level Historical Factors
R growth profile Usage level Fault latency % Severity 1 and 2 failures Fault activation rate MTTR
32
Holistic reliability management Keene & Associates
•I have observed a 10:1 variation in latent fault rate among developers of military quality systems
•The best documented software fault rate has been on the highly touted space shuttle program. It has a published fault rate of 0.1 faults/KSLOC on newly released code (but this is only after 8 months of further customer testing)
•The fault rate at customer turnover is 0.5 faults/KSLOC based upon private correspondence with the lead SS assurance manager.
•The entire code base approaches 6 sigma level of fault rate or 3-4 faults/KSLOC. Boeing Missiles and Space Division, another Level 5 Developer, told me they have achieved like levels of fault rate in their mature code base.
Fault Profile Curves vis a vis the CMM Level
33
Holistic reliability management Keene & Associates
Mapping of the SEI process capability levels (I,II,III,IV,V) against probable fault density distributions of the developed code (Depiction)
Level 1: Initial (adhoc)
Level 2: Repeatable (policies)
Level 3: Defined (documented)
Level 4: Managed (measured and capable)
Level 5: Optimized (optimizing)
34
Holistic reliability management Keene & Associates
Combined Results-Curt Smith ISSRE 99
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700F
eb-9
6
Mar
-96
Ap
r-96
May
-96
Jun
-96
Jul-
96
Au
g-9
6
Sep
-96
Oct
-96
No
v-96
Dec
-96
Jan
-97
Feb
-97
Mar
-97
Ap
r-97
May
-97
Jun
-97
Jul-
97
Au
g-9
7
Sep
-97
Oct
-97
No
v-97
Dec
-97
Jan
-98
Feb
-98
Mar
-98
Ap
r-98
So
ftw
are
Err
ors
Rem
ain
ing
/ 10
0 K
SL
OC
SWEEP actuals
SWEEP prediction
CASRE actuals
Generalized Poisson estimate
DPM prediction
91%
99%
35%
35
Holistic reliability management Keene & Associates
Synonyms
Keene Process Based Model same as the Development Process Model (Smith)
SWEEP (SW Error Estimation Process) developed by Software Productivity Consortium is an implementation of the Raleigh (Smith). Raleigh prediction model developed by John Gaffney of IBM.
36
Holistic reliability management Keene & Associates
Progressive Software Reliability Prediction
1) Collect Data:Defect Data from SW development phases
System Test
2) Curve fit:
Steps:
3) Predict Steady-State MTBF:
Get fault rates fordefect data profile.
f aul
t den
sityUse Rayleigh Model to
project latent fault density, fi ,at delivery. fi=Latent fault density
at delivery.
Insert observed fi into Keene’s model for operational MTBF profile.
t
fiOperationalOperational
MTBFMTBF
Development phase
Actual data
Raleigh models: Steven Kan and John Gaffney
37
Holistic reliability management Keene & Associates
Development FocusDevelopment Focus Rigorous Development ProcessRigorous Development Process
Requirements CaptureRequirements Capture““Voice of the Customer”Voice of the Customer”PrototypesPrototypesLessons LearnedLessons LearnedHigh Level Flow DiagramsHigh Level Flow DiagramsData DescriptionsData DescriptionsArchitectureArchitecture
FirewallsFirewallsPartitionsPartitionsSafe Subset of LanguageSafe Subset of Language
38
Holistic reliability management Keene & Associates
Development Focus ContinuedDevelopment Focus Continued
Safety Emphasis is Number 1Safety Emphasis is Number 1FTAFTAFMEAFMEAClean Room ApproachClean Room Approach
Development Development Cross Functional TeamCross Functional Team
Design ReviewsDesign ReviewsWalkthroughs, Walkthroughs,
InspectionsInspectionsBuilt in Safety MonitorBuilt in Safety MonitorRobust Human InterfaceRobust Human Interface
39
Holistic reliability management Keene & Associates
Development Focus cont.Development Focus cont.
Fault AvoidanceFault AvoidanceFault ToleranceFault ToleranceFMEAFMEAPFMEAPFMEADPP ******DPP ******Failure Review BoardFailure Review BoardManage Failure/Anomaly LogsManage Failure/Anomaly LogsFault InsertionFault InsertionCustomer FeedbackCustomer FeedbackAlpha TestingAlpha TestingBeta TestingBeta Testing
40
Holistic reliability management Keene & Associates
• Assure interoperability of COTS: incompatibility of data format, protocol, operating assumptions
• Version compatibility, migration and extensibility
• Vendor responsiveness • Vendor participation and
cooperativeness
COTS challenge
41
Holistic reliability management Keene & Associates
Visualizations:Visualizations:Flow Graphs (devise tests, reduce Flow Graphs (devise tests, reduce
coupling, manage complexity, prioritize coupling, manage complexity, prioritize analysis and verification) analysis and verification)
Entity Relationship DiagramsEntity Relationship DiagramsState Transition DiagramsState Transition DiagramsData StructuresData StructuresSwim Lane DiagramsSwim Lane Diagrams
Message Handling DiagramsMessage Handling DiagramsGUI ScreensGUI ScreensPrototypesPrototypesUser FeedbackUser FeedbackData Flow DiagramsData Flow Diagrams
42
Holistic reliability management Keene & AssociatesSwim Lanes: Previous ImplementationSwim Lanes: Previous ImplementationSwim Lanes: Previous ImplementationSwim Lanes: Previous ImplementationC
rea
tor
Ad
min
An
aly
stC
ore
Te
am
BI
Cre
ato
rB
I A
dm
inB
I R
evi
ew Reject
Investigate
Approve
Create ECR
Create Tasks
Review ECR
Tech Review ECR
Business Decision
Execute Tasks
Disposition ECR
Assign Tasks
Close ECR
Doc/Part Review
Draft Doc
Doc/Part Admin
Claim Task
Close Task
ApproveReleaseDoc/Part
Complete ExecuteAssignment
2 0
S w im L a n e s: P re v io u s Im p le m en ta tio nS w im L a n e s: P re v io u s Im p le m en ta tio nS w im L a n e s: P re v io u s Im p le m en ta tio nC
rea
tor
Ad
min
An
alys
tC
ore
Tea
mB
I C
rea
tor
BI
Ad
min
BI
Re
vie
w Re je c t
In v e s t ig a te
A p p ro v e
C rea te E CR
C rea te Tas k s
R eview E CR
Tec h R eview E CR
B us ines s Dec is ion
E x ec u te Tas k s
D is pos it ion E CR
A s s ign Tas k s
C los e E CR
Doc / P art Review
D ra ft D oc
D oc / P ar t A dm in
C la im Tas k
C los e Tas k
A p p ro v eR e leas eD oc /P art
C om p let e E x ec u teA s s ignm en t
43
Holistic reliability management Keene & Associates
1. Failures cluster
In particular code areas
In type or cause
2. All failures count – don’t dismiss
Looking at Failures
3. Prediction models count test failures only once during testing; but every failure in the field
4. Software development has been said to be a “defect removal process”
44
Holistic reliability management Keene & Associates
Software changes degrade the architecture and increase code complexity
Design for maintenance
45
Holistic reliability management Keene & Associates
Small Changes are Error Prone
LOC Changed Likelihood of error 1 line 50% 5 lines 75%20 lines 35%
Edwards, William, “Lessons Learned from 2 Years Inspection Data”, Crosstalk Magazine, No. 39, Dec 1992, cite: Weinberg. G., “Kill That Code!”, IEEE Tutorial
on Software Restructuring, 1986, p. 131.
Classic Example: DSC Corp, Plano Texas, 3bits of a MSLOC program were changed
leading to municipal phone outages in major metropolitan areas
46
Holistic reliability management Keene & Associates
Good design practices
Design for change Question requirements Design for “nots” as well as “shalls” FMEA Use and maintain documentation, eg flow graphs,
control charts, entity-relationship diagrams,… Question data
47
Holistic reliability management Keene & Associates
FAILURE MODE AND EFFECTS ANALYSISProduct Panel Assembly Process Date ______Team Members ________________ Page of
Process Description
ProcessFunction
Failure Mode Causes Effects SEV FREQ DET RPN Recommended Control
Procure zinc plated plastic panel
Provide conductive surface
Plating may not adhere to plastic surface completely
Dirty parts during plating process
Product malfunction
7 3 10 350 Use carbonized plastic instead of plating.
Mount fuel gage
To provide fuel reading
Cage may be mounted upside down
Operator error Customer will need to send product for repair
7 2 2 28 Design the mounting holes in different sized so the gage cannot be mounted wrong.
Assemble functon indicators
To snap in lamp cover in proper sequence
Cover installed in wrong sequence
Operator error Customer confused and gets false indications
8 4 2 64 Silk screen letters on the panel instead of lamp cover.
Install warning lights
To install warning light in proper location
Warning light cover interchanged with caution lamp cover
Operator error Customer may not get warning
9 2 2 36 Choose different size socket for warning light.
48
Holistic reliability management Keene & Associates
Samuel Keene
Why Testing Under Expected OperationalScenarios is Not Sufficient
49
Holistic reliability management Keene & Associates
•A form of software testingNot statistical testing, Not correctness proofs•“What-if Game”•The more you play, the more confident you can become that your software can deal with anomalous situations – Unanticipated Events•Determines the consequences of Incorrect code or input data•Crash testing software
Software Fault Injection
50
Holistic reliability management Keene & Associates
Capability (CMM)
Communication
• Collaborative Tools
• Common Vocabulary
• Team Focus (CFDT)
• GQM
Design Point Centering
• Optimization
• Robustness & Margin
• DOE
Understandability
• Program Structure
• Design Documentation
Process Improvement
• Metrics
• DPP
• Stat Analysis
Development Process
Six Sigma – World Class Process
Requirements &
Interfaces• KANO (M,S,D)
• Needs & Context
• Spiral Model
Operational Profile
Traceability
System Management
KPIV Variability
Hardware
• Cpk
• MSA (GR+R)
Application
• Environment
• Off-Nominal Modes
VOC
C
c
• Regression Cases
• Variational Testing
• Alpha - Beta
Product Verification
Developed by Sam Keene Jan 2006
51
Holistic reliability management Keene & Associates
Useful References
Draft Standard for Software Reliability Prediction IEEE_P_1633
[IEEE 90] Institute of Electrical and Electronics Engineers. IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries. New York, NY: 1990.
CASRE model: http://www.openchannelfoundation.org/projects/CASRE_3.0