+ All Categories
Home > Documents > Holistic reliability management Keene & Associates 1 Software and System Reliability Concepts to...

Holistic reliability management Keene & Associates 1 Software and System Reliability Concepts to...

Date post: 19-Dec-2015
Category:
View: 218 times
Download: 2 times
Share this document with a friend
Popular Tags:
51
1 Holistic reliability management Keene & Associates Software and System Reliability Concepts to Assess and Improve Products and Their Underlying Product Development Process Hewlett Packard February 15, 2006 Dr. Samuel Keene, FIEEE [email protected]
Transcript

1

Holistic reliability management Keene & Associates

Software and System Reliability Concepts to Assess and Improve Products and Their Underlying Product Development Process

Hewlett Packard February 15, 2006

Dr. Samuel Keene, FIEEE

[email protected]

2

Holistic reliability management Keene & Associates

Dr Samuel Keene, FIEEE

Six Sigma Sr Master Black Belt Past President of IEEE RS HW, SW and System Reliability Reliability Engineer of the Year in 1996 Reliability Engineer, Consultant, and Educator Education: Physics (BS and MS), Operations Research,

and MBA

3

Holistic reliability management Keene & Associates

ho·lis·tic   (h -l s t k)

a. Emphasizing the importance of the whole and the interdependence of its parts.

b. Concerned with wholes rather than analysis or separation into parts: holistic medicine; holistic ecology.

Note: Safety, security, reliability, and survivability are system attributes

4

Holistic reliability management Keene & Associates

The whole is more than the sum of its parts

Hardware reliability Hardware to software reliability comparison System concepts Software reliability concepts

Measuring Software reliability Building more reliable software and systems

Managing variation

Recent Ref: The Quest for Imperfection, Machine Design, 10.5.5 (C, Misapplied measurements and focus)

5

Holistic reliability management Keene & Associates

Notorious Failures (assignable cause)

•Patriot missile misfire (1991) operational profile change•Jupiter Fly by – Programmed to switch power supplies if communication not received within in 7 days (15 year mission)

•Mars Climate Orbitor (1998) mix of metric and Imperial units

•DSC Communications failure (1991) – 4 bits changed in 13 LOC but not regression tested

6

Holistic reliability management Keene & Associates

Allegedly, The first time the F-15 Crossed the equator

Normal Everyday Flight?

7

Holistic reliability management Keene & Associates

One more special cause driven reliability problem

Pfizer Pharmaceutical products were experiencing intermittent failures in a Paris operating room

On site investigation revealed that a doctor’s cell phone was infringing on Pfizer’s medical equipment

Solution: redesign chassis covers reducing the orifices (holes) in the equipment covers to block radiation

8

Holistic reliability management Keene & Associates

Bath tub curve

The slope provides insight into the failure mechanism

(t)

Time

Wearout

Random

Infant Mortality

=1.0 1.0 < 1.0

9

Holistic reliability management Keene & Associates

Hardware failures are special cause driven also

History:

Parts count

Mil Hnbk 217

•Part type

•Stress

•Some application factors

My IBM FA experience

PRISM model

10

Holistic reliability management Keene & Associates

Reliability Prediction Failure Analysis Experience

My IBM Failure Analysis experience Pareto (80-20) effect observed

Special Cause vs Common Cause

Actually, a 99-01 % breakdown of parts experiencing reliability problems

11

Holistic reliability management Keene & Associates

Prism Reliability Model

Based upon extensive military and commercial field return data modified by broad spectrum of expert application factors, (e.g., EMC related questions):

•Are the equipment orifices smaller than 1/10 emission wavelengths exposure?

•Will the product be EMC certified by EU for emissions and susceptibility?

•Do traces on alternate layers run orthogonal to each other?

•Are adjacent traces separated by twice their width?

•Plus 5 additional EMC application questions

Built in best practices and lessons learned

12

Holistic reliability management Keene & Associates

Failures vs Faults

Failure Fault Departure of system behavior in execution from user needs

Defect in system implementation that (can) causes the failure when executed

User-oriented

Developer-oriented

Failure Fault Departure of system behavior in execution from user needs

Defect in system implementation that (can) causes the failure when executed

User-oriented

Developer-oriented

13

Holistic reliability management Keene & Associates

Programmingoversight

(Error)

Fault(Failure

susceptibility)

FaultActivation(Failuretrigger)

Failure

The path to failure

E.g. F(x) = 1/(x+234); well behaved except at x = -234

Programming error can occur anywhere in the process from requirements development to test

14

Holistic reliability management Keene & Associates

A + B = C

1 + 0 =

1 + 1 =

1 + .5 =

1 + A =

15

Holistic reliability management Keene & Associates

Perfective changes – adding functionality, which might benew or overlooked

Adaptive – to have the code work in a changed environment

Corrective – fixing bugs

Preventive – preclude a problem

Software Maintenance

16

Holistic reliability management Keene & Associates

PerfectiveAdaptiveCorrectivePreventative

Category

Pie Chart of Percentage Activity vs Maintenance type

17

Holistic reliability management Keene & Associates

Reliability

Failure Intensity

Time

18

Holistic reliability management Keene & Associates

Operational profile Established definition: Operational profile is the set of input events that the software will receive during execution along with the probability that the events will occur

Modified definition: Operational profile (usage) is: (1) the set of input events that the software will receive during execution along with the probability that the events will occur, and (2) the set of context-sensitive input events generated by external hardware and software systems that the software can interact with during execution. This is the configuration (C) and machine (M). One could also add in the operator variation (O) on impacting the software reliability.

19

Holistic reliability management Keene & Associates

Operational Profile Example

OperationOccurance Probability

Enter card 0.332

Verify Pin 0.332

Withdraw checking 0.199

Withdraw savings 0.066

Deposit checking 0.040

Deposit savings 0.020

Query status 0.00664

Test terminal 0.00332

Input to stolen cards list 0.00058

Backup files 0.000023

Total 1.000000

 

Table 1. Operational Profile for ATM Machine

20

Holistic reliability management Keene & Associates

FI/FIO

Mcalls

0

2

4

6

8

10

12

14

16

18

0 0.1 0.2 0.3 0.4 0.5

Conventional test

Operational-profile-driven testreaches FIO faster

Reliability Estimation during Testing

21

Holistic reliability management Keene & Associates

Failure intensity plot

22

Holistic reliability management Keene & Associates

CASRE model selection rules for picking the “best fit model”

1. Do a goodness-of-fit test (i.e., KS or Chi-Square) on the model results

2. Rank the models according to their prequential likelihood values (larger)

3. -ln(Prequential Likelihood), though, smaller is better

4. In case of a tie in prequential likelihood, break the tie using the values of model bias

5. In case of a tie in model bias, break the tie using the values of model bias trend

6. Optional - in case of a tie in model bias trend, break the tie using model noise

From Dr Allen Nikora, NASA JPL, CASRE Developer

23

Holistic reliability management Keene & Associates

Software Reliability Predictive ModelsModel Name Data InputsKeene KSLOCs; SEI Level; fault density; years to maturityMusa Basic Error count; time of error detectionMusa Logarithmic Error count; time of error detectionShooman Error count; time of error detectionJelinski-Moranda Error count; time of error detectionLipow Error count; time of error detection; intervalsGoel-Okumoto Error count; time of error detection; intervalsSchick-Wolverton Error count; time of error detectionDual Test Common error count; error count from both groupsWeibull Error count; time of error detectionTesting Success # of test runs successful; total # of runs

24

Holistic reliability management Keene & Associates

Raleigh Model Reliability Prediction Based on Profile of Development Process Defect

Discovery

Early-Stage Prediction

Early-Stage Prediction

Code-Phase Prediction

Code-Phase Prediction

Unit-Test Phase

Prediction

Unit-Test Phase

Prediction

System-Test Phase Prediction

System-Test Phase Prediction

Operation Phase

Prediction

Operation Phase

Prediction

RequirementsDesign Code Unit Test System Test Operation

Process/Product Characteristics

Faults/Failure Data Collection

Software Reliability Estimation/Performance EvaluationSoftware Reliability Estimation/Performance Evaluation

Estimation & Development

25

Holistic reliability management Keene & Associates

The Necessity of Training Farm Hands for FirstClass Farms in the Fatherly Handling of Farm LiveStock is Foremost in the Eyes of Farm Owners.Since the Forefathers of the Farm Owners Trainedthe Farm Hands for First Class Farms in theFatherly Handling of Farm Live Stock, the FarmOwners Feel they should carry on with the FamilyTradition of Training Farm Hands of First ClassFarmers in the Fatherly Handling of Farm LiveStock Because they Believe it is the Basis of GoodFundamental Farm Management.

Inspection ExerciseInspection Exercise

Task: You have 60 seconds to document the number of times the 6th letter of the alphabet appears in the following text:

26

Holistic reliability management Keene & Associates

Quantitatively measuring software quality is more like finding flaws in silk than measuring the size of pearls or the firmness of fruit

The Reality

27

Holistic reliability management Keene & Associates

Time Assertion

Software does not wear out over time! If it is logically incorrect today it will be logically incorrect tomorrow

Models need to consider the quality of the test cases and complexity of the software

e.g., 1 LOC vs. 1M LOC

28

Holistic reliability management Keene & Associates

Reliability Focus

“System Management” Failures (Brandon Murphy)

Requirements deficiencies Interface deficiencies

The best products result from the best development process, example, “The defect prevention process” used by IBM to be the first to achieve SEI Level 5 for their SW development process.

29

Holistic reliability management Keene & Associates

Customer Fulfillment: Kano Diagram

RequirementFulfilled

RequirementUnfulfilled

Satisfaction

Dissatisfaction

Expected(Unspoken)

Unexpected(Unspoken)

Specifi

ed

30

Holistic reliability management Keene & Associates

Conclusion: Design, Software, Requirements Capture, and the Development Process (especially the quality of communications) made a big difference in reliability!

31

Holistic reliability management Keene & Associates

Keene Process-Based (apriori) SW Reliability Model

Process Capability (SEI Level) Development Organization Maintaining Organization

Code Extent (SLOC) Exponential growth to a plateau level Historical Factors

R growth profile Usage level Fault latency % Severity 1 and 2 failures Fault activation rate MTTR

32

Holistic reliability management Keene & Associates

•I have observed a 10:1 variation in latent fault rate among developers of military quality systems

•The best documented software fault rate has been on the highly touted space shuttle program. It has a published fault rate of 0.1 faults/KSLOC on newly released code (but this is only after 8 months of further customer testing)

•The fault rate at customer turnover is 0.5 faults/KSLOC based upon private correspondence with the lead SS assurance manager.

•The entire code base approaches 6 sigma level of fault rate or 3-4 faults/KSLOC. Boeing Missiles and Space Division, another Level 5 Developer, told me they have achieved like levels of fault rate in their mature code base.

Fault Profile Curves vis a vis the CMM Level

33

Holistic reliability management Keene & Associates

Mapping of the SEI process capability levels (I,II,III,IV,V) against probable fault density distributions of the developed code (Depiction)

Level 1: Initial (adhoc)

Level 2: Repeatable (policies)

Level 3: Defined (documented)

Level 4: Managed (measured and capable)

Level 5: Optimized (optimizing)

34

Holistic reliability management Keene & Associates

Combined Results-Curt Smith ISSRE 99

0

50

100

150

200

250

300

350

400

450

500

550

600

650

700F

eb-9

6

Mar

-96

Ap

r-96

May

-96

Jun

-96

Jul-

96

Au

g-9

6

Sep

-96

Oct

-96

No

v-96

Dec

-96

Jan

-97

Feb

-97

Mar

-97

Ap

r-97

May

-97

Jun

-97

Jul-

97

Au

g-9

7

Sep

-97

Oct

-97

No

v-97

Dec

-97

Jan

-98

Feb

-98

Mar

-98

Ap

r-98

So

ftw

are

Err

ors

Rem

ain

ing

/ 10

0 K

SL

OC

SWEEP actuals

SWEEP prediction

CASRE actuals

Generalized Poisson estimate

DPM prediction

91%

99%

35%

35

Holistic reliability management Keene & Associates

Synonyms

Keene Process Based Model same as the Development Process Model (Smith)

SWEEP (SW Error Estimation Process) developed by Software Productivity Consortium is an implementation of the Raleigh (Smith). Raleigh prediction model developed by John Gaffney of IBM.

36

Holistic reliability management Keene & Associates

Progressive Software Reliability Prediction

1) Collect Data:Defect Data from SW development phases

System Test

2) Curve fit:

Steps:

3) Predict Steady-State MTBF:

Get fault rates fordefect data profile.

f aul

t den

sityUse Rayleigh Model to

project latent fault density, fi ,at delivery. fi=Latent fault density

at delivery.

Insert observed fi into Keene’s model for operational MTBF profile.

t

fiOperationalOperational

MTBFMTBF

Development phase

Actual data

Raleigh models: Steven Kan and John Gaffney

37

Holistic reliability management Keene & Associates

Development FocusDevelopment Focus Rigorous Development ProcessRigorous Development Process

Requirements CaptureRequirements Capture““Voice of the Customer”Voice of the Customer”PrototypesPrototypesLessons LearnedLessons LearnedHigh Level Flow DiagramsHigh Level Flow DiagramsData DescriptionsData DescriptionsArchitectureArchitecture

FirewallsFirewallsPartitionsPartitionsSafe Subset of LanguageSafe Subset of Language

38

Holistic reliability management Keene & Associates

Development Focus ContinuedDevelopment Focus Continued

Safety Emphasis is Number 1Safety Emphasis is Number 1FTAFTAFMEAFMEAClean Room ApproachClean Room Approach

Development Development Cross Functional TeamCross Functional Team

Design ReviewsDesign ReviewsWalkthroughs, Walkthroughs,

InspectionsInspectionsBuilt in Safety MonitorBuilt in Safety MonitorRobust Human InterfaceRobust Human Interface

39

Holistic reliability management Keene & Associates

Development Focus cont.Development Focus cont.

Fault AvoidanceFault AvoidanceFault ToleranceFault ToleranceFMEAFMEAPFMEAPFMEADPP ******DPP ******Failure Review BoardFailure Review BoardManage Failure/Anomaly LogsManage Failure/Anomaly LogsFault InsertionFault InsertionCustomer FeedbackCustomer FeedbackAlpha TestingAlpha TestingBeta TestingBeta Testing

40

Holistic reliability management Keene & Associates

• Assure interoperability of COTS: incompatibility of data format, protocol, operating assumptions

• Version compatibility, migration and extensibility

• Vendor responsiveness • Vendor participation and

cooperativeness

COTS challenge

41

Holistic reliability management Keene & Associates

Visualizations:Visualizations:Flow Graphs (devise tests, reduce Flow Graphs (devise tests, reduce

coupling, manage complexity, prioritize coupling, manage complexity, prioritize analysis and verification) analysis and verification)

Entity Relationship DiagramsEntity Relationship DiagramsState Transition DiagramsState Transition DiagramsData StructuresData StructuresSwim Lane DiagramsSwim Lane Diagrams

Message Handling DiagramsMessage Handling DiagramsGUI ScreensGUI ScreensPrototypesPrototypesUser FeedbackUser FeedbackData Flow DiagramsData Flow Diagrams

42

Holistic reliability management Keene & AssociatesSwim Lanes: Previous ImplementationSwim Lanes: Previous ImplementationSwim Lanes: Previous ImplementationSwim Lanes: Previous ImplementationC

rea

tor

Ad

min

An

aly

stC

ore

Te

am

BI

Cre

ato

rB

I A

dm

inB

I R

evi

ew Reject

Investigate

Approve

Create ECR

Create Tasks

Review ECR

Tech Review ECR

Business Decision

Execute Tasks

Disposition ECR

Assign Tasks

Close ECR

Doc/Part Review

Draft Doc

Doc/Part Admin

Claim Task

Close Task

ApproveReleaseDoc/Part

Complete ExecuteAssignment

2 0

S w im L a n e s: P re v io u s Im p le m en ta tio nS w im L a n e s: P re v io u s Im p le m en ta tio nS w im L a n e s: P re v io u s Im p le m en ta tio nC

rea

tor

Ad

min

An

alys

tC

ore

Tea

mB

I C

rea

tor

BI

Ad

min

BI

Re

vie

w Re je c t

In v e s t ig a te

A p p ro v e

C rea te E CR

C rea te Tas k s

R eview E CR

Tec h R eview E CR

B us ines s Dec is ion

E x ec u te Tas k s

D is pos it ion E CR

A s s ign Tas k s

C los e E CR

Doc / P art Review

D ra ft D oc

D oc / P ar t A dm in

C la im Tas k

C los e Tas k

A p p ro v eR e leas eD oc /P art

C om p let e E x ec u teA s s ignm en t

43

Holistic reliability management Keene & Associates

1. Failures cluster

In particular code areas

In type or cause

2. All failures count – don’t dismiss

Looking at Failures

3. Prediction models count test failures only once during testing; but every failure in the field

4. Software development has been said to be a “defect removal process”

44

Holistic reliability management Keene & Associates

Software changes degrade the architecture and increase code complexity

Design for maintenance

45

Holistic reliability management Keene & Associates

Small Changes are Error Prone

LOC Changed Likelihood of error 1 line 50% 5 lines 75%20 lines 35%

Edwards, William, “Lessons Learned from 2 Years Inspection Data”, Crosstalk Magazine, No. 39, Dec 1992, cite: Weinberg. G., “Kill That Code!”, IEEE Tutorial

on Software Restructuring, 1986, p. 131.

Classic Example: DSC Corp, Plano Texas, 3bits of a MSLOC program were changed

leading to municipal phone outages in major metropolitan areas

46

Holistic reliability management Keene & Associates

Good design practices

Design for change Question requirements Design for “nots” as well as “shalls” FMEA Use and maintain documentation, eg flow graphs,

control charts, entity-relationship diagrams,… Question data

47

Holistic reliability management Keene & Associates

FAILURE MODE AND EFFECTS ANALYSISProduct Panel Assembly Process Date ______Team Members ________________ Page of

Process Description

ProcessFunction

Failure Mode Causes Effects SEV FREQ DET RPN Recommended Control

Procure zinc plated plastic panel

Provide conductive surface

Plating may not adhere to plastic surface completely

Dirty parts during plating process

Product malfunction

7 3 10 350 Use carbonized plastic instead of plating.

Mount fuel gage

To provide fuel reading

Cage may be mounted upside down

Operator error Customer will need to send product for repair

7 2 2 28 Design the mounting holes in different sized so the gage cannot be mounted wrong.

Assemble functon indicators

To snap in lamp cover in proper sequence

Cover installed in wrong sequence

Operator error Customer confused and gets false indications

8 4 2 64 Silk screen letters on the panel instead of lamp cover.

Install warning lights

To install warning light in proper location

Warning light cover interchanged with caution lamp cover

Operator error Customer may not get warning

9 2 2 36 Choose different size socket for warning light.

48

Holistic reliability management Keene & Associates

Samuel Keene

Why Testing Under Expected OperationalScenarios is Not Sufficient

49

Holistic reliability management Keene & Associates

•A form of software testingNot statistical testing, Not correctness proofs•“What-if Game”•The more you play, the more confident you can become that your software can deal with anomalous situations – Unanticipated Events•Determines the consequences of Incorrect code or input data•Crash testing software

Software Fault Injection

50

Holistic reliability management Keene & Associates

Capability (CMM)

Communication

• Collaborative Tools

• Common Vocabulary

• Team Focus (CFDT)

• GQM

Design Point Centering

• Optimization

• Robustness & Margin

• DOE

Understandability

• Program Structure

• Design Documentation

Process Improvement

• Metrics

• DPP

• Stat Analysis

Development Process

Six Sigma – World Class Process

Requirements &

Interfaces• KANO (M,S,D)

• Needs & Context

• Spiral Model

Operational Profile

Traceability

System Management

KPIV Variability

Hardware

• Cpk

• MSA (GR+R)

Application

• Environment

• Off-Nominal Modes

VOC

C

c

• Regression Cases

• Variational Testing

• Alpha - Beta

Product Verification

Developed by Sam Keene Jan 2006

51

Holistic reliability management Keene & Associates

Useful References

Draft Standard for Software Reliability Prediction IEEE_P_1633

[IEEE 90] Institute of Electrical and Electronics Engineers. IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries. New York, NY: 1990.

CASRE model: http://www.openchannelfoundation.org/projects/CASRE_3.0


Recommended