Failure Mode Assumptions and Assumption Coverage David Powell.

Post on 05-Jan-2016

222 views 1 download



Failure Mode Assumptions Failure Mode Assumptions and Assumption Coverageand Assumption Coverage

David PowellDavid Powell


Key questionsKey questions– How components may fail?How components may fail?

Prevention strategiesPrevention strategies

– At what rate they may fail? At what rate they may fail? The Amount of redundancy neededThe Amount of redundancy needed

– What are the important type of faults? What are the important type of faults? Types of redundancy neededTypes of redundancy needed

– The relation between dependability, The relation between dependability, redundancy and faults? redundancy and faults? General FT design guidelinesGeneral FT design guidelines

An F-T Paradox/DilemmaAn F-T Paradox/Dilemma

More faultyMore faulty

More redundancyMore redundancy

More possibility of faultsMore possibility of faults


Solution- Some Key StepsSolution- Some Key Steps

Classify, quantify and verify the Classify, quantify and verify the assumptionsassumptions

Type of FailuresType of Failures


Single-user serviceSingle-user service– Service ModelService Model– Potential ErrorsPotential Errors

Multiple-user serviceMultiple-user service– Service ModelService Model– Potential ErrorsPotential Errors

Single-user Service ModelSingle-user Service Model

Service items: sService items: sii, i=1,2,…, i=1,2,…

Values of sValues of sii: vs: vsii

Observation time of sObservation time of sii: ts: tsii

Service Model: Service Model:

SSii= = <vs<vsii, ts, tsii>>

An omniscient observerAn omniscient observer

Correctness ModelCorrectness Model

Service item sService item sii is correct iff is correct iff

(vs(vsii SV SVii) ) (ts (tsii ST STii) )

SVSVii and ST and STii are respectively the specified are respectively the specified

sets of values and times for service item ssets of values and times for service item s ii

Potential ErrorsPotential Errors

Arbitrary value error: sArbitrary value error: sii : vs : vsii SV SVii

Noncode error: sNoncode error: sii : vs : vsii CV CV (CV defines a (CV defines a code)code)

Arbitrary timing error: sArbitrary timing error: sii : ts : tsii ST STii

Early timing error: sEarly timing error: sii : ts : tsii < min(ST < min(STii))

Late timing error: sLate timing error: sii : ts : tsii > max(ST > max(STii))

Omission error: sOmission error: sii : ts : tsi i = = Impromptu error: sImpromptu error: sii: (vs: (vsii = = ) ) (ts (tsi i = = ) )

Multi-user Service ModelMulti-user Service Model

Service item sService item sii={s={sii(1), s(1), sii(2),…, s(2),…, sii(n),}(n),}

Service model: <vsService model: <vsii(u), ts(u), tsii(u)>, all i,u(u)>, all i,u

New issues: “consistency”New issues: “consistency”

Correctness ModelCorrectness Model

vsvsii(u)– the value of service item i on process u (u)– the value of service item i on process u vsvsii-- the value of service item i -- the value of service item i SVSVii– the set of specified service item i– the set of specified service item itstsii(u)– the observation time of service item i on process u(u)– the observation time of service item i on process uSTSTii(u) – the range of specified observation time of service (u) – the range of specified observation time of service item i on process uitem i on process uuvuv -- the time bound of related occurrences -- the time bound of related occurrences

Examples of Potential ErrorsExamples of Potential Errors

Consistent value errorConsistent value error

Consistent timing errorConsistent timing error

Semi-consistent value errorSemi-consistent value error

Failure Mode AssumptionsFailure Mode Assumptions

Attempt to formalize the concept of an Attempt to formalize the concept of an assumed failure modeassumed failure modeBy assertions on the sequences of service By assertions on the sequences of service items delivered by a componentitems delivered by a component

Examples of Value Error AssertionsExamples of Value Error Assertions

No value errors occur (VNo value errors occur (Vnonenone))

i , vsi , vsii SV SVii

The only value errors that occur are noncode The only value errors that occur are noncode value errors (Vvalue errors (Vnn))

i , (vsi , (vsii SV SVii) ) (vs (vsii CV CV ))

Arbitrary value error can occur (VArbitrary value error can occur (Varbarb))

i , (vsi , (vsii SV SVii) ) (vs (vsii SV SVi i ))

Examples of Timing Error Examples of Timing Error AssertionsAssertions

No timing error occurs (TNo timing error occurs (Tnonenone))

The only timing errors are omission errors (TThe only timing errors are omission errors (TOO))

The only timing errors are late timing errors (TThe only timing errors are late timing errors (TLL))

The only timing errors are early timing errors (TThe only timing errors are early timing errors (TEE))

Arbitrary timing error can occur (TArbitrary timing error can occur (Tarbarb))

Permanent omission/crash (TPermanent omission/crash (Tpp))

Bounded omission degree (TBounded omission degree (TBkBk))

Timing Error ImplicationsTiming Error Implications

Failure Mode Assertions(FMA)Failure Mode Assertions(FMA)

A complete FMA entails an assertion on A complete FMA entails an assertion on errors occurring on both value and time errors occurring on both value and time domainsdomains

By taking the Cartesian production of the By taking the Cartesian production of the two domains, we get a family of FMAtwo domains, we get a family of FMA

FMA Implication GraphFMA Implication Graph

So what?So what?

The FMA classification and implication The FMA classification and implication graph can serve as a guideline to design graph can serve as a guideline to design families of FT algorithms that can process families of FT algorithms that can process errors in increasing severity!errors in increasing severity!

Assumption CoverageAssumption Coverage

Establishing a link between assumed Establishing a link between assumed component failure mode and system component failure mode and system dependabilitydependability(The design a FT system relies on the (The design a FT system relies on the assumption they make)assumption they make)(The dependability of a FT system is related (The dependability of a FT system is related to the failure mode they assume) to the failure mode they assume)


Components may failComponents may fail

They may fail in a bad way They may fail in a bad way leads to a leads to a violation of assumptions of the systemviolation of assumptions of the system

The system, in turn, can failThe system, in turn, can fail

Question: to what degree can a Question: to what degree can a component FMA prove to be true in the component FMA prove to be true in the real system?real system?

The Coverage of the AssumptionThe Coverage of the Assumption


P(X) = Pr{ X= true | component failed}P(X) = Pr{ X= true | component failed}

P(VP(Varbarb T Tarbarb) = 1) = 1

P(VP(Vnonenone T Tnonenone) = 0) = 0

Coverage of an FT systemCoverage of an FT system

PS(X) = PS(X) =

Pr{ correct error processing |X= true}Pr{ correct error processing |X= true}

*Pr{ X= true | component failed}*Pr{ X= true | component failed}

Influence of Assumption Influence of Assumption Coverage on System Coverage on System


A Case StudyA Case Study

The System The System

A system of n processorsA system of n processorsConnected via unidirectional message-passing busConnected via unidirectional message-passing busEach processor carries out the same computation stepsEach processor carries out the same computation stepsThe result of each processing step is communicated to The result of each processing step is communicated to all other processorsall other processorsEach process has a decision function (DF)Each process has a decision function (DF)The DF is applied to the results received from other The DF is applied to the results received from other processorsprocessors……Each processor and its associated bus is viewed as a Each processor and its associated bus is viewed as a single componentsingle component

Fail-Silent Processor-busFail-Silent Processor-bus

A fail-silent processor A fail-silent processor – Only has semi-consistent value errorsOnly has semi-consistent value errors– Always produces message on time Always produces message on time – Or ceases to produce messages foreverOr ceases to produce messages forever– If a message is delivered to a processor, it is to be delivered to If a message is delivered to a processor, it is to be delivered to

all processors with consistent fixed delay all processors with consistent fixed delay

Fail-Consistent Processor BusFail-Consistent Processor Bus

Only semi-consistent value errors may occur Only semi-consistent value errors may occur

Faulty processors may send erroneous valuesFaulty processors may send erroneous values

Consistent timing error may occurConsistent timing error may occur

Fail-uncontrolled Processor BusFail-uncontrolled Processor Bus

Arbitrary timing errorArbitrary timing error

Arbitrary value errorArbitrary value error

Implications of Assumption Implications of Assumption CoverageCoverage

Failure mode relationsFailure mode relations

Coverage relationsCoverage relations

Dependability Expressions From Dependability Expressions From Markov ModelsMarkov Models

r = e r = e ––λλtt

λλ = failure rate = failure rate

A Life-critical ApplicationA Life-critical Application

System reliability objective: R > 1-10System reliability objective: R > 1-10-9-9 over over 10 hours10 hours

Single processor reliability: Single processor reliability: – r = er = e--λλtt – 1/1/λλ = 5 years = 5 years

A Money-Critical ApplicationA Money-Critical Application

It is about availability of the system rather It is about availability of the system rather than reliability of the systemthan reliability of the system

Please look at the paper for more detailsPlease look at the paper for more details

Unavailability v.s. CoverageUnavailability v.s. Coverage


A formalism for describing component A formalism for describing component failure modesfailure modes

Multiplicity of value and timing errorsMultiplicity of value and timing errors

The notion of assumption coverageThe notion of assumption coverage

The relation between dependability, The relation between dependability, availability and assumption coverageavailability and assumption coverage

Thank youThank you