Fault Propagation and Transformation: A Safety Analysis · RTN communication protocols destructive...

transcript

Fault Propagation and Transformation:A Safety Analysis

Malcolm Wallace

Software Safety

• Safety is not the same as correctness

• correctness = theorem-proving, model-checking, abstract interpretation, etc.

• safety = analysis of product + context, response in the presence of failure (internal or external)

• correctness and safety are orthogonal concepts

• verifying safety should ideally be cheaper than verifying full correctness.

• Safety is not compositional

• Examples of safety techniques:

• multiple (identical) redundancy - typically for hardware

• multiple implementation methods (non-identical) for hardware and software

Motivation

• ‘Design-in’ safety, at the architectural level

• identify hazards

• identify failures that could trigger a hazard

• demonstrate that failures are caught or mitigated

• certify the whole system

• Issues:

• huge cost (safety analysis is manual: labour-intensive)

• non-modular (safety of system, not components)

• incremental certification (build product in stages)

• impact of late design changes?

• product-line families

• need to capture variability of components and their composition

Contribution

• Given a modular product design (architecture),

• Define a modular, compositional, notation and analysis for failure behaviours.

• Tackle a small portion of the problem first

• Calculate the system failure behaviour automatically from the composition of its parts.

• reduces the cost of prediction ...

• both in initial design, and in estimating the impact of a change

“Real-Time Networks” notation

A simple real-time architecture

RTN communication protocols

destructive

(blocking)

non-destructive

(non-blocking)

non-destructive

(blocking)

destructive

(non-blocking)signal

constantchannel

Modelling faults

• Identify all ‘interesting’ fault types (following HAZOP/SHARD guide words):

• value faults:

• stale value

• detectably incorrect

• undetectably incorrect

• timing faults:

• early

• late

• message sequence faults

• omission

• commission

• This set of fault types is not fixed - tailor to specific contextual requirements

Propagation and Transformation

How does a component deal with a failure on input?

* → late early → * omission → omission late → stale value

(source)(sink)

(propagate)(transform)

(* = no failure)

Fault Propagation and Transformation

• an FPTC expression is a collection of individual transform clauses (pattern→result)

• each clause expresses one transformation behaviour

• combination of clauses gives full expressivity

• the language encompasses more complex patterns: variables, wildcards, sets of faults

• checks required to ensure pattern is comprehensive, yet no overlaps

* → Omission_ → *

(*,*,Value) → *(*,Value,*) → *(Value,*,*) → * (x,y,z) → {x,y,z}

Late → Value Early → * Omission → ValueCommission → * v → v

(Late,Value) → (Value,*) (_,Value) → (*,*) (x,y) → (y,{x,y})

Behaviour of comms protocols

early -> *

omission -> late

commission -> value

early -> *

omission -> late

commission -> late

* -> late

early -> *

omission -> stale value

commission -> *

late -> stale value

signal poolchannel

early → * omission → latecommission → value

early → * omission → latecommission → late * → late

early → * omission → stale valuecommission → * late → stale value

Automated analysis

• Architecture is a graph

• nodes have propagation/transformation expressions

• edges simply connect nodes

• Regard faults as tokens introduced by potential sources of errors

• Push the tokens round the graph network

• using FPTC expressions to remove, propagate, or transform them

• Collect sets of all possible tokens on each edge

• Calculate the fixpoint (i.e. maximal fault sets)

Data-Flow Equations

in(ci) = out(predecessor(ci))

1.. i ... n

1.. k ...m

out(ck) = rhsk (selecttup(transforms(c)))∪tup

select(f0, ..., fi, ..., fn) (ts) = t ∈ ts• lhs(t) unifies with (f0, ..., fi, ..., fn)

tup ∈ permutations(in(c0), ..., in(cn))

Algorithm Properties

• Terminating

• Confluent

• Stable - no ‘flip-flops’

Case Study: LRAAM launch software

Case Study: LRAAM launcher software

• Postulate fault injection by individual hardware components

IMU:* → omission

{ *, late, stale value, detectable value }

* → early { *, stale value, detectable value }

Case Study: LRAAM launch software

• Helps to determine where to place fault-accommodation design patterns (towards reduction of failures)

IMU:* → omission

{ *, late, stale value, detectable value }

2x IMU:* → omission

{ *, stale value, detectable value }

Conclusions• Architectural description must be modular

• identify components, comms connections, dependencies

• As one aspect of safety, we are interested in failure behaviours, rather than correct behaviour

• express response to failures on input as potentially different failures on output

• manual analysis of each component in all possible contexts

• can be done at design time, before code is written

• Automate the calculation of the whole-system response

• changes to the model only require a cheap re-run of the algorithm

• full impact of change can be assessed

• Aim: validation of the architecture - find bad design early

Future work

• Tracing semantics

• determine the causes of any fault on any edge

• find common causes of separate failures

• Probability/sensitivity

• add numbers representing likelihood of failure injection and likelihoods of particular propagation/transformations

• calculate probability of hazardous system behaviour arising

• requires tracing / common-cause analysis for accuracy

• certification requires low numbers e.g. 1 x 10^-9

• even if the absolute numbers are dubious, you can see whether particular architectures tend to reduce, keep, or magnify, risk

• similar to a QoS analysis

• Open (vs closed) diagrams - FPTC inference

Spare slides

Pattern language design questions

• Default clauses

• e.g.

• what happens in response to e.g. early?

• Multiple connections

• implicit multiplexing?

• tupling?

• late → (value,*,late)

• interaction with default clauses

• exponential growth in input clauses

• 6 failure modes, 5 inputs = (6+1)^5 = 16807 patterns

• permit variables, wildcards, sets

* → lateomission → late value → early

Pattern language design questions

• Variables and wildcards:

• wildcard: (value,_) → value

• variable: (late,f) → f

• Sets to amalgamate similar clauses:

• Overlapping patterns:

• Complex overlaps:

(f,g) → (*,f)(f,g) → (*,g)

(f,g) → (*,{f,g})≡

(*,late) → (*,*) (*,f) → (*,f)

(X,Y,_) → A(_,_,Z) → B

Fault Propagation and Transformation: A Safety Analysis · RTN communication protocols destructive...

Documents