Post on 26-Jul-2018
transcript
Software Safety
• Safety is not the same as correctness
• correctness = theorem-proving, model-checking, abstract interpretation, etc.
• safety = analysis of product + context, response in the presence of failure (internal or external)
• correctness and safety are orthogonal concepts
• verifying safety should ideally be cheaper than verifying full correctness.
• Safety is not compositional
• Examples of safety techniques:
• multiple (identical) redundancy - typically for hardware
• multiple implementation methods (non-identical) for hardware and software
Motivation
• ‘Design-in’ safety, at the architectural level
• identify hazards
• identify failures that could trigger a hazard
• demonstrate that failures are caught or mitigated
• certify the whole system
• Issues:
• huge cost (safety analysis is manual: labour-intensive)
• non-modular (safety of system, not components)
• incremental certification (build product in stages)
• impact of late design changes?
• product-line families
• need to capture variability of components and their composition
Contribution
• Given a modular product design (architecture),
• Define a modular, compositional, notation and analysis for failure behaviours.
• Tackle a small portion of the problem first
• Calculate the system failure behaviour automatically from the composition of its parts.
• reduces the cost of prediction ...
• both in initial design, and in estimating the impact of a change
RTN communication protocols
destructive
read
(blocking)
non-destructive
read
(non-blocking)
non-destructive
write
(blocking)
destructive
write
(non-blocking)signal
constantchannel
pool
Modelling faults
• Identify all ‘interesting’ fault types (following HAZOP/SHARD guide words):
• value faults:
• stale value
• detectably incorrect
• undetectably incorrect
• timing faults:
• early
• late
• message sequence faults
• omission
• commission
• This set of fault types is not fixed - tailor to specific contextual requirements
Propagation and Transformation
How does a component deal with a failure on input?
* → late early → * omission → omission late → stale value
(source)(sink)
(propagate)(transform)
(* = no failure)
Fault Propagation and Transformation
• an FPTC expression is a collection of individual transform clauses (pattern→result)
• each clause expresses one transformation behaviour
• combination of clauses gives full expressivity
• the language encompasses more complex patterns: variables, wildcards, sets of faults
• checks required to ensure pattern is comprehensive, yet no overlaps
* → Omission_ → *
(*,*,Value) → *(*,Value,*) → *(Value,*,*) → * (x,y,z) → {x,y,z}
Late → Value Early → * Omission → ValueCommission → * v → v
(Late,Value) → (Value,*) (_,Value) → (*,*) (x,y) → (y,{x,y})
Behaviour of comms protocols
early -> *
omission -> late
commission -> value
early -> *
omission -> late
commission -> late
* -> late
early -> *
omission -> stale value
commission -> *
late -> stale value
signal poolchannel
early → * omission → latecommission → value
early → * omission → latecommission → late * → late
early → * omission → stale valuecommission → * late → stale value
Automated analysis
• Architecture is a graph
• nodes have propagation/transformation expressions
• edges simply connect nodes
• Regard faults as tokens introduced by potential sources of errors
• Push the tokens round the graph network
• using FPTC expressions to remove, propagate, or transform them
• Collect sets of all possible tokens on each edge
• Calculate the fixpoint (i.e. maximal fault sets)
Data-Flow Equations
in(ci) = out(predecessor(ci))
c
1.. i ... n
1.. k ...m
out(ck) = rhsk (selecttup(transforms(c)))∪tup
select(f0, ..., fi, ..., fn) (ts) = t ∈ ts• lhs(t) unifies with (f0, ..., fi, ..., fn)
tup ∈ permutations(in(c0), ..., in(cn))
Case Study: LRAAM launcher software
• Postulate fault injection by individual hardware components
IMU:* → omission
{ *, late, stale value, detectable value }
* → early { *, stale value, detectable value }
Case Study: LRAAM launch software
• Helps to determine where to place fault-accommodation design patterns (towards reduction of failures)
IMU:* → omission
{ *, late, stale value, detectable value }
2x IMU:* → omission
{ *, stale value, detectable value }
Conclusions• Architectural description must be modular
• identify components, comms connections, dependencies
• As one aspect of safety, we are interested in failure behaviours, rather than correct behaviour
• express response to failures on input as potentially different failures on output
• manual analysis of each component in all possible contexts
• can be done at design time, before code is written
• Automate the calculation of the whole-system response
• changes to the model only require a cheap re-run of the algorithm
• full impact of change can be assessed
• Aim: validation of the architecture - find bad design early
Future work
• Tracing semantics
• determine the causes of any fault on any edge
• find common causes of separate failures
• Probability/sensitivity
• add numbers representing likelihood of failure injection and likelihoods of particular propagation/transformations
• calculate probability of hazardous system behaviour arising
• requires tracing / common-cause analysis for accuracy
• certification requires low numbers e.g. 1 x 10^-9
• even if the absolute numbers are dubious, you can see whether particular architectures tend to reduce, keep, or magnify, risk
• similar to a QoS analysis
• Open (vs closed) diagrams - FPTC inference
Pattern language design questions
• Default clauses
• e.g.
• what happens in response to e.g. early?
• Multiple connections
• implicit multiplexing?
• tupling?
• late → (value,*,late)
• interaction with default clauses
• exponential growth in input clauses
• 6 failure modes, 5 inputs = (6+1)^5 = 16807 patterns
• permit variables, wildcards, sets
* → lateomission → late value → early