1
Safety Critical Computing
Osman Kaan EROL
What is SAFETY?
Safety-related system
• Safety is a property of a system that it will
not endanger human life or the
environment
• A safety-related system is one by which
the safety of equipment or plant is assured
• Safety-critical system is a synonim for a
safety-related system but in some cases it
suggests a system of high criticality
levels of integrity
• The implications of failure vary greatly
between applications, and this leads to the
concept of levels of integrity that reflect the
importance of correct operation.
• Once a project has been assigned a safety
integrity level, this will determine the
methods of design and implementation
used for the system.
Software and safety
• All safety-critical applications depend on software, but software is among the
most imperfectof the products of modern technology
• However, software alone cannot provide such assurance, as its correct operation is dependent on the system hardware.
• video
Safety aspects
• Primary safety: Includes dangers from electrocution or electric shock, and from burns or fire caused directly by the hardware
• Functional safety: Covers aspects concerned with equipment that is directly controlled by the computer, and is related to the correct functioning of the hardware and its software.
• Indirect safety: Relates to the indirect consequences of a computer failure or the production of incorrect information.
2
Disadvantages of computer-based
systems
• Their complexity: By their very nature, all
computer-based systems are complex.
• In complex digital devices such as
microprocessors the number of possible
failure modes is so large that it may be
considered to be infinite.
Development lifecycle model
• Requirements → hazards and risk analysis→
specification→ architectural design→ module
design→ module construction and testing→system integration and testing→ system
verification→ system validation→ certification
→ completed system
• Green phases emphasis “top-down” approach to
design while orange design phases use “bottom-up” approch to testing.
Verification & Validation
• Verification is the process of determining
that a system, or module, meets its
specifiation. (modules)
• Validation is the process of determining
that a system is appropriate for its
purpose. (system as a whole)
Fault, error, system failure
• A fault is a defect within a system
• An error is a deviation from required
operation of the system or subsystem
• A system failure occurs when the system
fails to perform its required function
Fault classes
• Random faults, are associated with
hardware component failures
• Systematic faults. All software faults come
within this category as they are faults
within the design of the software.
Fault management
• Fault avoidance techniques aim to prevent faults from entering the system during the design stage
• Fault removal methods attempt to find faults within a system before it enters service
• Fault detection techniques are used during service to detect faults
• Fault tolerance techniques are designed to allow the system to operate correctly in the presence of faults.
3
Faults are inevitable! We must learn to manage them...
Systems requirements
• Reliability
• Availability
• Failsafe operation
• System integrity
• Data integrity
• System recovery
• Maintainability
• Dependability
Reliability
• is the probability of a component, or
system, functioning correctly over a given
period of time under a given set of
operating conditions
Availability
• The availability of a system is the
probability that the system will be
functioning correctly at any given time
Failsafe operation
• Possessing a set of output states that can be
identified as being ‘safe’.
• An example: Railway signalling system failsafe
state. All signalling
lights are red and all points are locked to their previous
positions. This brings all
the trains safely to a halt. If not...
Primary Cause(s) - Wiring defect
4
Integrity
• The integrity of a system is its ability to
detect faults in its own operation and to
inform a human operator
Data integrity
• Data integrity is the ability of a system to
prevent damage to its own database and
to detect, and possibly correct, errors that
do occur
System recovery
• It is vital to detect the failure and restart
itself quickly.
• Depending on the nature of the
application, the recovery process may
need to determine the current status of the
system, to take appropriate action to
continue operation, and to maintain safety.
Maintainability
• Maintenance is the action taken to retain a system in, or return a system to, its designed operating condition.
• Maintainability is the ability of a system to be maintained.
• Other terms related to this topic:- Mean time to repair (MTTR)- Maintenance-induced failures
Dependability
• Dependability is a property of a system
that justifies placing one’s reliance on it.
• It covers considerations of reliability,
availability, safety, maintainability and
other issues of importance in critical
systems
Conflict between system
requirements
• An obvious example: Desire of high
performance and a desire for low cost.
• Other topics to be learned:
- Global optimization
- Multi-objective decision support systems
- Pareto-front
- Coverage of pareto front or diversity in
archive list.
5
Safety requirements
• Identification of the hazards associated with the system
• Classification of these hazards
• Determination of methods for dealing with the hazards
• Assignment of appropriate reliability and availability requirements
• Determination of an appropriate safety integrity level
• Specification of development methods appropriate to this integrity level
Safety case
• A safety case describes the design and
assessment techniques used in the
development of the system.
• It is often referred as:
- Safety argument- Safety justification
- Safety assessment report
• The provision of such a document is accepted
as a good engineering practice
Hazard Analysis
• A hazard is a situation in which there is actual or
potential danger to people or to the environment.
• Among most widely used hazard analysis
techniques:
- failure modes and effect analysis- failure modes, effects and criticality analysis
- event tree analysis
- fault tree analysis can be given.
Failure modes and effects analysis
Ensure hardware design prevents excessive current through switch
negligibleSlight delay in sensing state of guard
a)Ageing effects
b)Prolonged high currents
Excessive switch-bounce
3
Modify software to detect switch failure and take appropriate action
Allows machine to be used when guard is absent-dangerous failure
System incorrectly senses guard to be closed
a)Faulty component
b)Excessive current
Short-circuit contacts
2
Select a reliable switch, rigid quality control on switch procurement
Prevents use of machine-system fails safe
Failure to detect tool guard in place
a)Faulty component
b)Excessive current
c)Extreme temperature
Open-circuit contacts
Tool guard switch
1
Remedial actionSystem effectsLocal effects
Possible causeFailure modeunitRef no
FMEA for a microswitch
Hazard and operability studies
As aboveTemp reading too low – could result in overheating and possible plant failure
Sensor mounted incorrectly or sensor failure
Less3
Consider use of duplicate sensor
Temp reading too high -results in decrease in plant efficiency
Sensor faultMore2
Lack of sensor signal detected and system shuts down
PSU,sensor or cable fault
NoVoltageSensor output1
RecommendationConsequenceCauseGuide word
AttributeInter-connectionItem
Fault tree analysis
• Tree components...in
out control
out
in
6
• Fault event resulting from other events • Basic event, taken as an input
• Fault event not fully traced to its source. It
is taken as an input but its causes may be
unknown
• The triangle symbol is used to link trees.
The ‘in’ symbol indicates an input from
another tree (on another sheet). The ‘out’
symbol appears in place of the ‘top event’
and indicates that this point forms the
input to another tree
in
out
• The output event occurs if ALL the inputs
occur
• The output event occurs if ANY of its
inputs occur, either alone or in
combination
7
• The control condition determines whether
the input event appears at the output
control
out
in
Loss of heating
Loss of fuel supply
Loss of electricity
Loss of
solid fuel
Loss of
liquid fuel
Warning lamp
does not operate
No voltage applied to lamp
Primary
lamp failure
Primary cable or
connector
failure
Battery
supply failure
Risk analysis
• An accident is an unintended event or sequence of events that causes death, injury, environmental or material damage.
• An incident is an unintended event or sequence of events that does not result in loss, but, under circumstances, has the potential to do so.
• Risk is a combination of the frequency or probability of a specified hazardous event, and its consequence.
• Example: Risk class or risk level or risk factor = severity (death) x frequency (failure per year)
Severity categories for civil aircraft
• Catastrophic: failure condition which would prevent continued safe flight
• Hazardous: failure conditions which yield to- a large reduction in safety margins- physical distress or higher workload- adverse effects on occupants, including serious or potentially fatal injuries
• Major: Failure conditions which would reduce the capability of the aircraft
• Minor: Failure conditions which would not significantlyreduce the capability of the aircraft
• No effect: Failure conditions which do not affect the operational capability of the aircraft
DO178B objectives according to
severity classes
00No effectE
228MinorD
257MajorC
1465HazardousB
2566CatastrophicA
With independenceObjectivesFailure conditionLevel
8
Severity conditions for military
systems
• Catastrophic: Multiple deaths
• Critical: A single death, and/or multiple severe injuries or severe occupational illness
• Marginal: A single severe injury or occupational illness, and/or multiple minor injuries or minor
occupational illnesses
• Negligible: At most a single minor injury or minor
occupational ilness
Hazard probability classes for
aircraft systems
* Probability per operating hour
10E-9Extremely improbableExtremely improbable
10E-8
10E-7Extremely remote
10E-6
10E-5RemoteImprobable
10E-4
10E-3Reasonably frequent
10E-2
10E-0*FrequentProbable
Similar definitions are used in Europe (JAR, 1994) and in US (FAR, 1993)
Allowable probability for civil
aircraft systems
10E-9Multiple deaths, usually with loss
of aircraft
Catastrophi
c
10E-7 – 10E-8Large reduction in safety margins, serious injury or death of a small number of occupants
Hazardous
10E-5 – 10E-6Significant reduction in safety margins, passenger injuries
Major
10E-3 – 10E-4Operating limitationMinor
10E-2Nuisance
10E0 - 10E-1Normal
Maximum probability per operating hour
Severity of effectCategory
Safety integrity
• Safety integrity is the likelihood of a safety-
related system satisfactorily performing
the required safety functions under all the
stated conditions within a stated period of
time
Assignment of integrity levels
Severity of hazardous event
Frequency of hazardous event
Risk classification
Integrity classification
Hardware integrity classification
Systematic integrity
classification
Software integrity
classification
Fault tolerance
• Faults can be distinguished by their- nature (random hardware faults, systematic system faults or design faults),- duration (permanent, transient intermittent) and- extent (localized or global fault).
• Transient faults appear and disappear after a short time. Effects of alpha particle strikes on semiconductor memory chips can cause such errors. It can change one or more bits within memory plane.
• Intermittent faults appear, disappear and then reappear at somelater time. Poor solder joints or corrosion on connector contacts can cause this class of faults. These can also result from electrical interference. Many permanent faults appear to be intermittent. For example, a software synchronization fault is permanent, as its code is always present, but its execution will only sometimes result in an error.
9
Hardware faults
• Among possible harware faults,
- a break in a conductor,
- or a short (or bridge) between various nodes of the circuit can be given.
• Several fault models are used to represent different types of fault:
- the single-stuck-at fault model,
- the bridging fault model,- the stuck-open fault model.
The single-stuck-at model
• The model assumes that a fault within a
module will cause it to respond as if one of
its inputs or outputs is stuck at a logic 1 or
a logic 0.
The bridging model
• A bridging or short-circuit fault occurs when two or more nodes of a circuit are accidentally joined together to form a permanent fault. Such a fault may sometimes be represented by the single-stuck-at model. The result is an unintentional logic operation similar to a ‘wired-AND’ (in positive logic) or ‘wired-OR’ (in negative logic) function assuming logic 0 can hide logic 1.
• PCB carbonization due to excessive heat can also cause such faults.
The stuck-open model
• The fault occurs when both output
transistors of a CMOS gate are turned off
as a result of an internal open- or short-
circuit. This causes the output to pulled
neither high nor low, producing an effect
similar to the high impedance state of a
three-state TTL gate.
The use of fault models
• A circuit with N nodes has 2N potential single-
stuck-at faults. By using appropriate
combinations of inputs or test vectors these can be traced out.
• A circuit with N nodes could have 3N-1 multiple faults. Multiple faults do occur because physical
defects can affect more than one node.
• A circuit with N nodes can generate a
combinaison of N with M shorted nodes.
Software faults
• Software faults may take an almost
unlimited number of forms. Examples
include:
- software specification faults
- coding faults
- logical errors within calculations
- stack overflows or underflows
- use of unitialized variables
10
Redundancy
• All forms of fault tolerance are achieved by
some form of redundancy, that is, the use
of some additional elements within the
system which would not be required in a
system that was free from all faults.
• Most commonly used form of redundancy
is the simple triple modular redundancy
(TMR).
TMR
Output
Module 1
Module 2
Module 3
Voting element
Input
Forms of redundancy
• The TMR priciple can be applied to:
- Harware redundancy
- Sowtware redundancy
- Information redundancy: parity bits,
checksums...
- Temporal (time) redundancy: same
operation is repeated three times to detect
transient faults.
Design diversity
• Failures as a result of similar faults in different redundant modules are termed common-mode failures.
• Most attempts to deal with such problems rely on redundancy combined with some form of diversity.
• A TMR system with redundant modules designed and manufactured by different teams preserve diversity.
• However, design diversity would not provide protection against mistakes within the specification.
Fault detection techniques
• Functionality checking
• Consistency checking
• Signal comparison
• Checking pairs
• Information redundancy
• Instruction monitoring
• Loopback testing
• Watchdog timers
• Bus monitoring
• Power supply monitoring
Functionality checking
• Functionality checking: This involves the use of routines which check to see that the hardware of the system is functioning correctly.
• RAM checking: Perform a write then read a pattern and later repeat the process with the complement of the previous pattern. This test will fail in the abscence of a memory location due to the capacitance of the bus lines that hold the information for a while.
• In simple design, testing all locations is possible, but in most cases only a fraction of the memory space is tested.
• The processor-s can be checked by executing a sequence of calculations and comparing the results with known values. If theroutine contains a wide range of processing operations this testverifies the operation of a large part of the system, including the processor, sections of the memory and the bus system.
11
Consistency checking
• Consistency checking uses some
knowledge of the nature of the information
within the system to test its validity. An
example of this form of testing is range
checking. This compares calculated or
stored values for a variable with
predefined values for its allowable range.
Signal comparison
• In systems with redundancy it is possible
to check the signal at similar points in the
various modules to validate them. This
process is simpler if the modules are
identical, rather than of diverse design.
Checking pairs
• Checking pairs are effectively a special
case of the use of signal comparison. Here
identical modules are designed to allow a
comparison of multiple signals in an
attempt to detect any discrepancies. If the
modules produce identical signals it is
assumed that both are fault free (or both
are faulty)
Information redundancy
• Parity checking
• M-out-of-N codes
• Checksums
• Cyclic redundancy codes
• Error correcting codes
• Each of these techniques use additional redundant information (nothing is created from nothing)
Information monitoring
• Normal operation of a processor involves the repated fetching and execution may corrupt the operation code or the operand. The action taken by processor in response to such error varies greatly between devices. Some processors immediately raise an exception, others simply fetch the next byte from memory.
• Processors that do not take appropriate action in response to unimplemented instructions must not be used in safety-critical systems.
Loopback testing
• This verifies that signals leaving one point of a circuit arrive at their destination unchanged.
• This is achieved by providing an independent return path for thesignal back to its source, and by comparing the outgoing and return signals to ensure equivalence.
• It is used in applications such as serial communications, within the processor. Bus lines, for example, may be constructed as loops so that they leave the processor, go to all relevant nodes of the circuit and then return to the processor for verification.
• In multiboard systems a ‘daisy-chain’ arrangement may be used to ensure that all boards are present.
• The outward and return paths should never be adjacent, to prevent the possibility of a short-circuit.
12
Watchdog timers
• It is used to detect the ‘crash’ of a processor.
• A timer (WDT) is arranged so that it will cause the system to reset if it is allowed to time out, but is prevented from doing so by specific processor activity such as loading of a prescaler value that the WDT repeatedly decrements it. WDT generates a reset condition if it is left undisturbed.
• WDT have limitations: following a system crash the processor will often continue to operate for some considerable time until reset. During this time the operation of the processor is unpredictable and hazardous.
• There may be cases where the system can crash in such a way thatthe WDT is reset periodically, preventing its intervention.
Bus monitoring
• Each bus address is compared with the
allowable range for the program that is
being executed and any out-of-range
value will result in an error being reported
to the processor.
Power supply monitoring
• A well-designed power supply, together with overvoltage protection, can normally protect components from damage due to excessive supply voltages.
• More serious problems can arise when the supply voltage drops below that required for normal operation. This is inevitable when the system is first turned on and when it is turned off. It may also occur if the supply voltage fails or ‘dips’ during operation.
• A power supply monitor warns the processor when the supply voltage reaches a dangerous level so that the processor takes emergency action.
• If the systems must operate continously, then uninterruptible power source must be used.
Hardware fault tolerance
• Static redundancy- TMR- N-modular redundancy
• Dynamic redundancy- Standby spares- Self checking pairs
• Hybrid redundancy- N-modular redundancy with spares- Module synchronization and diversity in hardware redundancy
TMR
Module 1
Module 2
Module 3
Voting element
Input
Voting element
Voting element
Output 1
Output 2
Output 2
A TMR system with triplicated woting
Module
Module
Module
Voting element
Voting element
Voting element
Input 2
Input 1
Input 3
Module
Module
Module
Voting element
Voting element
Voting element
3x3 connection matrix
Outputs
A multistage TMR arrangement
13
Module 1
Module 2
Module 3 Voting element
Input
Module N
Output
An N-modular redundant system 1111
1011
1101
0001
1110
0010
0100
0000
321
outputinputs
Truth table of 3-input 1-bit voting element
Module 1
Module 2
Fault detector
Switch
Input
Output
A standby spare arrangement with 2 modules (can be extended to N modules)
Module 1
Module 2
Comparator
Input
Output
A self-checking pair
Failure detected
Output
Failure detected
Dual-port memory
Processor
Memory
Processor
Memory
Fail
Fail
Module 2
Module 1
A self-checking pair using software comparison
Comparator
Output
Module 2
Module 1
Comparator
Fail
Fail
Combining failure detection signals using switches
14
Hybrid redundancy
• Hybrid redundancy uses a combination of
voting, fault detection and module
switching. Many techniques are used,
although most can be generalized as
some form of N-modular redundancy with
spares.
Switch
Output
Module 1
Disagreement detector
Voter
Module 2
Module 3
Module N
Spare 1
Spare 2
Spare M
N-modular redundancy with spares
Software fault tolerance
• Has two meanings:
- The “tolerance of software faults” by
using several hardware techniques.
- The “tolerance of faults (HW/SW) by the
use of software”. It has two common
methods namely;
* N-version programming
* Recovery blocks
N-version programming
• Duplicating modules has no benefit in term of redundancy if the same software is used in both.
• In order to achieve redundancy, different versions of SW must be run sequentially on the same processor. Same input data should give same results. If two version is used, it is analogous to self-checking pair. Three version + voter is analogous to TMR case. In practical, only two different versions (N=2) are used.
• N > 2 with multiple processors is used in A330/340 aircraft for example, where the application is very critical
Recovery blocks
• Date to early 70’s.
• It uses some form of error detection to validate the operation of a SW module. If an error is detected an alternative SW routine is used.
• It is based on acceptance tests. For example calculation of a square root is checked by squaring the result to obtain the input data. Equality obtained after reversing the operation validates the module.
• In case of a failure of the test, an alternative, redundant module is executed before another acceptance test.
Comparison of HW and SW
techniques
• N-version programming provides fault masking
in a manner similar to N-modular redundant HW
arrangements.
• Duplication of HW modules can be used to
provide tolerance of some forms of HW faults but duplication of identical SW modules has little
benefit. The latter provides a redundancy
against transient faults arising from HW but none against SW design faults.
15
Selecting fault-tolerant
architectures
• PES1 | PES2: Automotive, railway apps
• PES | NP: Automotive, railway apps
• PES1 | PES2 | PES3: Critical avionics
• PES1 | PES2 | NP: Critical avionics
• PES1 | PES2 | PES3 | PES4 (| NP): Nuclear reactor, fly-by-wire aircraft systems
• PES: Programmable Electronic System.
• NP: Non-Programmable module.
Space shuttle example
• 23 serial data buses among 5 CPUs and
memory, displays, sensors, telemetry,
control panels, payloads, boosters and
telecomms +...
• 5 serial interprocessor buses
Fault-tolerant PLC example
Input module
Input module
Input module
Processor
Sensor
Processor
Processor
Actuator
Sensor
Sensor
Output
module
Output module
Output module
Fault-tolerant avionics systems
Gateway
Processor
Processor
Power supply
...
Same module as above
ARINC 659 backplane bus
Same module as above
...
ARINC 659 serial data bus
Smart actuator
Smart sensor
Data concentrator
Dumb sensors
System reliability
• R(t) = n(t)/Nwhere R is reliabilty, n is the number of correctly operating samples, N is the number of samples.
• Q(t) = nf(t)/Nwhere Q is unreliability, nf is the number of failed samples.
• Q(t) =1 - R(t)
Typical variation of failure rate
time
Failure rate
Burn in Useful life Wear out
Failure rate = dnf (t )/n(t ) dt
16
Exponential failure law
• The exponential relationship between reliability an time is known as the exponential failure law. This indicates that for a constant failure rate reliability falls exponentially with time.
• R(t ) = e-λt
• Here λ denotes constant failure rate throughout the useful life of the system
Time-variant failure rates
• SW failures are due to design faults and in some circumstances may be located or removed during the lifetime of the product. In this case the number failures will tend to decrease with time.
• R(t ) = e-(t/η)β
where β is the shape parameter and η is the caracteristic life.
Mean time to failure (MTTF)
• MTTF is the another form of describing the
reliability. This is the expected time that a
system will operate before the first failure occurs.
• MTTF = ∫0∞ R(t )dt
• If failure rate is constant then
MTTF = 1/ λ. When t = MTTF, then R = 0.37 which means that the system has a chance to
operate correctly only about 37%.
Other equalities...
• Mean time to repair (MTTR): average time taken to repair a system that has failed
• Mean time between failures (MTBF): MTBF = MTTF + MTTR ≈ MTTF
• Availability = Time system is operationalTotal time
= MTTF / (MTTF + MTTR)
• Unavailability = 1 - Availability
Reliability modelling
• Combinational models:- series systems (modules are arranged in series)λ = λ1 + λ2 + λ3+ .... + λn
R (t) = R1(t )R2(t )....RN(t )- parallel systems (modules are parallel)Q(t ) = [1-R1(t )][1-R2(t )]...[1-Rn(t )]since R (t ) = 1-Q (t )R(t ) = 1-[1-Rm(t)]N where modules are identical.- series-parallel combinations
Example 1
• A system is composed of 100 components
and failure of any component will result in
failure of the system. If the failure of the
various components is completely
independent, and each component has a
reliability of 0.999, calculate the overall
system reliability. (R(t)=0.905,Q(t)=0.095)
17
Example 2
• A series system containing 100
components is required to have a reliability
of at least 0.999. Assuming that each of
the components is equally reliable, what
minimum reliability (Rc)would they require
to achieve the specified system
performance?
• (R((t)=Rc(t)100≥0.999 => Rc(t)=0.99999)
Example 3
• A system consists of three identical modules and
will operate correctly provided that at least one
module is operational. If the reliability of each of the modules is 0.999, what will be the reliability
of the complete system, assuming that the
modules fail independently?
• (R(t) = 0.999 999 999)
Example 4
• A system requires a minimum reliablity of 0.999.
A module designed to fulfil the requirements of
the system is found to have a reliability of only 0.85. If a parallel combination of these modules
is used to implement the system, what is the
minimum number of modules needed to achieve the required reliability?
• (N ≥ 3.64 so N=4)
Triple modular and N-modular
redundancy
• Probability of correct operation =
probability of no failures
+ probability of only module 1 failing
+ probability of only module 2 failing
+ probability of only module 3 failing
TMR reliability
• RTMR(t) =
R1(t )R2(t )R3(t )
+[1-R1(t )]R2(t )R3(t )
+[1-R2(t )]R1(t )R3(t )
+[1-R3(t )]R1(t )R2(t )
• If three modules are identical
RTMR(t) = 3R2(t) – 2R3(t)
Example 5
• A TMR system consists of three identical
modules each with a reliability of 0.95.
Calculate the reliability of the resultant
system, ignoring the effects of the voting
arrangement
• (RTMR(t ) = 0.993)
18
Cut and tie sets
• Cut sets are formed by drawing lines through the reliability block diagram to represent combinations of elements in which simultaneous failure would lead to system failure.
• Minimal cut sets represent cut sets in which no subset will result in system failure.
• Tie sets are formed by drawing lines through the reliability block diagram to represent groups which, if all the elements were working, would guarantee the functioning of the system.
• Minimal tie sets represent tie sets in which no subset will perform this function.
Cut and tie sets
1 2
43
1 2
43
Minimal cut sets Minimal tie sets
Reliability and MTTF
• Reliability is a function of time while the
MTTF of a system is a fixed characteristic
that does not change with time.
• They seem to correlate but this is not
necessarily be true: Adding redundancy to
a system increases its reliability over a
given time period while increasing
complexity yields to a lesser MTTF.
Independence of failures
• Up to now, all the failures are assumed to
be independent.
• This is valid for random component
failures but is not so for systematic
failures. Similar redundancy modules with
the same software may fail simultaneously
since all of them receives the same input
data and their SW are equal.
Markov models
• An alternative approach to determine the
overall reliability of a system is to assign
various states to a system and to
determine the probability of being in any of
these states.
• As an example, one might assign two
possible states to a system, representing
the working and not working conditions.
A two state system
1 20.9
0.1
0.4
0.6
From a state diagram, a tree diagram can be obtained. In this case time intervals are shown. In a state diagram, time parameter does not exist. The state diagram express that the system, if it is initially at state 1, will stay at that state with a probability of 0.9 and will switch to state 2 with a probability of 0.1. The sum of probabilites leaving a state must be equal to unity. This is discrete Markov model.
19
Continuous Markov modeling
1 21-λ
λ
µ
1- µ
In continuous Markov models, the probabilities are replaced by transition rates. λ represent the failure rate and µ represent the repair rate. So the limiting probabilites of being in each state are given by:P1 = µ/(λ + µ) and P2 = λ /(λ + µ)
Safety-Critical hardware
• Design faults within microprocessors may take a
number of forms:
- failure of the circuit to correctly implement its intended function
- failure of the documentation to correctly
describe the circuit’s operation.
• It is often said that when designing critical
systems one should assume that ‘whatever can go wrong, will go wrong’.
Microprocessor design faults
• Compilers may not use some of the microprocessor’s instructions. Any fault may stay hidden due to low usage of them.
• If a microprocessor’s operation is incorrect, the manufacturer may simple remove it from the documentation. Manufacturers may implement several undocumented instructions for testing purpose only.
• In safety critical-applications only documented instructions must be used, but compiler usage may lead to a lesser control over the instructions. The writers of safety critical code would obviously avoid such undocumented features.
Manufactuer’s responses
• In insignificant problems, simply they ignore the fault.
• In other situations they will acknowledge the effect as a ‘feature’ of the device and modify the documentation accordingly.
• In severe cases they may modify the mask of the chip, often without notifying users.
• In extreme circumstances, usually as a result of external publicity, they may acknowledge the problem, modify the device and recall existing circuits.
• It is necessary to ensure that any replacement devices are of the same mask type (revision) as that used during development. Unfortunally, manufacturers only mark this revision number on devices meeting certain military standards (such as MIL-STD-883D)
Choice of microprocessors
• Processors that possess potentially dangerous
characteristics should clearly not be used in
critical applications.
• Example: Motorola 6801 which has a test
function that fetches infinite number of bytes from memory. If the function is executed
inadvertently, perhaps as a result of a jump
garbled by noise, its effects are dramatic.
Microprocessors being used in
industry• 1750A defined in MIL-STD-1750A• 1750B is a more recent version of 1750A
• 1750A together with 1750B become de facto standards in many current military and space applications.
• In automotive, 68HC11 is used widely because of the provision of protected control registers, a WDT, a clock monitor, an illegal opcode trap and inhibition of the test functions during normal use.
20
Designing for EMC
• The physical layout of the PCB and long tracs act as antennas. Agood design can reduce radiated emissions by up to 20dB.
• Loops built into PCBs also act as antennas. This causes severe problems in the layout of power lines.
• The use of ground planes within multilayer PCBs greatly reduces emissions.
• It is better to break right angles into pairs of 45 degrees turns.• Conductors leaving the enclosure must be filtered adjacent to the
aperture.• Switching power supplies are the major source of electrical noise.
Screening the unit and filtering the power lines is essential. Using many decoupling capacitors in the PCB is required.
• Increasing the rise and fall time of the digital circuitries with a capacitor reduces the high-frequency components.
Safety-critical software
• Exhaustive testing – almost impossible• Dynamic testing – executing the sw within its
target environment or in some cases within a simulation of that environment.
• Static testing – the structure and properties of the SW are studied. It is also called static code analysis. It is called static since the SW is not executed. Control flow, data use and information flow are investigated.
• The designer must take steps to design for testability
Common problems found in SW
• Subprogram side-effects: where variables in the calling environment may be unexpectedly changed;
• Aliasing: where two or more distinct manes refer (possibly inadvertently) to the same storage location.
• Failure to initialize: where a variable is used before it has been assigned a value;
• Expression evaluation errors: such as those caused by the use of an out-of-range array subscript, an arithmetic divide-by-zero or an arithmetic overflow.
• Portability driven faults arise when target device has a modest performance than SW development environment.
Arithmetic handling
• Usage of floating point arithmetic should be
avoided in systems of the highest levels of
criticality.
• The exception handling differ from language to
languages. Example when dividing by zero, some assign the highest permissible value as a
result while the others may assign a special
code to represent an infinite number.
Language selection
• Assembly code usage in safety-critical systems
is banned although this is now seen unrealistic
(eg:UK Defence Standard 00-55)
• In safety-critical community it is generally agreed
that unstructured assembly language and C++ should not be used.
• ADA is the strongly recommended language in safety-critical projects.
Software partitioning
• It aids comprehension of the SW.
• It provides a level of isolation between software functions.
• It simplifies the verification process.
• It will produce a layered structure. High-level command and control fuctions reside in the upper level while input-output routines and device drivers lie at the lowest layer.
• The safety-critical routines should be kept as small and as simple as possible
• Adequate isolation should be achieved between the sotware modules. This is necessary because any verification of the individual routine is pointless if an external module can interfere with its operation. Also one routine must not overwrite a memory location that is previously used by another module.
• Layered structure permits fault detection modules be placed on top of two or more distinct versions of the same module (like voter)
21
Operating systems in safety-critical
applications• OS allocates time to a series of concurrent tasks, preventing any
simple task from taking too much of the processor’s time.
• OS often make use of memory management hardware to limit a program’s access to memory, and should thus prevent overwriting of data.
• However, in highly critical real-time applications the use of OS is not acceptable.
• In smaller systems, some of the functions of an OS is provided by a runtime kernel with task scheduler. Runtime kernels can be of relatively low complexity and can be verified. Alsys CSMART ADA kernel is certified for flight-critical systems such as those within the Boeing 777 aircraft.
Requirements for the design
description from DO-178B (RTCA)• A detailed description of how the software satisfies the specified
software high-level requirements, including algorithms, data structures, and how software requirements are allocated to processors and tasks
• The description of the software architecture defining the software structure to implement the requirements
• The i/o description, eg a data dictionary, both internally and externally throughout the software architecture
• The data flow and control flow of the design
• Resource limitations, the strategy for managing each resource and its limitations, the margins, and the method for measuring thosemargins, for example timing and memory
• Scheduling procedures and interprocessor/intertask communications mechanisms, including time-rigid sequencing, pre-emptive scheduling, ADA rendezvous and interrupts
Requirements from DO-178B
• Design methods and details for their implementation, for examplesoftware data loading, user-modifiable software, or multiple-version dissimilar software
• Partitioning methods and means of preventing partition breaches
• Descriptions of the software components, whether they are new orpreviously developed, with reference to the baseline from which they were taken
• Derived requirements from the software design process• If the system contains deactivated code, a description of the means
to ensure that the code cannot be enabled by target computer• Rationale for those design decisions that are traceable to safety-
related system requirements
DO-178B software testing activities
Ref: NASA
Modified condition/decision
coverage• Modified Condition/Decision Coverage (MC/DC), is used
in the standard DO-178B to ensure that Level A (Catastrophic) software is tested adequately
• Independence of a condition is shown by proving that only one condition changes at a time.
• The most critical (Level A) software, which is defined as that which could prevent continued safe flight and landing of an aircraft, must satisfy a level of coverage called Modified Condition/Decision Coverage (MC/DC).
Definition of terms
• Condition – A condition is a leaf-level Boolean expression (it cannot be broken down
into a simpler Boolean expression).
• Decision – A Boolean expression composed of conditions and zero or more
Boolean operators. A decision without a Boolean operator is a condition.
• Condition Coverage – Every condition in a decision in the program has taken all possible
outcomes at least once.
• Decision Coverage – Every point of entry and exit in the program has been invoked at least
once, and every decision in the program has taken all possible outcomes at least once.
22
Types of structural coverage Statement coverage
• Every executable statement in the program is invoked at least once during SW testing.
• Statement coverage is considered a weak criterion. Example:
• By choosing x=2, y=0 and z=4 every statement is executed once but if an “or” is coded mistakenly instaead of an “and” in the first sentence, the test case will not detect a problem.
• It is generally considered as “useless”.
Condition/Decision Coverage
Every point of entry and exit in the program has
been invoked at least once, every condition in a decision in the program has taken all possible
outcomes at least once, and every decision in
the program has taken all possible outcomes at least once.
Modified Condition/Decision
Coverage Every point of entry and exit in the program has been invoked atleast once, every condition in a decision in the program has taken on all possible outcomes at least once, and each condition has been shown to affect that decision outcome independently. A condition is shown to affect a decision’s outcome independently by varying just that condition while holding fixed all other possible conditions. The condition/decision criterion does not guarantee the coverage of all conditions in the module because in many test cases, some conditions of a decision are masked by the other conditions. Using the modified condition/decision criterion, each condition must be shown to be able to act on the decision outcome by itself, everything else being held fixed. The MC/DC criterion is thus much stronger than the condition/decision coverage
Exhaustive testing is not possible
Number of boolean expressions with n conditions taken from an airborne software. To test an expression with 36 conditions, it is required to test 2 ^ 36 conditions.
Minimum tests for 3-input AND
23
Minimum tests for 3-input OR Minimum tests for 2-input XOR
Testing of NOT operations Testing of comparisons
2 types of comparators
MC/DC
Better than MC/DC
Loop testing Loops with exit
24
Masking test sets
Before
After. The T value at the input of OR masks the AND gate output
MCDC example
Any false input to the AND gate will mask the other input. Hence, the false outcome of OR1 will mask the test case 1 for the OR2 gate. Similarly, the false outcome of OR2 will mask test case 3 for the OR1 gate.
Another example
25
The outputs computed match those provided. Hence, test cases 1, 2, 4, 5, and (FFTT) provide MC/DC for example
Grouped functionality testing
Determine MCDC
26
Software usage in aircraft
• In safety-critical applications, it is prefered wherever possible to use simple, electromechanical or non-programmable electronic solutions to problems related to safety
• However, software used within aircraft has rosen dramatically to 8K words in INS (near 1970) to nearly 10M words in A330/A340 aircraft in 1993
• If this trend continues, it is expected to see a further increase by a factor of about a 1000 over the next 20 years.
• During the period covered above, the number of fatal accidents throughout the world, measured per million hours of flight, has fallen by a factor of nearly 10 for most classes of civil aircraft.
• In automotive industry, the dramatic increase of the use of microcontrollers and bus systems such as CAN network lead to thenotion of safety-criticality. The volume of safety-critical software being used in the automotive industry will increase dramatically over the next years.
PLCs in safety-critical systems
Diagnostics
PLC
Control system
Outputs to actuators
Shutdown system sensors
System sensors
Level 1 shutdown system using PLCs, a single processor shutdown systemReferred as a 1OO1 system (a 1-out-of-1 architecture). It has single shutdown channel and this unique channel must work correctly. According to IEC standard, the first number indicates the number of independent channels and the second number indicates the number of channels that must work correctly.
Shutdown system sensors
Diagnostics
PLC
Control system
Outputs to actuators
System sensors
Diagnostics
PLC
Switch
Level 1 shutdown system using PLCs, a dual-processor shutdown systemReferred as a dual 1OO1D system (a dual 1-out-of-1 architecture)
Integrity levels in PLC systems
• Level 1: 1OO1, 1OO1D
• Level 2: 2OO2. Shutdown channels are
parallel so the two channel must work
correctly
• Level 3: 2OO3 (like TMR)
• Level 4: Dual TMR shutdown modules
from different manufacturers.
2OO2 arrangement
Diagnostics
PLC 1
Control system
Outputs to actuators
System sensors
Diagnostics
PLC 2
Shutdown system sensors
Shutdown system sensors
PLC1
PLC 2
Control system
Outputs to actuators
Shutdown system
sensors
PLC 3
Shutdown system sensors
Shutdown system sensors
A triple-processor arrangement with triple i/o, (2-out-of-3 arrangement, 2OO3)
27
A340’s fault tolerence
• Mechanical: Mechanical linkages to the rudder and trimmable horizontal stabilizer give control in the event of total electronic system failure
• Computers: Five computers of two types are used. Each computer uses two independent processors with diverse software and diverse programming languages. The primary and secondary computers have diverse hardware (icluding different processors), diverse software and different hardware manufacturers.
• Sensors: Multiple sensors (two or three) are used in each case.• Actuators: One, two, or three actuateors are used for each surface.
• Electrical supplies: The A340 uses six generators, two batteries and five buses. Four of the generators are driven by the engines, one by an auxiliary power unit and the last by the hydraulic system
References
• Safety-Critical Computer Systems – Neil
Storey – Pearson/Prentice Hall
• Wikipedia
• Youtube