+ All Categories
Home > Documents > Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity •...

Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity •...

Date post: 06-Mar-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
27
1 Safety Critical Computing Osman Kaan EROL What is SAFETY? Safety-related system • Safety is a property of a system that it will not endanger human life or the environment •A safety-related system is one by which the safety of equipment or plant is assured • Safety-critical system is a synonim for a safety-related system but in some cases it suggests a system of high criticality levels of integrity The implications of failure vary greatly between applications, and this leads to the concept of levels of integrity that reflect the importance of correct operation. Once a project has been assigned a safety integrity level, this will determine the methods of design and implementation used for the system. Software and safety All safety-critical applications depend on software, but software is among the most imperfect of the products of modern technology However, software alone cannot provide such assurance, as its correct operation is dependent on the system hardware. video Safety aspects Primary safety: Includes dangers from electrocution or electric shock, and from burns or fire caused directly by the hardware Functional safety: Covers aspects concerned with equipment that is directly controlled by the computer, and is related to the correct functioning of the hardware and its software. Indirect safety: Relates to the indirect consequences of a computer failure or the production of incorrect information.
Transcript
Page 1: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

1

Safety Critical Computing

Osman Kaan EROL

What is SAFETY?

Safety-related system

• Safety is a property of a system that it will

not endanger human life or the

environment

• A safety-related system is one by which

the safety of equipment or plant is assured

• Safety-critical system is a synonim for a

safety-related system but in some cases it

suggests a system of high criticality

levels of integrity

• The implications of failure vary greatly

between applications, and this leads to the

concept of levels of integrity that reflect the

importance of correct operation.

• Once a project has been assigned a safety

integrity level, this will determine the

methods of design and implementation

used for the system.

Software and safety

• All safety-critical applications depend on software, but software is among the

most imperfectof the products of modern technology

• However, software alone cannot provide such assurance, as its correct operation is dependent on the system hardware.

• video

Safety aspects

• Primary safety: Includes dangers from electrocution or electric shock, and from burns or fire caused directly by the hardware

• Functional safety: Covers aspects concerned with equipment that is directly controlled by the computer, and is related to the correct functioning of the hardware and its software.

• Indirect safety: Relates to the indirect consequences of a computer failure or the production of incorrect information.

Page 2: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

2

Disadvantages of computer-based

systems

• Their complexity: By their very nature, all

computer-based systems are complex.

• In complex digital devices such as

microprocessors the number of possible

failure modes is so large that it may be

considered to be infinite.

Development lifecycle model

• Requirements → hazards and risk analysis→

specification→ architectural design→ module

design→ module construction and testing→system integration and testing→ system

verification→ system validation→ certification

→ completed system

• Green phases emphasis “top-down” approach to

design while orange design phases use “bottom-up” approch to testing.

Verification & Validation

• Verification is the process of determining

that a system, or module, meets its

specifiation. (modules)

• Validation is the process of determining

that a system is appropriate for its

purpose. (system as a whole)

Fault, error, system failure

• A fault is a defect within a system

• An error is a deviation from required

operation of the system or subsystem

• A system failure occurs when the system

fails to perform its required function

Fault classes

• Random faults, are associated with

hardware component failures

• Systematic faults. All software faults come

within this category as they are faults

within the design of the software.

Fault management

• Fault avoidance techniques aim to prevent faults from entering the system during the design stage

• Fault removal methods attempt to find faults within a system before it enters service

• Fault detection techniques are used during service to detect faults

• Fault tolerance techniques are designed to allow the system to operate correctly in the presence of faults.

Page 3: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

3

Faults are inevitable! We must learn to manage them...

Systems requirements

• Reliability

• Availability

• Failsafe operation

• System integrity

• Data integrity

• System recovery

• Maintainability

• Dependability

Reliability

• is the probability of a component, or

system, functioning correctly over a given

period of time under a given set of

operating conditions

Availability

• The availability of a system is the

probability that the system will be

functioning correctly at any given time

Failsafe operation

• Possessing a set of output states that can be

identified as being ‘safe’.

• An example: Railway signalling system failsafe

state. All signalling

lights are red and all points are locked to their previous

positions. This brings all

the trains safely to a halt. If not...

Primary Cause(s) - Wiring defect

Page 4: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

4

Integrity

• The integrity of a system is its ability to

detect faults in its own operation and to

inform a human operator

Data integrity

• Data integrity is the ability of a system to

prevent damage to its own database and

to detect, and possibly correct, errors that

do occur

System recovery

• It is vital to detect the failure and restart

itself quickly.

• Depending on the nature of the

application, the recovery process may

need to determine the current status of the

system, to take appropriate action to

continue operation, and to maintain safety.

Maintainability

• Maintenance is the action taken to retain a system in, or return a system to, its designed operating condition.

• Maintainability is the ability of a system to be maintained.

• Other terms related to this topic:- Mean time to repair (MTTR)- Maintenance-induced failures

Dependability

• Dependability is a property of a system

that justifies placing one’s reliance on it.

• It covers considerations of reliability,

availability, safety, maintainability and

other issues of importance in critical

systems

Conflict between system

requirements

• An obvious example: Desire of high

performance and a desire for low cost.

• Other topics to be learned:

- Global optimization

- Multi-objective decision support systems

- Pareto-front

- Coverage of pareto front or diversity in

archive list.

Page 5: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

5

Safety requirements

• Identification of the hazards associated with the system

• Classification of these hazards

• Determination of methods for dealing with the hazards

• Assignment of appropriate reliability and availability requirements

• Determination of an appropriate safety integrity level

• Specification of development methods appropriate to this integrity level

Safety case

• A safety case describes the design and

assessment techniques used in the

development of the system.

• It is often referred as:

- Safety argument- Safety justification

- Safety assessment report

• The provision of such a document is accepted

as a good engineering practice

Hazard Analysis

• A hazard is a situation in which there is actual or

potential danger to people or to the environment.

• Among most widely used hazard analysis

techniques:

- failure modes and effect analysis- failure modes, effects and criticality analysis

- event tree analysis

- fault tree analysis can be given.

Failure modes and effects analysis

Ensure hardware design prevents excessive current through switch

negligibleSlight delay in sensing state of guard

a)Ageing effects

b)Prolonged high currents

Excessive switch-bounce

3

Modify software to detect switch failure and take appropriate action

Allows machine to be used when guard is absent-dangerous failure

System incorrectly senses guard to be closed

a)Faulty component

b)Excessive current

Short-circuit contacts

2

Select a reliable switch, rigid quality control on switch procurement

Prevents use of machine-system fails safe

Failure to detect tool guard in place

a)Faulty component

b)Excessive current

c)Extreme temperature

Open-circuit contacts

Tool guard switch

1

Remedial actionSystem effectsLocal effects

Possible causeFailure modeunitRef no

FMEA for a microswitch

Hazard and operability studies

As aboveTemp reading too low – could result in overheating and possible plant failure

Sensor mounted incorrectly or sensor failure

Less3

Consider use of duplicate sensor

Temp reading too high -results in decrease in plant efficiency

Sensor faultMore2

Lack of sensor signal detected and system shuts down

PSU,sensor or cable fault

NoVoltageSensor output1

RecommendationConsequenceCauseGuide word

AttributeInter-connectionItem

Fault tree analysis

• Tree components...in

out control

out

in

Page 6: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

6

• Fault event resulting from other events • Basic event, taken as an input

• Fault event not fully traced to its source. It

is taken as an input but its causes may be

unknown

• The triangle symbol is used to link trees.

The ‘in’ symbol indicates an input from

another tree (on another sheet). The ‘out’

symbol appears in place of the ‘top event’

and indicates that this point forms the

input to another tree

in

out

• The output event occurs if ALL the inputs

occur

• The output event occurs if ANY of its

inputs occur, either alone or in

combination

Page 7: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

7

• The control condition determines whether

the input event appears at the output

control

out

in

Loss of heating

Loss of fuel supply

Loss of electricity

Loss of

solid fuel

Loss of

liquid fuel

Warning lamp

does not operate

No voltage applied to lamp

Primary

lamp failure

Primary cable or

connector

failure

Battery

supply failure

Risk analysis

• An accident is an unintended event or sequence of events that causes death, injury, environmental or material damage.

• An incident is an unintended event or sequence of events that does not result in loss, but, under circumstances, has the potential to do so.

• Risk is a combination of the frequency or probability of a specified hazardous event, and its consequence.

• Example: Risk class or risk level or risk factor = severity (death) x frequency (failure per year)

Severity categories for civil aircraft

• Catastrophic: failure condition which would prevent continued safe flight

• Hazardous: failure conditions which yield to- a large reduction in safety margins- physical distress or higher workload- adverse effects on occupants, including serious or potentially fatal injuries

• Major: Failure conditions which would reduce the capability of the aircraft

• Minor: Failure conditions which would not significantlyreduce the capability of the aircraft

• No effect: Failure conditions which do not affect the operational capability of the aircraft

DO178B objectives according to

severity classes

00No effectE

228MinorD

257MajorC

1465HazardousB

2566CatastrophicA

With independenceObjectivesFailure conditionLevel

Page 8: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

8

Severity conditions for military

systems

• Catastrophic: Multiple deaths

• Critical: A single death, and/or multiple severe injuries or severe occupational illness

• Marginal: A single severe injury or occupational illness, and/or multiple minor injuries or minor

occupational illnesses

• Negligible: At most a single minor injury or minor

occupational ilness

Hazard probability classes for

aircraft systems

* Probability per operating hour

10E-9Extremely improbableExtremely improbable

10E-8

10E-7Extremely remote

10E-6

10E-5RemoteImprobable

10E-4

10E-3Reasonably frequent

10E-2

10E-0*FrequentProbable

Similar definitions are used in Europe (JAR, 1994) and in US (FAR, 1993)

Allowable probability for civil

aircraft systems

10E-9Multiple deaths, usually with loss

of aircraft

Catastrophi

c

10E-7 – 10E-8Large reduction in safety margins, serious injury or death of a small number of occupants

Hazardous

10E-5 – 10E-6Significant reduction in safety margins, passenger injuries

Major

10E-3 – 10E-4Operating limitationMinor

10E-2Nuisance

10E0 - 10E-1Normal

Maximum probability per operating hour

Severity of effectCategory

Safety integrity

• Safety integrity is the likelihood of a safety-

related system satisfactorily performing

the required safety functions under all the

stated conditions within a stated period of

time

Assignment of integrity levels

Severity of hazardous event

Frequency of hazardous event

Risk classification

Integrity classification

Hardware integrity classification

Systematic integrity

classification

Software integrity

classification

Fault tolerance

• Faults can be distinguished by their- nature (random hardware faults, systematic system faults or design faults),- duration (permanent, transient intermittent) and- extent (localized or global fault).

• Transient faults appear and disappear after a short time. Effects of alpha particle strikes on semiconductor memory chips can cause such errors. It can change one or more bits within memory plane.

• Intermittent faults appear, disappear and then reappear at somelater time. Poor solder joints or corrosion on connector contacts can cause this class of faults. These can also result from electrical interference. Many permanent faults appear to be intermittent. For example, a software synchronization fault is permanent, as its code is always present, but its execution will only sometimes result in an error.

Page 9: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

9

Hardware faults

• Among possible harware faults,

- a break in a conductor,

- or a short (or bridge) between various nodes of the circuit can be given.

• Several fault models are used to represent different types of fault:

- the single-stuck-at fault model,

- the bridging fault model,- the stuck-open fault model.

The single-stuck-at model

• The model assumes that a fault within a

module will cause it to respond as if one of

its inputs or outputs is stuck at a logic 1 or

a logic 0.

The bridging model

• A bridging or short-circuit fault occurs when two or more nodes of a circuit are accidentally joined together to form a permanent fault. Such a fault may sometimes be represented by the single-stuck-at model. The result is an unintentional logic operation similar to a ‘wired-AND’ (in positive logic) or ‘wired-OR’ (in negative logic) function assuming logic 0 can hide logic 1.

• PCB carbonization due to excessive heat can also cause such faults.

The stuck-open model

• The fault occurs when both output

transistors of a CMOS gate are turned off

as a result of an internal open- or short-

circuit. This causes the output to pulled

neither high nor low, producing an effect

similar to the high impedance state of a

three-state TTL gate.

The use of fault models

• A circuit with N nodes has 2N potential single-

stuck-at faults. By using appropriate

combinations of inputs or test vectors these can be traced out.

• A circuit with N nodes could have 3N-1 multiple faults. Multiple faults do occur because physical

defects can affect more than one node.

• A circuit with N nodes can generate a

combinaison of N with M shorted nodes.

Software faults

• Software faults may take an almost

unlimited number of forms. Examples

include:

- software specification faults

- coding faults

- logical errors within calculations

- stack overflows or underflows

- use of unitialized variables

Page 10: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

10

Redundancy

• All forms of fault tolerance are achieved by

some form of redundancy, that is, the use

of some additional elements within the

system which would not be required in a

system that was free from all faults.

• Most commonly used form of redundancy

is the simple triple modular redundancy

(TMR).

TMR

Output

Module 1

Module 2

Module 3

Voting element

Input

Forms of redundancy

• The TMR priciple can be applied to:

- Harware redundancy

- Sowtware redundancy

- Information redundancy: parity bits,

checksums...

- Temporal (time) redundancy: same

operation is repeated three times to detect

transient faults.

Design diversity

• Failures as a result of similar faults in different redundant modules are termed common-mode failures.

• Most attempts to deal with such problems rely on redundancy combined with some form of diversity.

• A TMR system with redundant modules designed and manufactured by different teams preserve diversity.

• However, design diversity would not provide protection against mistakes within the specification.

Fault detection techniques

• Functionality checking

• Consistency checking

• Signal comparison

• Checking pairs

• Information redundancy

• Instruction monitoring

• Loopback testing

• Watchdog timers

• Bus monitoring

• Power supply monitoring

Functionality checking

• Functionality checking: This involves the use of routines which check to see that the hardware of the system is functioning correctly.

• RAM checking: Perform a write then read a pattern and later repeat the process with the complement of the previous pattern. This test will fail in the abscence of a memory location due to the capacitance of the bus lines that hold the information for a while.

• In simple design, testing all locations is possible, but in most cases only a fraction of the memory space is tested.

• The processor-s can be checked by executing a sequence of calculations and comparing the results with known values. If theroutine contains a wide range of processing operations this testverifies the operation of a large part of the system, including the processor, sections of the memory and the bus system.

Page 11: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

11

Consistency checking

• Consistency checking uses some

knowledge of the nature of the information

within the system to test its validity. An

example of this form of testing is range

checking. This compares calculated or

stored values for a variable with

predefined values for its allowable range.

Signal comparison

• In systems with redundancy it is possible

to check the signal at similar points in the

various modules to validate them. This

process is simpler if the modules are

identical, rather than of diverse design.

Checking pairs

• Checking pairs are effectively a special

case of the use of signal comparison. Here

identical modules are designed to allow a

comparison of multiple signals in an

attempt to detect any discrepancies. If the

modules produce identical signals it is

assumed that both are fault free (or both

are faulty)

Information redundancy

• Parity checking

• M-out-of-N codes

• Checksums

• Cyclic redundancy codes

• Error correcting codes

• Each of these techniques use additional redundant information (nothing is created from nothing)

Information monitoring

• Normal operation of a processor involves the repated fetching and execution may corrupt the operation code or the operand. The action taken by processor in response to such error varies greatly between devices. Some processors immediately raise an exception, others simply fetch the next byte from memory.

• Processors that do not take appropriate action in response to unimplemented instructions must not be used in safety-critical systems.

Loopback testing

• This verifies that signals leaving one point of a circuit arrive at their destination unchanged.

• This is achieved by providing an independent return path for thesignal back to its source, and by comparing the outgoing and return signals to ensure equivalence.

• It is used in applications such as serial communications, within the processor. Bus lines, for example, may be constructed as loops so that they leave the processor, go to all relevant nodes of the circuit and then return to the processor for verification.

• In multiboard systems a ‘daisy-chain’ arrangement may be used to ensure that all boards are present.

• The outward and return paths should never be adjacent, to prevent the possibility of a short-circuit.

Page 12: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

12

Watchdog timers

• It is used to detect the ‘crash’ of a processor.

• A timer (WDT) is arranged so that it will cause the system to reset if it is allowed to time out, but is prevented from doing so by specific processor activity such as loading of a prescaler value that the WDT repeatedly decrements it. WDT generates a reset condition if it is left undisturbed.

• WDT have limitations: following a system crash the processor will often continue to operate for some considerable time until reset. During this time the operation of the processor is unpredictable and hazardous.

• There may be cases where the system can crash in such a way thatthe WDT is reset periodically, preventing its intervention.

Bus monitoring

• Each bus address is compared with the

allowable range for the program that is

being executed and any out-of-range

value will result in an error being reported

to the processor.

Power supply monitoring

• A well-designed power supply, together with overvoltage protection, can normally protect components from damage due to excessive supply voltages.

• More serious problems can arise when the supply voltage drops below that required for normal operation. This is inevitable when the system is first turned on and when it is turned off. It may also occur if the supply voltage fails or ‘dips’ during operation.

• A power supply monitor warns the processor when the supply voltage reaches a dangerous level so that the processor takes emergency action.

• If the systems must operate continously, then uninterruptible power source must be used.

Hardware fault tolerance

• Static redundancy- TMR- N-modular redundancy

• Dynamic redundancy- Standby spares- Self checking pairs

• Hybrid redundancy- N-modular redundancy with spares- Module synchronization and diversity in hardware redundancy

TMR

Module 1

Module 2

Module 3

Voting element

Input

Voting element

Voting element

Output 1

Output 2

Output 2

A TMR system with triplicated woting

Module

Module

Module

Voting element

Voting element

Voting element

Input 2

Input 1

Input 3

Module

Module

Module

Voting element

Voting element

Voting element

3x3 connection matrix

Outputs

A multistage TMR arrangement

Page 13: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

13

Module 1

Module 2

Module 3 Voting element

Input

Module N

Output

An N-modular redundant system 1111

1011

1101

0001

1110

0010

0100

0000

321

outputinputs

Truth table of 3-input 1-bit voting element

Module 1

Module 2

Fault detector

Switch

Input

Output

A standby spare arrangement with 2 modules (can be extended to N modules)

Module 1

Module 2

Comparator

Input

Output

A self-checking pair

Failure detected

Output

Failure detected

Dual-port memory

Processor

Memory

Processor

Memory

Fail

Fail

Module 2

Module 1

A self-checking pair using software comparison

Comparator

Output

Module 2

Module 1

Comparator

Fail

Fail

Combining failure detection signals using switches

Page 14: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

14

Hybrid redundancy

• Hybrid redundancy uses a combination of

voting, fault detection and module

switching. Many techniques are used,

although most can be generalized as

some form of N-modular redundancy with

spares.

Switch

Output

Module 1

Disagreement detector

Voter

Module 2

Module 3

Module N

Spare 1

Spare 2

Spare M

N-modular redundancy with spares

Software fault tolerance

• Has two meanings:

- The “tolerance of software faults” by

using several hardware techniques.

- The “tolerance of faults (HW/SW) by the

use of software”. It has two common

methods namely;

* N-version programming

* Recovery blocks

N-version programming

• Duplicating modules has no benefit in term of redundancy if the same software is used in both.

• In order to achieve redundancy, different versions of SW must be run sequentially on the same processor. Same input data should give same results. If two version is used, it is analogous to self-checking pair. Three version + voter is analogous to TMR case. In practical, only two different versions (N=2) are used.

• N > 2 with multiple processors is used in A330/340 aircraft for example, where the application is very critical

Recovery blocks

• Date to early 70’s.

• It uses some form of error detection to validate the operation of a SW module. If an error is detected an alternative SW routine is used.

• It is based on acceptance tests. For example calculation of a square root is checked by squaring the result to obtain the input data. Equality obtained after reversing the operation validates the module.

• In case of a failure of the test, an alternative, redundant module is executed before another acceptance test.

Comparison of HW and SW

techniques

• N-version programming provides fault masking

in a manner similar to N-modular redundant HW

arrangements.

• Duplication of HW modules can be used to

provide tolerance of some forms of HW faults but duplication of identical SW modules has little

benefit. The latter provides a redundancy

against transient faults arising from HW but none against SW design faults.

Page 15: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

15

Selecting fault-tolerant

architectures

• PES1 | PES2: Automotive, railway apps

• PES | NP: Automotive, railway apps

• PES1 | PES2 | PES3: Critical avionics

• PES1 | PES2 | NP: Critical avionics

• PES1 | PES2 | PES3 | PES4 (| NP): Nuclear reactor, fly-by-wire aircraft systems

• PES: Programmable Electronic System.

• NP: Non-Programmable module.

Space shuttle example

• 23 serial data buses among 5 CPUs and

memory, displays, sensors, telemetry,

control panels, payloads, boosters and

telecomms +...

• 5 serial interprocessor buses

Fault-tolerant PLC example

Input module

Input module

Input module

Processor

Sensor

Processor

Processor

Actuator

Sensor

Sensor

Output

module

Output module

Output module

Fault-tolerant avionics systems

Gateway

Processor

Processor

Power supply

...

Same module as above

ARINC 659 backplane bus

Same module as above

...

ARINC 659 serial data bus

Smart actuator

Smart sensor

Data concentrator

Dumb sensors

System reliability

• R(t) = n(t)/Nwhere R is reliabilty, n is the number of correctly operating samples, N is the number of samples.

• Q(t) = nf(t)/Nwhere Q is unreliability, nf is the number of failed samples.

• Q(t) =1 - R(t)

Typical variation of failure rate

time

Failure rate

Burn in Useful life Wear out

Failure rate = dnf (t )/n(t ) dt

Page 16: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

16

Exponential failure law

• The exponential relationship between reliability an time is known as the exponential failure law. This indicates that for a constant failure rate reliability falls exponentially with time.

• R(t ) = e-λt

• Here λ denotes constant failure rate throughout the useful life of the system

Time-variant failure rates

• SW failures are due to design faults and in some circumstances may be located or removed during the lifetime of the product. In this case the number failures will tend to decrease with time.

• R(t ) = e-(t/η)β

where β is the shape parameter and η is the caracteristic life.

Mean time to failure (MTTF)

• MTTF is the another form of describing the

reliability. This is the expected time that a

system will operate before the first failure occurs.

• MTTF = ∫0∞ R(t )dt

• If failure rate is constant then

MTTF = 1/ λ. When t = MTTF, then R = 0.37 which means that the system has a chance to

operate correctly only about 37%.

Other equalities...

• Mean time to repair (MTTR): average time taken to repair a system that has failed

• Mean time between failures (MTBF): MTBF = MTTF + MTTR ≈ MTTF

• Availability = Time system is operationalTotal time

= MTTF / (MTTF + MTTR)

• Unavailability = 1 - Availability

Reliability modelling

• Combinational models:- series systems (modules are arranged in series)λ = λ1 + λ2 + λ3+ .... + λn

R (t) = R1(t )R2(t )....RN(t )- parallel systems (modules are parallel)Q(t ) = [1-R1(t )][1-R2(t )]...[1-Rn(t )]since R (t ) = 1-Q (t )R(t ) = 1-[1-Rm(t)]N where modules are identical.- series-parallel combinations

Example 1

• A system is composed of 100 components

and failure of any component will result in

failure of the system. If the failure of the

various components is completely

independent, and each component has a

reliability of 0.999, calculate the overall

system reliability. (R(t)=0.905,Q(t)=0.095)

Page 17: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

17

Example 2

• A series system containing 100

components is required to have a reliability

of at least 0.999. Assuming that each of

the components is equally reliable, what

minimum reliability (Rc)would they require

to achieve the specified system

performance?

• (R((t)=Rc(t)100≥0.999 => Rc(t)=0.99999)

Example 3

• A system consists of three identical modules and

will operate correctly provided that at least one

module is operational. If the reliability of each of the modules is 0.999, what will be the reliability

of the complete system, assuming that the

modules fail independently?

• (R(t) = 0.999 999 999)

Example 4

• A system requires a minimum reliablity of 0.999.

A module designed to fulfil the requirements of

the system is found to have a reliability of only 0.85. If a parallel combination of these modules

is used to implement the system, what is the

minimum number of modules needed to achieve the required reliability?

• (N ≥ 3.64 so N=4)

Triple modular and N-modular

redundancy

• Probability of correct operation =

probability of no failures

+ probability of only module 1 failing

+ probability of only module 2 failing

+ probability of only module 3 failing

TMR reliability

• RTMR(t) =

R1(t )R2(t )R3(t )

+[1-R1(t )]R2(t )R3(t )

+[1-R2(t )]R1(t )R3(t )

+[1-R3(t )]R1(t )R2(t )

• If three modules are identical

RTMR(t) = 3R2(t) – 2R3(t)

Example 5

• A TMR system consists of three identical

modules each with a reliability of 0.95.

Calculate the reliability of the resultant

system, ignoring the effects of the voting

arrangement

• (RTMR(t ) = 0.993)

Page 18: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

18

Cut and tie sets

• Cut sets are formed by drawing lines through the reliability block diagram to represent combinations of elements in which simultaneous failure would lead to system failure.

• Minimal cut sets represent cut sets in which no subset will result in system failure.

• Tie sets are formed by drawing lines through the reliability block diagram to represent groups which, if all the elements were working, would guarantee the functioning of the system.

• Minimal tie sets represent tie sets in which no subset will perform this function.

Cut and tie sets

1 2

43

1 2

43

Minimal cut sets Minimal tie sets

Reliability and MTTF

• Reliability is a function of time while the

MTTF of a system is a fixed characteristic

that does not change with time.

• They seem to correlate but this is not

necessarily be true: Adding redundancy to

a system increases its reliability over a

given time period while increasing

complexity yields to a lesser MTTF.

Independence of failures

• Up to now, all the failures are assumed to

be independent.

• This is valid for random component

failures but is not so for systematic

failures. Similar redundancy modules with

the same software may fail simultaneously

since all of them receives the same input

data and their SW are equal.

Markov models

• An alternative approach to determine the

overall reliability of a system is to assign

various states to a system and to

determine the probability of being in any of

these states.

• As an example, one might assign two

possible states to a system, representing

the working and not working conditions.

A two state system

1 20.9

0.1

0.4

0.6

From a state diagram, a tree diagram can be obtained. In this case time intervals are shown. In a state diagram, time parameter does not exist. The state diagram express that the system, if it is initially at state 1, will stay at that state with a probability of 0.9 and will switch to state 2 with a probability of 0.1. The sum of probabilites leaving a state must be equal to unity. This is discrete Markov model.

Page 19: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

19

Continuous Markov modeling

1 21-λ

λ

µ

1- µ

In continuous Markov models, the probabilities are replaced by transition rates. λ represent the failure rate and µ represent the repair rate. So the limiting probabilites of being in each state are given by:P1 = µ/(λ + µ) and P2 = λ /(λ + µ)

Safety-Critical hardware

• Design faults within microprocessors may take a

number of forms:

- failure of the circuit to correctly implement its intended function

- failure of the documentation to correctly

describe the circuit’s operation.

• It is often said that when designing critical

systems one should assume that ‘whatever can go wrong, will go wrong’.

Microprocessor design faults

• Compilers may not use some of the microprocessor’s instructions. Any fault may stay hidden due to low usage of them.

• If a microprocessor’s operation is incorrect, the manufacturer may simple remove it from the documentation. Manufacturers may implement several undocumented instructions for testing purpose only.

• In safety critical-applications only documented instructions must be used, but compiler usage may lead to a lesser control over the instructions. The writers of safety critical code would obviously avoid such undocumented features.

Manufactuer’s responses

• In insignificant problems, simply they ignore the fault.

• In other situations they will acknowledge the effect as a ‘feature’ of the device and modify the documentation accordingly.

• In severe cases they may modify the mask of the chip, often without notifying users.

• In extreme circumstances, usually as a result of external publicity, they may acknowledge the problem, modify the device and recall existing circuits.

• It is necessary to ensure that any replacement devices are of the same mask type (revision) as that used during development. Unfortunally, manufacturers only mark this revision number on devices meeting certain military standards (such as MIL-STD-883D)

Choice of microprocessors

• Processors that possess potentially dangerous

characteristics should clearly not be used in

critical applications.

• Example: Motorola 6801 which has a test

function that fetches infinite number of bytes from memory. If the function is executed

inadvertently, perhaps as a result of a jump

garbled by noise, its effects are dramatic.

Microprocessors being used in

industry• 1750A defined in MIL-STD-1750A• 1750B is a more recent version of 1750A

• 1750A together with 1750B become de facto standards in many current military and space applications.

• In automotive, 68HC11 is used widely because of the provision of protected control registers, a WDT, a clock monitor, an illegal opcode trap and inhibition of the test functions during normal use.

Page 20: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

20

Designing for EMC

• The physical layout of the PCB and long tracs act as antennas. Agood design can reduce radiated emissions by up to 20dB.

• Loops built into PCBs also act as antennas. This causes severe problems in the layout of power lines.

• The use of ground planes within multilayer PCBs greatly reduces emissions.

• It is better to break right angles into pairs of 45 degrees turns.• Conductors leaving the enclosure must be filtered adjacent to the

aperture.• Switching power supplies are the major source of electrical noise.

Screening the unit and filtering the power lines is essential. Using many decoupling capacitors in the PCB is required.

• Increasing the rise and fall time of the digital circuitries with a capacitor reduces the high-frequency components.

Safety-critical software

• Exhaustive testing – almost impossible• Dynamic testing – executing the sw within its

target environment or in some cases within a simulation of that environment.

• Static testing – the structure and properties of the SW are studied. It is also called static code analysis. It is called static since the SW is not executed. Control flow, data use and information flow are investigated.

• The designer must take steps to design for testability

Common problems found in SW

• Subprogram side-effects: where variables in the calling environment may be unexpectedly changed;

• Aliasing: where two or more distinct manes refer (possibly inadvertently) to the same storage location.

• Failure to initialize: where a variable is used before it has been assigned a value;

• Expression evaluation errors: such as those caused by the use of an out-of-range array subscript, an arithmetic divide-by-zero or an arithmetic overflow.

• Portability driven faults arise when target device has a modest performance than SW development environment.

Arithmetic handling

• Usage of floating point arithmetic should be

avoided in systems of the highest levels of

criticality.

• The exception handling differ from language to

languages. Example when dividing by zero, some assign the highest permissible value as a

result while the others may assign a special

code to represent an infinite number.

Language selection

• Assembly code usage in safety-critical systems

is banned although this is now seen unrealistic

(eg:UK Defence Standard 00-55)

• In safety-critical community it is generally agreed

that unstructured assembly language and C++ should not be used.

• ADA is the strongly recommended language in safety-critical projects.

Software partitioning

• It aids comprehension of the SW.

• It provides a level of isolation between software functions.

• It simplifies the verification process.

• It will produce a layered structure. High-level command and control fuctions reside in the upper level while input-output routines and device drivers lie at the lowest layer.

• The safety-critical routines should be kept as small and as simple as possible

• Adequate isolation should be achieved between the sotware modules. This is necessary because any verification of the individual routine is pointless if an external module can interfere with its operation. Also one routine must not overwrite a memory location that is previously used by another module.

• Layered structure permits fault detection modules be placed on top of two or more distinct versions of the same module (like voter)

Page 21: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

21

Operating systems in safety-critical

applications• OS allocates time to a series of concurrent tasks, preventing any

simple task from taking too much of the processor’s time.

• OS often make use of memory management hardware to limit a program’s access to memory, and should thus prevent overwriting of data.

• However, in highly critical real-time applications the use of OS is not acceptable.

• In smaller systems, some of the functions of an OS is provided by a runtime kernel with task scheduler. Runtime kernels can be of relatively low complexity and can be verified. Alsys CSMART ADA kernel is certified for flight-critical systems such as those within the Boeing 777 aircraft.

Requirements for the design

description from DO-178B (RTCA)• A detailed description of how the software satisfies the specified

software high-level requirements, including algorithms, data structures, and how software requirements are allocated to processors and tasks

• The description of the software architecture defining the software structure to implement the requirements

• The i/o description, eg a data dictionary, both internally and externally throughout the software architecture

• The data flow and control flow of the design

• Resource limitations, the strategy for managing each resource and its limitations, the margins, and the method for measuring thosemargins, for example timing and memory

• Scheduling procedures and interprocessor/intertask communications mechanisms, including time-rigid sequencing, pre-emptive scheduling, ADA rendezvous and interrupts

Requirements from DO-178B

• Design methods and details for their implementation, for examplesoftware data loading, user-modifiable software, or multiple-version dissimilar software

• Partitioning methods and means of preventing partition breaches

• Descriptions of the software components, whether they are new orpreviously developed, with reference to the baseline from which they were taken

• Derived requirements from the software design process• If the system contains deactivated code, a description of the means

to ensure that the code cannot be enabled by target computer• Rationale for those design decisions that are traceable to safety-

related system requirements

DO-178B software testing activities

Ref: NASA

Modified condition/decision

coverage• Modified Condition/Decision Coverage (MC/DC), is used

in the standard DO-178B to ensure that Level A (Catastrophic) software is tested adequately

• Independence of a condition is shown by proving that only one condition changes at a time.

• The most critical (Level A) software, which is defined as that which could prevent continued safe flight and landing of an aircraft, must satisfy a level of coverage called Modified Condition/Decision Coverage (MC/DC).

Definition of terms

• Condition – A condition is a leaf-level Boolean expression (it cannot be broken down

into a simpler Boolean expression).

• Decision – A Boolean expression composed of conditions and zero or more

Boolean operators. A decision without a Boolean operator is a condition.

• Condition Coverage – Every condition in a decision in the program has taken all possible

outcomes at least once.

• Decision Coverage – Every point of entry and exit in the program has been invoked at least

once, and every decision in the program has taken all possible outcomes at least once.

Page 22: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

22

Types of structural coverage Statement coverage

• Every executable statement in the program is invoked at least once during SW testing.

• Statement coverage is considered a weak criterion. Example:

• By choosing x=2, y=0 and z=4 every statement is executed once but if an “or” is coded mistakenly instaead of an “and” in the first sentence, the test case will not detect a problem.

• It is generally considered as “useless”.

Condition/Decision Coverage

Every point of entry and exit in the program has

been invoked at least once, every condition in a decision in the program has taken all possible

outcomes at least once, and every decision in

the program has taken all possible outcomes at least once.

Modified Condition/Decision

Coverage Every point of entry and exit in the program has been invoked atleast once, every condition in a decision in the program has taken on all possible outcomes at least once, and each condition has been shown to affect that decision outcome independently. A condition is shown to affect a decision’s outcome independently by varying just that condition while holding fixed all other possible conditions. The condition/decision criterion does not guarantee the coverage of all conditions in the module because in many test cases, some conditions of a decision are masked by the other conditions. Using the modified condition/decision criterion, each condition must be shown to be able to act on the decision outcome by itself, everything else being held fixed. The MC/DC criterion is thus much stronger than the condition/decision coverage

Exhaustive testing is not possible

Number of boolean expressions with n conditions taken from an airborne software. To test an expression with 36 conditions, it is required to test 2 ^ 36 conditions.

Minimum tests for 3-input AND

Page 23: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

23

Minimum tests for 3-input OR Minimum tests for 2-input XOR

Testing of NOT operations Testing of comparisons

2 types of comparators

MC/DC

Better than MC/DC

Loop testing Loops with exit

Page 24: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

24

Masking test sets

Before

After. The T value at the input of OR masks the AND gate output

MCDC example

Any false input to the AND gate will mask the other input. Hence, the false outcome of OR1 will mask the test case 1 for the OR2 gate. Similarly, the false outcome of OR2 will mask test case 3 for the OR1 gate.

Another example

Page 25: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

25

The outputs computed match those provided. Hence, test cases 1, 2, 4, 5, and (FFTT) provide MC/DC for example

Grouped functionality testing

Determine MCDC

Page 26: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

26

Software usage in aircraft

• In safety-critical applications, it is prefered wherever possible to use simple, electromechanical or non-programmable electronic solutions to problems related to safety

• However, software used within aircraft has rosen dramatically to 8K words in INS (near 1970) to nearly 10M words in A330/A340 aircraft in 1993

• If this trend continues, it is expected to see a further increase by a factor of about a 1000 over the next 20 years.

• During the period covered above, the number of fatal accidents throughout the world, measured per million hours of flight, has fallen by a factor of nearly 10 for most classes of civil aircraft.

• In automotive industry, the dramatic increase of the use of microcontrollers and bus systems such as CAN network lead to thenotion of safety-criticality. The volume of safety-critical software being used in the automotive industry will increase dramatically over the next years.

PLCs in safety-critical systems

Diagnostics

PLC

Control system

Outputs to actuators

Shutdown system sensors

System sensors

Level 1 shutdown system using PLCs, a single processor shutdown systemReferred as a 1OO1 system (a 1-out-of-1 architecture). It has single shutdown channel and this unique channel must work correctly. According to IEC standard, the first number indicates the number of independent channels and the second number indicates the number of channels that must work correctly.

Shutdown system sensors

Diagnostics

PLC

Control system

Outputs to actuators

System sensors

Diagnostics

PLC

Switch

Level 1 shutdown system using PLCs, a dual-processor shutdown systemReferred as a dual 1OO1D system (a dual 1-out-of-1 architecture)

Integrity levels in PLC systems

• Level 1: 1OO1, 1OO1D

• Level 2: 2OO2. Shutdown channels are

parallel so the two channel must work

correctly

• Level 3: 2OO3 (like TMR)

• Level 4: Dual TMR shutdown modules

from different manufacturers.

2OO2 arrangement

Diagnostics

PLC 1

Control system

Outputs to actuators

System sensors

Diagnostics

PLC 2

Shutdown system sensors

Shutdown system sensors

PLC1

PLC 2

Control system

Outputs to actuators

Shutdown system

sensors

PLC 3

Shutdown system sensors

Shutdown system sensors

A triple-processor arrangement with triple i/o, (2-out-of-3 arrangement, 2OO3)

Page 27: Safety-related system levels of integrity Critical... · 2013. 2. 13. · levels of integrity • The implications of failure vary greatly between applications, and this leads to

27

A340’s fault tolerence

• Mechanical: Mechanical linkages to the rudder and trimmable horizontal stabilizer give control in the event of total electronic system failure

• Computers: Five computers of two types are used. Each computer uses two independent processors with diverse software and diverse programming languages. The primary and secondary computers have diverse hardware (icluding different processors), diverse software and different hardware manufacturers.

• Sensors: Multiple sensors (two or three) are used in each case.• Actuators: One, two, or three actuateors are used for each surface.

• Electrical supplies: The A340 uses six generators, two batteries and five buses. Four of the generators are driven by the engines, one by an auxiliary power unit and the last by the hydraulic system

References

• Safety-Critical Computer Systems – Neil

Storey – Pearson/Prentice Hall

• Wikipedia

• Youtube


Recommended