+ All Categories
Home > Technology > Fault tolerant systems

Fault tolerant systems

Date post: 05-Dec-2014
Category:
Upload: glen-alleman
View: 5,570 times
Download: 2 times
Share this document with a friend
Description:
Fault tolerance in the presence of partial diagnostic coverage.
216
FAULT TOLERANT SYSTEM RELIABILITY IN THE PRESENCE OF IMPERFECT DIAGNOSTIC COVERAGE By Glen B. Alleman Irvine California, Copyright © 1980 Revised and updated Niwot Colorado, Copyright © 1996 Submitted in Partial Fulfillment Of Masters in Systems Management (MSSM) University of Southern California Los Angles, California June 1980
Transcript
Page 1: Fault tolerant systems

FAULT–TOLERANT SYSTEM

RELIABIL ITY IN THE

PRESENCE OF IMPERFECT

DIAGNOSTIC COVERAGE

By

Glen B. Alleman

Irvine California, Copyright © 1980

Revised and updated

Niwot Colorado, Copyright © 1996

Submitted in Partial Fulfillment

Of

Masters in Systems Management (MSSM)

University of Southern California

Los Angles, California

June 1980

Page 2: Fault tolerant systems

ii

Page 3: Fault tolerant systems

FAULT–TOLERANT

SYSTEM RELIABILITY IN

THE PRESENCE OF

IMPERFECT DIAGNOSTIC

COVERAGE

Glen B. Alleman

The deployment of computer systems for the control of mission critical processes

has become the norm in many industrial and commercial markets. The analysis of

the reliability of these systems is usually understood in terms of the Mean Time to

Failure. The design and analysis of high reliability systems is now a mature science.

Starting with fault–tolerant central office switches (ESS4), dual redundant and n–

way redundant systems are now available for variety of application domains. The

technologies of microprocessor based industrial controls and redundant central

processor systems create the opportunity to build fault–tolerant computing

systems on a much smaller scale than previously found in the commercial market

place.

The diagnostic facilities utilized in a modern Fault–Tolerant Computer System

attempts to detect fault conditions present in the hardware and embedded

software. Coverage is the figure of merit describing the effectiveness of the

diagnostic system. This effort examines the effects of less than perfect diagnostics

coverage on system reliability. The mathematical background for analyzing the

coverage factor of fault–tolerant systems is presented in detail as well as specific

examples of practical systems and their relative reliability measures.

In a complex system, malfunction and even total nonfunction may not be detected for long periods, if ever.

— John Gall

Page 4: Fault tolerant systems

ii

Page 5: Fault tolerant systems

i

TABLE OF CONTENTS

INTRODUCTION ........................................................................................................ 1 Fault Tolerant System Definitions .......................................................................... 1 Fault–Tolerant System Functions ........................................................................... 2

Overview of This Work ....................................................................................... 3 RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS .............................. 5

Deterministic Models ................................................................................................ 6 Probabilistic Models ............................................................................................. 6 Exponential and Poisson Relationships ........................................................... 7 Reliability Availability and Failure Density Functions ................................. 13 Mean Time to Failure ......................................................................................... 16 Mean Time to Repair ......................................................................................... 20 Mean Time Between Failure ............................................................................. 20 Mean Time to First Failure ............................................................................... 21 General Availability Analysis ............................................................................ 25

Instantaneous Availability .......................................................................... 27 Limiting Availability .................................................................................... 28

SYSTEM RELIABILITY ............................................................................................ 31 Series Systems ...................................................................................................... 31 Parallel Systems ................................................................................................... 33 M–of–N Systems ................................................................................................ 33 Selecting the Proper Evaluation Parameters .................................................. 34

IMPERFECT FAULT COVERAGE AND RELIABILITY ............................. 37 Redundant System with Imperfect Coverage ................................................ 37 Generalized Imperfect Coverage ..................................................................... 39

MARKOV MODELS OF FAULT–TOLERANT SYSTEMS .......................... 45 Solving the Markov Matrix ............................................................................... 48

Chapman–Kolmogorov Equations .......................................................... 48 Markov Matrix Notation ................................................................................... 51 Laplace Transform Techniques ........................................................................ 53

Modeling a Duplex System ..................................................................................... 55 Modeling a Triple–Redundant System ................................................................. 61 Modeling a Parallel System with Imperfect Coverage ....................................... 65 Modeling A TMR System with Imperfect Coverage ......................................... 71 Modeling A Generalized TMR System ................................................................ 74

Laplace Transform Solution to Systems of Equations ................................ 75 Specific Solution to the Generalized System ................................................. 76

PRACTICAL EFFECTS OF PARTIAL COVERAGE ...................................... 83 Determining Coverage Factors .............................................................................. 83

Page 6: Fault tolerant systems

ii

Coverage Measurement Statistics ............................................................. 84 Coverage Factor Measurement Assumptions ........................................ 85 Coverage Measurement Sampling Method ............................................. 85 Normal Population Statistics ..................................................................... 86 Sample Size Computation .......................................................................... 86 General Confidence Intervals .................................................................... 88 Proportion Statistics .................................................................................... 88 Confidence Interval Estimate of the Proportion ................................... 90 Unknown Population Proportion ............................................................ 90 Clopper–Person Estimation ...................................................................... 91 Practical Sample Estimates ........................................................................ 92 Time Dependent Aspects of Fault Coverage Measurement ............... 93

Common Cause Failure Effects ............................................................................ 93 Square Root Bounding Problem ...................................................................... 95 Beta Factor Model .............................................................................................. 96 Multi–Nominal Failure Rate (Shock Model) ................................................. 96 Binomial Failure Rate Model ............................................................................ 97 Multi–Dependent Failure Fraction Model ..................................................... 97 Basic Parameter Model ...................................................................................... 98 Multiple Greeks Letter Model .......................................................................... 98 Common Load Model ....................................................................................... 99 Nonidentical Components Model ................................................................... 99 Practical Example of Common Cause Failure Analysis .............................. 99 Common Cause Software Reliability ............................................................. 101

Software Reliability Concepts .................................................................. 102 Software Reliability and Fail–Safe Operations ..................................... 108

PARTIAL FAULT COVERAGE SUMMARY ................................................... 111 Effects of Coverage ............................................................................................... 112

REMAINING QUESTIONS .................................................................................. 113 Realistic Probability Distributions ....................................................................... 113

Multiple Failure Distributions ........................................................................ 114 Weilbull Distribution ........................................................................................ 116

Periodic Maintenance ............................................................................................ 119 Periodic Maintenance of Repairable Systems .............................................. 119 Reliability Improvement for a TMR System ................................................ 123

CONCLUSIONS ........................................................................................................ 125

Page 7: Fault tolerant systems

iii

LIST OF FIGURES

Number Page

Figure 1 – Evaluation Criteria defining System Reliability. These criteria will be used to develop a set of time dependent metrics used to evaluate various configurations. ............................................................................................... 5

Figure 2 – Assumptions regarding the behavior of a random process that generated events following the Poisson probability distribution function. ........................................................................................................................ 8

Figure 3 – State Transition probabilities as a function of time in the Continuous–Time Markov chain that is subject to the constraints of the Chapman–Kolmogorov equation. ............................................................................................ 47

Figure 4 – Definition of the exponential order of a function. ........................................... 54 Figure 5 – the state transition diagram for a Parallel Redundant system with

repair. State 2 represents the fault free operation mode, State 1

represents a single fault with a return path to the fault free mode by a

repair operation, and State 0 represents the system failure mode, the

absorption state. ........................................................................................................ 56 Figure 6 – The transition diagram for a Triple Modular Redundant system with

repair. State 2 represents the fault free (TMR) operation mode, State

1 represents a single fault (Duplex) operation mode with a return

path to the fault free mode, and State 0 represents the system failure

mode, the absorbing state. ...................................................................................... 63 Figure 7 – The transition diagram for a Parallel Redundant system with repair and

imperfect fault coverage. State 2 represents the fault free mode, State

1 represents a single fault with a return path to the fault free mode by

a repair operation, and State 0 represents the system failure mode.

State 0 can be reached from State 2 through an uncovered fault,

which causes the system to fail without the intermediate State 1

mode. .......................................................................................................................... 66 Figure 8 –The state transition diagram for a Triple Modular Redundant system

with repair and imperfect fault coverage. State 3 represents the fault

free mode, State 2 represents the single fault (Duplex) mode, State

1 represents the two–fault (Simplex) mode, and State 0 represents

the system failure mode........................................................................................... 72

Page 8: Fault tolerant systems

iv

Figure 9 – The state transition diagram for a Generalized Triple Modular Redundant system with repair and [perfect fault detection coverage.

The system initially operates in a fault free state 0 . A fault in any

module results in the transition to state 1, , N . A second fault while

in state 1, , N results in the system failure state 1N . ....................... 76

Figure 10 – Sample size requirement for a specified estimate as tabulated by Clopper and Pearson. .............................................................................................. 91

Figure 11 – Common Cause Failure modes guide figures for electronic programmable system [HSE87]. These ratios of non–CCF to CCF for various system configurations. CCFs are defined as non–random faults that are designed in or experienced through environmental damage to the system. Other sources [SINT88]. [SINT89] provide different figures. ...................................................................................................................... 101

Figure 12 – Four Software Growth Model expressions. The exponential and hyperexponential growth models represent software faults that are time independent. The S–Shaped growth models represent time delayed and time inflection software fault growth rates [Mats88]. ...................................... 103

Figure 13 – MTTF of Simplex, Parallel Redundant, and TMR Systems. ...................... 111 Figure 14 – MTTF of Parallel Redundant and TMR Systems with varying degrees

of coverage............................................................................................................... 112 Figure 15 – Mean Time to Failure increases for a Triple Modular Redundant

system with periodic maintenance. This graph shows that maintenance intervals which are greater than one–half of the mean time to failure for one module have little effect on increasing reliability. But frequent maintenance, even low quality maintenance, improves the system reliability considerably. .......................................................................................... 124

Page 9: Fault tolerant systems

v

ACKNOWLEDGMENTS

The author wishes to thank Dr. Wing Toy of AT&T Napierville Laboratories,

Napierville, Illinois for his consultation on the ESS4 Central Office Switch and

his contributions to this work. Dr. Victor Lowe of Ford Aerospace, Newport

Beach, California for his consultation on the general forms of Markov model

solutions. Mr. Henk Hinssen of Exxon Corporation, Antwerp Belgium for his

discussion of the effects of partial diagnostic coverage in Triple Modular

Redundant Systems at the Exxon Polystyrene Plant, Antwerp, Belgium. Dr. Phil

Bennet of The Centre for Software Engineering, Flixborough, England for his

ideas regarding software reliability measurements in the presence of undetected

faults. Mr. Daniel Lelivre of Factory Systems, Paris France for his comments and

review of this work and its applicability to safety critical systems at Total, Mobile,

and NorSoLor chemical plants.

Several institutions have contributed source material for this work including The

Foundation for Scientific and Industrial Research at the Norwegian Institute of

Technology (SINTF), Trondheim Norway and the United Kingdom Atomic

Energy Authority, Systems Reliability Service, Culcheth, Warrington, England.

This work is a derivative of an aborted PhD thesis in Computer Science at the

University of California, Irvine. This effort started in the early 1980’s through

TRW, when holding a PhD was a naïve dream, requiring much more work then I

had capacity to produce.

Page 10: Fault tolerant systems

vi

PREFACE

This work was originally written to support the design and development of the

Triple Modular Redundant (TMR) computer produced by Triconex Corporation

of Irvine, California, while pursuing a Masters in Systems Management. In 1987,

Triconex designed and manufactured its first digital TMR process control

computer that was deployed in a variety of industrial environments, including:

turbine controls, boiler controls, fire and gas systems, emergency shutdown

systems, and general purpose fault–tolerant real–time control systems.

The Tricon (a classic 1980’s product name) was based on several innovative

technologies. As the manager of software development for the Triconex, I was

intimately involved in the software and hardware of the Tricon. In 1987, TMR

was not a completely new concept. Flight control systems and navigation

computers were found in aerospace applications. The Space Shuttle used a

TMR+1 computer system and was well understood by the public. What was new

to the market was an affordable TMR computer that could be deployed in a rugged

industrial environment. The heart of the Tricon was a hardware voting system that

performed a 2–out–of–3 vote for all digital input signals presented to the control

program. The contents of memory and the computed digital outputs were again

voted 2–out–of–3 at the physical output devices. Once the digital command had

been applied to the output device its driven state was verified and the results

reported to the control program.

The Tricon contained of 3 independent (but identical) 32–bit battery powered

microprocessors, a 2–out–of–3 voting digital serial bus connecting the three

processors, a dual redundant power system using DC–to–DC converters (state of

the art for 1987), and three separate isolated serial I/O buses connecting the I/O

Page 11: Fault tolerant systems

vii

subsystem to the three main processors. The I/O subsystem cards were

themselves TMR, using onboard 8–bit processors and a quad output device to

vote 2–out–of–3 the digital commands received from the control program.

The Tricon executed a control program on a periodic basis. The architecture of

the operating software was modeled after the programmable controllers of the

day, which were programmed in a ladder logic representing mechanical relays and

timers. Both digital and analog devices provided input and output to the control

program. The control program accepted input states from the I/O subsystem,

evaluated the decision logic and produced output commands, which were sent to

the I/O subsystem. This cycle was performed every 10ms in a normally

configured system.

In the presence of faults, the key to the survivability of the Tricon was the

combination of TMR hardware and fault diagnostic software. Diagnostic

software was applied to each processor element and the digital I/O device. This

diagnostic software was capable of detecting all single stuck–at faults, many

multiple stuck–at faults as well as many transient faults. A fault–injection and

reliability evaluation technique developed by the author and described in this

work was used to evaluate the coverage factor of the diagnostic software.

Triconex no longer exists as an independent company, having been absorbed into

a larger control systems vendor. The materials presented in this work were critical

to Tricon’s TÜV and SINTF [SINTF89] certification for North Sea Norwegian

Sector, German (then the Federal Republic), Belgium and British Health and

Safety Executive (HSE) industrial safety operations.

The concept of fault–tolerant computing has become important again in the

distributed computing market place. The Tandem Non–Stop processor, modern

flight and navigation computers as well as telecommunications computers all

Page 12: Fault tolerant systems

viii

depend on some form of diagnostics to initiate the fault recovery process. A

recent systems architectural paper mentioned TMR but without sufficient

attention to the underlying details. [1] The reissuing of this paper addresses several

gaps in the literature:

The foundations of fault–tolerance and fault–tolerance modeling have faded from the computer science literature. The underlying mathematics of fault–tolerant systems present a challenge for an industry focused on rapid software development and short time to market pressures.

The understanding that unreliable and untrustworthy software systems are created by latent faults in both the hardware and software is poorly understood in this age of Object–Oriented programming and plug and play systems development.

The Markov models presented in this work have general applicability to distributed computer systems analysis and need to be restated. The application of these models to distributed processing systems, with symmetric multi–processor computers is a reemerging science. With the advent of high–availability computing systems, the foundations of these systems needs to be understood once again.

The current crop of computer science practitioners have very little understanding of the complexities and subtleties of the underlying hardware and firmware that make up the diagnostic systems of modern computers, their reliability models and the mathematics of system modeling.

Glen B. Alleman Niwot Colorado 80503 Updated, April 2000

1 “Attribute Based Architectural Styles,” Mark Klein and Rick Kazman, CMU/SEI–99–TR–022, Software

Engineering Institute, Carnegie Mellon University, October 1999.

Page 13: Fault tolerant systems

1/130

C h a p t e r 1

INTRODUCTION

Two approaches are available to increase the system reliability of digital computer

system: Fault avoidance (fault intolerance) and fault tolerance [Aviz75]. Fault

avoidance results from conservative design techniques utilizing high–reliability

components, system burn–in, and careful design and testing processes. The goal

of fault avoidance is to reduce the possibility of a failure [Aviz84], [Rand75],

[Kim86], [Ozak88]. The presence of faults however results in system failure,

negating all prior efforts to increase system reliability [Litt75], [Low72]. Fault–

tolerance provides the system with the ability to withstand a system fault,

maintain a safe state in the presence of a fault, and possibly continue to operate in

the presence of this fault.

FAULT TOLERANT SYSTEM DEFINITIONS

A set of consistent definitions is used here avoid confusion with existing

definitions. These definitions are provided by the IFIP Working Group 10.4,

Reliable Computing and Fault–Tolerance [Aviz84], [Aviz82], [Ande82], [Robi82],

[Lapr84], [TUV86]:

A Failure occurs when the system user perceives a service resource ceases to deliver the expected results.

An Error occurs when some part of a system resource assumes an undesired state. Such a state is contrary to the specification of the resource to the expectation (requirement) of the user.

A Fault is detected when either a failure of the resource occurs, or an error is observed within the resource. The cause of the failure or error is said to be a fault.

Page 14: Fault tolerant systems

2/130

FAULT–TOLERANT SYSTEM FUNCTIONS

In fault–tolerant systems, hardware and software redundancy provides

information needed to negate the effects of a fault [Aviz67]. The design of fault–

tolerant systems involves the selection of a coordinated failure response

mechanism that follows four steps [Siew84], [Mell77], [Toy86]:

Fault Detection

Fault Location and Identification

Fault Containment and Isolation

Fault Masking

During the fault detection process, diagnostics are used to gather and analyze

information generated by the fault detection hardware and software. These

diagnostics determine the appropriate fault masking and fault recovery actions

[Euri84], [Rouq86], [Ossf80], [Gluc86], [John85], [John86], [Kirr86], [Chan70]. It

is the less than perfect operation of the Fault Detection, Location, and

Identification processes of the system that is examined in this work.

The reliability of the fault–tolerant system depends on the ability of the diagnostic

subsystem to correctly detect and analyze faults [Kirr87], [Gall81], [Cook73],

[Brue76], [Lamp82]. The measure of the correct operation of the diagnostic

subsystem is called the Coverage Factor. It is assumed in most fault–tolerant

product offerings that the diagnostic coverage factor is perfect, i.e. 100%. This

work addresses the question:

What is the reliability of the Fault–Tolerant system in the presence of less than perfect coverage?

To answer this question, some background in the mathematics of reliability

theory is necessary.

Page 15: Fault tolerant systems

3/130

Overview of This Thesis

The development of a reliability model of a Triple Modular Redundant (TMR)

system with imperfect diagnostic coverage is the goal of this work. Along the

way, the underlying mathematics for analyzing these models is developed. The

Markov Chain method will be the primary technique used to model the failure

and repair processes of the TMR system. The Laplace transform will be used to

solve the differential equations representing the transition probabilities between

the various states of the TMR system described by the Markov model.

The models developed for a TMR system with partial coverage can be applied to

actual systems. In order to make the models useful in the real–world a deeper

understanding of the diagnostic coverage and fault detection is presented. The

appendices provide the background for the Markov models as well as the

statistical process.

The mathematics of Markov Chains and the statistical processes that underlay

system faults and their repair processes can be applied to a variety of other

analytical problems, including system performance analysis. It is hoped the reader

will gain some appreciation of the complexity and beauty of modern systems as

well as the subtitles of their design and operation.

If the reader is interested in skipping to the end, Chapter 7 provides a summary

of the effects of partial coverage on various system configurations.

Page 16: Fault tolerant systems

4/130

Page 17: Fault tolerant systems

5/130

CC hh aa pp tt ee rr 22

RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS

When presented with the reliability figures for a computer system, the user must

often accept the stated value as factual and relevant, and construct a comparison

matrix to determine the goodness of each product offering [Kraf81]. Difficulties

often arise through the definition and interpretation of the term reliability.

This chapter develops the necessary background for understanding the reliability

criteria defined by the manufacturers of computer equipment. Figure 1 lists the

criteria for defining system reliability [Siew82], [Ande72], [Ande79], [Ande81].

Deterministic Models Survival of at least k component failures

Probabilistic Models

z t – Hazard (failure rate) function

R t – Reliability function

– Repair Rate

A t – Availability function

Single Parameter Models MTTF – Mean Time to failure MTTR – Mean Time to Repair MTBF – Mean Time Between Failure c – Coverage

Figure 1 – Evaluation Criteria defining System Reliability. These criteria will be used to develop a set of time dependent metrics used to evaluate various configurations.

Page 18: Fault tolerant systems

6/130

DETERMINISTIC MODELS

The simplest reliability model is a deterministic one, in which the minimum

number of component failures that can be tolerated without system failure is

taken as the figure of merit for the system.

Probabilistic Models

The failure rate of electronic and mechanical devices varies as a function of time.

This time dependent failure rate is defined by the hazard function, z t . The

hazard function is also referred to as the hazard rate or mortality rate. For

electronic components on the normal–life portion of their failure curve, the

failure rate is assumed to be a constant, , rather than a function of time.

The exponential probability distribution is the most common distribution

encountered in reliability models, since it describes accurately most life testing

aspects for electronic equipment [Kapu77]. The probability density function (pdf),

Cumulative Distribution Function (CDF), reliability function ( R t ), and hazard

(failure rate) function ( z t ) of the exponential distribution are expressed by the

following [Kend77]:

tpdf f t e (2.1)

1 tCDF F t e (2.2)

Reliability tR t e (2.3)

Hazard Function z t (2.4)

Page 19: Fault tolerant systems

7/130

The failure rate parameter describes the rate at which failures occur over time

[DoD82]. In the analysis that follows, the failure rate is assumed to be constant,

and measured as failures per million hours. Although a time dependent failure rate

could be used for un–aged electronic components, the aging of the electronic

components can remove the traditional bathtub curve failure distribution. The

constant failure rate assumption is also extended to the firmware controlling the

diagnostics of the system [Bish86], [Knig86], [Kell88], [Ehre78], [Eckh75],

[Gmei79], [RTCA85].

Exponential and Poisson Relationships

In modeling the reliability functions associated with actual equipment, several

simplifying assumptions must be made to render the resulting mathematics

tractable. These assumptions do not reduce the applicability of the resulting

models to real–world phenomenon. One simplifying assumption is that the

random variables associated with the failure process have exponential probability

distributions.

The property of the exponential distribution that makes it easy to analyze is that it

does not decay with time. If the lifetime of a component is exponentially

distributed, after some amount of time in use, the item is assumed to be good as

new. Formally, this property states that the random variable X is memoryless, if the

expression P X s t X t P X s is valid for all , 0s t [Cram66],

[Ross83]. If the random variable X is the lifetime of some item, then the

probability that the item is functional at time s t , given that it survived to time

t, is the same as the initial probability that is was functional at time s. If the item is

functional at time t, then the distribution of the remaining amount of time that it

survives is the same as the original lifetime distribution. The item does not

remember that it has already been in use for a time t.

Page 20: Fault tolerant systems

8/130

This property is equivalent to the expression

,P X s t X tP X s

P X t

or P X s t P X s P X t . Since the form of this expression is

satisfied when the random variable X is exponentially distributed (since

s t s te e e ), it follows that exponentially distributed random variables

are memoryless. The recognition of this property is vital to the understanding of the

models presented in this work. If the underlying failure process is not

memoryless, than the exponential distribution model is not valid.

The exponential probability distributions and the related Poisson processes used

in the reliability models are formally based on the assumptions shown in Figure 2

[Cox 62], [Thor26].

Failures occur completely randomly and are independent of any previous failure. A single failure event does not provide any information regarding the time of the next failure event.

The probability of a failure during any interval of time 0, t is proportional

to the length of the interval, with a constant of proportionality . The longer one waits the more likely it is a failure will occur.

Figure 2 – Assumptions regarding the behavior of a random process that generated events following the Poisson probability distribution function.

An expression describing the random processes in Figure 2 results from the

Poisson Theorem which states that the probability of an event A occurring k times

in n trials is approximately [Papo65], [Pois37],

1 1

1 2

k n kn n n kp q

k, (2.5)

Page 21: Fault tolerant systems

9/130

where p P A is the probability of an event A occurring in a single trial and

1q p . This approximation is valid when , 0n p and the product

n p remains finite. It should be noted that a large number of different trials of

independent systems is needed for this condition to hold, rather than a large

number of repeated trials on the same system.

The Poisson Theorem can be simplified to the following approximation for the

probability of an event occurring k times in n trials [Kend77],

12

12

!1 ,

! !

2,

!2

1

!1

!

.!

k n k

k n k

k

knn

n p

n k n k k

k

n

k

kn p

n n pn n pp q

k n k k n n

e n n pe

kn k e n

n p

kke

n

n pe

k

(2.6)

The exponential and Poisson expressions are directly related. A detailed

understanding of this relationship will aid in the development of the analysis

that follows.

Using the Poisson assumptions described in Figure 2, the probability of n

failures prior to time t is,

tP N n T t P n . (2.7)

Page 22: Fault tolerant systems

10/130

From of Eq. (2.7), the probability that no failures occur 0n between time t

and time t t is,

0 0 1t t tP P t , (2.8)

where the term np describing the total number of failures is of moderate

magnitude [Fell67]. The probability that n failures occur between time t and

time t t is then,

1 1 , 0t t t tP n P n t P n t n . (2.9)

Using Eq. (2.9) and Eq. (2.8) and allowing 0t , a differential equation can

be constructed describing the rate at which failures occur between time t and

time t t ,

0 0 ,

1 , for 0,

t t

t t t

dP P

dt

dP n P n P n n

dt

(2.10)

with the initial conditions of,

0.tP n (2.11)

The unique solution to the differential equation in Eq. (2.10) is [Klie75],

, 0, 1, 2,!

n t

t

t eP n n

n

(2.12)

Page 23: Fault tolerant systems

11/130

which is the Poisson distribution defined in Eq. (2.6). Using Eq. (2.12) to define

a function F t representing the probability that no failures have occurred as

of time t gives,

0 .t

tF t P n e (2.13)

The expression in Eq. (2.13) is also the definition for the Cumulative

Distribution Function, CDF, of the Poisson failure process [Fell67]. By using

Eq. (2.19), the probability distribution function, pdf, of the Poisson process can

be given as,

,tf t e (2.14)

which is the exponential probability distribution. [2] The following statement

describes the relationship between the Poisson and exponential expressions

[Cox65],

If the number of failures occurring over an interval of time is Poisson distributed, then the time between failures is exponentially distributed.

An alternative method of relating the exponential and Poisson expressions is

useful at this point. The functions defined in Eq. (2.1) and Eq. (2.2) are based

on the interchangeability of the pdf and the CDF for any defined probability

distribution. The Cumulative Distribution Function F x of a random variable

X is defined as a function obeying the following relationship [Papo65],

2 This development of the pdf is very informal. Making use of the forward reference to construct an

expression is circular logic and would not be permitted in more formal circumstances. For the purposes of

this work, this type of behavior can be tolerated, since the purpose of this development is to get to the

results rather than dwell on the analysis process. This is a fundamental difference between mathematics

and engineering.

Page 24: Fault tolerant systems

12/130

, .F x P X x x (2.15)

The probability density function f x of a random variable X can be derived

from the CDF using the following [Dave70],

.d

f x F xdx

(2.16)

The CDF can be obtained from the pdf by the following,

, .x

F x P X x f t dt x

(2.17)

Using Eq. (2.16) and Eq. (2.17), the CDF and pdf expressions for an exponential

distribution can be developed. If the mean time between failures (MTBF) is an

Exponentially distributed random variable, the CDF is,

1 , 0 ,

0 , otherwise,

te tF t

(2.18)

The number of failures in the time interval 0, t is a Poisson distributed random

variable with a probability density function of,

, 0,

0, otherwise,

e tdf t F t

dt

(2.19)

where t is a random variable denoting the time between failures.

Page 25: Fault tolerant systems

13/130

Reliability Availability and Failure Density Functions

An expression for the reliability of a system can be developed using the following

technique. The probability of a failure as a function of time is defined as,

, 0,P T t F t t (2.20)

where t is a random variable denoting the failure time. F t is a function

defining the probability that the system will fail by time t. F t is also the

Cumulative Distribution Function (CDF) of the random variable t [Papo65]. The

probability that the system will perform as intended at a certain time t is defined

as the Reliability function and is defined as,

1 .R t F t P T t (2.21)

If the random variable describing the time to failure t has a probability density

function f t then using Eq. (2.21) the Reliability function is,

1 1 .t t

R t F t f x dx f x dx (2.22)

Assuming the time to failure random variable t has an exponential distribution its

failure density defined by Eq. (2.19) is,

, 0, 0.tf t e t (2.23)

The resulting reliability function is then,

.t t

t

R t e dt e (2.24)

Page 26: Fault tolerant systems

14/130

A function describing the rate at which a system fails as a function of time is

referred to as the Hazard function (Eq. (2.4)). Let T be a random variable

representing the service life remaining for a specified system. Let F x be the

distribution function of T and let f x be its probability density function. A

new function z x termed the Hazard Function or the Conditional Failure Function

of T is given by

1

f xz x

F x. The function z x dx is the conditional

probability that the item will fail between x and x dx given it has survived a

time T greater than x.

For a given hazard function z x the corresponding distribution function is

01 1 exp

o

x

x

F x F x z y dy where 0x is an arbitrary value of x. In

a continuous time reliability model the hazard function is defined as the

instantaneous failure rate of the system [Kapu77],

0lim ,

1,

,

,

.

t

t

t

R t R tz t

t R t

dR t

R t dt

f t

R t

e

e

(2.25)

The quantity z t dt represents the probability that a system of age t will fail in

the small interval of time ,t t dt . The hazard function is an important

Page 27: Fault tolerant systems

15/130

indicator of the change in the failure rate over the life of the system. For a system

with an exponential failure rate, the hazard function is constant as shown in

Eq. (2.25) and it is the only distribution that exhibits this property [Barl85]. Other

reliability distributions will be shown in later chapters that have variable hazard

rates.

If a system contains no redundancy – this is, every component must function

properly for the system to continue operation – and if component failures are

statistically independent, the system reliability function is the product of the

component reliabilities and follows an exponential probability distribution. The

failure rate of such a system is the product of the failure rates of the individual

components,

1 1

exp .i

n nt

sys i i

i i

R t R t e t

(2.26)

In most cases it is possible to repair or replace failed components and accurate

models of system reliability will consider this. As will be shown the repair activity

is not as easily modeled as the failure mechanisms.

For systems that can be repaired, a new measure of reliability can be defined,

The probability that the system is operational at time ―t.‖

This new measure is the Availability and is expressed as A t . Availability

A t differs from reliability R t in that any number of system failures can

occur prior to time t but the system is considered available if those failures have

been repaired prior to time t.

Page 28: Fault tolerant systems

16/130

For systems that can be repaired, it is assumed that the behavior of the repaired

system and the original system are identical from a failure standpoint. In general,

this is not true, as perfect renewal of the system configuration is not possible. The

terms Mean Time to First Failure and Mean Time to Second Failure now become

relevant.

Assuming a constant failure rate , a constant repair rate , and identical failure

behaviors between the repaired system and the original system, the steady–state

system availability can be expressed as,

.SSA

(2.27)

The expression in Eq. (2.27) is an approximation of the expression of the

availability with repair requires the solution of the appropriate Markov model,

which will be developed in a later chapter.

Mean Time to Failure

The Mean Time to Failure (MTTF) is the expected time to the first failure in a

population of identical systems, given a successful system startup at time 0t .

The Cumulative Distribution function F x in Eq. (2.15) and the probability

density function f x in Eq. (2.16) characterize the behavior of the probability

distribution function of the underlying random failure process. These expressions

are in a continuous integral form and require the solution of integral equations to

produce a useable result. A concise parameter that describes the expected value

of the random process is useful for comparison of different reliability models.

This parameter is the Mean or Expected Value of the random variable denoted by

E X and is defined by [Parz60], [Dave70],

Page 29: Fault tolerant systems

17/130

.E X xf x dx (2.28)

The expression in Eq. (2.28) denotes the expected value of the continuous

function f x . It is important to note that this definition assumes x f x is

integrable in the interval , .

For an exponential probability density function of,

, 0,xf x e x (2.29)

the mean or expected value of the exponential function is given by,

0

.xE X xf x dx e dx

(2.30)

The evaluation of Eq. (2.30) can be done in a straightforward manner using the

Gamma function [Arfk70], which is defined as,

1

0

, 0,xx e dx (2.31)

or alternately,

1

0

.xx e dx (2.32)

Rewriting the expression in Eq. (2.30) for the expected values as,

Page 30: Fault tolerant systems

18/130

0

1,uE X ue du

(2.33)

where substituting the variables,

u x and ,du dx (2.34)

results in,

0

1,

12 ,

1,

uE X ue du

(2.35)

which is the MTTF for a simple system. Although this expression is useful for

simple systems, a general–purpose expression representing the MTTF is needed.

This function can be developed in the following manner.

Let X denote the lifetime of a system so that the reliability function is,

,R t P X t (2.36)

and the derivative of the reliability function which is also given in Eq. (2.21) and

Eq. (2.22) is again defined as,

.d

R t f tdt

(2.37)

The expression for the expected value or MTTF using Eq. (2.28) is given by:

Page 31: Fault tolerant systems

19/130

0 0

dE X tf t dt t R t dt

dt (2.38)

Using the technique of integration by parts [Smai49], [Arfk70] is shown in

Eq. (2.39),

,b b

a a

bd df x g x dx f x g x g x f x dx

adx dx (2.39)

to evaluate Eq. (2.38). Integrating by parts gives the expected value as,

0

.0

E X t R t R t dt (2.40)

Since R t approaches zero faster than t approaches infinity, Eq. (2.40) can be

reduced to,

0

,E X R t dt MTTF (2.41)

which is the expression for the Mean Time to Failure for a general system

configuration. This direct relationship between MTTF and the system failure rate

is one reason the constant failure rate assumption is often made when the

supporting reliability data is scanty [Barl75]. Appendix G describes the analysis of

the variance for this distribution.

Using an exponential failure distribution implies two important behaviors for the

system,

Page 32: Fault tolerant systems

20/130

Since a used subsystem is stochastically as good as a new subsystem, a policy of scheduled replacement of used subsystems which are known to still be functioning, does not increase the lifetime of the system.

In estimation the mean system life and reliability, data can be collected consisting only of the number of hours of observed life and the number of observed failures; the ages of the subsystems under observation are of no concern.

Mean Time to Repair

The Mean Time to Repair (MTTR) is the expected time for the repair of a failed

system or subsystem. For exponential distributions this is 1

MTTF

and

1MTTR

. The steady state availability SSA defined in Eq. (2.27) can be

rewritten in terms of these parameters,

.SS

MTTFA

MTTR MTTF

(2.42)

Mean Time Between Failure

The Mean Time Between Failure (MTBF) is often mistakenly used in place of Mean

Time to Failure (MTTF). The MTBF is the mean time between failures in a system

with repair, and is derived from a combination of repair and failure processes.

The simplest approximation for MTBF is:

.MTBF MTTF MTTR (2.43)

In this work, it is assumed MTTR MTTF so that MTTR is used in place of

MTBF. The Mean Time to Failure is considered since in fault–tolerant systems

Failure occurs only when the redundancy features of the system fail to function

properly. In the presence of perfect coverage and perfect repair the system should

Page 33: Fault tolerant systems

21/130

operate continuously. Therefore, failure of the system implies total loss of system

capabilities.

Mean Time to First Failure

The Mean Time to Failure is defined as the expected time of the first failure in a

population of identical systems. This development depends on the assumption

that the failure rate is constant Eq. (2.25), exponentially distributed Eq. (2.14),

and the repair time is constant, . In the general case, these assumptions may not

be valid and the Mean Time to Failure (MTTF) is not equivalent to the Mean Time to

First Failure (MTFF).

By removing the exponential probability failure distribution restriction in

Eq. (2.29) a generalized expression for the first failure time can be derived.

Given a population of n subsystems each with a random variable

, 1, 2, ,iX i n and a continuous pdf of f x , the failure time for the thn

subsystem is given by summing all the failure times prior to the failure,

1 2

1

.n

n n i

i

S X X X X (2.44)

If the random variables 1 2, , , nX X X are independent and identically

distributed, all with pdf’s of f x , the random process described by these

variables is referred to as an Ordinary Renewal Process [Cox62], [Ross70]. The details

of the Renewal Process are shown in Appendix E.

Page 34: Fault tolerant systems

22/130

Given the random process described by Eq. (2.44) the distribution function of nS

is provided by convolving each individual distribution function F t . The

convolution of two functions is defined as [Brac65], [Papo65]:

.f x g x f u g x u du (2.45)

The resulting convolution function for the n+1 subsystem failure is given by:

1

0

.t

n nF t F t x F x dx (2.46)

In renewal processes, the random variables are actually functions and can be

substituted in the reliability computations when:

1.n nN t n S t S (2.47)

When the conditions in Eq. (2.47) are met, the probability of n renewals in a time

interval is given by,

1

1

1

,

,

.

n n

n n

n n

P N t n P S t S

P S t P S t

F t F t

(2.48)

The renewal function H t can be defined as the average number of subsystem

failures and repairs as a function of time, and is given as,

.H t E N t (2.49)

Page 35: Fault tolerant systems

23/130

Using Eq. (2.48) in the evaluation of Eq. (2.49) and Eq. (2.30) as the definition of

the expectation value, gives the following for the renewal function,

0

10 0

0 1

,

.

1 .

n

n nn n

n nn n

H t n P N t n

n F t n F t

n F t n F t

(2.50)

Simplifying Eq. (2.50) results in an expression for the renewal function of,

11

.n

n

H t F t F t

(2.51)

The term 1nF

is the convolution of n

F and F which gives,

1

0

,t

n nF t F t x F x dx

(2.52)

which results in the expression for the renewal function of,

1 0

.t

nn

H t F t F t x F x dx

(2.53)

Rearranging the integral term in Eq. (2.53) gives,

10

.t

nn

H t F t F t x F x dx

(2.54)

Page 36: Fault tolerant systems

24/130

The summation term in Eq. (2.54) is the renewal function for the thn failure,

giving,

0

.t

H t F t H t x F x dx (2.55)

Using Eq. (2.16), the renewal density function h t is the derivative of the

distribution function, giving,

.d

h t H tdt

(2.56)

Using Eq. (2.50) to evaluate the derivative results in,

1

,n

n

h t f t

(2.57)

and using Eq. (2.54) as a substitute for the right–hand side of Eq. (2.57) results in,

0

.t

h t f t h t x f x dx (2.58)

Eq. (2.58) is known as the Renewal Equation [Ross70]. To solve the renewal

equation, the Laplace transform will be used. The transform of the probability

density function is,

0

,sxf s e f x dx

(2.59)

and the transform of the renewal function is,

Page 37: Fault tolerant systems

25/130

0

.sxh s e h x dx

(2.60)

Using the convolution property of the Laplace transform [Brac65], an equation

for the renewal distribution can be generated,

,h s f s h s f s (2.61)

and simplified to,

.1

f sh s

f s

(2.62)

Eq. (2.62) is now the generalized expression for the failure distribution for a

random process within an arbitrary probability distribution.

General Availability Analysis

The steady state system availability defined in Eq. (2.42) assumes an exponential

distribution for the failure rate of the system or subsystems. An important activity

in the analysis of Fault–Tolerant systems is the development of a general–

purpose availability expression, independent of the underlying failure distribution.

In the analysis that follows, it will be assumed that when a subsystem fails it is

repaired and the system restored to its functioning state. It will also be assumed

that the restored system functions as if it were new, that is with the failure

probability function restarted at 0t .

Page 38: Fault tolerant systems

26/130

Let iT be the duration of the ith functioning period and let iD be the system

downtime because of the failure of the system while the ith repair takes place.

These durations will form the basis of the renewal process.

By combining the subsystem failure interval and the subsystem repair duration, a

random variable sequence is constructed such that,

; 1, 2,i i iX T D i (2.63)

It must be assumed that the duration of the functioning subsystems are identically

distributed with a common Cumulative Distribution Function W t and a common

probability density function w t and that the repair periods are also identically

distributed with G t and g t . Using these assumptions the terms in Eq. (2.63)

are also identically distributed such that,

1, 2, ,iX i (2.64)

meets the definition of a Renewal process developed Eq. (2.44). Using this

development an expression for the convolution of the two independent random

processes is given by,

.f s w s g s (2.65)

Using Eq. (2.62) gives,

.1

w s g sh s

w s g s

(2.66)

Page 39: Fault tolerant systems

27/130

The average number of repairs M t in the time interval 0, t has the Laplace

transform:

.1

w s g sM s

s w s g s

(2.67)

Instantaneous Availability

The steady state availability defined in Eq. (2.42) can now be replaced with the

instantaneous availability A t . In the absence of a repair mechanism the

availability A t is equivalent to the repairability, 1R t A t of the

subsystem.

The subsystem may be functioning at time t because of two mutually exclusive

reasons,

The subsystem has not failed from the beginning.

The last renewal occurred within the time period and the subsystem continued to function since that time.

The probability associated with the second case is the convolution of the

reliability function and the renewal density, giving,

0

,t

R t x h x dx (2.68)

which results in a expression for the instantaneous availability of,

0

.t

A t R t R t x h x dx (2.69)

Taking the Laplace transform of both sides of Eq. (2.69) gives,

Page 40: Fault tolerant systems

28/130

,

1 ,

1 .1

A s R s R s L h s

R s h s

w s L g sR s

w s L g s

(2.70)

Since the reliability of the system is given as 1R t W t ,

1,

11.

A s W ss

w s w s

s s s

(2.71)

Substituting gives,

1.

1

w sA s

s w s g s

(2.72)

Given the failure–rate distribution and the repair–time distribution, Eq. (2.72) can

be used to compute the instantaneous availability as a function of time.

Limiting Availability

An important question to ask is – what is the availability of the system after some long

period of time? The limiting availability A t as t is defined as A or simply

the Availability.

To derive an expression for the limiting availability the Final Value Theorem of

Laplace transform can be used [Doet61], [Widd46], [ Brac65], [Ogat70], [Gupt66].

This theorem states that the steady state behavior of f t is the same as the

Page 41: Fault tolerant systems

29/130

behavior of sF s in the neighborhood of 0s . Thus it is possible to obtain the

value of f t as t .

Let,

0

0 ,t

F t f x dx F (2.73)

then using a table of Laplace transforms [Doet61], [Brac65],

0

0 ,sts F s F h s e f t dt (2.74)

and by letting 0,s

00

0

lim 0 ,

lim 0 ,

lim .

s

t

s

t

s H s f t dt F

f x dx F

F t

(2.75)

The Limiting availability is then given as,

0

lim lim .t s

A A t s A s

(2.76)

For small values of s the following approximations can be made [Apos74],

1 ,ste st (2.77)

giving,

Page 42: Fault tolerant systems

30/130

0

0 0

,

,

21 .

stw s e w t dt

w t dt s tw t dt (2.78)

where 1

MTTF

and,

21 ,g s (2.79)

and where 1

MTTR

giving the limiting availability as,

0

11 1

lim .1 1

1 1 1s

s

MTTFA

s s MTTF MTTR

(2.80)

Eq. (2.80) is an important result in the analysis of system reliability, because it

shows that the limiting availability depends only on the Mean Time to Failure and

the Mean Time to Repair and not in the underlying distributions of the failure and

repair times.

Page 43: Fault tolerant systems

31/130

C h a p t e r 3

SYSTEM RELIABILITY

This chapter provides the basis for the computation of the overall system

reliability given a redundant architecture with partial fault detection coverage.

Redundant systems can be modeled under variety operational assumptions. Of

most interest in this work are dual and triple redundant systems that contain

repair facilities.

Series Systems

Creating a reliable system often involves a series or parallel combination of

independent systems or subsystems. If iR t is the reliability of module i and all

the modules are statistically independent, then the overall system reliability of

modules connected in series is,

.series iR t R t (3.1)

For a series redundant system the failure probability seriesF is given by,

1

1

1 1 ,

1 1 .

n

series ser ies i

i

n

i

i

F t R t R t

F t

(3.2)

Expanding Eq. (3.1) will illustrate an aspect of the exponential distribution. For a

system of n subsystems connected in series the reliability of the system is given by

Page 44: Fault tolerant systems

32/130

Eq. (3.1). If a general purpose hazard function is used for the failure rate

[Shoo68] defined by,

,k

i i ih t c t (3.3)

where i , ic , and k are constants, then the reliability function for the individual

subsystem is given by,

1

exp ,1

k

i i i

tR t t c

k

(3.4)

and the reliability functions for the system is given by,

1

1 1

exp .1

kn n

series i i

i i

tR t t c

k

(3.5)

Defining two new terms for the summation of the failure rate and a new term for

the time constant adjustment gives, 1

n

i

i

, 1

n

i

i

c c

, and T t results

in the series reliability expression of,

11exp .

1

k

series k

c TR t T

k

(3.6)

As the number of subsystems grows large , the term 1

c

k

is

bounded and the expression for the system reliability becomes,

lim .T t

seriesn

R t e e

(3.7)

Page 45: Fault tolerant systems

33/130

Eq. (3.7) defines the failure distribution of the system as the number of

subsystems grows without bound. This implies that a large complex system will

tend to follow exponential distribution failure models regardless of the internal

organization of the subsystems.

Parallel Systems

In a parallel redundant configuration, the system fails only if all modules fail. The

probability of a system failure in a parallel system given by,

1

1 .n

iparallel

i

F t F t

(3.8)

The system reliability for a parallel system is given by,

1

1

1 1 ,

1 1 .

n

iparalle l paralle l

i

n

i

i

R t F t F t

R t

(3.9)

M–of–N Systems

An M–of–N system is a generalized form the parallel system. Instead of requiring

only one of the N modules of the system to remain functional, M modules are

required. The system of interest in this work is a Triple Modular Redundant (TMR)

configuration in which two of the three modules must function for the system to

operate properly [Lyons 62], [Kuehn 69]. [3] For a given module reliability of mR

the TMR reliability is given by,

3 In practical TMR systems, a simplex mode is allowed, which usually places the system in a shutdown mode,

allowing the controlled process to be safely stopped.

Page 46: Fault tolerant systems

34/130

3 23

1 .2

tm r m m mR R R R

(3.10)

In Eq. (3.10) all working states are enumerated. The 3

mR term represents that

state in which all three modules are functional. The 23

12

m mR R

term

represents the three states in which any one module has failed and the two states

in which a module is functional.

Selecting the Proper Evaluation Parameters

In comparing different redundant system configurations, it is desirable to

summarize their reliability by a single parameter. The reliability may be an

arbitrary complex function of time. The selection of the wrong summary

parameter could lead to incorrect conclusions, as will be shown below.

Consider a simplex system, with a reliability function of,

,t

sim plexR t e (3.11)

and using Eq. (2.41) to derive the Mean Time to Failure results in,

1

.sim plex

MTTF

(3.12)

For a TMR system with an exponential reliability function,

3 2

2 3

31 ,

2

3 2 ,

t t t

tm r

t t

R t e e e

e e

(3.13)

Page 47: Fault tolerant systems

35/130

and using Eq. (2.40) results in a Mean Time to Failure of,

3 2

.2 3

tm rMTTF

(3.14)

Comparing the simplex and TMR reliability expressions gives,

5 1

.6

tm r sim plexMTTF MTTF

(3.15)

By using the MTTF figure of merit, the TMR system can be shown to be less

reliable than the Simplex system. The above equations do not include the facility

for module repair. Once the TMR system has exhausted its redundancy, there is

more hardware to fail then the remaining modules of the non–redundant system.

This effect lowers the total system reliability. With online repair, the MTTF figure

of merit for the TMR system becomes an important measure of the overall

system reliability.

These results illustrate why simplistic assumptions and calculations may result in

erroneous information.

Page 48: Fault tolerant systems

36/130

Page 49: Fault tolerant systems

37/130

C h a p t e r 4

IMPERFECT FAULT COVERAGE AND RELIABILITY

Reliability models of systems with dynamic redundancy usually depend on perfect

fault detection [Arno73], [Stif80]. The ability of the system to detect faults that

occur can be classified as [Geis84],

Covered – faults that are detected. The probability that a fault belongs to this class is given by c.

Uncovered – faults that are not detected. The probability that a fault belongs

to this class is given by 1 c .

The underlying diagnostic firmware and hardware may not provide perfect

coverage for many reasons, primarily due to the complexity of the system under

diagnosis [Rous79], [Cona72], [Wood79], [Soma86]. Because of this built–in

complexity, an exhaustively tested set of diagnostics may not be possible.

Another factor affecting the diagnostic coverage is the presence of intermittent

faults [Dahb82], [Mall78]. The detection and analysis of these intermittent or

permanent faults is further complicated by the presence of transient faults which

behave as real faults but are only present in the system for a short time [Glas82],

[Sosn86]. Modeling a fault–tolerant system in the presence of imperfect fault

coverage becomes an important aspect in predicting the overall system reliability.

Redundant System with Imperfect Coverage

Before developing the Markov method of analyzing Fault–Tolerant systems, a

conditional probability method will be used to derive the MTTF and MTBF for a

redundant system with imperfect fault detection [Bour69]. Assume that the failure

rate for each subsystem of the redundant system is described by an independent

Page 50: Fault tolerant systems

38/130

random variable . Let X denote the lifetime of a system with two modules, one

active and the other in standby mode. Assume that the module in the standby

mode does not experience a fault during the mission time interval. [4] Let Y be a

random variable where, Y = 0 if a fault is not covered, and Y = 1 if a fault is

covered, then, 0 1P y c and 1 .P y c

To compute the MTTF of this system, the conditional expectation value of the

system lifetime X given the fault coverage state Y is must be derived.

If an uncovered fault occurs the MTTF of the system is the MTTF of the initially

active module,

1

0 .P X Y

(4.1)

If a covered fault occurs the MTTF of the system is the sum of the MTTF of the

active module and the MTTF of the inactive module,

2

1 .P X Y

(4.2)

The total expectation value of the system lifetime is then given by,

1 12

.c cc

E X MTTF

(4.3)

The computation of the system reliability depends on the combination of the two

independent exponential distribution functions when a covered fault occurs,

4 This is an invalid assumption in a practical sense, but it greatly simplifies this example.

Page 51: Fault tolerant systems

39/130

21 ,tf x t y te (4.4)

and when an uncovered fault occurs

0 .tf x t y e (4.5)

The joint exponential distribution function for both conditions is given by,

2

, ,

, 1 ; 0, 0,

, ; 0, 1.

t

t

f t y f X t y P y

f t y c e t y

f t y c te t y

(4.6)

and the marginal density function of X is computed by summing over the joint

density function,

2 1 .t tf t c te c e (4.7)

The system reliability as a function of the coverage is then given by integrating

the joint density function in Eq. (4.7) to give,

0

2

0

2

1 1,

1 1 ,

1 1 ,

1 .

t

tt t

t t

t

t

R t f x dx

c te c e dt

c te c e dt

c t e

(4.8)

Generalized Imperfect Coverage

In the previous example, the system consisted of two modules, one in the active

state and one in the standby state. The conditional probability that a fault will go

Page 52: Fault tolerant systems

40/130

undetected (uncovered) was computed using the conditional probability that the

system will survive for a specified period. Cox [Cox55] analyzed the general case

of a stage–type conditional probability distribution. The principle on which the

method of stages is based is the memoryless property of the exponential

distribution of Eq. (2.1) [Klie75]. The lack of memory is defined by the fact that

the distribution of the time remaining for an exponentially distributed random

variable is independent of the current age of the random variable, that is the

variable is memoryless. Appendix D develops further the memoryless property of

random variables with exponential distributions.

In the generalized model, it is assumed that individual modules are always in one

of two states – working or failed. It is also assumed that the modules are

statistically independent and module repair can take place while the remainder of

the system continues to function.

In the general case of N active and S standby modules, the lifetime of the system

is defined by a stage–type distribution. An active module has an exponential

failure distribution with a constant failure rate . Assume that the modules in the

standby state can fail at a rate (presuming 0 ). Let iX 1 i N be a

random variable denoting the lifetime of the active modules and let

jY 1 j S be a random variable denoting the lifetime of the standby

modules. The system lifetime L is then,

1 2 1 2, min , , , ; , , , , 1 ,

, , 1 .

N SL m N S X X X Y Y Y L m N S

W N S L m N S

(4.9)

where ,W N S is the time to first failure among the N S modules. After

the removal of the failed module, the system has N active modules and 1S

Page 53: Fault tolerant systems

41/130

standby modules. As a result 1N S modules have not aged by the

memoryless exponential assumption and therefore the system lifetime is,

1

, , 0 , .S

i

L m N S L m N W N i

(4.10)

Here , , 0L m N S L m N is the lifetime of the m–out–N system and is

therefore a thk order statistic with 1k N m [Kend77]. The distribution of

, 0L m N is an 1N m – phase Hypoexponential distribution with

parameters , 1 , ,N N m . The distribution for the time to first failure

,W N i has an exponential distribution with the parameter N i .

Using Theorem D.1 in Appendix D, the distribution L ,m N S has a

1N S m –stage Hypoexponential distribution [Koba78], [Cox55], [Ash70]

with parameters , 1 , , , , 1 , ,N S N S N N N m .

Let ,m N S

R t

denote the reliability of such a system, then the reliability

function is defined as,

,

1

,S N

N j i t

j im N Si i m

R t a e b e

(4.11)

where,

1

,S N

i

j j mj i

N j ja

j i j N i

(4.12)

and,

Page 54: Fault tolerant systems

42/130

1

.S N

i

j j mj i

N j jb

N i j j i (4.13)

Defining the constant K gives a new expression for the active and

standby terms in the reliability equation Eq. (4.11) of,

1

1 1

1

1 1 1,

1 1 11

! !1 1

! ! 1 ! !

1 ! ! !

,

1 ! ! !

1

11

1

N m

i

i N m

N m

NK S NK N N ma

i i iNK i S i iN m

K K K

NK S S i

NK i NK S S i

iN N N m

k

i im M m N m

K K

NK s S N

S i m

.

iN mi

KNK

N m

(4.14)

A similar expression can be developed for,

Page 55: Fault tolerant systems

43/130

1,

1 1 1 1

! 1 ! ! 1,

! ! 1 ! ! !

! ! ! ! !1 ,

! ! ! ! ! !

1

i

i m

i m

i m

NK S NK N mb

N i K S N i K i N m i

NK S N K N

NK N i K S i m N i i m

NK S S N i K N i m

S NK N i K S i i m m

NK S N i

S i m

N i K Si

m S

.

(4.15)

An expectation value of the reliability function derived from a general stage–type

distribution can be found using the Laplace transform [Cox 55]. The Laplace

transform of a stage–type random variable X is,

1 1 2 1

1 1

,ir

j

X i i

i j j

ss

(4.16)

where 1i i for 1 i r and 1 1r . Defining the Laplace transform of

the system described in Eq. (4.9) gives,

1

1 1

12

1 1

11

1

1.

1

iSi

X

i j

S N M

j j

N S js c c

s N S j

N jN jc

s N j s N j

(4.17)

By inverting the transformation in Eq. (4.17) an expression for the MTTF with

imperfect coverage can be given as,

Page 56: Fault tolerant systems

44/130

1 2

1 1 1

1 1 11 .

S S S Ni

i j S i j j M

E X c c cN j N j j

(4.18)

The details of the above development are described in more detail in [Ing76],

[Chan72], [King69], [Saat65], [Math70], [Triv82]. In the example described above,

the system does not provide for repair. When repairable systems are analyzed in

this manner, the number of stages becomes infinite. To deal with the infinite

number of conditional probabilities a different technique must be employed. The

Markov Chain is just such a technique, capable of dealing with a system

configuration of many modules, each with repairability.

An additional caution should be noted. The assumption of statistical

independence is questionable in the case of stage–type failure distributions. In

addition, the fixed probability distribution associated with each failure in the

stage–type should be removed in the detailed analysis [Rams76].

Page 57: Fault tolerant systems

45/130

C h a p t e r 5

MARKOV MODELS OF FAULT–TOLERANT SYSTEMS

A generalized modeling technique is required to deal with an arbitrary number of

modules, failure events, and repair events in the analysis of Fault–Tolerant

systems [Boss82]. Several techniques are available, including Petri Nets [Duga84],

[Duga85], Fault Tree Analysis [Fuss76], Failure Mode and Effects Analysis

[Mil1629], [Jame74], Event Tree Analysis [Gree82], and Hazard and Operability

Studies [Lee80], [Robi78], [Smit85]. When system components are not

independent, a state based analysis technique is needed which includes

redundancy and repair [Biro86], [Guid86].

A Continuous Parameter Markov Chain is a method used to analyze systems that have

state transitions that include repair processes [Hoel72], [Kend50], [Kend53]. A

Markov Process is a stochastic process whose dynamic behavior is such that the

probability distributions for its future behavior depend only on the present state

and not how the process arrived in that state [Mark07], [Fell67], [Issa76],

[Chun76], [Kulk84].

To illustrate the principles of a Markov process, consider a system S described in

Figure 3, which is changing over time in such a way that its state at any instant in

time v can be described in terms of a finite dimensional vector X t , [Triv74],

[Triv75a], [Triv75]. Assume that the state of the system at any time , for t t v

can be described by a predetermined function of the starting state v and the

ending state t:

Page 58: Fault tolerant systems

46/130

, .X t G X v t (5.1)

Given a set of reasonable starting conditions and the continuity of the function G

a differential equation for X t describing the rate at which transitions between

each state of the system takes place can be derived by expanding both sides of

Eq. (5.1) in powers of t to give,

.dx

X tdt

H (5.2)

Finite–dimensional deterministic systems described by the set of state vectors are

equivalent to systems described by sets of ordinary differential equations [Bell60],

[Brau67], [Beiz78], [Brue80]. This property will serve as the basis for analysis of

fault–tolerant systems that include repair.

It will be assumed that the system described by the set of differential equation in

Eq. (5.2) can exist in only one of the finite number of states [Keme60], [Koba78].

The transition from state i to state j in this system takes place with some random

probability defined by,

, , ; , .ijp v t P X t j X v i t v i j S (5.3)

Eq. (5.3) is the conditional pdf of the system of state transitions and satisfies the

relation,

, 1; 0 .j

i S

p v t v t

(5.4)

The unconditional pdf of the state transition vector X t is given by,

Page 59: Fault tolerant systems

47/130

, 1, 2, 3,jp t P X t j j (5.5)

with,

1, 0,j

j S

p t t

(5.6)

since the process at any time t must be in a unique state. An Absorbing Markov

Process is one in which transitions have the following properties [Gave73],

There is at least one absorbing state,

From every state, it is possible to get to the absorbing state.

i

ki

j

j

v t

uv t

Figure 3 – State Transition probabilities as a function of time in the Continuous–Time Markov chain that is subject to the constraints of the Chapman–Kolmogorov equation.

The fundamental assumption of the Markov model is that the probability of a

given state transition depends only on the current state of the system and not on

any previous state. For continuous–time Markov processes, that is, those

Page 60: Fault tolerant systems

48/130

described by ordinary differential equations, the length of time already spent in

the current state does not influence either the probability distribution of the next

state or the probability distribution of the remaining time in the same state before

the next transition. The Markov model fits with the standard assumption of the

reliability models developed so far in this work, that the failure rates are constant,

leading to an exponentially distributed state transition time for failures and a

Poisson distribution for the occurrence of these failures.

Solving the Markov Matrix

In order to describe a continuous–time Markov process using transition matrices,

it is necessary to specify the entire family of stochastic matrices, P t . Only

those matrices that meet certain conditions are useful in finding the solution to

the final absorption state rate of the system described by the Markov

Chain [Cour77].

Initial value problems involving systems of equations may be solved using the

Laplace transform. The advantage of this technique over traditional methods

(Elimination, Eigenvalue solutions, and Fundamental Matrix [Pipe63], [Cour43])

is that satisfaction of initial values is automatically provided. No special

techniques are needed to find particular solutions of the fundamental matrix, such

as repeated eigenvalues [Lome88].

Chapman–Kolmogorov Equations

A set of differential equations describing the transitions between each state can

be derived if the following conditions are met by the transitions probability

matrix [Bhar60], [Parz62], [Howa71]. These equations are the Chapman–Kolmogorov

Equations and are defined as the transition probabilities of the Markov chain that

satisfy Eq. (5.7) for all i and j, using Figure 3 as an example,

Page 61: Fault tolerant systems

49/130

, , , .ij ik kj

k

p v t p v u p u t (5.7)

A simplified notation for the matrix elements defined in Eq. (5.7) can be created

where the elements of each matrix are given by,

, , , ,v t v u u t v u t H H (5.8)

and where,

, ,t t H I (5.9)

is the identity matrix.

The Forward Chapman–Kolmogorov Equation is now defined as,

, , , ,v t s t t v tt

H H Q (5.10)

where the new matrix tQ is defined as,

0lim ,t

tt

t

P IQ (5.11)

with,

.t t v (5.12)

The matrix tQ is now defined as the transition rate matrix [Papo65a]. The

elements of tQ are ijq t and are defined by,

Page 62: Fault tolerant systems

50/130

0

, 1lim ,ii

iit

p t t tq t

t

(5.13)

and

0

, 1lim , .

ij

ijt

p t t tq t i j

t

(5.14)

If the system at time t is in state i, then the probability that a transition occurs to

any state other than state i during the time interval t t is given by,

,iiq t t o t (5.15)

where o h is any function of h that approaches zero faster than h, that is

0

lim 0.h

o h

h Eq. (5.13) is the rate at which the process departs state i when the

starting in state i.

Similarly, given that the system is in state i at time t, the conditional probability

that it will make a transition from state i to state j in the time interval ,t t t is

given by,

.ijq t t o t (5.16)

Eq. (5.14) is the rate at which the process moves from state i to state j given that

the system is in state i, since,

, 1,ijp v t (5.17)

then Eq. (5.13) and Eq. (5.14) implies,

Page 63: Fault tolerant systems

51/130

0, .ijq t i S (5.18)

Using these developments, the Backward Chapman–Kolmogorov equation is given by,

, , , .v t v v t v tv

H Q H (5.19)

The forward equation may be expressed in terms of its elements,

, , , .ij jj ij kj ik

k j

p v t q t p v t q t p v tt

(5.20)

The initial state i at the initial time v affects the solution of this set of differential

equations only through the following conditions,

1,,

0,ij

i jp v v

i j (5.21)

The backward matrix equation may be expressed in terms of its elements,

, , , ,ij jj ij ik kj

k j

p v t q t p v t q t p v tt

(5.22)

with the initial conditions,

1,,

0,ij

i jp t t

i j (5.23)

Markov Matrix Notation

The expressions developed in the previous section can be represented by a

transition probability matrix [Papo62] of the form,

Page 64: Fault tolerant systems

52/130

0

11 10

0 01 00

.

mn m

ij

n

p p

P p

p p

p p p

The entries in this matrix satisfy two properties; 0 1ijp and 1ij

j

p which

is a restatement of Eq. (5.17). The Transition Probability Matrix can also be

represented by a directed graph [Maye72], [Deo74]. A node labeled i in the

directed graph represents state i of the Markov Chain and a branch labeled ijp

from node i to node j implies that the conditional probability

1n n ijP X j X j p is met by the Markov Process represented by the

directed graph.

The transition probabilities represent a set of differential equations describing the

rate at which the transitions take place between each node in the directed graph.

The differential equations are then represented by a matrix structure of,

0

1 10 11

0 00 0

0

.

n

mn m n

n

n

dP

dt p p P

dp p PP

dtp p P

dP

dt

The solution to this set of linear homogeneous differential equations can be

derived by elimination using the Laplace transform method.

Page 65: Fault tolerant systems

53/130

Laplace Transform Techniques

Given a set of differential equations in Eq. (5.20) and Eq. (5.22), the Laplace

transform can be used to generate solutions to these equations [Lome88]. One

advantage of using the Laplace transform method is its ability to handle initial

conditions automatically, without having first to find a general solution and then

having to evaluate the integration constants. The Laplace transform is defined as,

0

stF s e f t dt f t (5.24)

The differential equation solution method depends on the following operational

property of the Laplace transform [Krey72]. The Laplace transform of the

derivative of a function is,

0 0

lim .0

b

st st st

b

bf t e f t dt e f t s e f t dt (5.25)

In the limit, the integral appearing on the right–hand side of Eq. (5.25) is

f t , so that the first term in Eq. (5.25) can be evaluated in the following

manner [McLac39],

0lim 0 .sb

be f b e f (5.26)

Using the property of absolute values and limits [Arfk70], Eq. (5.26) can be

rewritten as,

lim lim .sb sb

b be f b e f b (5.27)

Page 66: Fault tolerant systems

54/130

The term f b is of the order abe as b . For b T using the definition for

exponential order, Eq. (5.27) can be reevaluated to the following,

lim lim lim .

s bsb sb b

b b be f b e Me Me (5.28)

The function f b is said to be of exponential order as b if there

exists a constant such that: ,be f b is bounded for all t greater than

some T. If this statement is true, there also exists a constant M, such that

, .tf b Me t T

Figure 4 – Definition of the exponential order of a function.

If s , then 0,s giving,

lim 0,s b

bMe

(5.29)

so that in the limit,

lim 0,sb

be f b

(5.30)

giving the final form of the Laplace transform of a differential equation as,

0 .f t s f t f (5.31)

The notation for the Laplace transform for the differential equation for the rate

of arrival at the transition state i is then given by,

.i iP t P s (5.32)

Page 67: Fault tolerant systems

55/130

From this point on, this Laplace transform notation will be used in the solution

of the Markov transition matrix differential equations. Using the expression

1R t F t P T t to define the system reliability, where F t is the

probability distribution function of the time to failure, a new random variable, Y,

can be defined which represents the expected time to system failure. A notation

can be defined such that 0

Y

dR t dP tf t

dt dt is the failure density of the

random variable Y. The Laplace transform of this failure density is denoted by

0 .Y Y Yf t s f s sP s In this work 0P s represents the

absorbing state of the Markov model. By using the Laplace transform notation in

the solution of differential equations, the inverse transform can be used to

generate the failure density function for the random variable Y. Using Eq. (2.38)

the derivative of the failure density function can be integrated to produce the

Mean Time to Failure 0

dMTTF E Y t R t

dt

. The inversion of the

Laplace transform may be straightforward in some cases and more complex in

other cases.

MODELING A DUPLEX SYSTEM

Duplex systems or Parallel Redundant systems have been utilized in electronic

central office switching systems and other high–reliability systems for the past 35

years [Toy78]. Parallel redundant systems depend on fault detection and recovery

for their proper operation. In most dual redundant architectures both system are

monitored continuously, providing fault detection in the primary subsystem as

well as the standby subsystem.

This section describes the detailed development of the Markov model for a

parallel redundant system with perfect diagnostic coverage. The failure rate of

Page 68: Fault tolerant systems

56/130

both subsystems are assumed to be a constant and the repair rate a constant

. The system is considered failed when both subsystems have failed. The

number of properly functioning subsystems is described in the state space

2,1,0S , where 0 is the failure state of the system. The state diagram for

the system is shown in Figure 5.

2 01

2

Figure 5 – the state transition diagram for a Parallel Redundant system with repair. State

2 represents the fault free operation mode,

State 1 represents a single fault with a

return path to the fault free mode by a repair

operation, and State 0 represents the

system failure mode, the absorption state.

The initial state of the system is 2 and the initial conditions for the transition

equations are,

2 1 00 1, 0 0 0.P P P (5.33)

Using the initial conditions, the system of differential equations derived from the

transition matrix,

Page 69: Fault tolerant systems

57/130

2

2

1

1

0

0

2 0

2 ,

0 2 0

dP tP t

dt

dP tP t

dt

dP tP t

dt

are given by,

2

2 1

1

2 1

0

1

2 ,

2 ,

.

dP tP t P t

dt

dP tP t P t

dt

dP tP t

dt

(5.34)

Using the Laplace transform solution technique described in the previous section

and in detail in [Doet61], [Widd46], [Lome88], [Rea78], and [Lath65] gives the

following set of equations in Laplace form,

2 2 1

1 2 1

0 1

1 2 ,

2 ,

.

sP s P s P s

sP s P s P s

sP s P s

(5.35)

Solving Eq. (5.35)(a) for the final failed state 2 gives,

2 2 1

2 1

1

2

2 1,

2 1,

1,

2

sP s P s P s

s P s P s

P sP s

s

(5.36)

Page 70: Fault tolerant systems

58/130

and solving for Eq. (5.36)(b) for state 2 gives,

1 2 1

1 1 2

1

2

2 ,

2 ,

.2

sP s P s P s

sP s P s P s

s P sP s

(5.37)

Equating Eq. (5.36) and Eq. (5.37) a solution representing state 1 can be

derived, giving,

1 1 1

.2 2

s P s P s

s

(5.38)

Multiplying each side by 1

1

P s gives,

1

1

,2 2

s P s

s

which results in,

1

22 2 .s s

P s

(5.39)

Solving Eq. (5.39) for state 1 gives,

1

2.

2 2P s

s s

(5.40)

Expanding and simplifying Eq. (5.40) gives,

1 2 2

2.

3 2P s

s s s

(5.41)

Page 71: Fault tolerant systems

59/130

Substituting Eq. (5.41) into Eq. (5.35)(c) gives the solution to the final absorbing

state 0 as,

0 1

0 2 2

2

0 2 2

,

2,

3

2.

3

sP s P s

sP ss s s

P ss s s s

(5.42)

After producing the inverse Laplace transform of Eq. (5.42)(c), the probability

that no subsystems are operating at time, 0t is the result. Let the random

variable Y be the time to failure of the system and 0P t be the probability that

the system has failed at or before time t. The reliability of the system is then

defined by,

01 .R t P t (5.43)

Using Eq. (2.37), the failure density function for the random variable Y is given

by,

0 ,Y

dP tdRf t

dt dt (5.44)

and using Eq. (5.31), its Laplace transform is given by,

2

0 0 2 2

20 .

3 2Y YL s f s sP s P

s s

(5.45)

Inverting Eq. (5.45) gives the failure density of Y as,

Page 72: Fault tolerant systems

60/130

2 1

2

1 2

2,t t

Yf t e e

(5.46)

where,

2 2

1 2

3 6, .

2

(5.47)

Using Eq. (2.28), the MTTF of the Parallel Redundant system with repair is given

by,

2 1

0

2

1 2 0 0

2

2 2

1 2 2 1

2

1 2

2 2

1 2

2

22

2

2,

2 1 1,

2,

2 3,

2

3.

2 2

Y

y y

E Y yf y dy

y e dy y e dy

(5.48)

The MTTF of a two element Parallel Redundant system without repair 0

would have been equal to the first term in Eq. (5.48)(c). The effect of adding a

repair facility to the system increases the mean life of the system by,

2

as a result of Repair ,2

MTTF

(5.49)

Page 73: Fault tolerant systems

61/130

or a factor of,

22 ,

3 32

(5.50)

over a system without repair facilities.

MODELING A TRIPLE–REDUNDANT SYSTEM

A Triple Modular Redundant (TMR) system continues to operate correctly as

long as two of the three subsystems are functioning properly. A second

subsystem failure causes the system to fail. This model is referred to as 3–2–0. A

second architecture (shown in Figure 7) is possible in which the system will

continue to operate in the presence of two (2) subsystem failures. This system

operates in simplex mode 3–2–1–0. The 3–2–0 model without coverage will be

developed in this section. Figure 6 describes a TMR system with a constant

failure rate and a constant repair rate .

The repair activity takes place with a constant response time whenever a

subsystem fails, giving a Markov transition matrix of,

2

2

1

1

0

0

3 0

3 2 .

0 2 0

dP tP t

dt

dP tP t

dt

dP tP t

dt

(5.51)

The set of differential equations derived from the transition matrix is given by,

Page 74: Fault tolerant systems

62/130

2

2 1

1

2 1

0

1

3 ,

3 2 ,

2 .

dP tP t P t

dt

dP tP t P t

dt

dP tP t

dt

(5.52)

Rewriting the differential equations in the Laplace transform format gives,

2 2 1

1 2 1

0 1

1 3 ,

3 2 ,

2 .

sP s P s P s

sP s P s P s

sP s P s

(5.53)

Using Eq. (5.53)(a) and Eq. (5.53)(b) to solve for state 2 gives,

2 2 1

2 1

1

2

3 1,

3 1,

1.

3

sP s P s P s

s P s P s

P sP s

s

(5.54)

Page 75: Fault tolerant systems

63/130

2 01

3

2

Figure 6 – The transition diagram for a Triple Modular Redundant system with repair. State

2 represents the fault free (TMR) operation

mode, State 1 represents a single fault

(Duplex) operation mode with a return path to the fault free mode, and State

0 represents the system failure mode, the

absorbing state.

Using Eq. (5.54)(a) and Eq. (5.54)(b) again to solve for state 2 gives,

1 2 1

1 1 2

2 1

3 2 ,

2 3 ,

2.

3

sP s P s P s

sP s P s P s

sP s P s

(5.55)

Equating (5.54) and Eq. (5.55) and solving for state 1 gives,

1

1

1

2 1,

3 3

3.

2 3 3

s P sP s

s

P ss s

(5.56)

Simplifying Eq. (5.56)(b) gives,

Page 76: Fault tolerant systems

64/130

1 2 2

3.

5 6P s

s s s

(5.57)

Substituting the solution for state 1 , Eq. (5.57), into Eq. (5.54)(c) gives the

solution for the final absorbing state 0 ,

0 1 2 2

2

0 2 2

32 2 ,

5 6

6

5 6 .

sP s P ss s s

P ss s s s

(5.58)

Expanding and factoring the denominator of Eq. (5.58)(b) gives the differential

equation for the absorption state as,

2

02 2 2 21 1

2 2

6

5 10 5 10P s

s s s

(5.59)

Expanding the partial fractions of Eq. (5.59) and taking the inverse Laplace

transform, results in the following reliability function,

2 212

2 212

2 25 10

2 2

2 25 10

2 2

5 10

2 10

5 10.

2 10

R t e

e

(5.60)

Integrating Eq. (5.60) using Eq. (2.24) produces the MTTF of,

Page 77: Fault tolerant systems

65/130

2 2

2 2 2 2

2 2

2 2 2 2

5 10

5 10 10

5 10.

5 10 10

MTTF

(5.61)

Simplifying Eq. (5.61) gives the MTTF for a TMR system with repair as,

2

5.

6MTTF

(5.62)

Rearranging Eq. (5.62) and isolating the repair term from the failure term gives,

2

5.

6 6MTTF

(5.63)

MODELING A PARALLEL SYSTEM WITH IMPERFECT COVERAGE

A more realistic model of a Parallel Redundant System assumes that not all faults

are recoverable and that the coverage factor c denotes the conditional probability

that the system detects the fault and survives. The state diagram for this system is

shown in Figure 7

Page 78: Fault tolerant systems

66/130

2 01

2 c

2 1 c

Figure 7 – The transition diagram for a Parallel Redundant system with repair and

imperfect fault coverage. State 2 represents

the fault free mode, State 1 represents a

single fault with a return path to the fault free mode by a repair operation, and State

0 represents the system failure mode. State

0 can be reached from State 2 through

an uncovered fault, which causes the system

to fail without the intermediate State 1

mode.

The transition matrix for Figure 7 is,

2

2

1

1

0

0

2 2 1 0

2 ,

2 1 2 0

dP tc c P t

dt

dP tc P t

dt

dP tc P t

dt

(5.64)

With an initial state of 2 producing a set of starting conditions,

Page 79: Fault tolerant systems

67/130

2 1 00 1, 0 0 0P P P ,

the system of equations describing the state transitions are,

2

2 2 1

1

2 1

0

2 1

2 2 1 ,

2 ,

2 1 .

dP tcP t c P t P t

dt

dP tcP t P t

dt

dP tc P t P t

dt

(5.65)

Using the Laplace transform method, the above equations are reduced to,

2 2 1

1 2 1

0 2 1

1 2 ,

2 ,

2 1 .

sP s P s P s

sP s cP s P s

sP s c P s P s

(5.66)

Using Eq. (5.66)(a) and solving for state 2 gives,

2 2 1

2 1

1

2

2 ,

2 ,

1.

2

sP s P s P s

s P s P s

P sP s

s

(5.67)

Using Eq. (5.66)(b) to solve for state 2 gives,

1 2 1

1 2

1

2

2 ,

2 ,

.2

sP s cP s P s

s P s cP s

s P sP s

c

(5.68)

Page 80: Fault tolerant systems

68/130

Equating Eq. (5.67)(c) and Eq. (5.68)(c) and solving for state 1 gives,

1 11.

2 2

P s s P s

s c

(5.69)

Simplifying Eq. (5.69) and solving for state 1 gives,

1 1

1

2 2 2 ,

2.

2 2

cP s c s s P s

cP s

s s c

(5.70)

Using Eq. (5.66)(a) and solving for state 1 gives,

2 2 1

2 1

1

1

2 ,

2 ,

2 1.

sP s P s P s

s P s P s

s P sP s

(5.71)

Using Eq. (5.66)(b) and solving for state 1 gives,

1 2 1

1 2

1 2

2 ,

2 ,

2.

sP s cP s P s

s P s cP s

cP s P s

s

(5.72)

Equating Eq. (5.71) and Eq. (5.72) and solving for state 2 gives,

Page 81: Fault tolerant systems

69/130

2

2

2

2 1 2,

.2 2

s P s cP s

s

sP s

s s c

(5.73)

Substituting Eq. (5.70) and Eq. (5.73) into Eq. (5.66)(c) and solving for state 0

gives,

0 2 1

2 2 2

2 1 ,

22 1 ,

2 2 2 2

2 2 2 2 2 2 2,

2 2

2.

2 2

sP s c P s P s

s cc

s s c s s c

sc c c c

s s c

s sc c

s s c

(5.74)

Simplifying Eq. (5.74) for state 0 gives,

0

2.

2 2

s c sP s

s s s c

(5.75)

If X is the random variable describing the time to the system failure, then,

0 ,XF t P t (5.76)

and,

Page 82: Fault tolerant systems

70/130

0

2.

2 2X X

s c sf s sP s

s s c

(5.77)

Inverting Eq. (5.77) is difficult in this case and an alternative method of finding

the mean time to system failure will be used.

If X denotes the time to failure of a system, then from the knowledge of the

Laplace transform X s the MTTF can be obtained using the Moment

Generating Property of the Laplace transform.

Let X be a random variable possessing a Laplace transforms X s . Then the

1, 2, ,thk k n moment of X is given by,

1 .0

nnk X

n

d sE X

sds

(5.78)

In this case Eq. (5.77) can be rewritten as,

2

,X

Us

V

(5.79)

where,

,U s c s (5.80)

and,

2 2 .V s s c (5.81)

Page 83: Fault tolerant systems

71/130

The evaluation of the derivative term in Eq. (5.79) can be performed using the

quotient rule [Midd46], [Smai49],

2

.

d dV U U V

d U ds ds

ds V V

(5.82)

This method is useful when it is difficult to obtain the density function Xf t

and the reliability function XR t . Using the moment generating property,

Eq. (5.78) is evaluated as,

2

2

2 2 3 1,

0 0

2 3 2 2 1,

2 2

XU s V cdL

E Xs sds V

c c c

c

(5.83)

which gives the expression for the MTTF of a parallel redundant system with

imperfect coverage as,

1 2.

2 1

cE X

c

(5.84)

MODELING A TMR SYSTEM WITH IMPERFECT COVERAGE

In some Triple Modular Redundant systems a simplex mode of operation is

allowed, providing the facility to have two successive subsystem failures before

total system failure is experienced [Toy87]. Figure 8 describes the states of the

TMR system operating in the 3–2–1–0 mode.

Page 84: Fault tolerant systems

72/130

2 01

2 c

2 1 c

3

3

Figure 8 –The state transition diagram for a Triple Modular Redundant system with repair

and imperfect fault coverage. State 3

represents the fault free mode, State 2

represents the single fault (Duplex) mode,

State 1 represents the two–fault (Simplex)

mode, and State 0 represents the system

failure mode.

With an initial state of 3 , the starting conditions are,

3 2 1 00 1, 0 0 0 0.P P P P (5.85)

The transition matrix is,

Page 85: Fault tolerant systems

73/130

33

22

11

00

3 0 0

3 2 2 1 0

.

0 2 0

0 2 1 0

dP tP t

dt

dP tc c P t

dt

dP tc P t

dt

dP tc P t

dt

(5.86)

producing a system of differential equations,

33 2

23 2 2 1

12 1

02 1

3 ,

3 2 2 1 ,

2 ,

2 1 .

dPP t P t

dt

dPP t cP t c P t P t

dt

dPcP t P t

dt

dPc P t P t

dt

(5.87)

The Laplace transform of these equations produces,

3 3 2

2 3 2 2 1

1 2 1

0 2 1

1 3 ,

3 2 2 1 ,

2 ,

2 1 .

sP s P s P s

sP s P s cP s c P s P s

sP s cP s P s

sP s c P s P s

(5.88)

Solving Eq. (5.88) for the absorption state and inverting the Laplace transform

gives the mean time to failure for a TMR system with imperfect converge as,

Page 86: Fault tolerant systems

74/130

2 2 2 3 3 4 3 2 4

3

7 2 8 17 13 1 5 6 12 6 6.

6 1

MTTF

c c c c c c c

c

(5.89)

MODELING A GENERALIZED TMR SYSTEM

In practice, the state transitions for a Triple Modular Redundant computing

system take place in a controlled manner. The physical structure of the underlying

hardware and software create a transition network similar to that describe in

Figure 8. In normal operation, the system functions as a TMR machine. Failure

of a subsystem transforms the system into an intermediate mode in which failure

of a second subsystem results in the final absorption state. The states

1 N represent the possible remaining subsystems after the failure of one of

the TMR subsystems.

A closed form solution to this general TMR model using perfect fault coverage

would be useful [Dona84], [Dona85], [Iyer84], [Meye82]. Using Figure 8 as the

basis for the transition model, this section develops an expression for the MTTF

of a generalized TMR system with the following behavior:

Each subsystem is itself a TMR system.

The subsystems are connected to form a fault–tolerant system.

The system starts in state 0 .

A fault in any subsystem, results in the transition to state 1 N .

A fault in the remaining subsystems result in the absorption state 1N .

Page 87: Fault tolerant systems

75/130

Laplace Transform Solution to Systems of Equations

The scalar differential equation of the form d

p t ap tdt

has solutions of the

form atc e , where c is a constant [Lome86], [Brau70], [Brau67]. Given a system of

differential equations of the form,

,i i

dP t P t

dt A (5.90)

the generation of the fundamental matrix A leads to a unique solution. Taking the

Laplace transform of both sides of Eq. (5.90) and using the initial conditions

vector 0P gives,

0sP s P P s A (5.91)

Rearranging Eq. (5.91) results in,

0s P s P I A (5.92)

The system of equations described in Eq. (5.92) is a linear nonhomogeneous

system of n algebraic equation in n unknowns. If s is not equal to an eigenvalue of

A, e.g. det 0s I A , Eq. (5.92) can be solved for P s in terms of 0P and

s, using Cramers Rule and / or Gauss Reduction (see Appendix B).

Since det s I A , where 11 1

1

detm

n nm

s a as

a s a

I A , is a polynomial of

degree n, P s is a vector whose components are rational functions of s and

Page 88: Fault tolerant systems

76/130

linear in 1 20 , 0 , , 0nP P P , providing a unique solution to the system

described in Eq. (5.92).

2 N

1

1

1

0

N+ 1

2

N

2

N

1

2

N

Figure 9 – The state transition diagram for a Generalized Triple Modular Redundant system with repair and [perfect fault detection coverage. The system initially operates in a

fault free state 0 . A fault in any module

results in the transition to state 1, , N . A

second fault while in state 1, , N results in

the system failure state 1N .

Specific Solution to the Generalized System

The solution set of differential equations describing the reliability of the system in

Figure 9 can be developed using the Gauss Reduction method by transforming

these set of differential equations into a set of algebraic equations using the

Page 89: Fault tolerant systems

77/130

Laplace transform [Lehm62], [Bell65], [Jame67]. Appendix F provides additional

details on this solution method. The matrix described in Eq. (5.93) represents the

state transition rates between each node of Figure 9,

1 2

1

1 1

2 2

3 3

0

0 0

0 0,

0 0

0 0

0 0 0 0

N

i N

i

N N

s

A (5.93)

where the fundamental matrix in the Laplace Transform notation of A, is given

by,

s I A , (5.94)

resulting in,

1 2

1

1 1

2 2

3 3

0

0 0

0 0,

0 0

0 0

0 0 0 0

N

i N

i

N N

s

s

s

s

s

s

(5.95)

using the initial conditions for the system of equations,

Page 90: Fault tolerant systems

78/130

0 ,P P s (5.96)

and the normalizing term,

0

1

,

; 1, 2, , ,

N

i

i

i i

s

s i N

(5.97)

the system of equations can be written in the transposed form as,

0 0

1 1 1

2 1 2

3 3

1 2 1

1 0

0 0 0 0

0 0

0 0 ,

0 0

0 0

N N

N N

P s

P s

P s

P s

P s

s P s

(5.98)

which can be rewritten in an augmented form as,

0

1 1

2 2

3 3

1 1 1 1

1 0

0 0 0 0 0

0 0 0 0

0 0 0 0 0 .

0 0 0 0 0

0 0

N N

s

(5.99)

Using the Gauss Reduction method to multiply row 1i by, 1i and add it

to row 1 to give the new form of the matrix as,

Page 91: Fault tolerant systems

79/130

0

1 1

2 2

3 3

1 1 1 1

1 0 0 0 0 0

0 0 0 0 0

0 0 0 0

.0 0 0 0 0

0 0 0 0 0

0 0

i i

N N

s

(5.100)

The augmented matrix is now in lower triangular form and can be solved for

1NP s .

Defining the term,

0 0 ,i

B

(5.101)

gives the following sets of equations from Eq. (5.100),

0 0

1 0 1 1

2 0 3 2

3 0 3 3

0

1 1 2 2 1

1 ,

0 ,

0 0 ,

0 ,

0 ,

0 .

N N N

N N N

B P s

P s P s

P s P s

P s P s

P s P s

P s P s P s sP s

(5.102)

Solving for each differential equation gives,

Page 92: Fault tolerant systems

80/130

1

0 0

1 0 11

1 1 0

2 0 22

2 2 0

1 0 1

1 1 0

1

1 1 10 0

1

1

1 10 0

,

,

,

,

1 1 1,

1 1.

N

N N Ni i i i

N i i

i i ii i

N Ni i

N i i i

i ii

P s B

P sP s

B

P sP s

B

P sP s

B

P s P ss s B sB

sP sB B

(5.103)

Solving the differential equations set for the absorption state 1N using the

moment generating function describes in Eq. (5.78) gives,

1

0

,0

N

dsP s p t dt MTBF

sds

(5.104)

as a general expression. Defining the intermediate terms,

1

0 0

1

0

1

0

,

,

,

i i

T i i

T i i

B

B s

B s s

(5.105)

and the derivative term,

1

0 1 ,i i

d dB s

ds ds

(5.106)

and

Page 93: Fault tolerant systems

81/130

2

0 21 1 .

i

i i

i

dB s

ds

(5.107)

The term 01u B , is defined so that the derivative 2 2

0

11 i

i

du

ds B

can be defined resulting in

11

1 2 2

1 1i

ii

d ds

ds ds s

.

Expanding Eq. (5.106) and Eq. (5.107) to,

1

1

0

1,N i i i

d dsP s

ds ds B

(5.108)

and,

1 1

1

0 0

1 1,N i i i i i i

d d dsP s

ds ds B B ds

(5.109)

evaluating at 0s gives an expression for the MTTF of the generalized TMR

system,

1

2 2

2

0

1

.

N

i i i i i iT

i ii i

iT

i

dsP s

sds

(5.110)

Simplifying Eq. (5.110) gives the closed form solution for the generalized TMR

equation of,

Page 94: Fault tolerant systems

82/130

1

1

1

.

Ni

i i

Ni i

i i

MTBF

(5.111)

The expression shown in Eq. (5.111) represents the generalized form of a Triple

Modular Redundant system with individual failure rates and a constant repair rate

for each failure event. Although this model appears generic, is can be utilized to

represent actual TMR system with a high degree of accuracy [Sint89], [Whit82],

[UKAE88].

Page 95: Fault tolerant systems

83/130

C h a p t e r 6

PRACTICAL EFFECTS OF PARTIAL COVERAGE

The analysis presented in the previous section utilizes a coverage factor c in

evaluating the Mean Time to Failure of various fault–tolerant system configurations.

In practice, the coverage factor of the underlying diagnostics firmware must be

determined through analysis and verification. The important aspects of the

diagnostics subsystem are [Kraf81],

The physical determination of the coverage factor through some means.

The effects of systematic failures in the diagnostic system on the reliability of the system.

DETERMINING COVERAGE FACTORS

Throughout this work, the term coverage has been referred to as a scalar value

describing the probability that the system successfully recovers from a specific

type of fault. What has not been described is the method by which the coverage

factor is determined for a particular system or subsystem.

Several difficulties arise when the coverage factor is measured:

For a particular subsystem, exhaustive testing of each fault condition may not be possible. A typical printed circuit board may have 20,000 device connection points, each of which can exhibit several different fault states. Defining the fault condition and the proper recovery state for each point can severely restrict the test activities.

Reconfirmation of the coverage factor for the next generation of hardware or firmware could require exhaustive retesting. Once the original coverage factors has been computed or measured, some method for reconfirming the coverage is necessary, without expending the same effort.

Page 96: Fault tolerant systems

84/130

What is needed is,

A method by which the coverage factor can be measured and then reconfirmed for each iteration of the system design with the minimum of effort while maintaining a high degree of confidence that the measured factor represents the actual coverage.

One method used to address this question is Physical Fault Injection. This technique

places physical faults in the system under test and observes the resulting error or

failure response [Crou82], [Lala83], [Damm88], [Schu86], [Gunn87].

Coverage Measurement Statistics

The behavior of a system in the presence of faults can be determined through a

suitably selected set of proportion sample tests. Each test induces a fault in the

system and the resulting behavior is observed. From these sample tests an

inference can be made regarding the total population of faults that can occur in

the system. The method used in the following sections is based on Statistical

Inference theory [Coch77], [Lars81], [ Hoel62], [Yama67], [Hoel72], [Bern88].

Three aspects of the sampling process and statistical inference are applicable here:

As the sample size increases, the estimate of the parameter of interest generally gets closer to the true value, with complete correspondence reached when the sample size equals the entire population. This aspect is referred to as Consistency.

Whatever the sample size, the sample should be representative of the underlying population. This aspect is referred to as Bias.

For certain statistical distributions, the arithmetic mean is considered to be more stable from sample to sample than any other measure of the central tendency of the sampled population. For any size sample the sample mean will tend to be closer to the population mean than any other unbiased estimator. The precision of the sample mean is referred to as the Efficiency of the estimator.

Page 97: Fault tolerant systems

85/130

Coverage Factor Measurement Assumptions

There are several assumptions used in the development of the coverage factor

calculations:

Each fault that can occur in the system will result in an error or failure of the system. This assumption states that there are no hidden faults or circuit elements that when faulty do not result in an error or failure.

The goal of the diagnostics software is to detect and properly recover from 100% of the faults. The determination of the proper recovery behavior is determined by the functional specification of the system under evaluation.

The fault locations can be enumerated in some way. The population of possible faults can then serve as a sample population for the sampling method described below.

Coverage Measurement Sampling Method

In a practical implementation what is needed is a sampling method by which a

small number of fault locations can be tested to determine, within a desired

confidence interval, that the remaining (unselected) fault locations will exhibit

similar behavior. The determination of the proper number of samples from the

total population of available faults is based on the technique of estimating the

population proportion.

The supporting theory for the sampling method used in this method relies in the

following assumptions:

If a random variable X is normally distributed with a mean of and a

standard deviation of , and a random sample size n is drawn, then the

sample mean X will be normally distributed with a mean and a standard

deviation .

A population proportion is considered a special case of the mean, where the random variable X takes on only value of 0 or 1 (successful or unsuccessful). The sample proportion p is an unbiased estimator of the population proportion P.

Page 98: Fault tolerant systems

86/130

Normal Population Statistics

The probability density function for a normally distributed random variable is,

2

1 1exp .22

X

X

Xf x

(6.1)

Eq. (6.1) is difficult to work with since the mean and standard deviation values

must always be adjusted to the distribution. A standardizing technique can be

used to Normalize the distribution using the expression,

.X

X

XZ

(6.2)

For a small sample population, the confidence interval estimate for the

population proportion is given by [Bend71],

.S S S SS S

p q p qp Z p p Z

n n

(6.3)

Sample Size Computation

The confidence interval estimate of the true proportion p obtained from Eq. (6.3)

is,

.S SS

p qp Z

n (6.4)

To determine the sample size that meets the confidence interval estimate,

Eq. (6.3) can be rearranged to give,

Page 99: Fault tolerant systems

87/130

.S

pqZ p p

n (6.5)

In this case the sampling error e is the difference between the sample estimate Sp

and the population parameter p given by,

.pq

e Zn

(6.6)

Solving Eq. (6.6) for the sample size gives,

2

2.

Z pqn

e (6.7)

For a finite population size using sampling without replacement the error on

estimating the sample proportion is,

.1

pq N ne Z

n N

(6.8)

Without considering the finite population error correction, the number of

samples necessary for a desired error is,

2

0 2,

Z pqn

e (6.9)

where 0n is the sample size without considering the finite population correction

factor. Applying the finite population correction factor, the number of samples is

given by,

Page 100: Fault tolerant systems

88/130

0

0

.1

n Nn

n N

(6.10)

General Confidence Intervals

What is now needed is a technique which produces a sample size which results in

the desired confidence estimate for the coverage factor. In practice the selection

of the sample size depends on three unknowns. Any normal random variable X is

converted to a standardized normal random variable Z with a mean of 0 and a

standard deviation of 1. The probability density function for Z is then given by,

212

1exp .

2f Z Z

(6.11)

If the samples are draw from a non–normal distribution Eq. (6.11) may not be

valid. An important concept is used at this point – The Central Limit Theorem

[Bend71], [Hoel62], [Papo65].

As the sample size (the number of observations in each sample) grows ―large enough,‖ the sampling distribution of the mean can be approximated by the normal distribution. This is true regardless of the shape of the distribution of the individual values in the population.

Proportion Statistics

In the case of the fault coverage measurement statistics, the variable of interest X

indicates success or failure of the diagnostics. What is of interest is the proportion

of faults that are covered by the diagnostic software rather than the actual

number of covered faults.

The proportion of the sample values that exhibits successful results are given by,

Page 101: Fault tolerant systems

89/130

,S

Xp

n (6.12)

with a sample proportion mean of,

Sp p and 1 ,q p (6.13)

and a sample proportion standard deviation of,

.Sp

pq

n (6.14)

Rewriting Eq. (6.2) in terms of the proportion statistics results in a normalization

expression of,

.Sp pZ

pq

n

(6.15)

The Central Limit Theorem and the Standard Error of the Mean are based on the

premise that the samples selected are chosen with replacement. However, when

the fault location samples are drawn they are done so without replacement from a

finite population of size N. When the sample size is not small compared to the

population, that is 0.05n N , a finite population correction factor is applied to

the standard error of the mean. Eq. (6.16) shows this correction factor [Coch77],

[Bern88],

.1Sp

pq N n

n N

(6.16)

Page 102: Fault tolerant systems

90/130

Confidence Interval Estimate of the Proportion

Using the Central Limit Theorem or knowledge about the population distribution,

an estimate can be made regarding the percentage of sample means that fall

within a certain distance from the population mean [Coch77], [Sned80],

The level of confidence desired, Z,

The sampling error permitted e and

The true proportion of success p.

Unknown Population Proportion

The selection of the three unknowns described above is often difficult. Once the

desired level of confidence is chosen and the appropriate Z computed from the

normal distribution, the sampling error e indicates the amount of error that is

acceptable. The third quantity, the true proportion of success p is actually the

population parameter the measurement is attempting to quantify. The question is,

How can a value for the true proportion be stated for the value of the sample proportion that is being measured?

There are two alternatives. The first is to use past information regarding the

population proportion to make an educated estimate of the true proportion p. If

past information is not available an estimate is provided which never underestimates

the sample size needed. Referring to Eq. (6.16) it is observed that the quantity

p q appears in the numerator. The value of p that will make p q the largest is

0.5p .

With no prior knowledge or estimate of the true proportion p using 0.5p will

result in the most conservative determination of the necessary sample size. If the

actual sample proportion is very different from 0.5, the width of the real

Page 103: Fault tolerant systems

91/130

confidence interval may be substantially narrower than the estimate obtained

using this method.

Clopper–Person Estimation

When sampling values from the proportion from an unknown distribution (non–

normal), some constraints must be placed on the sample size so that the

expressions in the proceeding section are valid. For small sample sizes, a lower

limit is set for a finite population and a specified confidence interval. Details of

these constraints are given in [Yama67] pp. 89–95. The sample size limits were

tabulated by C. J. Clopper and E. S. Pearson in [Clop34], [Kend61].

If the proportion estimate is to have a 95% confidence with a 5% error, (these

figures are general guidelines described in [Yama67]) Figure 10 is used. The

normal approximation for sample and the absolute sample size n follows the

guidelines in Figure 10 [Lawr68].

Sample Proportion Sample Size must be

p n

0.4 50

0.3 or 0.7 80

0.2 or 0.8 200

0.1 or 0.9 600

0.05 or 0.95 1400

Figure 10 – Sample size requirement for a specified estimate as tabulated by Clopper and Pearson.

Page 104: Fault tolerant systems

92/130

Practical Sample Estimates

Assume an electronic circuit assembly containing 6000 fault points consisting of

375 packages with 8 outputs, each capable of being in one of three states

[Myer64], [Barr73], [Boss70], [Seth77], [McCl86], [TUV86],

Operating properly

Stuck HIGH

Stuck LOW

The question is,

How many random samples must be taken on the circuit assembly to determine with 95% confidence ( 1.96Z ) and 5% ( 0.05e ) error that the coverage factor for the circuit is 0.95?

Using Eq. (6.16) and the worst case estimator for the population proportion

0.5p gives the number of samples from an infinite population of,

22

0 22

1.96 0.5 0.5320 samples,

0.05

Z p qn

e

(6.17)

and a corrected sample count from a finite population of 6000 without

replacement of,

0

0

320 6000300 samples.

1 320 6000 1

n Nn

n N

(6.18)

A fault injection procedure using 300 fault locations is a feasible task during the

engineering validation portion of a product development cycle. Using the

sampling method, an incremental engineering design can be revalidated with

relative ease. Such testing procedures have been performed on industrial control

Page 105: Fault tolerant systems

93/130

electronics as a mechanism for verifying each incremental release of hardware and

software.

Time Dependent Aspects of Fault Coverage Measurement

In the previous section, it was assumed that each injected fault immediately

resulted in a observable error or system failure. In many instances, this is not the

case, rather some time may pass before the effects of the injected fault is

observed. Two time intervals require further analysis [Aviz86], [Bour69],

The fault dormancy is the time interval between the occurrence of a fault and its activation as an error.

The error detection latency is the time interval between an error and its detection by the diagnostic subsystem.

In addition to single faults causing observable errors, near coincident faults are of

important interest in multiprocessor and / or real–time control systems

[McGo83].

An expression for the coverage in the presence of time–delayed realization will be

developed using the following notation [Boss82], [Shin86], [Arla85]. Let the

following variables are defined,

PT – is a random variable denoting the time at which the system is checked

for an observable error or failure as a result of a physical fault injection.

n – is the number of physical fault injections performed.

N t – is the total number of times the system is checked in the interval

0, t .

COMMON CAUSE FAILURE EFFECTS

Systematic failures differ from random failures in that the protection mechanism

for random failures may not function properly in the presence of a systematic

failure [Bour81], [Wats79], [Hene81]. Systematic failures occur as a result of

Page 106: Fault tolerant systems

94/130

design, operation, or environmentally induced errors which effect two or more

channels of a redundant hardware system simultaneously, possibly resulting in a

total system failure. Software systematic failures may also occur, but are more

difficult to quantify [Cost78], [Hals79], [Shoo73], [Iyer85], [Leve83], [Rama79]. A

reliability analysis metric, such as those described in Figure 1 is needed to estimate

the effects of systematic failures in a redundant system. To date no clear method

for deriving these metrics has been developed, although practical experience has

shown that the effect of systematic failures are present but are not quantified.

In the literature, systematic failures are referred to as Common Cause Failures

[HSE87], [Heik79]. This is the term that will be used in this work. There are many

conflicting issues, as well as conflicting definitions in the area of Common Cause

Failure analysis. The confusion is due to different interpretations of the term

failure versus unavailable functionality. There are several techniques for modeling the

Common Cause Failure process and defining the detailed terms and expressions

including,

Square Root Bounding [Harr86]

Beta–Factor [Flem74], [Hump87], [Evan84], [Wall85]

Multinomial Failure Rate [Apos87]

Basic Parameter [Flem85], [Apos87]

Multiple Dependent Failure Function [Heis84]

Common Load [Mank77]

Non Identical Components [Vese77]

Multiple Greek Letter Models [Apos87], [Flem85]

In the following section, an overview of the Common Cause Failure modeling

techniques will be presented. Throughout the discussion, several terms will be

used to describe the various failure rates. The term is the total failure rate of

one specific component, regardless of the effects of the failure. The term i is

Page 107: Fault tolerant systems

95/130

the rate of independent component failures. The failure of an independent

component in a system does not lead to the failure of the system if,

The diagnostic coverage factor for that component is equal to 1.

The specific component is not a single point of failure in the architecture of the system.

The term d is the rate of dependent failures that possibly affect other

components. The failure of a dependent component may lead to the failure of the

system, if the failure of the specific component leads to the failure of other

components (i.e. component dependent on the specific component). The term

S k is the rate of system failures in which exactly k components fail, where

1, 2, ,k n is termed the failure multiplicity. In a Triple Modular Redundant

system, S k with 1k is the rate of failure of any single subsystem or

component of the triad.

The total failure rate is then given by i d . This notation is different than

other notations found in this work. The subscript i is used to indicate

independent failures and the subscript d is used to indicate dependent failures. In

previous sections the subscript i is used to indicate the thi component of a system

or subsystem.

The following section describes several methods for evaluating Common Cause

Failures.

Square Root Bounding Problem

A Common Cause Failure analysis method introduced in the WASH–1400

Report [WASH75] provides a technique for measuring the upper, upperP and

lower lowerP bounds of the Common Cause Failure probabilities. These bounds

Page 108: Fault tolerant systems

96/130

are denoted by upper complete dependenceP q and

lower complete independencenP q . The probability of a simultaneous failure

of all components is given by the geometric mean upperP and lowerP , which implies

the square root of their product, i.e. 1 2

lower upper

nP P q

.

Beta Factor Model

The Beta–Factor model is most commonly used to account for the occurrences

of dependent failures [Flem74] because of its simplicity. Given that a component

has failed, this failure will with probability be a dependent failure, and all

components of the redundant system will fail. With probability1 , there is an

independent failure and only the specific component which is a subset of the

system fails.

The rate of independent component failures is given by 1i and

dependent component failures given by d . It also follows that the system

failure rates are 1 1S n for a single component, 0S k ,

2,3, , 1k n , and S n for multiple components.

Multi–Nominal Failure Rate (Shock Model)

The Multi–Nominal Failure Rate or Shock Model [Apos87] techniques accounts

for the possibility that any number of components may fail simultaneously. The

fundamental assumption of the Shock Model is that two causes of failure exist.

The variable describes the failure rate for one specific component that

experiences an independent failure. The variable describes the rate of External

System Shocks, which may cause one or more components to fail simultaneously.

Given that a shock or rate occurs, the multiplicity of failure distribution is given

Page 109: Fault tolerant systems

97/130

by kf . This term is the probability that exactly k components, among the total of

n components fail due to the external shock.

Using the failure distribution kf requires that the term 0

1n

k

k

f

be satisfied.

The system failure rate of multiplicity k is given by the following,

11S n f , 1k and , 2,3,S kk f k n .

Binomial Failure Rate Model

The Binomial Failure Rate Model can be defined in a similar manner to the

Multi–Nominal model [Atwo86] with the following specific assumptions.

At the occurrence of the shock, each component has the same probability p of failure.

Given that a shock has occurred, all components fail independently.

With these assumptions, the number of components affected by the shock will

fail according to the binomial distribution with parameters n and p. The

multiplicity failure distribution is then 1n kk

k

nf p p

k

[Vese77]. The mean

number of components that fail is n p and the component failure rate as a result

of the shock is p . The total failure rate of a single component is given by

p .

Multi–Dependent Failure Fraction Model

The previous models can be expanded to provide information about the general

behavior of the failures of identical components. This model [Heis84] can be

considered a special case of the Multiple Failure Rate Model when 0 0f and

0 . Thus denotes the total rate of system failures and

Page 110: Fault tolerant systems

98/130

, 1, 2,S kk f k n . In this failure model the sum of the multiplicity terms

' 1kf s k equals 1 with only n parameters required for the estimation.

Basic Parameter Model

This model could be formulated as a Shock Model [Flem85], [Apos87] with

0 . However, the following parameter is chosen for the model,

failure rate of a specific group of components.k k

The meaning of this parameter and its relation to the parameters of the Multiple

Failure Rate Model is made clear when a Triple Modular Redundant system is

considered. The rate of failure of one specific component in the system, 1 1f .

The failure rate of two specific components of the Triplicate system is given by

2 2 3f . This failure rate is the double simultaneous rate of the Triplicate

system. The simultaneous failure rate of all three components is then given by

3 3f . The total failure rate of one specific component is then given by,

1 2 32 .

Multiple Greeks Letter Model

The Multiple Greek Letter Model [Flem85], [Apos87] represents another form of

the Basic Parameter Model. Using a Triplicate system as an example, is the

total failure rate of one specific component. The conditional probability that the

component failure will be shared by at least one additional component is given by

. The conditional probability that a component which is known to share at least

one additional component will in fact be shared by two or more additional

components is given by .

Page 111: Fault tolerant systems

99/130

Using these three parameters a group of expressions can be formed which

describe the failure rates of one, two, or three components. The failure rate of

one specific component is given by 11 . This single component is one

of the triplicate set of components in the system. The failure rate of two

components is given by 21 2 . The two components that fail are

members of the set of three components of the triplicate system. The failure rate

of three components is given by 3 .

Common Load Model

The Common Load Model [Mank77] describes a stochastic resistance to failure

for each independent component. These components are then exposed to a

stochastic stress. The probability that exactly k components fail due to some

common stress is calculated. A generalization of this model has been developed

which is similar to the Shock Model [Harr86].

Nonidentical Components Model

In each of the above models, homogeneous components have been assumed in

order to simplify the number of parameters. Nonidentical component models

have been developed [Vese77] in which diverse components that operate in

parallel are considered.

Practical Example of Common Cause Failure Analysis

The determination of systematic failure rate in various systems models is based

on the empirical estimation from field measurements [Wats79]. Field data is

expressed in terms of Common Cause Failure Rate for each set of parallel

subsystems. In Figure 8, a systematic failure mode would result in a transition

from state 3 to state 0 . In the presence of the Common Cause Failure event

Page 112: Fault tolerant systems

100/130

this transition would take place with a probability of 1p (since the failure

mode is not random).

The Beta–Factor model is the most straightforward technique (if its limitations are

understood [SINT88]). The Common Cause Failure rate using the Beta Factor is

related to the independent subsystem failure rate by the following,

ccf

ccf

.i

(6.19)

The ratio of the Common Cause Failure rate to the independent failure rate is,

ccf ,1i

(6.20)

giving,

ccf .1

i

(6.21)

The total decrease in the system reliability is adjusted by the Beta Factor. Some

examples of these factors are shown in Figure 11 [HSE87].

Page 113: Fault tolerant systems

101/130

Minimum Typical Maximum

Identical Systems 3 10 30

Diverse Systems 0.3 1 3

Figure 11 – Common Cause Failure modes guide figures for electronic programmable system [HSE87]. These ratios of non–CCF to CCF for various system configurations. CCFs are defined as non–random faults that are designed in or experienced through environmental damage to the system. Other sources [SINT88]. [SINT89] provide different figures.

Common Cause Software Reliability

Throughout this work the reliability discussion has centered on the underlying

hardware and subsystems that make up the electronic system. The reliability

figures for hardware components are based on a statistical failure mode

developed in Chapter 2 in which the failure rate is based on an underlying

physical failure mechanism.

When discussing software failures, statistical models are also used. However, the

basis for these models are not founded in the underlying physical failures of

semiconductor or mechanical components, but rather on less tangible metrics,

including [Musa89a], [Harr89], [Litt73], [Musa75], [Goel79], [Leve87], [Leve89],

Field measurements of existing software systems and their observed failure rates.

Calculations based on theoretical models of software systems and their execution.

Page 114: Fault tolerant systems

102/130

Because of this less tangible failure process, traditional failure prediction methods

themselves sometimes fail to produce proper results [Lipo79], [Boeh76]. Further

understanding of software reliability models will be necessary.

Software Reliability Concepts

The reliability of software can be defined by the conditional survival probability

of two random variables kS and kX which describes the behavior of the

software system, such that k k tR x y P X x S t where kX is the time

interval between the th

1k and thk software failure and kS is the thk software

failure occurrence time. Software reliability then corresponds to the probability

that a failure does not occur in the time interval ,t t x [Yama83].

Through a testing procedure, software faults are detected and removed.

Assuming no new software faults are introduced as a result of the repair process,

the software reliability increases with time. The test–repair cycle is well

understood and is usually explained by a Software Reliability Growth Model [Goel79],

[OHBA84], [Yama83]. The growth curve H t , which relates the number of

detected software faults to the time span of the program testing. By definition

H is the cumulative number of faults to be eventually detected. The number

of residual faults at time t can be estimated by H H t . Figure 12

describes the various software reliability growth models.

Page 115: Fault tolerant systems

103/130

Exponential Growth Models with

parameters ,N 1 tH t N e

N – initial number of faults in the system

– fault detection rate

1

1 i

nt

i

i

H t N e

S–Shaped Growth Models with

parameters , , 0N r 1 1 tH t N t e

N – initial number of faults in the system.

– fault detection rate.

r – ratio of total faults to detectable faults.

1

11

t

t

eH t N

re

r

Figure 12 – Four Software Growth Model expressions. The exponential and hyperexponential growth models represent software faults that are time independent. The S–Shaped growth models represent time delayed and time inflection software fault growth rates [Mats88].

In contrast to hardware failure models, software reliability cannot take advantage

of the aging aspects of the product. Once a software error is detected, diagnosed

and repaired, it is corrected forever. The result of this action is an improvement

in the reliability of the system as shown in Figure 12. However, during the

lifetime of the system, errors continue to appear at some rate and a simple model

of the underlying reliability is needed. (The subject of software reliability is

complex and the following description is a simple approach for the purpose of

this work.)

Let N be the number of distinct input types to the system, and let N be the

number of input types that result in software failure. N is assumed to be large

relative to N, which is assumed to be unknown. Once an error is detected and

Page 116: Fault tolerant systems

104/130

corrected, the total number of possible errors, N, is reduced by one. If inputs to

the software system occur according to a Poisson process with intensity , then

given N and N, the probability of no failure in the interval 0, t is expressed

by 1

exp!

it

i

e N N Nt

i N N

. This is the Shock Model for failures

[Barl85].

Let iT denote the time between the 1i and i failure, i N . Given N and

N, the survival function of iT is given by

1

, , expi

N i tF t N N

N

. This expression and its refinements

[Jeli72], [Musa75], [Schi78], provide appropriate approximations for a static

reliability model.

The software utilized in a high reliability system most often runs continuously,

rather than on demand. This continuous operation provides the observer with an

unchanging set of functions, from which inferences can be made regarding

reliability. Inputs to the software system can be considered a finite set of random

variables. Specification of these inputs and the relative frequency of their

occurrence is termed the Operational Profile [Musa89]. Modeling the failure rate of a

steady state software system can be performed using the function f t t ,

were is the failure rate constant. This function models a system where both the

software and the operational profile remain fixed in their behaviors.

Changes to the software functionality are usually made only when the system is

not in use. Modeling the failure rate for a software system under development

and being maintained (i.e. the software is being changed periodically) can be done

Page 117: Fault tolerant systems

105/130

with the function, 00

0

1 expf t f tf

. During the system test phase,

software faults that are found can be repaired, so the failure rate should decrease

as the testing proceeds. In this model the Operational Profile remains unchanged,

but the total number of faults 0f are corrected immediately whenever a failure

occurs. If the faults are not corrected immediately, repeated occurrences of the

same fault are discarded and the model becomes 00

0

1 expf t tf

.

This model is referred to as the Basic Execution Model [Musa87].

The Basic Execution Model provides a reasonable representation of the software

failure modes. The advantage of this model is that the parameters 0 , the initial

software failure rate and 0f , the total number of failure possibilities, can be

related to other parameters in the software failure analysis.

A second model for software failures can be constructed if it is assumed that

some faults are more likely than others to cause a system failure. If a second

assumption that the improvement in the failure as a result of the software fault

repair decreases exponentially as the repairs are made, the failure function

becomes 01 ln 1f t t . This function models a system in which the

Operational Profile remains unchanged and where corrections are made to the

software when a failure occurs. For this model the failure is given as

0

0 1f t

t

and is referred to as the Logarithmic Poisson Execution Model

[Musa89]. The parameter 0 has the same meaning as it does in the Basic

Execution Model. The parameter characterizes the exponential rate at which the

failure rate is reduced as a function of time [Chri88].

Page 118: Fault tolerant systems

106/130

In each of the models described above, it is assumed that each software fault

results in one observable error. Suppose a software system initially contains an

unknown m number of faults. Fault number i will cause an error in accordance

with a Poisson process having an unknown failure rate i . Then the number of

errors due to fault i that occur in any s units of operating time is Poisson

distributed with a mean is . Assume the process causing the faults 1, 2, ,i m

are independent and assume the software system is to be executed for time t with

all the resulting errors observed. At the end of this time an analysis will be

performed to determine which faults caused which errors. These faults will then

be removed and the error rate of the repaired software system determined.

Let,

1, if fault has not caused an error by time ,

0, otherwise.i

i tf t

then the quantity to be estimated is given by,

,i i

i

F t f t (6.22)

which is the error rate of the final software package. It should be noted that

i t

i i i

i i

E F t E f t e . Let jM t be the number of faults

that caused exactly 1 error, 2M t is the number of faults that caused exactly 2

error, etc., with j

j

jM t being the total number of errors that resulted. The

expectation value of the number of single error occurring from single faults is

given by 1E M t .

Page 119: Fault tolerant systems

107/130

A set of indicator variables can be defined such that,

1, fault causes exactly 1 error,

0, otherwise.i

iI t

The number of faults causing exactly one error is given by 1 i

i

M t I t

resulting in the expression for the estimated value of,

1 .i t

i i

i i

E M t E I t te (6.23)

From Eq. (6.22) and Eq. (6.23) an interesting result can be obtained,

1 0.

M tE F t

t

(6.24)

Eq. (6.24) suggests that 1M t t could be a good estimate of F t . To

determine this, the variance of the two expressions is given as,

2

1 1

1 12

var ,

2 1var cov , var .

M t M tE F t F t

t t

F t F t M t M tt t

Expanding the variance in terms gives,

2 2var var 1 .i it t

i i i

i i

F t f t e e

1var var 1 .i it t

i i i

i i

M t I t te e

Page 120: Fault tolerant systems

108/130

1cov , cov , ,

cov , ,

cov , ,

.i i

i i j

i j

i i j

i j

i i i

i

t t

i i

i

F t M t f t I t

f t I t

f t I t

e te

The two distributions if t and jI t are independent when i j since they

refer to different Poisson distributions. This result in 0i jf t I t which

results in an expression for the variance between the error rate for the entire

software package and the error rate resulting from single faults being given as,

2

21

1 2

2

1,

2.

i it t

i i

i i

M tE F t e e

t t

E M t M t

t

Software Reliability and Fail–Safe Operations

Fault–Tolerant computer systems are often applied to safety applications, where

statistical reliability figures provide little comfort to the user [Brow80]. In this

environment, there is a tradeoff between reliability and safety. Reliability doe not

assure safety [NSCC80]. Fail–Safe functionality becomes the design goal, rather

than Fault–Tolerance [Leve86] with specific techniques being applied to software

to assure the Fail–Safe operation of the system [TÜV86]. Although this work

deals in the reliability and availability aspects of fault–tolerant system, it can serve

as the basis for the analysis of Fail–Safe operations as well [Leve83].

Page 121: Fault tolerant systems

109/130

Unlike hardware, software does not have a wear out failure mode, even though

specific failure modes do exist [Thom81]. In Fault–Tolerant system these failure

modes include,

Failure to perform an expected command.

Failure to diagnose a latent fault (fault coverage).

Generation of an erroneous, premature, or delayed command.

In the context of Fault–Tolerant system failures, faults have several definitions

[SINT89],

Dangerous Revealed – are failures that will be revealed by the effects of the fault.

Dangerous Unrevealed – are failures that will be unrevealed and will be found only through proof testing of the system.

Safe Revealed – are failure that are safe and will cause the system under control to assume a safe – but operative – state.

Safe Unrevealed – are failures that are safe but result in continued operation of the system under control. They will be revealed through the proof testing of the system.

Page 122: Fault tolerant systems

110/130

Page 123: Fault tolerant systems

111/130

C h a p t e r 7

PARTIAL FAULT COVERAGE SUMMARY

The previous sections have developed the background and details of several

reliability measures for redundant systems. Figure 13 compares the relative

improvements in system MTTF for various redundant configurations. For

comparison purposes, the failure rate is assumed to have a value of 410 , one

failure every 410 hours. The repair rate for each system with repair is assumed to

be 8 hours, giving 0.125 . The MTTF for a simplex system is approximately 1

year. In the parallel redundant configuration, the MTTF is dominated by the

second term, which represents the added redundancy of the second subsystem.

System MTTF MTTF in years

Simplex 1

1

Parallel Redundant 2

3

2 2

625

TMR in the 3–2 mode. 2

5

6 6

208

TMR in the 3–2–1 mode 2

2 3

11 2

6 3 6

260,400

Figure 13 – MTTF of Simplex, Parallel Redundant, and TMR Systems.

In the TMR configuration, the system continues to operate as long as two of the

three subsystems are functional. This configuration is similar to the Parallel

Page 124: Fault tolerant systems

112/130

Redundant configuration, except that there is one additional subsystem. As a result,

the MTTF is reduced by a factor of 3 when compared to a perfect coverage

parallel Redundant system.

If the TMR system is allowed to operate in the Simplex mode, the MTTF is

increased by a factor of 260,400.

EFFECTS OF COVERAGE

When coverage is added to the reliability models, a significant change occurs in

the relative reliability figures for various configurations. Figure 14 describes the

effect of coverage on the MTTF of the various configurations.

MTTF, years

System 1c 0.99c 0.95c 0c

Dual 2–1 625 50 10 0.5

TMR 3–2 208 – – –

TMR 3–2–1 260,400 20,833 4167 208

Figure 14 – MTTF of Parallel Redundant and TMR Systems with varying degrees of coverage.

Page 125: Fault tolerant systems

113/130

C h a p t e r 8

REMAINING QUESTIONS

Although Figure 13 describes the effect of coverage on redundant architectures

with exponential distributions for the failure rate of the underling hardware, the

figures stated are a simplified version of commercial products with the

assumptions stated in Figure 2.

Several major topics of discussion remain open:

If the distributions of an actual system are not exponential and / or are not independent, what effect does this have on the reliability calculations presented in this work?

Given an actual redundant configuration what dependencies are present for a specific hardware or software structure? Does the common cause failure model effect the reliability in a significant way?

Can the assumption that all failure rates are constant be verified for actual hardware and software systems in commercial products?

Can repairable systems be modeled using similar techniques if the repair rate is extremely long (> 6 months) or carried out at regular intervals, rather than on demand? This question arises when dealing with unmanned systems.

REALISTIC PROBABILITY DISTRIBUTIONS

Throughout this work an assumption has been made regarding the probability

distributions of the failure rates and repair rates. Although these assumptions may

be appropriate in describing physical events, the validation of their

appropriateness has not been stated formally. Several motivations are present for

assuming Exponential and Poisson Failure distributions:

Page 126: Fault tolerant systems

114/130

There is evidence that electronic components follow an exponential failure curve. Although the goodness of fit of this curve is acceptable in many instances, alternative distributions provide more accurate models [Mels78], [Snyd75].

The use of the exponential distribution provides known Laplace transforms which are utilized in the Markov models of system reliability. Alternative distributions, both periodic and cyclostationary do not posses Laplace transforms and therefore are intractable when forming differential equations [Cast80].

A constant hazard rate (Eq. (2.25)) provides an easy means of defining the underlying failure mechanism of the system under analysis. In practice the hazard rate is usually a function of time, use, and application [Bell78].

Faults present in the system which result in an error or a failure are assumed to be permanent rather than transient. This assumption allows exponential distributions to be used in the Markov model. In practice transient faults may represent the majority of system upset conditions [Stif80]. In an attempt to analyze transient fault conditions, the Weilbull distribution provides a better estimator of the underlying statistical process. This distribution has no know Laplace transform and therefore is intractable to analysis with the Markov process [McCo79].

Multiple Failure Distributions

Before examining the Weilbull distribution an understanding of the effects of

ignoring the assumptions made in Figure 2. The questions to be answered is,

What is the effect on the system reliability if the failure distribution is different for each subsystem; if the individual faults are not independent; and if the repair / replacement strategy is also not constant?

Suppose there are a large number n of failure types, with individual failure time of

1, , nY Y . That is iY denotes the failure time that would be observed if all

failures except the thi failure were suppressed. The actual–failure–time, denoted

by nX , when there are n failure types is then given as,

1MIN , , .nY Y (8.1)

Page 127: Fault tolerant systems

115/130

Next assume that iY are mutually independent and identically distributed

random variables with a Cumulative Distribution Function of L y . Then since,

iff ; 1, 2, , ,n iX x Y x i n (8.2)

if follows that,

; 1, 2, , ,

1 .

n i

n

P X x P Y x i n

L x

(8.3)

Suppose now that L x ax for 0 as 0x . Then for a sufficiently

large n, only small x need be considered in Eq. (8.3) and,

,

.

nL x

n

nax

P X x e

e

(8.4)

Let n n nX X k where nk is a normalizing constant to be chosen so that nX

has a limiting distribution as n . Then,

,

.n

n n n

nak x

P X x P X k x

e

(8.5)

Take 1

nk na

, then,

.x

nP X x e (8.6)

Page 128: Fault tolerant systems

116/130

The standardized variable 1

nna X

has a limiting Weilbull distribution.

The index of the Weilbull distribution is determined by the local behavior near

0x of the underlying Cumulative Distribution Function L x . If 1 so

that L x is locally rectangular near 0x , the limiting distribution function is

exponential.

Weilbull Distribution

The Weilbull distribution is best described in its Cumulative Distribution form

[Kapu77]. The Cumulative Distribution of a random variable x distributed as the

three parameter Weilbull is given as,

1 exp , for ,1

tF t t

(8.7)

where 0, 0, 0 where is the shape parameter or Weilbull Slope

[Kimb60], [Whit69]. As increases, the mean of the distribution approaches the

characteristic life , with 1 , the Weilbull distribution becomes the

exponential distribution. The location parameter defines the minimum life of

the system. The scale of the distribution is controlled by .

The three parameter expression in Eq. (8.7) can be reduced to a two parameter

form by assuming the minimum life parameter is zero [Naka75], [Kapu77].

The probability density function pdf, Cumulative Distribution Function CDF, and

Hazard Function z of the Weilbull two parameter distribution are expressed in

the following,

Page 129: Fault tolerant systems

117/130

1,

tpdf f t t e

(8.8)

1 ,t

CDF F t e

(8.9)

Reliability ,t

R t e

(8.10)

1

Hazard Function .z t t

(8.11)

The values of Eq. (8.8) through Eq. (8.11) are a function of time only, through

the product of the scale factor. Since the failure rate is given by t

, this rate is

directly influenced by the shape parameter in the [Thom69], [Meno63],

If 1 the failure rate is decreasing with time.

If 1 the failure rate is constant with time and the Weilbull distribution is identical to the exponential distribution.

If 1 the failure rate is increasing with time.

The Weilbull distribution most closely matches the failure distribution of

transient faults [McCoy79] and therefore is useful in the analysis of Fault–

Tolerant systems which are capable of handling transient as well as permanent

faults.

The primary difficulty with the Weilbull distribution is that no known Laplace

transform exists, making it intractable to generate closed form differential

equations for the transition probabilities in a Markov model. Other properties of

the Weilbull distribution are useful for the analysis of reliable systems.

The thk moment of the Weilbull distribution is given by [Kend77],

Page 130: Fault tolerant systems

118/130

1

0

.tk k

k E t t t e dt

(8.12)

Using the transformation u t

and 1

du t dt

gives a new form of

the moment generating function,

0

1.

k

k u

k u e du

(8.13)

Eq. (8.13) is the recognizable form of the Gamma function [Arfk70] and results

in,

1

1 .

k

k

ku

(8.14)

The mean for the Weilbull distribution is given by,

1 1

1 ,

(8.15)

and the variance is given by,

2

2 21 2 11 1 .

(8.16)

With only the mean and the variance (standard deviation) of a sample data set

available, the Weilbull failure rate can be determined to be increasing, constant, or

decreasing. The relationship between the means and the standard deviations can

be a useful indicator,

Page 131: Fault tolerant systems

119/130

, the failure rate is decreasing

, the failure rate is constant

, the failure rate is increasing

PERIODIC MAINTENANCE

In the previous sections the reliability models assumed a fixed Mean Time to Repair,

, for each module which resulted in the system availability measures given in

Eq. (2.42) and Eq. (2.80). The assumption in these expressions is that a

maintenance request will be generated as a result of a failure and that the length

of time between the failure event and the repair of the module is a fixed interval.

In a practical application, responding to a module repair request within the repair

time may not be possible [Ingl77]. The inability to respond in the minimum time

may be caused by,

The remoteness of the installation

System operation rules

Lack of spare parts

In many instances a periodic maintenance interval is substituted for the on demand

maintenance technique assumed in the previous sections.

Periodic Maintenance of Repairable Systems

The overall system reliability of a general purpose configuration given in

Eq. (2.26) can be restated as the reliability of an unmaintained system by [Ng77],

0 ,i t

i

i

R t a e (8.17)

Page 132: Fault tolerant systems

120/130

where ia is a normalizing constant. In this model the following notation will be

used,

– is the time interval from the latest periodic maintenance action

– is the time interval between two adjacent maintenance actions

i – is the thi occurrence of a maintenance action

R – is the reliability of a maintained system

R – is the reliability of an unmaintained system

The reliability of a system after receiving the thi maintenance action is iR t .

The reliability of an unmanned system is given by 0i R i R and the reliability

of a maintained system is given by 1i i iR i R i R .

A convenient representation of the reliability of each system is developed through

the generating function of the probability sequence of each maintenance action

[Ahlf66], [Fell71], [Jury64]. The generating function for the sequence of failure

events for an unmaintained system is given by 1

i

i

i

G s s

R and the

generating function for the sequence of failure events if a maintained system is

given by 1

i

i

i

G s

R . By maintaining a Fault–Tolerant system in a periodic

manner, an improvement in the change in the overall reliability can be expected.

The probability that the maintenance action does not result in an improvement in

the reliability is given by 1 . While the probability that the maintenance action

Page 133: Fault tolerant systems

121/130

brings the system to a condition identical to the reliability at the initial system

start, 0t , is given by 2 .

After time units, the system is inspected and if module failures have occurred,

repair of the system takes place. With probability 1 , all failures, if any, are

ignored or insufficiently repaired. With probability 2 all failures are removed

from the system and the operation of the system is restored to full perfection.

Since all failures are removed from the system, partial restoration of a failed

system is not considered [Naka79], [Oda81], [Naka81], such that the probability

of all failures being removed is given by, 1 21 .

From the definition of 1 and 2 it can be shown that after i maintenance

actions the system reliability is given by 1iR t with probability 1 and

1 0iR i R t i with probability 2 . Defining a term for the time from the

latest maintenance action as t i , the system reliability can be expressed as,

1 1 2 0 1 ,i i iR i R i R R i (8.18)

with 0i and 0 .

Restating the system reliability as a function of the maintenance interval gives,

0 ,j

j

j

R a e

(8.19)

and

1 1 2 1 .j j

j

j

R a e e

R (8.20)

Page 134: Fault tolerant systems

122/130

A general expression can be developed by using the symmetry of Eq. (8.18)

resulting in,

1

1 2 1

0

.j j j

ii kt

i j i k

j k

R i a e e e

R (8.21)

To obtain iR from Eq. (8.21) the reliability time base is set to the time between

adjacent maintenance actions, t and the sequence of summation changed to

give,

1

1 1 1 2 1 1

0

.i

i k

i i k i k

k

R R R R (8.22)

The probability of survival until the first maintenance action is the same, whether

the system is maintained or not, that is,

1 1 .R R (8.23)

The iR may be determined recursively from Eq. (8.22) and Eq. (8.23). The iR

may also be obtained from its generating function defined by,

1

1 2 1

,G s

G sG s

(8.24)

and

,1

j

jj

j

sG s a e

se

(8.25)

Page 135: Fault tolerant systems

123/130

where Eq. (8.24) and Eq. (8.25) are obtained from Eq. (8.22) and Eq. (8.19)

[Heiv80]. The determination of iR provides the final expression for the

reliability equation from Eq. (8.21),

1 2 1

1

1 .j j

i kt i

i j k

j k

R t a e e

R (8.26)

The Mean Time to Failure using Eq. (2.41) is,

0 0

,i

i

MTTF R i

(8.27)

1

1

2

1 1

11 .

1

j

j

j

j j

aG eMTTF

e

(8.28)

Reliability Improvement for a TMR System

The analysis of periodically maintained TMR system has been presented in

several other works [Maka81], [Helv80], [Math70]. Although the details of these

works are relevant to the discussion presented here, the mathematics is beyond

the scope of this presentation and is the subject of a future work.

A simplifying assumption will be made regarding the periodic maintenance of the

TMR system. Only the relative increase in reliability will be considers in the work. If a TMR

system is considered with an unmaintained reliability expression given in

Eq. (3.13). Figure 15 shows the increase in system reliability as a result of periodic

maintenance to the TMR system [Helv80].

Page 136: Fault tolerant systems

124/130

0.1

0.2

1.0

0.8

0.4

0.3

10 0.0

10 .0

1.0

0.001 0.010 0.100 1.000

Normalized Maintenance Interv al

No

rmal

ized

Mea

n T

ime

to F

ailu

re

Figure 15 – Mean Time to Failure increases for a Triple Modular Redundant system with periodic maintenance. This graph shows that maintenance intervals which are greater than one–half of the mean time to failure for one module have little effect on increasing reliability. But frequent maintenance, even low quality maintenance, improves the system reliability considerably.

Page 137: Fault tolerant systems

125/130

C h a p t e r 9

CONCLUSIONS

A broad range of topics have been presented in support of the original question,

What is the reliability of a system in the presence of less than perfect diagnostic coverage?

The following conclusions can be made regarding this question:

Analytical expression have been developed which describes the time dependent behavior of Fault–Tolerant systems. These expressions are based on a continuous–time Markov Chain, which assumes instantaneous transition times of an underlying random process. The statistical distribution that drives the random process is assumed to be exponential in nature.

A coverage factor was introduced in the Markov Chain model that results in a closed form solution for the system Mean Time To Failure for various fault–tolerant configurations. Several assumptions have been made regarding the underlying failure process as well as the repair process.

In the presence of imperfect coverage a scheme had been described which determines the actual coverage present in the system.

In addition to the above summary conclusions, several questions regarding the

validity of the original assumptions have been presented. The alternative

considerations present a different set of analytical problems that in turn require

simplifying assumptions to produce tractable results.

These questions include,

A modeling technique capable of dealing with Weilbull distribution.

A modeling technique capable of dealing with finite transition times between fault detection and fault repair. This technique would provide

Page 138: Fault tolerant systems

126/130

realistic models for systems with actual repair processes, rather than instantaneous repair processes.

Analytical expressions for a general configuration of models, including repair cycles.

Page 139: Fault tolerant systems

127/130

Index

A

Absorbing Markov Process, 45

aged, 39

Availability, 14, 27

Steady State, 15

B

Basic Parameter, 92

Beta–Factor, 92

Bias, 82

C

Central Limit Theorem, 86, 88

Chapman–Kolmogorov Equations

Backward, 49

Forward, 47

Chapman–Kolmogorov Equations,

46

Clopper–Person Estimation, 89

Common Cause Failure, 92

Common Cause Failures, 92

Common Load, 92

Conditional Failure Function, 13

confidence interval, 84

Consistency, 82

convolution, 22

convolution function, 21

coverage, 81

Coverage, i

Coverage Factor, 2

Cramers Rule, 73

Cumulative Distribution Function,

5, 25

D

Deterministic Models, 4

Distribution

Hypoexponential, 39

stage–type, 41

stage–type, 38

Distributions

stage–type failure, 42

E

Efficiency, 82

Eigenvalue solutions, 46

Error, 1

error detection latency, 91

Expected Value, 15

exponential failure distribution, 18

exponential order, 52

F

Failure, 1

failure probability, 30

failure rate parameter, 6

failure–rate distribution, 27

failures per million hours, 6

Fault, 1

Fault avoidance, 1

Fault Containment and Isolation, 2

Fault Detection, 2

fault dormancy, 91

fault injection procedure, 90

Fault Location and Identification, 2

Fault Masking, 2

Fault tolerance, 1

Faults

Page 140: Fault tolerant systems

128/130

Covered, 35

Uncovered, 35

Final Value Theorem, 27

fundamental matrix, 73

Fundamental Matrix, 46

G

Gamma function, 16

Gauss Reduction, 73, 74, 76

Generalized Modeling

Event Tree Analysis, 43

Failure Mode and Effects

Analysis, 43

Hazard and Operability Studies,

43

Generalized Modeling

Fault Tree Analysis, 43

Petri Nets, 43

goodness, 4

H

hazard function, 5

Hazard Function, 13, 113

Hypoexponential, 39

I

IFIP Working Group, 1

instantaneous availability, 26

instantaneous failure rate, 13

L

Laplace transform, 23, 24, 51

limiting availability, 27

Limiting availability, 28

M

Markov Chain, 3, 42

Continuous Parameter, 43

Markov Process, 43

Mean Time to Failure, i, 15, 18, 29,

34, 53, 81

Mean Time To Failure, 122

Mean Time to First Failure, 15, 20

Mean Time to Repair, 19, 29

Mean Time to Second Failure, 15

memoryless, 6, 7, 38

Multinomial Failure Rate, 92

Multiple Dependent Failure

Function, 92

Multiple Greek Letter Models, 92

N

Non Identical Components, 92

O

Ordinary Renewal Process, 20

P

Physical Fault Injection, 82

Poisson and exponential

relationships, 10

Poisson assumptions, 8

Poisson Theorem, 7

population parameter, 85

population proportion, 83

Probabilistic Models, 4

probability density function, 5, 25

R

reliability function, 5

Reliability Function, 12

renewal density, 26

Renewal Equation, 23

renewal function, 21

Renewal process, 25

Renewal Process, 20

repairability, 26

repair–time distribution, 27

Page 141: Fault tolerant systems

129/130

S

sample population, 84

sample proportion, 83

series reliability configuration, 31

service life, 13

simplex mode of operation, 69

Simplex system, 34

Single Parameter Models, 4

Square Root Bounding, 92

Standard Error of the Mean, 87

state vectors, 44

Statistical Inference theory, 82

stuck–at faults, vii

system

series redundant configuration,

30

System

M–of–N configuration, 32

Parallel Redundant, 53, 63

parallel redundant configuration,

32

simplex, 33

Triple Modular Redundant, 59,

69

system lifetime, 36, 38

T

transition probability matrix, 49

transition rate matrix, 47

U

Unaged, 6

unavailable functionality, 92

Page 142: Fault tolerant systems

130/130

Page 143: Fault tolerant systems

Appendix A – 1

A p p e n d i x A

MARKOV CHAINS

Modeling of Partial Diagnostic Coverage in Fault–Tolerant systems with a

Markov Matrix technique requires an underlying assumption regarding the failure

probability distributions for the electronic components of the system. In addition,

the properties of the probability generating functions for the failure processes are

assumed to be understood in detail. In the body of this work, some assumptions

were made regarding Markov Chains without detailed development. This

appendix provides the definition and theorems for the properties of the transition

matrix theory utilized in Chapter 5. [Chun67], [Parz60], [Parz62], [Midd46],

[Pipe63], [Keme60], [Cox62], [Cox65], [Bhar60], [Bell60], [Hoel72].

Definition A.1

A Stochastic Process is defined as any collection of random variables

( ) ,X t t T∈ defined on a common probability space where T is a subset of

( ),−∞ ∞ and is referred to by a time parameter set. The random process is called

a continuous parameter process if T is an interval having positive length and a discrete

parameter process if T is a subset of integers. If the random variables ( )X t take

values from the fixed set ¡ , then ¡ is called the State Space of the process∴

Definition A.2

A Stochastic Process { }t tX +∈¡ is said to satisfy the Markov property if for all times

0 1 nt t t t< < < <L and for all n it is true that,

Page 144: Fault tolerant systems

Appendix A – 2

{ }0 10 1, , ,nt t t t nP X j X i X i X i= = = = ∴… (A.1)

Definition A.3

The Markov Chain { }t tX +∈¡ is said to be stationary if for every i and j in the

probability transition matrix, S, { }t h tP X j X j+ = = is independent of t ∴

The following conditions must be satisfied by the probability transition matrices

used in Chapter 5.

( ) 0 ,ijp t i j≥ ∀ ∈S and 0t ≥ (A.2)

( ) 1 and 0ijj S

p t i S t∈

= ∀ ∈ ≥∑ (A.3)

( ) ( ) ( ) , and , 0.ij ik kjk S

p t s p t p s i j S t s∈

+ = ∀ ∈ ≥∑ (A.4)

( )0

1, if ,lim

0, if .ijt

i jp t

i j+→

== ≠

(A.5)

Theorem A.1

Let tX be a continuous–time Markov chain with a transition probability function

( )ijp t . If i and j are fixed states in S then ( )ijp t is a uniformly continuous

function of time t.

Proof of Theorem A.1

Let 0ε > be given. In order to show that ( )ijp t is uniformly continuous, it must

be shown that there exits a 0δ > such that ( ) ( )ij ijp t h p t+ − < ε for all t

Page 145: Fault tolerant systems

Appendix A – 3

whenever 0 h< < δ . Using the Chapman–Kolmogorov equations defined in

Eq. (5.20) and Eq. (5.22) results in,

( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

,

1 .

ij ij ik kj ijk

ik kj ij ii

p t h p t p h p t p t

p h p t p t p h

+ − = −

≤ + −

∑ (A.6)

Since ( ) 1kjp t ≤ for all k, j, and t it follows that,

( ) ( ) ( ) ( )1

1ij ij ik iik

p t h p t p h p h≠

+ − ≤ + −∑ (A.7)

But since ( ) 1ikk S

p h∈

=∑ results in ( ) ( ) 1ik iik i

p h p h≠

= −∑ the following

relationship holds,

( ) ( ) ( )( )2 1 ,ij ij iip t h p t p h+ − ≤ − (A.8)

and by Eq. (A.5) it is given that ( )0

lim 1 0iih

p h+→

− = . Therefore given 0ε > a

0δ > can be found so that ( )12iip hε

− < for all 0 h< < δ . Since δ is

independent of t the proof is complete∴

Theorem A.1 establishes the uniform continuity of ( )ijp t . The next step is to

show that the transition probability function is differentiable. This result is

important in developing the set of ordinary differential equations used to model

the reliability transition. The matrix containing these transitions describes the next

move or next state of the system modeled by the Markov Chain. For a continuous–

time Markov process the next state is not so easily defined. Since the process is

Page 146: Fault tolerant systems

Appendix A – 4

free to take any state (as long as the path through the state diagram is followed)

there is no basic time unit 0t ∗ > such that ( )ijp t ∗ yields the probability

transition of the next state from the current state. Since no positive time unit can

be used in considering where the system state can go next, instantaneous state

changes must be considered in the analysis of the system.

This requirement leads to the analysis of the derivative of the transition

probability function ( )ijp t for i ∈S , j ∈S . Since the derivative is defined as the

limit of the difference quotient, the determination of limits is necessary. In order

to make this determination the following Lemma will be used, which involves

certain properties of the superemum of a set of real numbers.

Lemma A.1

Let ( )f t be a real–valued function defined on the open interval ( )0, ∞ . Assume

that ( )0

lim 0t

f t+→

= and assume the f is subadditive, that is

( ) ( ) ( )f s t f s f t+ ≤ + for all ,s t +∈R . Then ( )

0limt

f t

t+→

exists and in fact

equals ( )

supt

f t

t+∈

R

Theorem A.2

Let tX be a continuous–time Markov chain with a transition probability function

( )ijp t . In the case where i j= the resulting probability function of t has a

derivative on the right–hand side of the equality in the sense that the

( )0

1lim ii

iit

p tq

t+→

− = exists. This limit will always be nonpositive and in some

cases, the limit will be −∞ .

Page 147: Fault tolerant systems

Appendix A – 5

Proof of Theorem A.2

The transition probability condition for the Markov chain described in Eq. (A.4)

states that for all ), 0,s t ∈ ∞ , ( ) ( ) ( )ii ik kik

p t s p t p s+ = ∑ , so when the sum is

reduced to the case where k i= it follows that for all ), 0,s t ∈ ∞ ,

( ) ( ) ( )ii iip t s p t p s+ ≥ . Consider the function ( ) ( )log iif t p t= − . It can be

shown that ( )f t is subadditive and the ( )0

lim 0t

f t+→

= . It also follows that

( )0

loglim ii

t

p t

t+→

exists. The difference quotient in question is now

( ) 1iip t

t

− , so the following fact from calculus is used to reduce this term.

For small values of 0x > , the Taylor series expansion for ( )log 1 x− − about

zero is ( ) ( )2 3 4

log 12 3 4

x x xx x x R x− − = + + + + = +L . The dominant term in

this series is the first term. It can be shown that ( )

0lim 0x

R x

x+→

=

. Using the

above developments an expression for the right–hand derivative is given as,

( ) ( )( ) ( ) ( )( )log log 1 1 1 1 .ii ii ii iip t p t p t R p t − = − − − = − + − (A.9)

It follows that,

( ) ( ) ( )( )

( ) ( )( )( )

1log 1,

111 .

1

iiii ii

iiii

ii

R p tp t p t

t t t

R p tp t

t p t

−−= −

−−= +

(A.10)

Page 148: Fault tolerant systems

Appendix A – 6

resulting in,

( ) ( ) ( )( )

( )

111 log

1 .1

iiii ii

ii

R p tp t p t

t t p t

− −−

= + −

(A.11)

Taking the limits of both sides of Eq. (A.11) as 0t +→ and noting that both

terms on the right–hand side have limits at 0t +→ , results in confirming the

existence of ( )

0

1lim ii

iit

p tq

t+→

− =

Theorem A.3

Let tX be a continuous–time Markov process with a probability transition

function ( )ijp t . In the case where i j≠ the resulting probability transition

function of t has a right–hand derivative at zero in the sense that

( )0

lim ijij

t

p tq

t+→

=

exists and is finite∴

Proof of Theorem A.3

In proving this theorem tX will be considered only at certain discrete times,

even though this is a continuous Markov process. This is done by examining the

tX process at times , 2 , 3 ,h h h … where h is a small positive number. With this

discrete version of tX it can be noted the ( ) ( ) ( )mij ijp mh p h= . In this case

( )ijp mh will be considered as an m–step transition of a discrete–time Markov

process with a probability transition matrix of ( )ijp h . In the same way forbidden

probabilities, ( ) ( )mj iip h are defined as,

Page 149: Fault tolerant systems

Appendix A – 7

( ) ( ) ( ) ( ) ( )1

0

.m

kij j ii ij jj

k

p mh p h p h p mh kh h−

=

≥ − −∑ (A.12)

The right–hand side of Eq. (A.12) is smaller since not all possible probability

transitions are considered. By using Eq. (A.6) is follows that ( ) ( )0

lim 1kj ii

hp h

+→= .

This is illustrated by writing,

( ) ( ) ( ) ( ) ( ) ( ) ( )0

1 ,k

k k lii j ii ij ji

l

p h p h f h p k h=

= + − ∑ (A.13)

where ( ) ( )lijf h is the probability of the first visit to state j from state i in l steps.

Now,

( )0

lim 1 0,jih

p k h+→

− = (A.14)

so,

( ) ( ) ( ) ( )0 0

lim lim 1.k kj ii ii

h hp h p h

+ +→ →= = (A.15)

Let 0ε > be given and choose 0 0t > such that ( ) 1jjp t > − ε for all ( )00,t t∈

and ( ) ( ) 1kj iip h > − ε for all ( )00,kh t∈ . Then using Eq. (A.15) it follows that,

( ) ( ) ( ) ( ) ( )1

2 2

0

1 1 ,m

ij ij ijk

p mh p h mp h−

=

≥ − ε = − ε∑ (A.16)

whenever 0mh t< . Dividing Eq. (A.16) by mh gives,

( ) ( ) ( )2

0

1whenever .ij ijp mh p h

mh tmh h

− ε≥ < (A.17)

Page 150: Fault tolerant systems

Appendix A – 8

Let the transition rate in the limit be defined by,

( )

0

lim inf .ij

t

p tL

t+→

= (A.18)

The limit L is finite since, if m is chosen so that 002

tmh t≤ ≤ results in,

( ) ( )

( ) ( ) ( )2 2 2

0

1 1 2.

1 1 1ij ijp h p mh

h mh mh t≤ ⋅ ≤ ≤

− ε − ε − ε (A.19)

Using the definition of lim inf, choose 01 2

tt < such that

( )1

1

ijp tL

t< + ε . By

continuity there is an interval about 1t where this inequality holds. That is, there

exists ( )0 10,h t∈ such that ( )ijp t

Lt

< + ε , for all t satisfying 1 0t t h− < . Chose

0h h< and choose m so that,

1 0 1 0 0 .t h mh t h t− < < + < (A.20)

Using these choices of m and h and using Eq. (A.17) it is given that,

( )

( )( )

( )2 2

1.

1 1ij ijp h p mh L

h mh

+ ε≤ ⋅ ≤

− ε − ε (A.21)

Since Eq. (A.21) holds for all 0h h< it follows that,

( )

( )20

lim sup ,1

ij

h

p h L

h+→

+ ε≤

− ε (A.22)

Page 151: Fault tolerant systems

Appendix A – 9

but 0ε > is arbitrary, so ( )

0

lim sup ij

h

p hL

h+→

. Therefore the limit of

( )ijp t

t as

0t +→ exists and is finite. This limit is denoted by ijq ∴

Page 152: Fault tolerant systems

Appendix A – 10

Page 153: Fault tolerant systems

Appendix B – 1

A p p e n d i x B

SOLUTIONS TO LINEAR SYSTEMS

The generalized TMR system model described in Chapter 5 depends on the

solution of a set of differential equations using the Gauss Reduction method.

This appendix outlines the details of this solution method [Brau70], [Cour43],

[Jame67], [Rals65], [Hild56].

A set of m simultaneous linear equations in n unknowns, 1 , , nx x… , has the form,

11 1 1 1

1 1

n n

m mn n m

a x a x b

a x a x b

+ =

+ =

LM

L (B.1)

where the ,ij ia b are the known constants. Rewriting the matrix terms as

ija=A , [ ]1 , , nx x=x … , [ ]1 , , nb b=b … , then using a standard matrix notation,

Eq. (B.1) becomes, =Ax b .

In the case of a square matrix, where m n= the rank of the matrix is given by

( )r n=A so that 0≠A , the unique solution obtained by 1−=x A b can be

written as,

. 1

1, 1, , .

n

i ij jj

x A b i n=

= =∑A… (B.2)

Page 154: Fault tolerant systems

Appendix B – 2

From the definition of the determinate, it is observed that ji jj

A b∑ is the

determinate of the matrix formed from A by replacing column i with jb , giving,

1 12 1 1 1, 1 1

2 , 1

1 1; ; .

n n

i n

n n nn n n n n

b a a a a b

x x

b a a b a b

= =A A

L LM M … M M

L L (B.3)

Eq. (B.2) is called Cramer’s Rule for finding the solution to a system of n equations

in n unknowns. An alternative method, which provides more efficiency in

numerical computation, is called Gauss Reduction.

The Gauss Reduction method assumes that 11 0a ≠ such that first equation in

Eq. (B.1) is divided by 11a and the resulting equation is used to eliminate 1x in

equations 2, ,n… .

This method gives a new set of equations with augmented terms of,

1 12 2 1 1

22 2 2 2

2 2

ˆˆ ˆ

ˆˆ ˆ

ˆˆ ˆ

n n

n n

n nn n n

x a x a x b

a x a x b

a x a x b

+ + + =

+ + =

+ + =

LL

ML

(B.4)

where the new terms 11

11

ˆ , 2, ,jj

aa j n

a= = … ;

11

11

ˆ , 2, , , 2, ,jij ij i

aa a a j n i n

a= − = =… … and 1

111

ˆ , 2, ,jj

bb i n

b= = … and

11

11

ˆ , 2, ,ii i

ab b b i n

a= − = … result from the elimination terms.

Page 155: Fault tolerant systems

Appendix B – 3

The second equation in the set of Eq. (B.4) is then divided by 22a (This method

assumes that the equations and variables are renumbered such that 22ˆ 0a ≠ ) and

eliminates 2x in equations 3, ,n… . In a finite number of steps a set of equations

can be formed (provided 0≠A ) such that,

12 1 1 1

2 2 2

1

1.

1

1

n

n

n n

h h x g

h x g

x g

⋅ =

LL

M M M (B.5)

Then n nx g= , and by back substitution 1 1 1,n n n n nx g h g− − −= − , etc.

A general case can be considered where m is different from n. The condition

under which there is at least one solution to Eq. (B.1) can be determined. A new

matrix can be introduced with a dimension ( )× +1m n giving ( ),b b=A A , with

bA defined as the augmented matrix for the system. This matrix is formed by

annexing to the matrix A the vector b which becomes column 1n + . It should be

noted that ( ) ( )br r≥A A since every minor in A also appears in bA . Also if

( ) ( )br r>A A , then there does not exist a solution to Eq. (B.1). This follows

since if ( ) ( )br k r= >A A , every set of k linearly independent columns from bA

must contain b. Hence b cannot be written as a linear combination of columns in

A, therefore there does not exist a jx such that j jj

x a b=∑ . On the other hand,

if ( ) ( )br k r= =A A , then there exists a set of k columns from A such that every

column in bA , and in particular b, can be written as a linear combination of these

k columns. Therefore, there exits one solution to Eq. (B.1) and in fact a solution

with no more than k of the variables different from zero.

Page 156: Fault tolerant systems

Appendix B – 4

Theorem B.1

Consider the initial value problem,

( ) ( ), 0 0.d

y y g t ydt

= + =A (B.6)

where A is a constant matrix in nnF , ( )g t is continuous and of exponential

growth at infinity. Then the solution ( )tφ of Eq. (B.6) exists and is unique for

0 t≤ ≤ ∞ and its derivative ( )t′φ is of exponential growth at infinity.

Proof of Theorem B.1

The initial value problem in Eq. (B.6) is equivalent to the problem of finding a

continuous function ( )tφ such that,

( ) ( ) ( ) ( )φ = + φ + ∫0

0 ,t

t

t y A s s g s ds (B.7)

for every t in the interval I.

By hypothesis, there exists constants 0M > and c and a time T such that,

( ) ,ctg t Me≤ (B.8)

for t T≤ . Assume 0c > , since increasing c increases the right–hand side of

Eq. (B.8) and does not affect the relation of the inequality. Eq. (B.7) can be

written as,

( ) ( ) ( ) ( ) ( )0

0 0

.T T t t

T T

t y s ds g s ds s ds g s dsφ = + φ + + φ +∫ ∫ ∫ ∫A A (B.9)

Page 157: Fault tolerant systems

Appendix B – 5

It can be shown that Eq. (B.7) is bounded on the interval 0 t T≤ ≤ . Using this

fact, and the continuity of ( )g t , there exists a constant k such that,

( ) ( )0

0 0

.T T

y s ds g s ds K+ φ + ≤∫ ∫A (B.10)

Taking the norms in E q. (B.9) and using Eq. (B.8) and the following properties,

the inequality v v≤A A is true for nn∈A F and for any nn∈v F and

( ) ( )b b

a a

f t dt f t dt≤∫ ∫ for every b a> and every continuous vector function f

on the interval a t b≤ ≤ .

Using the triangle inequality for vector norms, the following relationship is

obtained,

( ) ( )

( ) ( )

( )

,

,

.

t tcs

T T

tct cT

T

tct

T

t K s ds Me ds a

MK s ds e e

c

MK e s ds

c

φ ≤ + φ +

≤ + φ + −

≤ + + φ

∫ ∫

A

A

A

i

i

i

(B.11)

Multiplying Eq. (B.11) by cte− , the following expression is obtained,

( ) ( )

( )

,

.

tct ct ct

T

tct cs

T

Mt e Ke s e ds

c

MKe s e ds

c

− − −

− −

φ ≤ + + φ

≤ + + φ

A

A

i

i (B.12)

Page 158: Fault tolerant systems

Appendix B – 6

since ct cse e− −≤ for t s≥ . Since 0c > , there exists a constant L such that

ct MKe L

c− + ≤ for all t T≥ . The inequality describes in Eq. (B.12) now

becomes ( ) ( )t

ct cs

T

t e L s e ds− −φ ≤ + φ∫A .

Now let 0K > and 0a ≥ be constants. Suppose that ( )r t is a continuous

nonnegative function for 0t t≥ which satisfies the inequality

( ) ( )0

t

t

r t a K r d≤ + τ τ∫ on some interval I. Then ( ) ( )0K t tr t a e − ≤ for 0t t≥ and

t in the interval I. Using this development with ( ) ( ) ctr t t e −= φ gives

( ) ( )t Tctt e Le −−φ ≤ A or ( ) ( )c tTctt e Le e +−φ ≤ AA for t T≥ . Thus ( )tφ has

exponential growth at infinity. Since ( ) ( ) ( )t t g t′φ = φ +A the following

inequality exists,

( ) ( ) ( ) ( )

( ) ( )

,

,

c tT ct

c tT

t A t g t A Le e Me

Le M e

−−

+−

′φ ≤ φ + ≤ +

≤ +

AA

AAA

i (B.13)

and thus ( )t′φ also has exponential growth at infinity∴

Page 159: Fault tolerant systems

Appendix C – 1

A p p e n d i x C

PROBABILITY GENERATING FUNCTIONS

Throughout the body of this paper a probability density function is used to

represent a random variable associated with the failure rate of a system. In many

instances, these random variables assume only nonnegative integer values

representing the occurrence of a random event or the duration between the

occurrences of random events [Dave70].

The series of random variables generated by the underlying random process can

be represented by a sequence of numbers with several interesting properties.

Definition C.1

Let 0 1, , , na a a… be a sequence of real numbers representing a random process. If

( ) 20 1 2

jj

j

A s a a s a s a s= + + + = ∑L converges in some open interval

( )0 0s s s− < < then ( )A s is defined as the Generating Function of the sequence

{ }ja [Fell67]∴

Unlike the Laplace transform notation, the variable s has no significance in this

instance. If the sequence { }ja is bounded, then a comparison with the geometric

sequence shows that ( )A s converges at least for 1s < . Let X be a random

variable assuming the values 0,1, 2,… . Let { } jP X j p= = and

{ } 1j jP X j q p> = = − , then, 1 2k k kq p p+ += + +L . The Generating Function of

the sequences { }jp and { }kq are defined as,

Page 160: Fault tolerant systems

Appendix C – 2

( )

( )

20 1 2

20 1 2

,

.

jj

j

jj

j

s p p s p s p s

s q q s q s q s

= + + + =

= + + + =

P

Q

L

L (C.1)

Since ( ) =1 1P , the series for ( )sP converges absolutely at least for { }1 1s− ≤ ≤ .

The coefficients of ( )sQ are less than unity, and so the series for ( )sQ

converges at least in the open interval ( )1 1s− < < .

Theorem C.1

For the interval ( )1 1s− < < , ( ) ( )1

1

ss

s

−=

−P

Q ∴

Proof of Theorem C.1

The coefficients of ns in ( ) ( )1 s s− ⋅Q is equivalent to 1n n nq q p−− = − when

1n ≥ and is equivalent to 0 1 2 01q p p p= + + = −L when 0n = . Therefore

( ) ( ) ( )1 1s s s− = −Q P ∴

Restating the Generating Function for the random variable X gives,

( ) { } ,

.

j

j

X

G s P X j s

E s

= =

=

∑ (C.2)

If s is allowed to be any complex value, s x iy= + , an important generalization

can take place [Fell66], [Arfk70]. By defining a new variable , is s e ξ⇔ , where

1i = − and substituting this variable in the expression in Eq. (C.2) a

generalization of the Generating Functions can be given as,

Page 161: Fault tolerant systems

Appendix C – 3

( ) { }

{ }( )

,

.

isj

j

jis

j

G s P X j e

P X j e

= =

= =

∑ (C.3)

The summation on the right–hand side of Eq. (C.3) is a power series in ise . Given

( )G s as a function of s, the expansion of a power series in s gives the probability

{ }P X j= directly as a coefficient of js . This relationship is also the definition

of the z–Transform of a complex function [Papo65].

Defining 0t sz e= and expanding the definition to

( ) [ ]0 00 0cos sint i tz e e t i tσ+ ω σ= = ω+ ω and substituting this definition into the

discrete representation of the Laplace Transform defined by [Iser77],

( ){ } ( ) ( ) 00

0

kt s

k

x t x s x kt e∞

=

= = ∑L (C.4)

results in,

( ){ } ( )

( ) ( ) ( )

0 00

1 2

,

0 1 2

k

k

Z x kt x kt z

x x z x z

∞−

− −

=

= + + +

∑L

(C.5)

This infinite series converges if ( )0x kt is restricted to finite values and if 1z < .

Since σ in the term s i= σ + ω can be chosen appropriately, convergence is

possible for many functions of the form ( )0x kt ∴

Page 162: Fault tolerant systems

Appendix C – 4

Page 163: Fault tolerant systems

Appendix D – 1

A p p e n d i x D

POISSON PROCESSES

Throughout this paper the Poisson process is referred to with little or no formal

development. In searching the literature, it is curious that the fundamental axioms

of the Poisson process are rarely developed. Although several relationships are

developed in this paper between the Poisson process and the Exponential

process, the underlying source of a random process exhibiting a Poisson behavior

has not been describes in sufficient detail. This appendix attempts to correct this

situation.

Numerous physical and organic processes exhibit behavior that is meaningfully

modeled by the Poisson process. One of the first observations of the Poisson

process was that it:

Properly represented the number of army soldiers that died as a result of being

kicked by a horse. [Fry28]

It has been shown that the sum of a large number of independent stationary

renewal processes, each with an arbitrary distribution renewal time, will tend to a

Poisson process [Palm43], [Khin60].

The exponential function is perhaps the most important function in the analysis

of physical systems. This is because the exponential function has an almost magical

property that its derivative and its integral yield the function itself. For example

t t tde e dt e

dt= =∫ . Most of the phenomena observed in nature can be described

by exponential functions [Lath65].

Page 164: Fault tolerant systems

Appendix D – 2

Although the Poisson process arises frequently as a model for random events, it

also serves as a building block with other useful stochastic processes, most

importantly the Markov process.

Definition D.1

Consider a finite time interval [ )0,T in which the probability distribution

function describing the number of events occurring in the interval is sought. This

function can be found by dividing the period T into m subintervals, each of length

h T m= . Let λ represent the average event occurrence rate within the interval

of interest. For any subinterval, the probability that one (1) event occurs is

( )h O hλ + , that two or more events occur is ( )O h and therefore, that no events

occur is ( )1 h O h− λ + , where the function ( )O h represents any quantity that

approaches zero faster than h as 0h → ; that is ( ) 0O h h → as 0h → .

The rate at which events occur is said to be Poisson when the following property

holds,

The event occurrence rate in any subinterval is statistically independent of the

event occurrence rate in any other subinterval not overlapping this subinterval.

This is a formal restatement of the properties given in Figure 3. This property can

best be described if an event is considered a success of a Bernoulli trial [Bern31],

then the event pattern observed over the period of T mh= can be regarded as a

result of a sequence of m Bernoulli trials. Then the probability that exactly i

events occuring in the m subintervals can be approximated by using the binomial

distribution ( ) ( )1i m im

h O h h O hi

− λ + − λ +

. By taking the limits 0h →

Page 165: Fault tolerant systems

Appendix D – 3

and m → ∞ while keeping mh T= constant, the number of events in the interval

T, given by ( )N T , has the probability distribution function,

( ){ } ( )

( )( )

!lim lim 1 ,

! !

,!

i m i

im m

iT

T m TP N T i

i m m i m

Te

i

→∞ →∞

−λ

λ λ = = ⋅ ⋅ − −

λ=

(D.1)

which is a Poisson distribution of Eq. (2.6).

Another important property of the Poisson distribution is the distribution of

intervals between events. Let X be an arbitrarily chosen interval from the time

origin to the first event. The distribution for the random variable X can be

developed by noting that no arrivals occur in the interval ( )0, X if an only if

X x> ; that is { } ( ){ }0P X x P N x> = = where ( )N x represents the number

of events occurring in x time units. Using Eq. (D.1) results in

( ){ }0 xP N x e−λ= = . The probability distribution function of the random

variable X is then given by ( ) 1 xXF x e−λ= − and using Eq. (2.15) the probability

density function is given as ( ) ( ) xX Xf x F x e −λ′= = λ ∴

The Poisson process can be characterized in several ways:

§ The pure birth process with constant birth rates. A birth–death process is a continuous parameter Markov chain ( ){ }, 0N t t ≥ with the state space

{ }0,1, 2,… and homogeneous transition probabilities. The conditional probability that the population will increase by one (1) may depend on the population size n and is denoted by nλ . The conditional probability that the population may decrease by one (1) may depend on the population size n and denoted by nµ .

Page 166: Fault tolerant systems

Appendix D – 4

§ The renewal counting process with exponentially distributed inter–arrival times. If ( )f t , ( )g t , and ( )h t are functions defined for 0t ≥ satisfying

the relation ( ) ( ) ( ) ( )0

,t

g t h t g t s f s ds= + −∫ 0t ≥ then ( )f t and ( )h t

are known functions, and ( )g t is an unknown function to be determined as the solution to the integral equation. This integral equation is referred to as the Renewal Equation. A renewal equation for the mean value function of a renewal continuous process can be stated as,

( ) ( ) ( ) ( )0

N tn

m t E N t np n∞

=

= = ∑ .

§ An integer–valued process with stationary independent increments which has unit jumps.

Definition D.2

The counting process ( ){ }, 0N t t ≥ is said to be a Poisson process have rate

, 0λ λ > if

§ ( )0 0N = .

§ The process has independent increments

§ The number of events in any interval of length t is Poisson distributed with a mean of tλ . That is for all , 0s t ≥ ,

( ) ( ){ } ( ) ( )!

nt

P N t s N s n en

−λ λ

+ − = = ⋅

Definition D.3

In Chapter 2, the memoryless property of the exponential distribution was

described. This property is important to many areas of reliability analysis. In the

theory of recurrent events (Appendix E) and the theory of Markov Chains, the

underlying probability distributions are memoryless.

Given a random variable X, such that X t> . Suppose the random variable X

represents the lifetime of a component. If the component has been observed for

Page 167: Fault tolerant systems

Appendix D – 5

some time t then the observer would be interested in the probability distribution

of the remaining lifetime of the component, Y X t= − , or the residual lie of the

component.

Let the conditional probability of Y y≤ , given that X t> be denoted by the

function ( )tF y . Thus for 0y ≥ ,

( ) { }( ) { }( ) { }

( ) { }{ }

= ≤ >

= − ≤ >

= ≤ + >

≤ + >=

>

,

,

,

and .

t

t

t

t

F y P Y y X t

F y P X t y X t

F y P X y t X t

P X y t X tF y

P X t

(D.2)

By definition of the conditional probability [Kend77],

( ) { }{ }

.t

P t X y tF y

P X t

< ≤ +=

> (D.3)

Now consider the exponential distribution function ( ) xf x e−λ= λ . If this

function represents both the previous life of a component and the remaining life

(residual life) of the component, then the ratio of the two distributions should not

be a function of time. Thus,

( )

( )

( )residual

lifetime

0

11 .

y t y tx

t yyt t

t tx

t

f x dx e dxe e

ee

f x dx e dx

+ +−λ

−λ −λ−λ

∞ −λ−λ

λ−

= = = −λ

∫ ∫

∫ ∫ (D.4)

Page 168: Fault tolerant systems

Appendix D – 6

Thus two exponential distributions are independent of time t. Stated more clearly

a component whose failure process is described by an exponential distribution

does not age, e.g. it is good as new or it forgets how long is has been operating, and

this its eventual breakdown is a result of a sudden failure, not its gradual

deterioration.

If the arrival time of a component failure is exponentially distributed, then the

memoryless property implies that the time of arrival of the next failure is

independent of how long it has been since the last failure. This can be shown if

we assume X is a nonnegative continuous random variable with the Markov

property. Then the distribution of X must be exponential.

Using Eq. (D.3) then { }

{ }{ } { }0

P t X y tP X y P X y

P X t

< ≤ += ≤ = < ≤

>. The

Cumulative Distribution Function (CDF) is then given by

( ) ( ) ( ) ( ) ( )1 0X X X X XF y t F t F t F y F+ − = − − . Since ( )0 0XF = , the

CDF can be rearranges to give ( ) ( ) ( ) ( )1X XX X

F t F yF y t F y

t t

− + − = .

Taking the limit as 0t → gives, ( ) ( ) ( )

0lim X

Xt

F y t F yF y

t→

+ −′= which is an

expression of the derivative of XF . Using Eq. (2.21) for the definition of the

reliability function, e.g. ( ) ( )1X XR y F y= − , then the derivatives of the reliability

is given by,

( ) ( ) ( )0 .X X X

d dR y R R y

dx dy= (D.5)

Page 169: Fault tolerant systems

Appendix D – 7

the solution to the differential equation Eq. (D.5) is given by

( ) ( )exp 0X X

dR y K R y

dy

=

, where K is a constant of integration and

( ) ( ) ( )0 0 0X X X

d dR F f

dy dy− = = which is the pdf of the random variable X

evaluated at 0. Noting that ( )0 1XR = and that ( )0Xf = λ results in

( ) yXR y e−λ= and thus ( ) 1 , 0y

XF y e y−= − > , which by inspection is an

exponential distribution [Bhat72].

When working with exponential distribution it is desirable to treat then at times

as joint probability distributions rather than a conditional probability distribution.

The following two definitions provide the basis for this process.

Definition D.4

Given two random variables X and Y and a function ( ),g x y of the real variables

x and y. If ( ),g x y satisfies certain general conditions, namely (1) For the set YI

of all real numbers x such that ( )g x y≤ should be a countable union or

intersection of intervals for any y; only then { }Y y≤ is an event. If ( )g x has this

property it is called a Baire function. (2) the set of outcomes ζ such that

( )g X ζ = ±∞ should be an event with zero probability, then the function

( ),Z g x y= is a random variable and it value ( )Z ζ is given by

( ) ( ) ( ),Z g X Yζ = ζ ζ ∴

Definition D.5

If the random variables X and Y are independent, then the density of the their

sum Z X Y= + equals the convolution of their respective densities,

Page 170: Fault tolerant systems

Appendix D – 8

( ) ( ) ( ) ( ) ( )Z X Y X Yf Z f z y f y dy f x f z x dx∞ ∞

−∞ −∞

= − = −∫ ∫ . It should be noted that

if X and Y take only positive values, i.e. if ( ) 0Xf x = for 0x < and ( ) 0Yf y =

for 0y < , then the convolution takes the form ( ) ( ) ( )0

Z

Z X Yf z f z y f y dy= −∫

for 0z > . In the case of the subtraction of two independent distributions,

Z X Y= − , the convolution is given by ( ) ( ) ( )Z X Yf z f z y f y dy∞

−∞

= − −∫ ∴

In the analysis of Fault–Tolerant systems, situations frequently arise in which a

series of exponential distributions are arranged in sequence. The resulting density

is known as an r–stage Hypoexponential distribution.

Definition D.6

In the previous section of this appendix, the Poisson distributions have been

treated as separate and independent processes. In practical systems a group of

Poisson processes can be combined or separated in a manner modeling the

physical organization of the underlying systems.

The superposition of Poisson processes can be described by considering m

independent source that generate random events. Assume search source follows a

Poisson distribution with rate λ = …, 1, 2, ,k k m . If these sources are combined

into a single source, then this new source is a Poisson process with a rate that is

the sum of all components, λ = λ + λ + + λL1 2 m . This additively process can be

shown by using the probability generating function. Consider an interval of

length T. the number of events from the thk Poisson source in this interval is

Poisson distributed with the parameter λkT , and using Eq. (C.2) its probability

Page 171: Fault tolerant systems

Appendix D – 9

generating function is given as ( ) ( )−λ −= 1kT zk z eG . The total number of events

from all sources has the probability generating function

( ) ( ) ( )−λ −

=

= =∏ 1

1

mT z

kk

z z eG G , with a parameter =

λ = λ∑1

m

kk

, where the product

form is due to the statistical independence of the m generating sources. The total

number of events in a merged stream is Poisson–distributed with mean λT .

The second situation is which Poisson processes are combined is one in which a

source of random variables is decomposed into m separate Poisson processes. If

the originating source has a generating rate of λ the probability of selecting the

thk decomposed process is kr .

Let ( )N T denote as before the number of events occurring in T time units and

let ( )kN T denote the number of events in the thk decomposed process. The

conditional joint distribution of ( ) = …, 1, 2, ,kN T k m is given as ( ) =N T n

and is a multinomial distribution of r independent Poisson random processes

resulting in the expression,

( ) ( ) ( ) ( ){ }= = = =

= ⋅

… L

L LL1 2

1 1 2 2

1 21 2

, , ,

!

! ! !m

m m

n n nm

m

P N T n N T n N T n N T n

nr r r

n n n

.

By multiplying the probability of the random variable n, which is the Poisson

distribution with parameter λT , an expression for the joint probability of m

decomposed Poisson distributions is given by,

Page 172: Fault tolerant systems

Appendix D – 10

{ } ( )

( )

−λ

− λ

=

λ= ⋅

λ= ∏

… LL1 2

1 2 1 21 2

1

!, , ,

! ! ! !

.!

m

k

k

nn n n T

m mm

nmk r T

k k

TnP n n n r r r e

n n n n

r Te

n

(D.6)

Since the joint probability factors into m Poisson distributions, the random

variables …1 2, , , mn n n are statistically independent for an arbitrary chosen interval

T.

Suppose there are r identical exponential distributions and an overall process that

experiences these distributions in sequence. The resulting density is given as,

( )( )

− −λλ= > λ > =

−…

1

, 0, 0, 1, 2,1 !

r r tt ef t t r

r (D.7)

The corresponding distribution function is given by,

( ) ( )−−λ

=

λ= − ≥ λ > =∑ …

1

0

1 , 0, 0, 1, 2,!

krt

k

tF t e t r

k (D.8)

A single hypoexponential distribution is equivalent to an exponential distribution

when = 1r resulting in,

( )

− −λ−λ

=

λ→λ ∴

1

11 !

r r tt

r

t ee

r (D.9)

Theorem D.1

Let =

= ∑1

r

ii

Z X , where …1 2, , , rX X X are mutually independent and iX is an

exponentially distributed random variable with the parameter

Page 173: Fault tolerant systems

Appendix D – 11

λ λ ≠ λ ≠, for i i j i j . Then the density of Z, which is a r–stage hypoexponential

distributed random variable, is given as,

( ) −λ

=

= λ >∑1

, 0,i

rz

Z i ii

f z a e z (D.10)

where,

=≠

λ= ≤ ≤ ∴

λ − λ∏1

, 1r

ji

j j ij i

a i r (D.11)

Page 174: Fault tolerant systems

Appendix D – 12

Page 175: Fault tolerant systems

Appendix E – 1

A p p e n d i x E

RENEWAL THEORY

In Chapter 3, the Mean Time to First Failure expression was developed using

Renewal Theory [Cox62], [Ross70], [Smit58], [Fell66]. This appendix develops

further the foundations of Renewal Theory, also referred to as the Theory of

Recurrent Events. Renewal Theory started as the study of the probability

associated with the failure and replacement of components.

Suppose there is a population of components, each component characterized by a

non–negative random variable X called its failure–time. The failure–time is the age

of the component at which a clearly defined event occurs – its failure. The

random variable X is non–negative and there are two considerations: (i) there is a

positive constant h such that he only possible values of X are { }…0, , 2 ,h h ; (ii)

the random variable has a continuous distribution over the range ( )∞0, , its

distribution being determined by a probability density function.

In the analysis of Fault–Tolerant systems which are capable of repair, the fault–

failure–repair sequence can be modeled as a recurring sequence of random

variables. A pattern of random variables R , qualifies as a renewal process if after

each occurrence of an event in R , the random process generating the events

starts again with the restriction that the sequence of events following an

occurrence in R forms a replica of the entire process. The duration of waiting

times between occurrences in R are mutually independent random variables and

have the same probability distribution.

Page 176: Fault tolerant systems

Appendix E – 2

In the simplest example, R represents an event in which a success occurs, in a

sequence of Bernoulli trials. The waiting time to the occurrence of the first event

has a geometric distribution. This is defined as so that when the first success

occurs, the trial starts again, and the number of trials between the thr and the

( )+th 1r success have the same geometric distribution. The waiting time up to

the thr success is then the sum if r independent variables.

This sequence of events also holds when R represents a success followed by a failure.

In the analysis of reliability systems, this model can be used to represent a

sequence of proper operations followed by a fault event, which results in a repair,

followed by another series of proper operations.

Consider a sequence of repeated trials with possible outcomes ( )= …1, 2,jE j .

Since these outcomes need not be independent, they can be used to analyze

Markov processes. Suppose it is possible to continue the trials indefinitely, then

the probabilities { }…1 2, , ,

nj j jP E E E being defined consistently for all finite

sequences. Let R be an attribute of these finite sequences.

Definition E.1

The attribute R defines a recurrent event if: (a) In order that R occurs at the

thn and ( )thn m+ location in the sequence ( )

1 2, , ,

n n n mj j jE E E+ + +

… , (b) If R

occurs at the thn location then identically

{ } { } { }1 1 1, , , , , ,

n m n n n mj j j j j jP E E P E E P E E+ + +

= ⋅… … … ∴

Page 177: Fault tolerant systems

Appendix E – 3

Using this definition the term R occurs in the sequence ( )1 2, , ,

nj j jE E E… for the

first time at the thn place. It can also be stated with each recurrent event R there

is associated the two sequences of numbers for 1, 2,n = … as follows,

{ }th ocuurs at the trial ,nu P n= R

and

{ }th ocuurs for the first time at the trial .nf P n= R

For convenience 0 1f = , 0 1u = which results in the generating functions

( )1

kk

k

F s f s∞

=

= ∑ and ( )1

kk

k

U s u s∞

=

= ∑ being defined. The sequence { }ku is not a

probability distribution since ku = ∞∑ . The events described by R occurring

for the first time at the thn trial are mutually exclusive, therefore 1

1nn

f f∞

=

= ≤∑ .

The term 1 f− can be interpreted as the probability that R does not occur in a

indefinitely prolonged sequence of trials. If 1f = a random variable T can be defines as

{ } nP T n f= = which is to say that T is an improper, ort defective random

variable, which with probability 1 f− does not assume a numerical value.

Theorem E.1

Let ( )rnf be the probability that the thr occurrence of R tales place at the thn

trial. The sequence ( ){ }rnf is the probability distribution of the sum

( )1 2

rrT T T T= + + +L of independent random variables 1 , , rT T… , each having

Page 178: Fault tolerant systems

Appendix E – 4

the distribution { } nP T n f= = . For a fixed r the sequence ( ){ }rnf has the

generating function ( )rF s ∴

Proof of Theorem E.1

It follows that ( ) ( )1

1r r rn

n

f F f∞

=

= =∑ ∴

It is preferable to consider the number n of trials as a fixed value and determine

the number nN of occurrences of R in the first n trials as a random variable. In

the analysis of system failures and repairs the behavior of the distribution of nN

fro large n in important.

Let ( )rT represent the number of trials up to and including the thr occurrence of

R . The probability distributions of ( )rT and nN are related by the identity,

{ } ( ){ }rnP N r P T n≥ = ≤ ∴ (E.1)

Theorem E.2

If the recurrent event R is persistent and its recurrent times have finite mean µ

and variance 2σ , then both the number ( )rT of trials up to the thr occurrence of

R and the number nN of occurrences of R in the first n trials are

asymptotically normally distributed.

Proof of Theorem E.2

Assume the case where R is persistent and the distribution { }nf of its

recurrence times has a finite mean µ and variance 2σ . Since ( )rT is the sum of r

Page 179: Fault tolerant systems

Appendix E – 5

independent variables, the central limit theorem asserts for each fixed x as

r → ∞ ,

( )

( ) ,rT r

P x xr

− µ< →

σ N (E.2)

where ( )xN is the normal distribution function. Now let n → ∞ and r → ∞ is

such a way that,

,n r

xr

− µ→

σ (E.3)

then Eq. (E.1) and Eq. (E.2) together lead to { } ( )nP N r x≥ → N . This relation

can be seen in a familiar from if the reduced variable ( ) 2n nN N nn

∗ µ= µ −

σ is

introduced. The inequity nN r≥ is identical with,

.n

r n r rN x

n nr∗ µ − µ µ

≥ ⋅ = −σ

(E.4)

By dividing Eq. (E.3) by r it is seen that n r → µ , and the right side of Eq. (E.4)

rends to x− . Since, ( ) ( )1x x− = −N N it follows that { } ( )nP N x x∗ ≥ − → N

or { } ( )1nP N x x∗ < − → − N ∴

Unfortunately many of the recurrence times occurring in various stochastic

processes and in statistical applications have infinite expectations. In such cases the

normal approximation is replaced by more general limit theorems of an entirely

different character [Fell49], and the chance fluctuations exhibit unexpected

features.

Page 180: Fault tolerant systems

Appendix E – 6

One expects intuitively that [ ]nE N should increase linearly with n because on the

average R must occur twice as often in twice as many trials. However, this is not true

[Fell49]. The previous discussion is important to understand when analyzing the

response of e Fault–Tolerant system in the presence of a request for action, with

a finite probability that the system will not respond properly. This issue is best

illustrated in an example.

Suppose a Fault–Tolerant system is operating normally, that is all faults that can

occur are detected properly – the coverage factor is 1.0. When an external event

generates a demand on the system, it will respond properly, with a probability of

1.0. If however, the coverage factor is less than 1.0, the probability that the

system will respond properly is now less than 1.0.

In Chapter 7, there are summary tables that described the Mean Time Between

Failures for the various configurations. In order to understand the consequences

of this situation on these tables, the probability of a demand must be included in

the model. A paradox will develop when this factor is included.

Let kA represent the thk uncovered fault in the system, which is assumed to

occur at time kt . Assume that the intervals 1k kt t+ − are independent and

identically distributed random variables with the distribution functions given as

( ) { }1k kF x P t t x+ − ≤@ . The common pdf for these intervals is defined as

( ) ( )df x F x

dx@ .

Now chose some random point in time, τ , for a demand on the system to occur.

Let X denote the interarrival time for the undetected fault and let Y denote the

time to the next occurrence of the undetected fault after the arrival of a system

Page 181: Fault tolerant systems

Appendix E – 7

demand. The sequence of arrival events { }kt forms a renewal process. In this

case { }kt forms the sequence of instants when the system experiences an

undetected fault. The random variable X is defined as the lifetime of the system

under consideration, and the random variable Y is defined as the residual lifetime of

the component at time t and 0X X Y= − is defined as the age of the system at

time t.

A paradox will be discovered if the pdf for the lifetime and the residual lifetime

are compared. It is assumed the renewal process has been operating for an

arbitrary long time and limiting distributions are observed. The result will be that

the random variable X will not be distributed according to

( ) { }1k kF x P t t x+ − ≤@ . This means that the interval during which the demand

on the system occurs is not a typical interval. A longer interval is more likely to be

on the average twice as long as a typical interval.

Let the residual life have a distribution ( ) { }YF x P Y x≤@ with a density

function of ( ) ( )Y Y

df x F x

dx= . Let the selected lifetime X have a distribution

( ) { }XF x P X x≤@ with a density function of ( ) ( )X X

df x F x

dx= .

Page 182: Fault tolerant systems

Appendix E – 8

kt 1kt +τ

kA 1kA +

0X Y

X

Figure E.1 – Residual life Y and the process life X of a system demand event occurring between s system fault event.

The derivation of the residual life ( )Yf x is performed by observing that the

event { }Y y≤ can occur if and only if 1k kt y t +τ < ≤ τ + < for some k. The pdf

is given by,

( ) { }

( ) { }1

,

1 .

Y t

y

kk

F y P Y y t

F y x dP t xτ+∞

= τ

= − τ + − ≤ ∑ ∫

@ (E.5)

Observing that kt x≤ if an only if ( )xα , the number of arrival in the interval

( )0, x , is at least k, that is { } ( ){ }kP t x P x k≤ = α ≥ then the cumulative

distribution function is given by { } ( ){ }1 1

kk k

P t x kP x k∞ ∞

= =

≤ = α ≥∑ ∑ . For larger x,

Page 183: Fault tolerant systems

Appendix E – 9

the mean–value expression for the CDF is 1x m . Let ( ) ( )limY Y ttF y F y

→∞= there

is a corresponding pdf of,

( ) ( )1

1.Y

F yf y

m

−= (E.6)

An intuitive explanation of the above derivation can be used for the lifetime

density which takes advantage of the physical properties of this situation. It is

observed that long intervals between renewal points occupy larger segments of

the time axis than do shorter intervals, and therefore it is more likely that the

random point τ , will fall in a long interval. It is recognized that the probability of

an interval of length x is chosen should be proportional to the length ( )f x as

well as the relative occurrence of such intervals, which is given by ( )f x dx . Thus

for the selected interval,

( ) ( ) ,Xf x dx Kxf x dx= (E.7)

where the left–hand side is [ ]P x X x dx< ≤ + and the right–hand side

expresses the linear weighting with respect to interval length and includes a

constant K, which must be evaluated so as to properly normalize this density.

Integrating both sides of Eq. (E.7) it is found that 11K m= , where

[ ]1 1k km E t t −−@ and is the common average time between renewals (between

failures and repair of the system). The density associated with the selected

intervals is given in terms of the density of a typical interval by,

( ) ( )1

.X

xf xf x

m= (E.8)

Page 184: Fault tolerant systems

Appendix E – 10

The results described in Eq. (E.8) and Eq. (E.6) can be expressed in terms of

their Laplace transforms ( ) ( )X Xf x F s∗⇔ and ( ) ( )Y Yf x F s∗⇔ . Using the

Laplace transforms ( ) ( )1

1u t t

s− = δ ⇔ and ( ) ( )t F sf t dt

s

−∞

⇔∫ to transform

Eq. (E.6) gives,

( ) ( )1

1.Y

F sF s

sm

∗∗ −

= (E.9)

The moments of the residual life can now be found in terms of the lifetimes

themselves.

Denote the thn moment of the lifetime by nm and the thn moment of the

residual life by nr , that is,

( )1 ,n

n k km E t t − − @ (E.10)

.nnr E Y @ (E.11)

Using the moment generating function ( ) ( ) ( )0 1nn nA X∗ = − , Eq. (E.11) can be

differentiated to obtain the moments of the residual life. As 0s → indeterminate

forms are produced which may be evaluated using L’Hospital’s rule to give the

moments of the residual life as ( )

1

1

.1n

n

mr

n m+=

+

Page 185: Fault tolerant systems

Appendix E – 11

If ( )f x and ( )g x are continuous in an interval including x a= , and if

the derivatives ( )f x′ and ( ) 0g x′ ≠ in this interval (expect possibly at

x a= ) and if ( ) 0f a = and ( ) 0g a = , then

( )( )

( )( )

lim limx a x a

f x f x

g x g x→ →

′=

′ provided the latter limit exists.

Figure E-2 – L’Hospital’s Rule used to evaluate the indeterminate Form 0/0 of two continuous functions over an interval.

This expression is used to evaluate 1r , the mean residual life, which is found to be

21

12

mr

m= . The mean residual life may also be expressed in terms of the lifetime

variance (second moment, denoted by 2 22 1m mσ −@ ) to give

21

112 2

mr

m

σ= + . This

last form shows the correct answer for the paradox is 1 2m , one half the mean

interarrival time, only if the variance is zero (regularly spaced arrivals). However

for Poisson arrives, 1 1m = λ and 2 21σ = λ , giving 1 11r m= λ = , which

confirms the earlier solution to the residual life paradox.

Since the distributions representing the lifetime, ( ) { }XF x P X x≤@ , and the

residual life, ( ) { }YF x P Y x≤@ are independent, the distribution of the system

age can be obtained through the convolution of the two distributions.

Page 186: Fault tolerant systems

Appendix E – 12

Page 187: Fault tolerant systems

Appendix F – 1

A p p e n d i x F

LAPLACE TRANSFORM GENERALIZED SOLUTION METHODS

The generalized solution to the reliability problems described in Chapter 5 are

based on an approach to computing the Markov reliability matrix using

Eigenvalues [Wilk65], [Wibe71], [Shoo68], [Buza70]. These methods can be

adapted to the solution of systems that have been transformed to the Laplace

transform for.

The following assumptions can be made regarding the generalized solution,

§ The underlying system can be represented by a set of n states. These states represent different conditions the system S may assume and are contained in the transition state space matrix A consisting of elements ijp .

§ The transition from one state to another is a Markov process [Hoel72].

§ The time in each state is exponentially distributed.

§ The probability of more than one state change during a time increment t∆ is of the order ( )O t∆ .

The key to the solution of the generalized reliability system, is to partition the

resulting Markov matrix into an appropriate set of states. Two subs–states can

exist: non–absorbing and absorbing. For the absorbing states the following

condition is true 0 for 1, 2, , 1njp j n= = −… . What is of interest is the

probability for a state–change from iS to jS during the time t∆ . Although the

discrete time equation is useful, the continuous time form will be used in this

appendix. By utilizing this form, a set of differential equations, results in a set of

Page 188: Fault tolerant systems

Appendix F – 2

algebraic equations (derived from the Laplace transform) which can be solved

using techniques described below.

The continuous time state–space equations corresponding to the probability

transition rates are given by,

( ) ( ) ( )

( ) ( ) ( )

1 2

12 12

21 2 21

1 2 2

1

11

, , ,

1

1, , , ,

1

n

n

ij nj

n

j nj

n j

n

n njj

d d dP t P t P t

dt dt dt

p p p

p p pP t P t P t

p p

=

=≠

=

= −

− = ⋅

… L

L

LL …

M O M

L L

(F.1)

where ( ) ( ) ( )1 2, , , nP t P t P t … is the probability that the thi row–vector of the

system A being in a certain state as a function of time. The corresponding matrix

differential equations are of the simplified form,

( ) ( ) ,d

P t P tdt

= ⋅ Ar r

(F.2)

where ( )P tr

is now the general row–vector probability of the system being in a

certain state.

The initial condition for this differential equation system is ( ) 00P P=r

.

Page 189: Fault tolerant systems

Appendix F – 3

Definition F.1

A recurrent Markov Chain is a chain with a transition matrix A having the

following properties:

§ There is at least one recurrent state.

§ Form every state t is possible to reach a recurrent state.

The nonrecurring states are referred to as transient states. The canonical form of the

transition matrix can be written with the recurrent states listed first and the

nonrecurring states listed last. Suppose there are a total of r transition states and s

absorbing states. Then the canonical form for the transition matrix A is [Gave73],

( )

( )1

0

1, iff 0

0, otherwise

.

ij

ij

ijn

ij ik kj ij iji T i C

iin

p i j

p

p np p f p p

p n

=∞

∈ ∉

=

= = = = =

= =

= = + = =

∑∑ ∑

I O

I OA

R QR Q

Definition F.2

Each element of the canonical matrix is defined as,

§ The matrix Q with elements ijp is not stochastic since 1,ijj

p i∈

≤ ∀ ∈∑S

S

but is defined as sub–stochastic if 0ijp ≥ for all i and j and if 1ihj

p ≤∑ for

all i.

§ The matrix Q by itself does not have the usual probabilistic interpretations. Since the sum of some of the rows is less than one, the entries in Q represent all the transitions among the transient states of the system.

§ The matrix R corresponds to the transition from the transient states to recurrent states. The R matrix is not square and in general does not have an interpretation as a transition matrix from a state space into itself.

Page 190: Fault tolerant systems

Appendix F – 4

§ The matrix I denotes the stochastic matrix corresponding to transitions within the set of recurrent states.

Definition F.3

The fundament matrix of an absorbing Markov Chain is the matrix,

( ) 1 2 k− = + + + + +N = I - Q I Q Q QL L∴

Definition F.4

The Laplace transform can be used for constructing a fundamental matrix for the

system described by ( ) ( )dP t P t

dt= ⋅ A

r r, where A is an arbitrary n n× constant

matrix. If ( )tf is a vector function with n components defined on ( )0 t≤ < ∞ ,

then ∈ Λf , where Λ is the class of complex functions, then there exists the

Laplace transform { } ( )0

ste t dt∞

−= ∫f fL . Using Theorem B.1 (in the scalar sense)

and applying it to the vector case, it follows that if φ is a solution to

( ) ( )dP t P t

dt= ⋅ A

r r, with ( )0P

r as the initial conditions vector, then the solution

vector φ∈Λ for any vector ( )0Pr

This can be sown in the following manner. Let the Laplace transform of the

solution to ( ) ( )dP t P t

dt= ⋅ A

r r, be given by ( ) ( )P s = φL . Taking the Laplace

transform of both sides of ( ) ( )dP t P t

dt= ⋅ A

r r, and using the initial conditions

( )0Pr

gives a system of equations,

( ) ( ) ( )0 .sP s P P s− = Ar r r

(F.3)

Page 191: Fault tolerant systems

Appendix F – 5

Rearranging Eq. (F.3) (this is possible at this point since Eq. (F.3) is treated as an

algebraic equation, rather than a differential equation) results in,

( ) ( ) ( )0 .s P s P=I - Ar r

(F.4)

The system in Eq. (F.4) is a linear nonhomogeneous system of n algebraic

equations in n unknowns, with ( ) ( ) ( )( ) ( )1 2, , , nP s P s P s P s∈r… . If s is not equal

to an eigenvalue of P, that is ( )det 0s ≠I - P , then Eq. (F.4) can be solved

uniquely for ( )P sr

in terms of η and s by Cramer’s Rule (Eq. (B.2)). Since

( )det s I - P is a polynomial of degree n, it is clear that ( )Y s is a vector whose

components are rational functions of s and linear in ( )1 2, , , nη η η… with

components in ( )0Pr

. Each component of ( )P sr

can be expanded in partial

fractions (the denominators will be integral powers of ( )js − λ , where jλ is an

eigenvalue of ( )P sr

). ( )P sr

can then be inverted to find the solution ( )tφ

corresponding to any initial vector ( )0Pr

.

Letting ( )0Pr

take on successive values,

( ) ( ) ( )1 2

01 0

10

00 , 0 , , 0 ,0

0 10

nP P P

= = =

M…M M

The solutions 1 2, , , nφ φ φ… used as column vectors of the matrix Φ generate a

fundamental matrix of ( ) ( )dP t P t

dt= ⋅ A

r r, such that ( )0Φ = I .

Page 192: Fault tolerant systems

Appendix F – 6

Page 193: Fault tolerant systems

Appendix G – 1

A p p e n d i x G

EXPONENTIAL DISTRIBUTION VARIANCE

In the analysis of Mean Time to Failure models in Chapter 2, hard data may not be

available for calculating a traditional mean for the random variable representing

the failure rate λ . What is of interest is the effect of the variance on the failure

rate of the underlying system. If the failure rate is λ , then for a total cumulative

observation time T the number of failures n has a Poisson distribution. The

restatement of Eq. (2.6) is given in a posterior probability form as,

{ } ( ).

!

nTT

P n en

−λλλ = (G.1)

The parameter of interest is λ and the prior estimate, ( )h λ , of this parameter

will be described by the Gamma distribution, which is the natural conjugate of

the Poisson distribution [Kapu77]. The Gamma prior pdf for λ is given by,

( )( )

1

, 0, 0, 0.e

hδ δ− −ρλρ λ

λ = λ ≥ δ ≥ ρ ≥Γ δ

(G.2)

Definition G.1

The failure density function for a Gamma distribution is given by

( )( )

1 tf t t eη

η− −λλ=

Γ η, 0, 0, 0t ≥ η > λ > , where η is the shape parameter and λ

is the scale parameter. The failure distribution function is then given by

Page 194: Fault tolerant systems

Appenix G – 2

( )( )

1

0

ttF t e d

ηη− −λλ

= τ τΓ η∫ . If η is an integer, it can be shown by successive

integration by parts that ( ) ( )!

k t

k

t eF t

k

−λ∞

λ= ∑ . Then the reliability function can

be given as ( ) ( ) ( )1

0

1!

k t

k

t eR t F t

k

−λη−

=

λ= − = ∑ , and the hazard function as

( ) ( )( )

( )( )

1

1

0 !

t

k t

k

t ef t

z tR t t e

k

ηη− −λ

−λη−

=

λΓ η

= =λ∑

The posterior pdf for λ , given n failures over the time interval T, can be found

using Bayes’ theorem. That ism the solution to,

( ) ( ) { }( )2

,h P n

k nf n

λ λλ = (G.3)

must be found for the parameters λ and n. Recognizing that the joint pdf for n

and λ is found from ( ) ( ) { },f n h P nλ = λ λ , which in the instance of Eq. (G.3)

becomes,

( )( ) ( )

( )1, , 0.1

nTnT

f n en

δ−λ ρ+δ+ +ρ

λ = λ λ ≥Γ δ Γ +

(G.4)

The marginal pdf for n is,

( )( ) ( )

( )12

0

.1

nTnT

f n e dn

∞δ−λ ρ+δ+ +ρ

= λ λΓ δ Γ + ∫ (G.5)

Page 195: Fault tolerant systems

Appenix G – 3

Solving the integral in Eq. (G.5) by letting ( )Tµ = λ ρ + , results in,

( )( ) ( )( )

12

0

.1

nn u

n

Tf n u e du

n T

∞δδ+ − −

δ+

ρ=

Γ δ Γ + ρ + ∫ (G.6)

The term under the integral in Eq. (G.6) is the Gamma function resulting in,

( ) ( )( ) ( ) ( )2 , 0,1, 2,

1

n

n

T nf n r

n T

δ

δ+

ρ Γ δ += =

Γ δ Γ + ρ +… (G.7)

Substituting Eq. (G.4) and Eq. (G.6) into Eq. (G.3) to obtain the posterior

distribution for λ gives,

( ) ( )( )

( )1 , 0.n

tnTk n e

n

δ+−λ ρ+δ+ −ρ +

λ = λ λ ≥Γ δ +

(G.8)

The expression in Eq. (G.8) is recognized as a Gamma pdf with parameters

( )Tρ + and ( )nδ + . The Bayesian point estimator for λ is the mean of the

Gamma posterior pdf given in Eq. (G.8) and is, ˆ .b

n

T

δ +λ =

ρ +

The upper and lower confidence limits for λ can be obtained from the posterior

distribution given in Eq. (G.8). Both upperλ and lowerλ , respectively define a

( )100 1 %− α one sided upper and lower Bayesian confidence limit for λ .

Since the posterior pdf in Eq. (G.8) is Gamma, a simple transformation defined

by,

( )2 ,z T= λ ρ + (G.9)

Page 196: Fault tolerant systems

Appenix G – 4

can be applied to the pdf in Eq. (G.8) to produce a random variable z which is

Chi–squared distributed with ( )2 Tρ + degrees of freedom. Using tabled Chi–

squared values results in a set of confidence limits for the variable λ .

Definition G.2

The exponential probability distribution and the Chi–Squared distribution are

related in the following manner. Consider an interval of time t to the first failure

where ( ) , 0tf t e t−λ= λ ≥ . The random variable 2y t= λ is Chi–square

distributed with two degrees of freedom. This can be shown by defining

2dy dt= λ and 2

t y=λ

. Using the relationship ( ) ( )( ) ( )11 dh y

g y dy f h ydy

−−=

results in ( ) ( )21, 0

2yg y dy e dy y−= ≥ , which us a Chi–squared distribution with

two degrees of freedom∴

Considering the transform in Eq. (G.9), the confidence limit can be given as,

( ) ( ){ }21 ,2 2 1 .nP t−α δ+ ≤ λ ρ + = − αX (G.10)

Rearranging the inequality gives,

( )

( )( )

( )

2 21 ,2 ,2 ,

2 2n nT

T T−α δ+ α δ+

≤ ≤ ρ + ρ +

X X (G.11)

as the confidence bounds on the failure rate random variable λ .

Page 197: Fault tolerant systems

Bibliography – 1

B i b l i o g r a p h y

[Ande79] Anderson, T. and Randell, B., Computing Systems Reliability, Cambridge University Press, 1979.

[Ande81] Anderson, T. and Lee, P. A., Fault Tolerant Principles and Practice, Prentice Hall, 1981.

[Ande82] Anderson, T. and Lee, P. A., “Fault Tolerant Terminology Proposals,” Proceedings of the 12th Annual International Symposium on Fault Tolerant Computing, IEEE Computer Society Press, June 1982, pp. 29–33.

[Apos74] Apostolakis, G., “Mathematical Methods of Probabilistic Safety Analysis,” Technical Report, School of Engineering and Applied Science, University of California, Los Angeles, 1974.

[Apso87] Apostolakis, G. and Moieni, P., “The Foundation of Models of Dependence in Probabilistic Safety Assessment,” Reliability Engineering, 18(3), pp. 177–195, 1987.

[Arfk70] Arfken, G., Mathematical Methods for Physicists, Academic Press, 1970.

[Arla85] Arlat, J. and Laprie, J.–C., “On the Dependability Evaluation of High Safety Systems,” Proceedings of the 15th International Symposium on Fault–Tolerant Computing, June 1985, pp. 318–322.

[Arla85] Arlat, J., Croizet, Y. and Laprie, J.–C., “Fault Injection for Dependability Validation of Fault–Tolerant Computing Systems,” Proceedings of the 15th International Symposium on Fault–Tolerant Computing, June 1985, pp. 348–355.

[Arno73] Arnold, T. F., “The Concept of Coverage and Its Effects on the Reliability Model of a Repairable System,” IEEE Transactions on Computers, Volume 22, March 1973, pp. 251–254.

[Arse80] Arsenault, J. E. and Roberts, J. A., Reliability and Maintainability of Electronic Systems, Computer Science Press, Rockville, MD., 1980.

[Ash70] Ash, R. B., Basic Probability Theory, John Wiley & Sons, 1970.

[Atwo86] Atwood, C. L., “The Binomial Failure Rate Common Cause Model,” Technometrics, Volume 28, 1986, pp. 139–148.

[Aviz67] Avizienis, A., “The Design of Fault Tolerant Computers,” AFIPS Conference Proceedings, 1967, 31:733–743.

[Aviz85] Avizienis, A., “Fault–Tolerance and Fault–Intolerance: Complementary Approaches to Reliable Computing,” Proceedings of the International Conference on Reliable Software, in SigPlan Notices, 10(6), June 1995. pp. 458–464.

Page 198: Fault tolerant systems

Bibliography – 4

[Aviz82] Avizienis, A., “The Four–Universe Information System Model for the Study of Fault–Tolerance,” Proceedings of the 12th Annual International Symposium on Fault–Tolerant Computing, IEEE Computer Society Press, June 1982, pp. 6–13.

[Aviz75] Avizienis, A., “Fault Tolerant and Fault–Intolerance: Complementary Approaches to Reliable Computing,” Proceedings of the International Conference on Reliable Software in SigPlan Notices, 10(6), June 1975, pp. 458–464.

[Aviz82] Avizienis, A., “The Four–Universe Information System Model of the Study of Fault–Tolerance,” Proceedings of the 12th Annual International Symposium of Fault–Tolerant Computing, IEEE Computer Society, June 1982, pp. 6–13.

[Abiz84] Avizienis, A. and Kelly, P. J., “Fault Tolerance by Design Diversity: Concepts and Experiments,” Computer, 17(8), August 1984, pp. 67–80.

[Aviz86] Avizienis, A. and Laprie, J.–C., “Dependable Computing: From Concepts to Design Diversity,” Proceedings of the IEEE, May 1986, pp. 629–638.

[Barl75] Barlow, R. E. and Proschan, F., Statistical Theory of Reliability and Life Testing: Probability Models, Holt, Rinehart and Winston, 1975.

[Barl85] Barlow, R. E. and Singurwalla, N. D., “Assessing the Reliability of Computer Software and Computer Networks: An Opportunity for Partnership with Computer Scientist,” The American Statistician, 39(2), May 1985, pp. 88–94.

[Barr73] Barrett, L. S., “Reliability Design and Application Considerations for Classical and Current Redundancy Schemes,” Lockheed Missiles and Space Company, Inc., Sunnyvale, California, September 30, 1973.

[Bell65] Bellman, R. and Kalaba, R., Quasilinearization and Nonlinear Boundary–Value Problems, American Elsevier Publishing, 1965.

[Bell87] Bellis, H., “Comparing Analytical Reliability Models to Hard and Transient Failure Data,” Master’s Thesis, Carnegie–Mellon University, Department of Electrical Engineering, 1987.

[Bend71] Bendat, J. S. and Piersol, A. G., Random Data: Analysis and Measurements Procedures, Wiley–Interscience, 1971.

[Bern13] Bernoulli, J., Ats Conjectandi, The main work of James Bernoulli (1654–1705) published in 1713.

[Bern88] Berenson, M. L., Levine, D. M. and Rindskopf, D., Applied Statistics, Prentice Hall, 1988.

Page 199: Fault tolerant systems

Bibliography – 5

[Bhar88] Bharicha–Reid, A. T., Elements of the Theory of Markov Processes and Theory Applications, McGraw–Hill, 1988.

[Bhat87] Bhat, U. N., Elements of Applied Stochastic Processes, John Wiley & Sons, 1972.

[Biez87] Biezer, B., Micro–Analysis of Computer System Performance, Van Nostrand Reinhold, 1987.

[Biro86] Birolini, A., “On the Use of Stochastic Processes in Modeling Reliability Problems,” 5th International Conference on Reliability and Maintainability, Biarritz France, October 1986.

[Bish86] Bishop, P. G., “PODS – A Project on Diverse Software, IEEE Transactions on Software Engineering, SE12(9), 1986, pp. 929–940.

[Brac65] Bracewell, R., The Fourier Transform and it Applications, McGraw Hill, 1965, pp. 25–50.

[Bray67] Brauer, F., and Nohel, J. A., Ordinary Differential Equations, W. A. Benjamin, 1967.

[Brau83] Brauer, F., Nohel, J. A., and Scheider, H., Linear Mathematics, W. A. Benjamin, 1983.

[Breu76] Breuer, M. A. and Friedman, A. D., Diagnosis and Reliable Design of Digital Systems, Computing Science Press, 1976.

[Brue80] Bruell, S. C., and Balbo, G., Computational Algorithms for Closed Queuing Networks, Elsevier North Holland, 1980.

[Brow80 Browing, R. L., The Loss Rate Concept in Safety Engineering, Marcel Dekker, 1980.

[Boeh76] Boehm, B. W., “Software Engineering,” TRW Software Series TRW–SS–76–08, TRW Systems Engineering and Integration Division, One Space Park, Redondo Beach, CA 1976.

[Borr87] Borrelli and Coleman, C., Differential Equations, A Modeling Approach, Prentice Hall, 1987.

[Boss70] Bossen, D. C., Ostapko, D. L., and Patel, A. M., “Optimum Test Patterns for Parity Networks,” Proceedings AFIPS Fall 1970, Joint Computer Conference, Volume 37, Houston Texas, November, 1970, pp. 63–68.

[Boss82] Bossen, D. C. and Hsiao, N. Y., “Model for Transient and Permanent Error–Detection and Fault–Isolation Coverage,” IBM Journal of Research and Development, 26(1), 1982 pp., 67–77.

Page 200: Fault tolerant systems

Bibliography – 6

[Bour69] Bouricius, W. G., “Reliability Modeling Techniques for Self–Repairing Computer Systems,” Proceedings of the 24th National Conference of the ACM, August 1969, pp. 295–309.

[Born81] Borne, A. J., Edwards, G. T., Hunns, D. M., Poulter, D. R., and Watson, I. A., “Defense Against Common Cause Mode Failures in Redundancy Systems,” Safety and Reliability Directorate, SRD–R196, 1981.

[Buza80] Buzacoft, J. A., “Markov Approach to Finding Failure Times of Repairable Systems,” IEEE Transactions on Reliability, Volume R–19, November 1980, pp. 128–134.

[Cast80] Castillo, X., “Workload, Performance, and Reliability of Digital Computing Systems,” Carnegie–Mellon University Technical Report, Computer Science Department, 1980.

[Chan70] Chang, H. Y., Manning, E. G., and Metze, G., Fault Diagnosis of Digital Systems, Wiley–Interscience, 1970.

[Chan72] Chandy, K. M., “Analysis and Solutions of General Queuing Networks,” Proceedings of the Sixth Annual Princeton Conference on Information Sciences and Systems, Princeton University, March 1972.

[Chri88] Christenson, D. A., “Using Software–Reliability Models to Predict Field Failure Rates in Electronic Switching Systems,” Proceedings of the National Security Industrial Association Annual Joint Conference on Software Quality and Reliability, National Security Industrial Association, Washington D. C., 1988.

[Chun76] Chung, K. L., Markov Chains with Stationary Transition Probabilities, 2nd Edition, Springer–Verlag, 1976.

[Clop34] Clopper, C, J. and Pearson, E. S., “The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial,” Biometrike, 26, pp. 104, 1934.

[Coch77] Cochran, W. G., Sampling Techniques, 3rd Edition, John Wiley & Sons, 1977.

[Cohe65] Cohen, A. C., “Estimation in Mixtures of Poisson and Mixtures of Exponential Distributions,” NASA Technical Memo, NASA TMX–53245, April 1965.

[Cona72] Conant, R. C., “Detecting Subsystems of a Complex System,” IEEE Transactions on Systems, Man and Cybernetics, Volume SMC–2, Number 4, September 1972, pp. 550–553.

[Cook73] Cook, R., W., Sisson, W. H., Storey, T. F., and Toy, W. N., “Design of a Self–Checking Microprogram Control,” IEEE Transactions on Computing, Volume C–22, March 1973, pp. 255–262.

Page 201: Fault tolerant systems

Bibliography – 7

[Cost78] Costes, A. C., Landrault, and Laprie, J. C., “Reliability and Availability Models for Maintained Systems Featuring Hardware Failures and Design Faults,” IEEE Transactions on Computers, Volume C–27, June 1978, pp. 548–560.

[Cour43] Courant, R. and Hilbert, D., Methods of Mathematical Physics, Interscience Publishers, 1943.

[Cour77] Courtois, P. J., Decomposability: Queuing and Computer System Applications, Academic Press, 1977.

[Cox62] Cox, D. R. Renewal Theory, Methuen & Company, 1962.

[Cox65] Cox, D. R. and Miller, H. D., The Theory of Stochastic Processes, Methuen, 1965.

[Cram66] Cramer, H. and Leadbetter, M., Stationary and Related Stochastic Processes, John Wiley & Sons, 1966.

[Crou82] Crouzet, Y. and Decounty, B., “Measurements of Fault Detection Mechanisms Efficiency: Results,” Proceedings of the 12th Annual International Symposium on Fault Tolerant Computing, IEEE Computer Society Press, June 1982, pp. 373–376.

[Dahl82] Dahlbura, A. T. and Masson, G. M., “A New Diagnosis Theory as the Basis of Intermittent–Fault/Transient–Upset Tolerant System Design,” Proceedings of the 12th Annual International Symposium on Fault Tolerant Computing, IEEE Computer Society Press, June 1982, pp. 353–356.

[Damm82] Damm, A., “Experimental Evaluation of Error–Detection and Self–Checking Coverage of Components of a Distributed Real–Time System,” Doctorate Thesis, Technical University of Vienna, October, 1982.

[Dave70] Davenport, W. B. Jr., Probability and Random Processes, McGraw–Hill, 1970.

[Deo74] Deo, N., Graph Theory with Applications to Engineering and Computer Science, Prentice Hall, 1974.

[DoD82] Department of Defense, “Reliability Prediction of Electronic Equipment,” Military Handbook, MIL–HDBK–217D, 15 January 1982.

[Doet61] Doetsch, G., Guide to the Laplace Transform, Van Nostrand, 1961.

[Dona84] Donatiello, L. and Iyer, B. R., “Analysis of a Composite Performance Reliability Measure for Fault–Tolerant Systems,” IBM Research Report RC–10325, Yorktown Heights, New York, 1984.

Page 202: Fault tolerant systems

Bibliography – 8

[Dona84] Donatiello, L. and Iyer, B. R., “Closed–form Solution for System Availability Distribution,” IBM Research Report RC–11169, Yorktown Heights, New York, 1984.

[Drak87] Drake, H. D. and Wolting, D. E., “Reliability Theory Applied to Software Testing,” Hewlett–Packard Journal, April 1987, pp. 35–39.

[Dung84] Dungan, J. B., Trivedi. K. S., Geist, R. M. and Nicola, V. F., “Extended Stochastic Petri Nets: Analysis and Applications,” Performance ’84, Paris, North Holland, December 1984.

[Dung85] Dungan, J. B., Bobbio, A., Ciardo, G. and Trivedi, K. S., “The Design of a Unified Package for the Solution of Stochastic Petri Net Models,” Proceedings of the International Workshop on Timed Petri Nets, Torino Italy, July 1985.

[Eckh85] Eckhardt, D. E. and Lee, L. D., “A Theoretical Basis for the Analysis of Multi–Version Software Subject to Coincident Errors,” IEEE Transactions on Software Engineering, Volume SE–11, Number 12, December 1985, pp. 1511–1517.

[Ehre78] Ehrenberger, W. and Plogert, K., “Statistical Verification of Reactor Protection Software,” Proceedings of the International Symposium on Nuclear Plant Control, Cannes France, April 1978.

[Euri84] Euriger, M. and Reichert, W., “The AS220 EHF Fault–Tolerant and Fail–Safe Automation System with Two–Out–of–Three Redundancy,” Siemens Power Engineering, November/December 1984, pp. 323–327.

[Evan84] Evans, M. G. K., Parry, G. W. and Wreathall, J., “On the Treatment of Common–Cause Failures in Systems Analysis,” Reliability Engineering, 9(2), 1984, pp. 107–115.

[Fell49] Feller, W., “Fluctuation Theory of Recurrent Events,” Transactions of the American Mathematical Society, Volume 67, 1949, pp. 98–119.

[Fell67] Feller, W., An Introduction to Probability Theory and Its Applications, Volume I, 3rd Edition, John Wiley & Sons, 1967.

[Fell71] Feller, W., An Introduction to Probability Theory and Its Applications, Volume II, 3rd Edition, John Wiley & Sons, 1971.

[Flem74] Fleming, K. N., “A Reliability Model for Common Mode Failures in Redundant Safety Systems,” General Atomic Report, GA–13284, 1974.

[Felm85] Fleming, K. N., Mosleh, A. and Deremer, R. K., “A Systematic Procedure for the Incorporation of Common Cause Events into Risk and Reliability Models,” Nuclear Engineering and Design, 93, 1985, pp. 245–279.

Page 203: Fault tolerant systems

Bibliography – 9

[Fry28] Fry, T. C., Probability and Its Engineering Uses, Van Nostrand, 1928.

[Fuss76] Fussel, J., “Fault Tree Analysis–Concepts and Techniques,” in Generic Techniques in Reliability Assessment, Henely, E. and Lynn, E., editors, Noordhoff Publishing Company, Leyeden Holland, 1976.

[Gall76] Gallaher, L. E. and Toy, W. N., “Fault–Tolerant Design of the 3B20 Processor,” NCC–81 Proceedings, Chicago, May, 1976.

[Gave73] Gaver, D. P. and Thompson, G. L., Programming and Probability Models in Operations Research, Brooks/Cole Publishing, 1973, pp. 417–426.

[Geis84] Geist, R., Trivedi, K., Dungan, J. B. and Smotherman, M., “Modeling Imperfect Coverage in Fault–Tolerant Systems,” Proceedings IEEE 14th Fault Tolerant Computing Symposium, June 1984, pp. 77–82.

[Glas82] Glaser, R. E. and Masson, G. M., “The Containment Set Approach to Upset Handling in Microprocessor Control Design,” Proceedings 12th Annual International Symposium of Fault–Tolerant Computing, IEEE Computer Society Press, June 1982, pp. 215–222.

[Gkuc83] Gkuch, D. P. and Paul, M. J., “Fault Tolerance in Distributed Digital Fly–by–Wire Flight Control Systems,” Digest of the 13th Annual International Symposium on Fault–Tolerant Computing, Milan Italy, June 1983, pp. 121–126.

[Goel79] Goel, A. L. and Okumoto, K., “Time–Dependent Error Detection Rate Model for Software Reliability and Other Performance Measures,” IEEE Transactions on Reliability, August 1979, Volume R–28, pp. 206–211.

[Gree82] Green, A. E. (editor), High Risk Safety Technology, Wiley–Interscience, 1982.

[Guid86] Guidal, C., “Computerized FMECA: An Overview of FMEGEN Program,” 5th International Conference on Reliability and Maintainability, Biarritz France, October 1986.

[Gunn87] Gunneflo, J. B., Johnson, J., Karlsson, S., Lowb, S., and Torin, J., “A Fault Injection Systems for the Study of Transient Fault Effects on Computer Systems,” Technical Report Number 47, Chalmers University, 1987.

[Gupt66] Gupta, D., Transform and State Variable Methods in Linear Systems, John Wiley & Sons, 1966.

[Hals79] “Commemorative Issue in Honor of Dr. Maurice H. Halstead,” Special Edition of IEEE Transactions on Software Engineering, Volume SE–5, Number 2, March 1979.

[Harr89] Harrington, P., “Applying Customer–Oriented Metrics,” IEEE Software, November 1989, pp. 71–74.

Page 204: Fault tolerant systems

Bibliography – 10

[Heik79] Heikkila, M., “A Model for Common Mode Failures,” Proceedings of the Second National Reliability Conference, 1979.

[Heis84] Heising, C. D. and Guey, C. H., “A Comparison of Methods for Calculating System Unavailability Due to Common Cause Failures: The Beta Factor and Multiple Dependent Failure Fraction Methods,” Reliability Engineering, 8(2), 1984, pp. 106–116.

[Heiv80] Heivik, B. E., “Periodic Maintenance, on the Effect of Imperfectness,” Digest of Papers of the 10th International Symposium of Fault–Tolerant Computing, IEEE Computer Society Press, 1980, pp. 204–206.

[Hene81] Henely, E. J. and Hiromitsu, K., Reliability Engineering and Risk Assessment, Prentice Hall, 1981.

[Hild56] Hildebrand, F. B., Introduction to Numerical Analysis, McGraw–Hill, 1956.

[Hoel62] Hoel, P. G., Introduction to Mathematical Statistics, John Wiley & Sons, 1962.

[Hoel72] Hoel, P. G., Port, S. C. and Stone, C. J., Introduction to Stochastic Processes, Houghton Mifflin, 1972.

[Howa71] Howard, R. A., Dynamic Probabilistic Systems: Volume I (Markov Models) and Volume II (Semi–Markov and Decision Processes), John Wiley & Sons, 1971.

[HSE87] Health and Safety Executive, Programmable Electronic Systems in Safety Related Applications, Rimington, J. D., Director General, Her Majesty’s Stationary Office, (HMSO), London, 1987.

[Hump87] Humphreys, R. A., “Assigning a Numerical Value to the Beta Factor Common Cause Evaluation,” Proceedings of Reliability ’87, Paper 2C, 1987.

[Ingl77] Ingle, A. D. and Siewiorek, D. P., “Reliability Models for Multiprocessor Systems With and Without Maintenance,” Proceedings Fault–Tolerant Computing Symposium, 1977, pp. 3.

[Iser76] Isermann, R., Digital Control Systems, Springer–Verlag, 1977.

[Issa76] Issacson, D. L. and Madsen, R. W., Markov Chains Theory and Applications, John Wiley & Sons, 1976.

[Iyer80] Iyer, R. K., “A Study of the Effect of Uncertainty in Failure Rate Estimation of System Reliability,” Digest of Papers of the 10th International Symposium on Fault–Tolerant Computing, IEEE Computer Society Press, 1980, pp. 219–224.

[Iyer84] Iyer, B. R.., Donatiello, L. and Heidelberger, P., “Analysis of Performability of Stochastic Models of Fault Tolerant Systems,” IBM Research Report RC 10719, Yorktown Heights, New York, 1984.

Page 205: Fault tolerant systems

Bibliography – 11

[Iyer85] Iyer, B. R. and Velardi, P., “Hardware–Related Software Errors,” IEEE Transactions on Software Engineering, Volume SE–11, February, 1985.

[Jame67] James, M. L., Smith, G. M. and Wolford, J. C., Applied Numerical Methods for Digital Computation with Fortran, International Textbook Company, 1967.

[Jame74] James, L. E., Sheffield, T. S. and Plein, K. M., Study of Reliability Prediction Techniques for Conceptual Phases of Development, Final Report, Rome Air Development Center, RDAC–TR–74–235, October, 1974.

[Jeli72] Jelinski, Z. and Moranda, P. B., “Software Reliability Research,” Statistical Computer Performance Evaluation, edited by W. Freigerger, Academic Press, 1972, pp. 465–484.

[John85] Johnson, B. W. and Aylor, J. H., “Reliability and Safety Analysis of a Fault–Tolerant Controller, IEEE Transactions on Reliability, Volume R–35, Number 4, October 1985, pp. 355–362.

[John86] Johnson, B. W. and Julich, P. M., “Fault–Tolerant Computer Systems for the A129 Helicopter,” IEEE Transactions on Aerospace and Electronic Systems, Volume AES–21, Number 2, March 1985, pp. 220–229.

[Jury64] Jury, E. I., Theory and Application of the z–Transform Method, John Wiley and Sons, 1964.

[Kapur77] Kapur, K. C. and Lamberson, L. R., Reliability in Engineering Design, John Wiley & Sons, 1977.

[Kell88] Kelly, J. P., Eckhardt, D. E., et al, “A Large Scale Second Generation Experiment in Multi–Version Software: Description and Early Results,” Proceedings 18th International Symposium on Fault–Tolerant Computing, IEEE Computer Society Press, June 1988, pp. 9–14.

[Keme60] Kemey, G. and Snell, J. L., Finite Markov Chains, D. Van Nostrand, 1960.

[Kend50] Kendall, D. G., “Some Problems in the Theory of Queues,” Journal Royal Statistical Society, Series B 13, 1950, pp. 151–185.

[Kend53] Kendall, D. G., “Stochastic Processes Occurring in the Theory of Queues and Their Analysis by the Method of Imbedded Markov Chain,” American Mathematical Statistics, 24, 1953, pp. 338–354.

[Kend61] Kendall, M. G., and Stuart, A., The Advanced Theory of Statistics, Volume 2, Inference and Relationships, Hafner Publishing, 1961.

[Kend77] Kendall, M. G. and Stuart, A., The Advanced Theory of the Statistics, Volume 1, Distribution Theory, 4th Edition, MacMillan Publishing Company, 1977.

Page 206: Fault tolerant systems

Bibliography – 12

[Khin60] Khinchin, A. J., Mathematical Methods in the Theory of Queuing, Griffin, London, 1960.

[Kimb60] Kimball, B. F., “On the Choice of Plotting Positions on Probability Paper,” Journal of the American Statistical Association, Volume 55, September, 1960.

[Kim86] Kim, K. H., “A Scheme for Coordinated Execution of Independently Designed Recoverable Distributed Processes,” Proceedings 16th International Symposium of Fault–Tolerant Computing, Computer Society Press, 1986, pp. 130–135.

[King69] Kingman, J. F. C., “Markov Population Processes,” Journal of Applied Probability, 6, 1969, pp. 1–18.

[Kirr86] Kirrmann, H., “Fault–Tolerant Issues in the Design of a Highly Available High–Speed Controller for HVDC Transmission,” Proceedings 16th International Symposium on Fault–Tolerant Computing, Computer Society Press, 1986, pp. 184–189.

[Klie75] Klienrock, L., Queuing Systems, Volume 1: Theory, John Wiley & Sons, 1975.

[Knig86] Knight, J. C. and Leveson, N. G., “An Experimental Evaluation of the Assumption of Independence in Multiversion Programming,” IEEE Transactions on Software Engineering, Volume SE–12, Number 1, January 1986, pp. 96–109.

[Koba78] Kobayashi, H., Modeling and Analysis: An Introduction to System Performance Evaluation Methodology, Addison Wesley, 1978.

[Kraf81] Kraft, G. D. and Toy, W. D., Microprogrammed Control and Reliable Design of Small Computers, Prentice Hall, 1981.

[Krey72] Kreyszig, E., Advanced Engineering Mathematics, John Wiley & Sons, 1972.

[Kueh69] Kuehn, R. E., “Computer Redundancy: Design, Performance, and Future,” IEEE Transactions on Reliability, Volume R-18, Number 1, February 1969, pp. 3–11.

[Kulk84] Kulkarnia, V. G., Nicola. V. F., and Trivedi, K. S., “A Unified Model for Performance and Reliability of Fault–Tolerant / Multi–Modal Systems,” CS–1984–12, Department of Computer Science, Duke University.

[Lala83] Lala, J. H., “Fault Detection, Isolation, and Reconfiguration in FTMP: Methods and Experimental Results,” Proceedings AIAA/IEEE Avionics Systems Conference, November 1983, pp. 21.3.1–1.3.9.

Page 207: Fault tolerant systems

Bibliography – 13

[Lamp82] Lamport, L., Shostak, R., and Pease, M., “The Byzantine Generals Problem,” ACM Transactions of Programming Languages and Systems, Volume 4, Number 3, July 1982, pp. 382–401.

[Lapr84] Laprie, J., “Dependable Computing and Fault–Tolerance: Concepts and Terminology,” Proposal to the IFIP WG 10.4 Summer 1984 Meeting, Kissimmee Florida, June 16–19, 1984.

[Lars81] Larsen, R. L. and Marx, M. L., Introduction to Mathematical Statistics and Its Applications, Prentice Hall, 1981.

[Lath65] Lathi, B. P., Signals and Communication, John Wiley & Sons, 1965.

[Lawr68] Lawrence, W. A., “Proposed Methods for Evaluation of Current Trouble Location Manuals – Case 36279–133,” Internal Memo, Bell Telephone Laboratories, December 9 1968.

[Lee80] Lees, F. P., Loss Prevention in the Process Industries, Butterworth, 1980.

[Lehm62] Lehman, R. S., “Dynamic Programming and Gaussian Elimination,” Journal of Mathematical Analysis and Applications, Volume 5, pp. 1–16, 1962.

[Leve83] Leveson, N. G. and Harvey, P. R., “Analyzing Software Safety,” IEEE Transactions on Software Engineering, Volume SE–9, Number 5, September 1983, pp. 569–579.

[Leve86] Leveson, N. G., “Software Safety: Why, What, and How,” ACM Computing Surveys, June 1986, pp. 125–163.

[Leve86a] Levendal, Y., “Quality and Reliability Estimation for Large Software Projects Using a Time–dependent Model,” Proceedings of COMPSAC87, Tokyo Japan, October 1987, pp. 340–346.

[Leve89] Levendal, Y., “Quality and Reliability Prediction: A Time–Dependent Model with Controllable Testing Coverage and Repair Intensity,” Proceedings of the 4th Israel Conference on Computer Systems and Software Engineering, Tel–Aviv, June 1989.

[Lill69] Lilliefors, H. W., “On the Kolmogorov–Smirnov Test for the Exponential Distribution with Mean Unknown,” Journal of the American Statistical Association, 64, 1969, pp. 387–389.

[Lipo79] Lipow, M., “Prediction of Software Failures,” TRW–SS–79–05, TRW Systems and Integration Division, One Space Park, Redondo Beach, California, 1979.

[Litt73] Littlewood, B. and Verrall, J. L., “A Bayesian Reliability Growth Model for Computer Software,” Applied Statistics, Volume 22, 1973, pp. 332–346.

Page 208: Fault tolerant systems

Bibliography – 14

[Litt75] Littlewood, B., “A Reliability Model for Markov Structured Software,” Proceedings of the International Conference on Reliable Software, SIGPLAN Notices, Volume 10, Number 6, June 1975, pp. 204–207.

[Lome88] Lomen, D. and Mark, J., Differential Equations, Prentice Hall, 1988.

[Low72] Low, T. A. W. and Noltingk, B. E., “Quantative Aspects of Reliability in Process–Control,” Proceedings of the I.E.E., I.E.E. Reviews, Volume 119, Number 8R, August, 1972.

[Luen79] Luenberger, D., Introduction to Dynamic Systems, John Wiley & Sons, 1979.

[Lyon62] Lyons, R. E. and Vanderkulk, W., “The Use of Triple–Modular Redundancy to Improve Computer Reliability,” IBM Journal of Research and Development, Volume 6, April 1962, pp. 200–209.

[Mall78] Mallela, S. and Masson, G. M., “Diagnosable Systems for Intermittent Faults,” IEEE Transactions on Computers, Volume C–27, June 1978, pp. 306–366.

[Maka81] Makam, S. V. and Avizienis, A., “Modeling and Analysis of Periodically Renewed Closed Fault–Tolerant Systems,” Proceedings of the Eleventh Annual International Symposium on Fault–Tolerant Computing, June 1981, pp. 134–141.

[Mank77] Mankamo, T., “Common Load Method, A Tool for Common Cause Failure Analysis,” Sahkotekniikan Laborratorio, Technical Research Centre of Finland, Tiedonanto 31, 1977.

[Mann71] Mann, N. R., Fertig, K. W., and Scheuer, E. M., Confidence and Tolerance Bounds and a New Goodness–of–Fit for Two Parameter Weilbull or Extreme Value Distribution (with Tables for Censored Sample Sizes), Aerospace Research Laboratories, Wright–Patterson Air Force Base, Ohio, ARL 71–0077, Contract No. F33(615)–70–C–1216, May 1971.

[Mark07] Markov, A. A., “Extensions of the Limit Theorems of Probability Theory to a Sum of Variables Connected in a Chain,” The Notes of the Imperial Academy of Sciences of St. Petersburg, VIII Series, Pysio–Mathematical College, Volume XXII, No. 9, December 5, 1907.

[Math70] Mathur, F. P. and Avizieis, A., “Reliability Analysis and Architecture if a Hybrid Redundant Digital System: Generalized Triple Modular Redundancy with Self–Repair,” AFIPS Conference Proceedings, Spring Joint Computer Conference, Volume 36, 1970, pp. 375–383.

[Mats88] Matsumoto, K., Inoue, K., Kikuno, T., and Torji, K., “Experimental Evaluation of Software Reliability Growth Models,” Proceedings of the 18th International Symposium on Fault–Tolerant Computing, June 1988, pp. 148–153.

[Maye72] Mayeda, W., Graph Theory, Wiley–Interscience, 1972.

Page 209: Fault tolerant systems

Bibliography – 15

[McMl86] McClusky, E. J., Logic Design Principles with Emphasis on Testable Semi–Custom Circuits, Prentice Hall, 1986.

[McCo79] McConnell, S. R., Siewiorek, D. P., and Tsao, M. M., “The Measurement and Analysis of Transient Errors in Digital Computer Systems,” Digest Ninth International Fault–Tolerant Computing Symposium, IEEE Computer Society Press, 1979, pp. 67–70.

[McGo83] McGough, J., “Effects of Near–Coincident Faults in Multiprocessors Systems, Proceedings AIAA/IEEE Digital Avionics Systems Conference, November 1983, pp. 16.6.1–16.6.7

[McGo85] McGough, J., Smotherman, M., and Trivedi, K. S., “The Conservativeness of Reliability Estimates on Instantaneous Coverage,” IEEE Transactions on Computing, Volume C–34, July 1985, pp. 602–609.

[Mels78] Melsa, J. L. and Cohen, D. L., Decision and Estimation Theory, McGraw Hill, 1978.

[Meno63] Menon, M. V., “Estimation of the Shape and Scale Parameters of the Weilbull Distribution,” Technometrics, Volume 5, Number 2, May 1963.

[Midd46] Middleniss, R. R., Differential and Integral Calculus, McGraw Hill, 1946.

[MIL1629] MIL–STD–1629a, Procedures for Performing a Failure Mode and Effects Analysis.

[Mell77] Melliar–Smith, P. M. and Randell, B., “Software Reliability: The Role of Programmed Exception Handling,” Proceedings of Conference on Language Design for Reliable Software, SIGPLAN Notice, Volume 12(3), March 1977, pp. 95–100.

[McLac39] McLachlan, N. W., Complex Variable and Operational Calculus, Cambridge University Press, 1939.

[Meye82] Meyer, J. F., “Closed Form Solutions of Performability,” IEEE Transactions on Computers, July 1982, pp. 648–657.

[Morg78] Morganti, M., Coppadoro, G., and Ceru, S., “UDET 7116–Common Control for PCM Telephone Exchange: Diagnostic Software Design and Availability Evaluation,” Digest of the Eighth International Fault–Tolerant Computing Symposium, IEEE Computer Society, Toulouse, France, 1978, pp. 16–23.

[Musa75] Musa, J. D., “A Theory of Software Reliability and Its Applications,” IEEE Transactions on Software Engineering, SE–1, September 1975, pp. 312–327.

[Musa87] Musa, J. D., Iannino, A., and Okumoto, K., Software Reliability: measurement, Prediction, Application, McGraw–Hill, 1987.

Page 210: Fault tolerant systems

Bibliography – 16

[Musa89] Musa, J., “Faults, Failures, and a Metrics Revolution,” IEEE Software, March 1989, pp. 85–91.

[Myer64] Myers, R., Wong, K. and Gordy, H., Reliability Engineering for Electronic Systems, John Wiley & Sons, 1964.

[Naka75] Nakagawa, T. and Osaki, S., “The Discrete Weilbull Distribution,” IEEE Transactions on Reliability, R–24, Number 5, December 1975, pp. 300–301.

[Naka79] Nakagawa, T., “Optimum Policies when Preventive Maintenance is Imperfect,” IEEE Transactions on Reliability, Volume R–28, Number 4, October 1979.

[Naka81] Nakagawa, T., Yasui, K., and Osaki, S., “Optimum Maintenance Policies for a Computer System with Restart,” Proceedings of the Eleventh Annual International Symposium on Fault–Tolerant Computing, June 1981, pp. 148–150.

[Ng76] Ng, Y–W., “Reliability Modeling and Analysis for Fault Tolerant Computers,” Ph. D Dissertation, Computer Science Department, University of California, Los Angeles, 1976.

[Ng77] Ng, Y–W. and Avizienis, A., “A Reliability Model for Gracefully Degrading and Repairable Fault–Tolerant Systems,” Proceedings of Fault–Tolerant Computing Symposium, 1977, pp. 22.

[NSCC80] NSCC/PATE Guidebooks: Volume IIA – Nuclear Safety Cross–Check Analysis and Technical Evaluation Process, SED–80204–1, Logicon, San Pedro, California, 1980.

[Oda81] Oda, Y., Tokma, Y., and Furuya, K., “Reliability and Performance Evaluation of Self–Reconfiguable Systems with Periodic Maintenance,” Proceedings of the Eleventh Annual International Symposium on Fault–Tolerant Computing, June 1981, pp. 142–147.

[Ogat70] Ogata, K., Modern Control Engineering, Prentice Hall, 1970.

[Ohba84] Ohba, M., “Software Reliability Analysis Models,” IBM Journal of Research and Development, July 1984, Volume 28, Number 4, pp. 259–265.

[Ossf80] Ossfeldt, B. and Jonsson, I., “Recovery and Diagnostics in the Central Control of the AXE Switching System, IEEE Transactions on Computers, June 1980, pp. 482–491.

[Ozak88] Ozaki, B. M., Ferandex, E. B., and Gudes, E., “Software Fault Tolerance in Architectures with Hierarchical Protection Levels,” IEEE Micro, August, 1988, pp. 30–43.

Page 211: Fault tolerant systems

Bibliography – 17

[Palm43] Palm, C., “Intensitatsschwankungen im Fernsprecverkehr,” Ericsson Technics, Volume 44, 1943, pp. 1–89.

[Papo65] Papoulis, A., Probability, Random Variables, and Stochastic Processes, McGraw–Hill, 1965.

[Papo65a] Papoulis, A., “Markoff and Wide–Sense Markoff Sequences,” Proceedings of the IEEE, October 1965.

[Parz60] Parzen, E., Markov Probability Theory and its Applications, John Wiley & Sons, 1960.

[Parz62] Parzen, E., Stochastic Processes, Holden Day, 1962.

[Pipe63] Pipes, L. A., Matrix Methods of Engineering, Prentice Hall, 1963.

[Pois37] Poisson, Siemeon D., Rescherches sur la Probabilite des Jugements en Maitere Criminelle et en Matiere Civile, Precedes des Regeles Generales du Calcul des Probabilities, appearing in 1837.

[Rals65] Ralston, A., A First Course in Numerical Analysis, McGraw–Hill, 1965.

[Rama79] Ramamoorthy, C. V., Bastiani, F. B., Favaro, J. M., Mok, Y. R. Nam, C. W., and Suzuki, K., “A Systematic Approach to the Development and Validation of Critical Software for Nuclear Power Plants,” Proceedings of the 4th International Conference on Software Engineering, 1979, pp. 231–240.

[Rams79] Ramshaw, L. H., “Formalizing the Analysis of Algorithms,” Stanford University Computer Science Department Technical Report, STAN–CS–79–741, 1979.

[Rand75] Randell, B., “System Structures for Software Fault Tolerance,” IEEE Transactions on Software Engineering, Volume SE–1, Number 3, June 1975, pp. 221–232.

[Rea78] Research and Education Association, The Differential Equations Problem Solver, Research and Education Association, 1978.

[Robi82] Robison, A. S., “A User Oriented Perspective of Fault–Tolerant System Models and Terminologies,” Proceedings 12th Annual International Symposium Fault Tolerant Computing, IEEE Computer Society Press, June 1882, pp. 22–28.

[Ross83] Ross, S., Stochastic Processes, John Wiley & Sons, 1983

[Ross70] Ross S., Applied Probability Models with Optimization Applications, Holden–Day, 1970.

[Rous79] Rouse, W. B. and Rise, S. H., “Measures of Complexity of Fault Diagnosis Tasks,” IEEE Transactions on Systems, Man, and Cybernetics, Volume SMC–9, Number 11, November 1979, pp. 720–727.

Page 212: Fault tolerant systems

Bibliography – 18

[Rouq86] Rouquet, J. C. and Traverse, P., “Safe and Reliable Computing on Board and Airbus and ATR Aircraft,” Proceedings Fifth International Workshop on Safety of Computer Control Systems, SAFECOMP 86, Sarlet, France, 1986, pp. 93–97.

[RTCA85] RTCA, Radio Technical Commission for Aeronautics, “Software Considerations in Airborne Systems and Equipment Certification,” Technical Report DO–178A, Washington D. C., March 1985. RTCA Secretariat, One McPherson Square, 1425 K Street, N. W., Suite 500, Washington D. C., 20005.

[Saat65] Saaty, T. L., “Stochastic Network Flow: Advances in Networks of Queues,” Proceedings Symposium Congestion Theory, University of North Carolina Press, 1965, pp. 86–107.

[Schi78] Schick, G. K. and Wolverton, R. W., “An Analysis of Competing Software Reliability Models,” IEEE Transactions on Reliability, R–32.

[Schu86] Schutette, M. A., Shen J. P., Siewiorek, D. P., and Zhu, Y. X., “Experimental Evaluation of Two Concurrent Error Detection Schemes,” Proceedings of the 16th International Symposium on Fault–Tolerant Computing, July 1986, pp. 138–143.

[Sega88] Segall Z., et al, “FAIT–Fault Injection Based Automated Testing Environment,” Proceedings of the 18th International Symposium on Fault–Tolerant Computing, June 1988, pp. 102–107.

[Seth77] Seth, S. C. and Kodandapani, K. L., “Diagnosis of Faults in Linear Tree Network,” IEEE Transactions on Computers, Volume C–26, Number 1, January 1977, pp. 29–33.

[Shin86] Shin, K. G and Lee, Y. H., “Measurement and Application of Fault Latency,” IEEE Transactions on Computers, Volume C–35, April 1986, pp. 370–375.

[Shoo68] Shooman, M. L., Probabilistic Reliability: An Engineering Approach, McGraw–Hill, 1968.

[Shoo73] Shooman, M. L., “Operational Testing and Software Reliability Estimation During Program Development,” in Record, IEEE Symposium on Computer Software Reliability, 1973, pp. 51–57.

[Siew82] Siewiorek, D. P. and Swarz, R. S., The Theory and Practice of Reliability System Design, Digital Press, 1982.

[Siew84] Siewiorek, D. P., “Architecture of Fault–Tolerant Computers,” Computer, Volume 17, Number 8, August 1984, pp. 9–17.

Page 213: Fault tolerant systems

Bibliography – 19

[SINT88] SINTEF, “Reliability Evaluation of Safety System Configurations,” The Foundation for Scientific and Industrial Research at the Norwegian Institute of Technology, Report Number STF75 F88002, 1988.

[SINT89] SINTEF, “Reliability Assessment of TRICON for Boiler Management Applications,” The Foundation for Scientific and Industrial Research at the Norwegian Institute of Technology, Report Number DRAFT–890914, 1989.

[Smai49]Smail, L. L., Calculus, Appleton–Century–Crofts, 1949.

[Smit58] Smith W. L., “Renewal Theory and Its Ramifications,” Journal of the Royal Statistical Society, Series B, 20, pp. 243–302, 1958.

[Smit85] Smith, D. J., Reliability and Maintainability in Perspective, MacMillan, 1985.

[Sned80]Snedecor, G. W. and Cochran, W. G., Statistical Methods, 7th Edition, Iowa State University Press, 1980.

[Snyd75] Synder, D. L. Random Point Processes, John Wiley & Sons, 1975.

[Soma86] Somani, A. and Agarwal, A. V., “On Complexity of Diagnosability and Diagnosis Problems in System–Level Diagnosis,” Proceedings 16th Annual International Symposium on Fault–Tolerant Computing Systems, IEEE Press, pp. 232–237, July 1986.

[Sosn86] Sosnowski, J., “Evaluation of Transient Hazards in Microprocessor Controllers,” Proceedings 16th Annual International Symposium on Fault Tolerant Computing, IEEE Press, July 1986, pp. 364–369.

[Stif80] Stiffler, J. J., “Robust Detection of Intermittent Faults,” Proceedings 10th Annual International Symposium of Fault Tolerant Computing, IEEE Press, October 1980, pp. 216–218.

[Thay78] Thayer, T. A., Lipow, M. and Nelson, E. C., Software Reliability – A Study of Large Project Reality, TRW Series on Software Technology, Volume 2, Elsevier North Holland, 1978.

[Thom69] Thoman, D. R., Bain, L. J. and Antle, C. E., “Inferences on the Parameters of the Weilbull Distribution,” Technometrics, Volume 7, Number 4, November 1969.

[Thom81] Thomas, J. C. and Leveson, N. G., “Applying Existing Safety Design Techniques to Software Safety,” Technical Report Number 180, University of California, Irvine, 1981.

[Thor26] Thorndike, F., “The Applications of Poisson’s Probability Summation,” The Bell System Technical Journal, Volume 5 (1926), pp. 604–624.

[Toy78] Toy, W. N., “Fault–Tolerant Design of Local ESS Processors,” Proceedings of IEEE, October 1978, pp. 1126–1145.

Page 214: Fault tolerant systems

Bibliography – 20

[Toy86] Toy, W. N. and Zee, B., “Fault Tolerant Computing,” in Computer Hardware /Software Architecture, Prentice Hall, 1986, pp. 337–392.

[Toy87] Toy, W. N., “Dual Versus Triplication Reliability Estimations,” AT&T Technical Journal, November/December, 1987, Volume 66, Issue 6, pp. 15–20.

[Triv74] Trivedi, A. K., “Reliability Estimation and Prediction,” Reading Report, Department of EE/EP, Polytechnic Institute of New York, June 1975.

[Triv75] Trivedi, A. K., “Computer Software Reliability: Many State Markov Techniques,” PH. D. Dissertation, Department of Electrical Engineering, Polytechnic Institute of New York, June 1975.

[Triv75a] Trivedi, A. K. and Shooman, M. L., “A Many–State Markov Model for the Estimation and Prediction of Computer Software Performance Parameters,” Proceedings of the International Conference on Reliable Software, SIGPLAN Notices, Volume 10, Number 6, June 1975, pp. 208–215.

[Triv82] Trivedi, A. K., Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Prentice Hall, 1982.

[TÜV86] TÜV Study Group on Computer Safety, Microcomputers in Safety Technique, and Aid to Orientation for Developer and Manufacture, Technischer Uberwachungs–Verein Bayern e. V., Verlag TUV Reinland, 1986.

[UKAE88] United Kingdom Atomic Energy Authority, “A Reliability Analysis of the Relay Logic for a Burner Control and Safety System in a Boiler Installation,” Safety and Reliability Directorate Report SRS/ASG/31610/2, 1988.

[Vses77] Vesely, W. E., “Estimating Common Cause Failure Probabilities in Reliability and Risk Analysis: Marshall–Olkin Specializations,” in Nuclear Systems Reliability Engineering and Risk Management, editors, Fussel, J. B. and Burdick, G. R., Philadelphia Society of Industrial and Applied Mathematics, 1977, pp. 314–341.

[Wall85] Waller, R. A., “A Brief Survey and Comparison of Common Cause Failure Analysis,” NUREG/CR–4314, Los Alamos National Laboratory, 1985.

[WASH75] WASH–1400, Reactor Safety Study: Appendix IV: Common Mode Failures–Bounding Techniques and Special Techniques, NUREG/75/014, US Nuclear Regularity Commission, Washing D. C., 1975.

[Wats79] Watson, I. A. and Edwards, G. T., “A Study of Common Cause Mode Failures,” United Kingdom Atomic Energy Authority (UKAEA), Report SRD–R146, Safety and Reliability Directororate, 1979.

Page 215: Fault tolerant systems

Bibliography – 21

[Whit69] White, J. S., “The Moments of Log–Weilbull Order Statistics,” Technometrics, Volume 11, Number 2, May 1969.

[Whit82] Whittingham, R. B., “Air: Ammonia Ratio on a Nitric Acid Plant, Associate Member’s Overview of the SRS Study,” United Kingdom Atomic Energy Authority, Systems Reliability Service, SRS/GR/54, 1982.

[Wibe71] Wilberg, D. M., State Space and Linear Systems, Schaum’s Outline Series, McGraw Hill, 1971.

[Widd46] Widder, D. V., The Laplace Transform, Princeton University Press, 1946.

[Wilk65] Wilkinsson, J. H., The Algebraic Eigenvalue Problem, Oxford University Press, 1965.

[Wood79] Woodfield, S. N., “An Experiment on Unit Increase in Problem Complexity,” IEEE Transactions on Software Engineering, Volume SE–5, Number 2, 1979, pp. 76–79.

[Yama67] Yamane, T., Elementary Sample Theory, Prentice Hall, 1967.

[Yama85] Yamada, S., Ohba, M., and Osaki, S., “S–Shaped Reliability Growth Modeling: Models and Applications,” IEEE Transactions on Reliability, December 1983, Volume R-32, pp. 475–478.

Page 216: Fault tolerant systems

Bibliography – 22


Recommended