[American Institute of Aeronautics and Astronautics 8th Computing in Aerospace Conference -...

AIM-91-3791 -CP A MARKOV MODEL REDUCTION TECHNIQUE FOR

FAULT TOLERANT PROCESSOR RELIABILITY ANALYSIS

Andrei L. Schor and Gene Rosch

Technical Staff, The Charles Stark Draper Laboratory, Inc.

Cambridge, Massachusetts

Abstract

A fault tolerant processor (FTP) plays a key role in many high performance, safety-critical control system applications. Realistic modeling of an FTP is crucial to gaining a high degree of confidence in the reliability and safety analysis of such a system. While fidelity is clearly a major consideration, a practical model must also be kept to a moderate size to allow its incorporation into the overall system model. This paper presents a systematic reduction technique that starts from a complex, detailed model of a mple, redundant FTP and produces a low order approximation of very high accuracy. The existence of two distinct time scales represents the key to the success of the technique. No eigenvalue solution or coordinate transformation are needed. The reduced model captures all the important features of the detailed model, is amenable to an analytical solution and provides insight into the reconfiguration behavior of an FTP.

Introductio n

There is an increasing demand for fault tolerant control systems in high performance, critical applications. At the core of such systems, there is one or more fault tolerant processors (FTp’s). The FTP receives information from sensors, sends commands to actuators and performs redundancy management. Given its critical role in any control system, the operation and performance of the FTP must be very carefully modeled if the safety and performance evaluation process is to be relied upon.

An FTP may contain computing elements, dedi&ted and/or shared memories, information replicating components, YO-dedicated electronics, etc. The key feature of an FTP is the ability to handle faults in a controlled, timely manner that enables correct reconfiguration and uninterrupted (on the time scale of interest) operation. An accurate model must be able to describe in detail rate processes such as component

failures, fault detection, fault isolation and reconfiguration.

The Markov modeling method has clearly emerged as the preferred approach with regard to the analysis of sucn processing systems. The ability of this approach to correctly capture rate processes and event sequence depimdencies are of crucial importance. While discrete simulation approaches might be perfectly adequate from the point of view of modeling flexibility, computational efficiency strongly tilts the balance towards the Markovian approach. This is so because, for such highly reliable systems, the component failure rates are very low. Consequently, the fault occurrence event rate is so low as to require a prohibitively large number of trials in order to accumulate statistically meaningful and reasonably accurate results.

The Markov method however suffers from a major drawback. The number of states proliferate rapidly, often leading to an intractably large model. The FTP proper represents a system of moderate size, such that even a rather detailed model does not generally pose a major problem. However, when attempting to analyze the entire control system which the FTP is part of, maintaining the level of detail desirable for the FTP alone would most likely give rise to an unwieldy, perhaps even intractable model. Techniques such as aggregation, truncation and decomposition are used to mitigate this basic difficulty. Still, it would be highly desirable to devise a simplified model of the FIT which would greatly contribute to alleviating the space explosion problem while correctly preserving the main features of the detailed model.

The purpose of this paper is to describe the development of such an approximate model. First a detailed Markov model of a triply redundant FTP is introduced. The model reduction technique is then presented, leading to an excellent approximate model for this FIT. A complete analytical solution of the reduced

Copyright 0 1991 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved. 600

,

model follows. To get a feel for the approximations involved, the procedure is shown applied to a simple example, for which analytical solutions are feasible for both the exact and reduced models. The paper ends with a few concluding remarks regarding the reduction technique to be presented.

Detailed Triplex FTP Markov Model

A detailed Markov model for a quad FI'P has been described by Harper et al.(l) The model tracks separately, within a computational channel, the processor element, the associated dedicated memory and the replicatinghroadcasting interstage. Both permanent and transient failures are accounted for, along with the appropriate reconfiguration mechanism. Provisions are made for including common mode failures as well. This model provides a realistic paradigm of the fault occurrence/fault handling processes.

In many critical conml applications, a triplex FTP is quite satisfactory, often representing an optimum design solution accounting for competing factors such as reliability, cost, weight, power, etc. The approach used in constructing the quad FTP model was used to generate a

no failures (triplex operation)

successful recovery after one failure

(duplex operation)

Markov model for a similar, "Draper-style" triplex FTP. In order to further simplify the description of the model reduction method, some additional assumptions were made:

channel were treated as one component, - the processor and its associated memory in one

- the triplex coverage was assumed perfect and - the common mode failure was disregarded.

The Markov model for the triplex FTP, based on these assumptions, is shown in Figure 1. The notation used for the state transition rates is the following:

- Aib(,sl) is the failure rate for component a (where a = p for the processor and a = i for the interstage) from configuration n (where n = t, d or s, i.e., triplex, dual or single, respectively): b indicates the type of failure, (with b = t or p denoting transient or permanent failure): SI stands for system loss.

- &, is the rate of the reconfiguration process initiated by a b-type failure of component a starting from configuration n: here a, b and n have the same meaning as previously indicated.

State 1 represents operation with no failures. States 6 and 11 correspond to degraded modes of operation,

successful recovery after two failures

(simplex operation)

system loss caused by incorrect reconfiguration

or exhaustion

Figure 1 Detailed Markov Model of a Triplex FTP

601

namely operation with one channel failed and two channels failed, respectively. Finally state 12 denotes an aggregated system loss condition. This state is reached either as a result of incorrect reconfiguration or because of exhaustion.

The group comprising states 2 , 3 , 4 and 5 is associated with the state of the system after one failure and the group including states 7 , 8 , 9 and 10 corresponds to the system after two failures. The states in these two groups are characterized by extremely short holding times, when compared to states 1,6, 1 1 or 12. Specifically, the "fast" states represent intermediate configurations, persisting for only very short periods of time during the reconfiguration processes. The reader should note that both permanent and transient failures are accounted for, with a distinct reconfiguration path. Specifically, transient failures, i s . , states 2, 5,7 and 10, are reconfigured back to their respective origin states, Le., states 1 and 6. In contrast, permanent failures, Le., states 3,4 ,8 and 9, lead to reconfigurations to the appropriate degraded operational modes, states 6 and 11.

The detailed Markov model described so far captures the key fault occurrence and handling processes in a mplex FTP, designed to withstand Byzantine faults.

It is perfectly feasible to use this model to analyze the FTP on a stand alone basis. Still, care must be taken in the solution technique to overcome difficulties caused by the very pronounced stiffness of the resulting system of ordinary differential equations. Indeed, there is an enormous discrepancy between the time constants characterizing the failure events and those associated with reconfiguration processes.

When the FTP must be analyzed as a subsystem within a much larger control system, the high level of detail in the model becomes a liability. This is so both because of the large state space the analyst will have to deal with and in view of the stiffness aspect mentioned above. It is thus natural to search for an approximation technique allowing a high level of fidelity, while providing a much more tractable model to work with. Such a technique will be described in the next section.

FTP Model Reduction

As already mentioned, the detailed Markov model has a number of states characterized by a very short time constant relative to the time scale of the failure events. This situation immediately suggests the possibility of a behavioral decomposition. The existence of distinct time scales is often encountered in control system applications

and is exploited to generate a reduced model of the original system. Reference 5 presents a good review of this approach in the system control area. The same basic idea is also often used in simplifying various physical models (see, for example, Segal and Siemrod (@). These reduction techniques often rely on detailed eigenvalue analysis and consequently are rather cumbersome. The reduction techniques become considerably more appealmg when it is obvious which states are characterized by fast time constants and which ones are "slow".

In the reliability analysis field, the need to model both the fault occurrence (slow) and the fault handling (fast) processes leads naturally to the situation previously described. There is a strong incentive to perform a systematic behavioral (or temporal) decomposition in order to both reduce the size of the state space and also remove the severe stiffness of the mathematical model. References 2 and 3 propose a decomposition approach. While the approach is well founded, it is quite impractical for complex applications because of some rather cumbersome probabilistic arguments used to determine aggregated transition rates. Bobbio and Trivedi (4) suggest another approach, similar to that used by Segal and Slemrod straightforward model reduction procedure. The technique does not use a formal eigenvalue analysis, relying solely on an examination of the original, detailed model structure and transition rates. This technique will be applied to obtain a reduced FTP model.

which leads to a systematic and

The state transition rates may clearly be divided into two separate sets, one consisting of the slow rates, i.e., the failure rates, the other consisting of thefust rates, i.e., the reconfiguration rates. We can then partition the n states of the model into two disjoint and exhaustive subsets defined as follows:

12), i.e., states with no outgoing transitions classified as fast,

- [r;l is the set of nFfust states, i.e., states with at least one fast outgoing transition. For convenience, we further subdivide the set [F] into the subsets [Fl] and [F2], corresponding to the fast states reached following a single failure (2,3,4 and 5) and two failures (7 ,8 ,9 and lo), respectively. The transition matrix and the associated probability vector are then reordered such that the equations governing the states in [SI become the first nS equations, followed by the nF1 equations corresponding to the states in F11 and the n ~ 2 equations corresponding to the states in W], with nF = nF1 + nF2.

- [SI is the set of ns slow states (1, 6, 11 and

602

After this reordering, the equations describing the Markov model can be written as:

where:

0 - b o o

0 0 4 1 1 0 Do =

L o o +hi1 o J D1 = diag( -62, -63, -64, -6s )

D2 = diag( -67, -68, -69, -610) 0 = null matrix

p21 p31 p4l 051

Bo1=[ 7 7 7 7 1 h2 h3 Ad hs

O l 0 0 0

p76 p86 p% p10.6

p7,ll ps.11 p9J1 Pl0,ll

1 7 A8 hs hl0

h12 0 0 0

Bo2 =

Bio =

O h 7 0 0

0 &,lo 0 1 &o=[ :l

For facility, the notation follows the convention indicated in Figure 2. The total outgoing transition rate from the fast state "k" is denoted 6k and is given by:

while the total outgoing transition rate from slow state "i" is denoted hi and is given by:

6 k = pki + pkj k

h i = C L

where the summation is implied over all the fast states "fed" by slow state "i".

fast state

system loss

Figure 2 Notation Convention

At this point, the key approximation is made that the fast states reach their steady state well within the time scale of interest, Le., mission time. In other words, the fast states are assumed to respond instantaneously to changes in the slow states. Seuing the temporal derivatives of the fast states' probabilities to zero leads to the following approximate expressions for the probability subvectors PF1 and P F ~ :

These expressions for pF1 and pF2 are used to eliminate them in favor of Ps in the first nS equations, leading to the following reduced system of equations for Ps :

Ps =[Do - BoiDi'Bio - Bfi'Bm] Ps = A: Ps (3)

It should be noted that the algebraic manipulations implied in (2) and (3) are particularly easy to carry out in our application because of the specific structure of the Markov model. Indeed, the submatrices D1 and D2 are strictly diagonal, making their inversion trivial. The significance of their strictly diagonal structure is that no direct coupling exists among the fast states. Moreover, B10 and B20 are very sparse, further reducing the computational effort required to obtain the approximate model equations.

The transition matrix of the reduced model has the following structure:

603

CaT1 o o 0 1

L $1 $2 $3 o 1 where:

5 a;l= - (pk6 + L)

k = 2 & 10

(4)

The reduced formulation is thus fully defined in terms of the original model parameters. A more insightful and convenient, but still fully equivalent formulation, that clearly identifies the successful as well as the failed reconfigurations, can be obtained by rewriting the reduced model transition matrix in the following form:

0 0 O 1

Here, the equivalent triplex and duplex coverages are obtained by simply comparing formulations (4) and (6):

ard 10

h6k p k , l l

Ibk ( p k , l l + k) C d = k = 7 6k

10

k = 7 6 k

A few remarks are in order at this point. Since it was assumed that the triplex coverage (Le.. the detection and the isolation) is perfect, a system loss can be caused only by a coincident failure. Since the reconfiguration rate is many orders of magnitude greater than the failure

rate, the equivalent triplex coverage is very nearly 1 .O. In contrast, the situation for the dual operation, reached after a successful recovery following a single failure, is quite different. Here, the detection is still assumed perfect, but for isolation we must rely on self-test, which is assigned a probability of success cd,kol < 1.0. Consequently, a transition to system loss may take place not only because of a coincident failure but also predominantly because of improper reconfiguration. As a result, the equivalent duplex coverage may be significantly less than 1 .O.

A simple example will further clarify this important aspect. Disregarding transient failures and treating the pmcessor/interstage set as one "component" with a failure rate h and a reconfiguration rate p, we have, from (7):

(8) - 1 -- A P

3h (p+2h) 1 +a p + 2h c' =

P p + 2h

Since h << p. then ct = 1.0 and cd = Cd,isol. The expressions (7) properly reduce to the simple model often used to represent a triplex FTP.

The reduced model obtained in this section is illustrated in Figure 3. The three operating states (1,2 and 3) in the reduced model correspond on a one-to-one basis to the operating states (1,6 and 11) in the original model. The transition rates used in this model are given by the expressions (5) and (7). From the initial n-state model (n = 12), an approximate model containing only nS states (ns = 4) has been obtained.

Extensive numerical experimentation, comparing the original model with the reduced one, has consistently indicated an excellent agreement, proving the validity of this temporal decomposition technique in our application.

604

no failures one failure two failures

system loss

Figure 3 Reduced Markov Model of a Triplex FTP

The simplicity of the reduced model allows a fully analytical solution, which will be introduced in the next section.

Analytical So lution of the Reduced FTP Model

The simple, non-cyclic structure of the reduced model allows a compact, analytical solution. The reduced model is basically a chain, with additional transitions to system loss before component exhaustion. It can be easily shown that the solution to the model depicted in Figure 3 is:

P3 = (At?) (hdcd) X ( ht-hd) ( hd-hS) ( h W ) (1oc)

The probabilities correspond to the three operational modes postulated for the triplex FIF, i.e., operation with no failures, with one failure and with two failures, respectively. For the particular case when 1' = 3h, h = 21 and hS = 1, these formulas take on an especially simple, compact form, Le., a binomial formula modified to account for imperfect coverage:

d

3 Pi = Ro

P 2 = 3 c t R i ( 1 -Ro)

P3 = 3 C' cd Ro (1 - Roy

where Ro = exp(-ht). In these formulas, the appearance of the coverage probabilities account for the obvious need for successful reconfiguration if the FTP is to continue to operate in a degraded mode.

This analytical formulation is a powerful tool for canying out extensive parametric studies. The formulation can be easily adapted to a different type of FIF, for example a quad configuration.

Exact and Reduced Models: SimDle E x ~ D I ~

It is instructive to use a simple example to further examine the approximations involved in carrying out the model order reduction procedure outlined up to this point. Let us consider a dual FI", treating a processor/interstage set as one component, subject to both transient and permanent failures occurring at the rates ht and h, respectively. The same reconfiguration rate, p, is assumed for both transient and permanent failures. The self-test coverage, accounting for the imperfect isolation characteristic of the dual architecture, is denoted by c. In view of the much greater rate of the reconfiguration process compared to the rate of failure events, the effect of a coincident failure will be dsregarded. The Markov model incorporating all these assumptions is illustrated in Figure 4. This model can be solved analytically to yield the following expression for the probability of system loss:

where the coefficients are given by:

h [SZ - 2p (I-c)] (s1 - s2) (SI - h)

(s1 - s2) (s2 - h)

(s1 - h) (s2 - h)

A =

B = - h [Sl - 2p (143

C = 2 4 P

the exponents s1 and s2 are the roots of the quadratic:

s2 + (2h + p) s + 2p [Ap + (l-c)L] = 0

and h is the total failure rate, i.e., h = ht + hp.

605

no failures recovery after system loss (duplex one permanent caused by

operation) failure incorrect (simplex reconfiguration

operation) or exhaustion n

Figure 4 Markov Model of a Dual FTP

Applying the procedure previously outlined, the initial model is reduced to the approximate model shown in Figure 5. The probability of system loss for this model is given by:

P S L approx = 1 - ge-$h + (l-c)h]t - ,-kt (13) Where

It can be easily shown that if h << p, then the roots of the quadratic are well approximated by:

s2 = 2 [& + (l-c)h,] and SI = p (14)

Substituting (14) into the expression of the coefficients in (12) leads to the conclusion that :

A + 0 , B + B and C + c ,

as (Up) -+ o

In addition, it is clear that the first exponential will decay very rapidly compared to the other two. Consequently, for any length of time sufficiently in excess of the time scale of the reconfiguration process (i.e., Up), the approximate solution, (12), is in excellent agreement with the exact solution, (13).

no failures one failure

system loss

Figure 5 Reduced Markov Model of a Dual FTP

It is interesting to note the expression of the equivalent dual coverage for this example. From Figure 5 and according to equations (7), this effective coverage is given by:

The effective coverage is equal to the self-test coverage when the transient failures are disregarded. It decreases monotonically as the ratio of transient - to - permanent failures increases. This result is quite general, in spite of the simple model used to illustrate it.

Conclusions

A methodology enabling a systematic reduction of a complex FI" model has been developed. The technique relies solely on an examination of the structure of the transition mamx, with no eigenvalue analysis and coordinate transformation necessary. The reduced model captures all the relevant features of the detailed model and represents an excellent approximation. The simplicity of the reduced model allows a fully analytical solution, which is extremely effective especially when extensive trade studies are required.

The technique is illustrated on a simple but instructive example, for which an analytical solution is possible for both the exact and the reduced models. The

606

high quality of the approximation is clearly shown. In addition, the crucial impact of the transient failures on the success of the reconfiguration process is demonstrated.

The methodology and the results presented herein provide considerable insight regarding the difficulties and the subtleties involved in the rigorous reliability and performance analysis of an FI'P.

Acknowledrrement

The work presented in this paper was sponsored by the Langley Research Center of the National Aeronautics and Space Administration under Contract NAS 1 - 18565. Their support is gratefully acknowledged.

References

1. R. Harper, L. Alger and J. Lala, "Advanced Information Processing System: Design and Validation Knowlwdgebase," NASA Contractor Report, Contract NAS1-18565, July 1990.

2. K. Trivedi and R. Geist, "Decomposition in Reliability Analysis of Fault-Tolerant Systems," IEEE Trans. on Reliability, vol. R-32, No. 5, December 1983.

3. J. McGough, K. Smotherman and K. Trivedi, "The Conservativeness of Reliability Estimates Based on Instantaneous Coverage," IEEE Trans. on Computers, V O ~ . C-34, NO. 7, July 1985.

4. A. Bobbio and K. Trivedi, "An Aggregation Technique for the Transient Analysis of Stiff Markov Chains," IEEE Trans. on Computers, vol. C-35, No. 9, September 1986.

5. H. Kando, T. Iwazumi and H. Ukai, "Singular Perturbation Modelling of Large Scale Systems with Multi-Time-Scale Property, " Int. J. Control, vol48, No. 6, 1988.

6. L. Segal and M. Slemrod, "The Quasi-Steady State Assumption: A Case Study in Perturbation, " SIAM Review, vol. 31, No. 3, September 1989.

60 7

Date post:	15-Dec-2016
Category:	Documents
Upload:	gene
View:	218 times
Download:	0 times

[American Institute of Aeronautics and Astronautics 8th Computing in Aerospace Conference -...

Documents