Design of Adaptive Procedures for Fault Detection and Isolation

IEEE TRANSACTIONS ON RELIABILITY, VOL. R-20, NO. 1, FEBRUARY 1971 7

[4] "Definitions of effectiveness terms for reliability, maintainability Short Course. Univ. California, Los Angeles, 1968-1970.human factors, and safety," Dep. Defense, MIL-STD-721B, Aug. 25. [10] S. M. Rolwing, "An application of the gamma distribution to1966c. downtime distribution functions," Electron. Industries Assoc. Main-

[5] M. M. Holland, "A technique of availability prediction for advanced tainabilitV Bull., no. 13, Sept. 1968.support program development," 7th Annu. Reliability and Main- [11] "Systems engineering management procedures," U.S. Air Forcetainability Conf. Proc., vol. 1, no. 2, 1968. Systems Command, AFSCM 375-5, Mar. 10, 1966.

[6] R. J. McNichols, H. E. Lynch, and J. T. Miller, "An application of [12] "Maintainability design criteria handbook for designers of ship-renewal theory to downtime distribution functions," Electron. board electronic equipment," U.S. Navy Bureau of Ships, NAV-Industries Assoc. Maintainability Bull., no. 12, Sept. 1968. SHIPS 94324, as amended, Mar., 1965a.

[7] E. C. Molina, Poisson's Exponential Binomial Limit. Princeton, [13] Weapon System Effectiveness Industrv Advisory Committee (WSEIAC),N.J.: Van Nostrand, 1942. 10 vol.; AFSC-TR-65-l, -2 (3 vol.), -3, -4 (3 vol.), and -5 (2 vol.),

[8] E. Parzen, Modern Probability Theory and Its Applications. New Jan. 1965.York: Wiley, 1964. [14] E. L. Welker and R. C. Horne, "Concepts associated with system

[9] E. L. Peterson, "System effectiveness from the support point of view," effectiveness," ARINC Res. Corp., mono. 9, 1960.

Design of Adaptive Procedures for FaultDetection and Isolation

MARTIN COHN AND GENE OTT

Abstract-An algorithm is presented for designing minimum-expected-cost The tests are defined simply in terms of which elementstest trees for detecting and isolating single faults in a system. A test is each of them checks; i.e., a test is named by specifying aspecified by the subset of components that must be good for the test to pass, subset of elements. The test associated with the full setand with each test is associated a fixed cost. Each component is assumed tohave an a priori probability of failure. The test tree specifies an adaptive che wheterpthsytempishnop ting; thetes aso iaetesting procedure that detects a failure and isolates the faulty component with the empty set accomplishes nothing. With each avail-while minimizing the expected cost of testing. able test is associated an arbitrary cost, reflecting the time,

Reader Aids: expense, or difficulty involved in executing the test. If a

particular subset cannot be checked in practice, the cor-Purpose: Widen state of the artSpecial math needed for explanations: Set notation responding test may either be ignored or be assigned an

Special math needed for results: Set notation infinite cost.Results useful to: Computer designers and reliability engineers The present formulation generalizes two previously

treated problems. In the case where the cost per test is theINTRODUCTION same for all tests, the problem is identical to the design of a

m HE PROBLEM solved in this paper is the design of minimum-redundancy binary code for a source whose

minimum-expected-cost testing procedures for detect- messages are statistically independent. The test tree

ing and isolating single faults in a system. The system is becomes a decoding tree whereby the receiver decides from

assumed to be a collection of elements, which might be a sequence of received symbols what message was sent. At

anything from basic components to large modules. A fault each node in the decoding tree a received 0 or 1 may beconstrued as passage or failure of a test directing theiSany anomalyi1nthei1nput-output behaviorofanelement. dc ert g'g

Weasuethtth..piripn decoder to branch left or right to the next node. ThisWe~~~~~~~~~~~~asumthtteapir,rbaiiiso lmn coding problem was originally solved by Huffman [1] whosefailures can be accurately estimated. Furthermore, the souielements must be defined in a way that renders negligible on is periodically rediscovered.

the probabilities of multiple failure. *Another problem treated in the literature restricts theA~~~~~~~~~~~~~~~tetianprocdr htpoie ufcetifra repertoire of tests to those which test single elements. In this

> . . ~case the fault-isolating tree structure iS always a single limbtion to determine whether or not all members of a particular wihoelsndeta tenubr feem tsntesubset of elements of the system are functioning properly.

system, as depicted in Fig. 1. The solution consists ofManuscript received February 16, 1970; revised December 30, 1970. deingw chtsto mtadin htsqu ceoprfmThe authors are with the Sperry Rand Research Center, 100 North Road, the remaining tests. This problem can be solved by inter-

Sudbury, Mass. 01776. preting it as a machine-setup problem [21 or as a dynamic

8 IEEE TRANSACTIONS ON RELIABILITY, FEBRUARY 1971

the associated ambiguity subset, thus reducing the ambigu-ity. The root node, or full subset, corresponds to a state ofcomplete ignorance, while at the twigs, which correspondto unit subsets and hence where the outcome is determined,there is no further ambiguity.

Fig. 1. A single-limb tree. For example, Fig. 2 shows the system of Example 1, andFig. 3 illustrates a tree that describes an adaptive (non-

program [3]. A further complication, which can be handled optimal) testing procedure for that system, where theby the latter method, is the introduction of test-reliability notation is: test (cost) [elements not checked; elementsmeasures that assign to the tests probabilities of yielding checked]. The convention is to branch left if the test iscorrect decisions. passed, right if failed. For clarity the test costs appear in

Other work [4], [5] has treated the isolation of a fault in a parentheses, the ambiguity subsets in square brackets, andsystem organized into modules that in turn are composed of element probabilities are given at the twigs. The procedureelements. As above, tests are restricted to checking a single shown is adaptive because the test performed at the secondmodule at a time, and then single elements within the faulty level depends on the outcome of the first test. If at each levelmodule. Finally, all the results cited deal with the fault- all the tests are identical, the procedure is said to be preset,isolation problem only. The method presented below yields in which case the procedure can be given as a simpleprocedures for detection as well as isolation of faults and sequence of tests rather than as a tree. It can be shown thatentails no restrictions on the type of tests available, on their if all test costs are equal, a preset procedure appears amongcosts, or on system organization. those of least expected cost.

It is crucial that not only must the topology of the tree beFAULT DETECTION AND ISOLATION decided, but also the choice of test at each node. In the

present example, for instance, the same information wouldThis section treats the design of testing procedures be gained, albeit at different expected cost, if T1 were

capable of detecting as well as isolating faults. Such pro- replaced by T3, T., or T7. Similarly, T5 could be replacedcedures pertain to routine maintenance in contrast to by T2, T3, or T4. In general, given a topology and labeledtroubleshooting, where the existence of a fault has been twigs, the test used at any node must cause the correctestablished. The latter case is discussed in the next section. partition of the ambiguity subset; the effect of the test on

Problem Statement elements outside the ambiguity subset of interest is irrele-vant. In the notation of test vectors, a test is sufficient at a

Let a system be described as a collection of n elements: node if it has 0's for elements in one class of the partition,s=Sl,. S2, Sn.l's for elements in the complementary class, and either

value for elements outside the ambiguity subset. In anIn practice these elements may vary in complexity, and will optimal procedure, a least expensive sufficient test must beprobably represent the smallest replaceable modules. Let used at every node.there be associated with each element si an a priori prob- The expected cost of the tree (really of the testing proce-ability pi of being in failure; it is assumed that the prob- dure) can be computed in two equivalent ways. The expectedabilities ofmultiple failures are negligible. Finally, consider a cost of a node is defined as the product of the cost of its testrepertoire of 2n possible tests, corresponding to all the sub- and its probability of being reached. This probability issets of the set of n elements. A test checks just those elements simply the sum of probabilities of all descendant twigs. Thein its corresponding subset; the test is failed if any one of expected cost of the tree is then the sum of all expectedthem is faulty, and otherwise is passed. A convenient way costs of nodes.of denoting such tests is by binary n vectors; a test checks Alternatively, the expected cost of a path (from root toelement si if and only if the ith component of the test vector twig) is defined as the product ofthe probability oftraversingis a 1. The tests can further be enumerated by interpreting the path (i.e., the probability at the twig) and the cost of thethe vectors as binary numbers. For example, in a five- path, which is the sum of all test costs along the path. Theelement system the test T13 (01101) checks 2'S 3, and expected cost of the tree is then the sum of all expected costss5. Note that the test To checks nothing and so can be of paths. For instance, the expected cost of the tree shown inignored, while the test T2. -1 checks the entire system Fig. 3 can be computed at 30(0.90 + 0.05 + 0.03 + 0.02) +(detects a fault). As mentioned earlier, to each test 7j there 1(0.90 + 0.50) + 35(0.03 + 0.02) = 32.70, or as 0.90(30 +corresponds a fixed cost Cj. 1)+ 0.05(30 ± 1) + 0.03(30 + 35) + 0.02(30 + 35) = 32.70.The problem at hand is to design minimum-expected-cost

adaptive test procedures. These can be described by tree Solution of the Problemstructures, with nodes labeled by tests and twigs labeled by As a formal convenience toward the solution, the condi-system elements. Each node of a tree can be interpreted as a tion "'no fault" will be replaced by a dummy element sO.state ofignorance, calledan ambiguity subset. Theambiguity The augmented system now has n + 1 elements, but thesubset at each node consists of the twigs that are descendant number of possible tests is unchanged; since the fictitiousfrom the node. The test applied at a node serves to partition element s0 cannot be checked, every test vector must have a

COHN AND OTT: FAULT DETECTION PROCEDURES 9

Element 1 2 3 Element 1 2 3

Probability 0.02 0.03 0.05 Costs Probability 0.90 0.02 0.03 0.04 Costs

T 0 0 1 C = 1 T1 0 0 0 1 1

T2 0 1 0 C2 30 T2 0 0 1 0 30

Test T3 0 1 1 C 25 T 0 0 1 1 2533 T3 0 01 2

T4 1 0 0 C4 =30 Tests T4 0 1 0 0 30

T 1 0 1 C 35 T5 0 1 0 1 355 5 T6 0 1 1 0 30T6 1 1 0 C6 = 30 T7 0 1 1 1 20

T7 1 1 1 C7 = 20

Fig. 4. Augmented system of Example 1.Fig. 2. Example 1.

[O 1 23]

T6(30)[nof5ult3 2] [o 12] 02

13] [023] [12 3

T (I )[nofoult3l] I35) [ 2 ] 1] [o 2] [9 3J [I 2 [I 3] [2 3]

[no fault] E1 [2] [ [oig [1] [2]git [3serry0.90 0.05 0.03 0.02 element probabilities Fig.5. Ambiguitysubsetarray.

Fig. 3. A testing procedure for Example 1.

[0 1 2 3]0 in the zeroth component. The system of Example 1 would T1; 21.25appear as shown in Fig. 4. [ 12 t 1 3] 0 2 3 [ 2 3]The solution is both motivated and verified by the obser- [12] [l 193] 23 [1 2 13

vation that if to every possible ambiguity subset there can T7; 20.25 TI;19.37 T1; 19.58be assigned an evaluation, consisting of the least expected [o 1] [o 2] [ 3] 12 [1 3 [2 3thentheevaluationoftheT7; 18 60T095T125~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~02]1031-113cost of resolving that ambiguity,thentheevaluation ofthe T7; 18.40 T7; 18.60 T1; 0.95 T3; 1.25 T1; 0.07 T1; 0.08

subset of complete ignorance is the cost of the optimal tree. I

This evaluation function can, in fact, be computed by a [o] [1] [2] [3]recursion on the number of elements in the ambiguity ;0° 0 -0°;0subsets. For this recursion it is convenient to arrange the Fig. 6. Evaluated array for Example 1.subsets in a latticelike array, all subsets of a given cardinalityappearing in a single row. The empty subset is omitted sinceit corresponds to no state of ambiguity. For the earlier Partition Test PDCA,B + EA + E3A B

example the array is shown in Fig. 5. 1 23 T3 0.10 x C3 + E + E23 =2.50 + 0 + 19.58 = 22.08The partial resolution of ambiguity by a test amounts to 3 3 1 23

2 13 T2 0.l x C2±E + E1 =3.00 + 0+ 0.07 = 3.07partitioning the ambiguity subset into two nontrivial sub- 3 12sets necessarily occurring lower in the array. The evaluation 1 1 E3 0of the subset in question is simply the minimum, overall 0 123 T7 1 x C7 + EI + E123 = 20 + 0 + 1.35 = 21.35partitionings, of the expected cost of the test plus the 1 023 T4 1 x C4+E1 +E023 =30+evaluations of the two subsets thus reached. Symbolically, 2 013 T2 1 x C2 + E2 + E3 = 30 +the evaluation ED of a subset D is 3 012 T1 I x Cl =E13 + E012 =1 + 0 + 20.25 = 21.25

01 23 T+3 1 X C3 E01

+ 23 -25 +

m inPDCA,B + EA + EBI 02 13 x C + E + E35+A,B 3 02 13

03 12 T6 1 X C6+ E03 + E12 30 +

where D ==A U B, A n B = /, PD is the sum of probabil-ities of elements in D, and CAB is the cost of the least Fig. 7. Evaluations of subsets [1 2 3] and [O 1 2 3] for Example 1. (Noteexpensivetest part n D it n l s s A a how computations can be halted as soon as the partially computedexpensive test partitioning D into nontrivial subsets A andavlainecesth etpeiu u.evaluation exceeds the best previous sum.)

B. The subsets in the lowest row contain single elements, sothey represent states of no ambiguity; their evaluations areall zero, since no further testing is needed. From this obser-vation and the expression for ED, the entire array can clearly X()[,,,3be evaluated by a recursion on cardinality. The evaluation T7(2O) [0jl,2] < [3]of Fig. 5 for the system of Example 1 is shown in Fig. 6; / X 0.05 E023=2i 25.sample evaluations of subsets [0 2 3] and [0 1 2 3] are shown []T95[/ 2]in Fig. 7. Under each subset in Fig. 6 is given not only the [i ] [2]evaluation, but also the test used to achieve it. From these 0.02 0.03data it is straightforward to draw the tree shown in Fig. 8. Fig. 8. Least-expected-cost tree for Example 1.

10 IEEE TRANSACTIONS ON RELIABILITY, FEBRUARY 1971

The optimal tree is now read from the array (Fig. 6) start- - Element 1 2 3 4ing at the top: [01 23] is partitioned by T1 into [3] and Probability 0.4 0.3 0.1 0.2 Cost[0 1 2]; [0 1 2] by T7 into [0] and [1 2]; [1 2] by T3 into [1] T1 0 0 0 ° 13

T2 0 0 1 0 8and [2]. It may be confirmed that the expected cost of the T3 0 0 1 1 13tree is 21.25, the evaluation of [01 2 3]. Test T4 0 1 0 0 16Note that the check of the entire system is called for in the T5 0 1 0 1 16

T6 0 1 1 0 15middle of the procedure, it being the most efficient way to T7 0 1 1 1 16partition [0 1 2] in spite of the fact that element 3 has alreadybeen checked. [1 2 3 4]

T3; 25.9

TROUBLESHOOTING 1 2 3 [1 2 4] 1 3 4 ] [2 3 4]In the event that fault detection is of no interest, the T6; 15.2 TI; 20.9 T3 11.5 T3; 10.2

problem at hand reduces to fault isolation, or trouble- 12 ] [ 13] 14] [23] [24] [3]shooting. Naturally, the procedure of the last section can be T6; 10.5 T2; 4.0 I 7. T2; 3.2 T1; 6.5 T2; 2.4applied, with zero probability assigned to the fictitious"no-fault" element so. But significant savings in computa- [1] L02 [33 4]tion can be achieved by deleting so and using the single- Fig. 9. Example 2.fault assumption to reduce the repertoire of tests.Consider a system (with no fictitious element) in which

exactly one element is faulty. Ifa test that checks some subset T,(13) [1,23.4]ofelements is passed, the test that checks the complementary Tt15)[1;2] TI 8) [4 3]subset must be failed, and conversely. Therefore such a pair /[ / 4of tests is redundant, and in the quest for least-expected-cost [I] [2] [4] [3]procedures the more expensive of the pair can be ignored. 0.4 0.3 0.2 0.1By this argument the repertoire of tests for an n-element Fig. 10. Least-expected-cost troubleshooting tree for Example 2.system can number at most 2n-. Furthermore, the testvector of one of any complementary pair of tests must have There are thusa 0 in its first component, and it is only a matter of notationto insist that all the tests retained in the repertoire have this E ,n + 18(2k 1) 1(3n+ 1)property. Thus, the fault-isolation problem is seen to fit the k=2 k 2

format discussed in the last section, and therefore can besolved by the algorithm developed there.fThenum be such computations. For the same system, fault isolation

in alone will require about 2" memory locations and abouttests, however, has essentially been halved. The saving i(3n ) computationscomputation that accrues from this reduction will become 2

clearer in the next section. An example of fault isolation is Note that a partition explicitly names the subsets A andshown in Figs: 9 and 10. Note that the probabilities of B from which EA and EB can be read, and that the subsets

element failures are now conditioned on there being a fault, can naturally be denoted by binary vectors, suggesting theand so they sum to unity. efficacy of associative-memory techniques in programming

the algorithm.

COMPUTATION AND MEMORY REFERENCES

The expense of applying the algorithm to design testing [1] D. A. Huffman, "A method for the construction of minimum-procedures for a system of n real elements may be estimated redundancy codes," Proc. IRE, vol. 40, Sept. 1952, pp. 1098-1101.as follows. Approximately 2nII memory locations are [2] W. Eastman, private communication.required to store the array of subsets. To evaluate a subset [3] R.Bellman, Dynamic Programming. Princeton, N.J.: Princetonof kc elements, exactly -' -_ 1 partitionS mus t be con- [4] B. Gluss, "An optimum policy for detecting a fault in a complexsidered, each involving a computation of the form system," Oper. Res., vol. 7, July/Aug. 1959, pp. 468-477.

[5] S. Firstman and B. Gluss, "Optimum search routines for automaticPDCA,B + EA + EB. fault location," Oper. Res., vol. 8, July/Aug. 1960, pp. 512 523.

Date post:	07-Nov-2016
Category:	Documents
Upload:	gene
View:	214 times
Download:	1 times

Design of Adaptive Procedures for Fault Detection and Isolation

Documents