Monte Carlo Simulation of Markov Unreliability Models

Nuclear Engineering and Design 77 (1984) 49-62 49 North-Holland, Amsterdam

M O N T E C A R L O S I M U L A T I O N O F M A R K O V U N R E L I A B I L I T Y M O D E L S

E.E. L E W I S and F r a n z B O H M *

Department of Mechanical and Nuclear Engineering, The Technological Institute, Northwestern University, Evanston, Illinois 60201, USA

Received 9 June 1983

A Monte Carlo method is formulated for the evaluation of the unrealibility of complex systems with known component failure and repair rates. The formulation is in terms of a Markov process allowing dependencies between components to be modeled and computational efficiencies to be achieved in the Monte Carlo simulation. Two variance reduction techniques, forced transition and failure biasing, are employed to increase computational efficiency of the random walk procedure. For an example problem these result in improved computational efficiency by more than three orders of magnitudes over analog Monte Carlo. The method is generalized to treat problems with distributed failure and repair rate data, and a batching technique is introduced and shown to result in substantial increases in computational efficiency for an example problem. A method for separating the variance due to the data uncertainty from that due to the finite number of random walks is presented.

1. Introduction

Fault tree methodologies [1-3] are widely employed in probabilistic risk assessment of nuclear reactors. Re- actor shutdown, emergency core cooling and other safety systems require such low failure probabilities that suffi- cient reliability estimates often cannot be made from operating experience or system test data. Fault trees provide a method for estimating system reliability parameters in terms of more easily obtainable data for component failure and repair rates, For large fault trees computer analysis is needed both to express the logical structure of the tree in terms of minimal cut sets, and for the quantitative evaluation of the system unrealibility or unavailability. In what follows we shall consider Markov Monte Carlo methods for the evaluation of system unreliability.

Early use of Monte Carlo techniques was made for the quantitative evaluation of fault trees [4,5]. While some effort has continued in the use of purely Monte Carlo methods, they have largely been supplanted deterministic techniques often referred to as Kinetic Tree methods [5-9]. Two limitations, however, present them- selves in the use of Kinetic Tree methods.

* Current address: Institut fur Kerntechnik und Energiesys- teme, Universit~t Stuttgart, Pfaffenwaldring 32, 7000 Stutt- gart 80, Fed. Rep. Germany.

First, in Kinetic Tree methods the reliability characteristics of each component are modeled separately. To evaluate the fault tree by combining component failure probabilities, the components are assumed to behave independently of one another. In fact, dependencies often arise from common mode failures, from the increased stress in partially disabled systems, and from a variety of errors in testing, maintenance and repair.

Due to this limitation of the Kinetic Tree formulation there is increasing use of Markov models for reliability analysis [2,3,10-12], for with such models quite general dependencies between components may be treated. For systems with more than a few components, however, Markov analysis by deterministic means becomes a prodigious task. For even while innovative methods have been employed to reduce the complexity of the computations [10-12], the fact remains that one must solve a set of 21 coupled first-order differential equations, where I is the number of components, Thus even a system with only ten components will result in a system of over one thousand coupled equations with a transition matrix with over a million elements. More- over, if some of the components are repairable, the equations are likely to be quite stiff, requiring that very small time steps be used in the numerical integration.

A second limitation on Kinetic Tree methods is a result of the lack of precision to which the component failure and repair rates are normally known. A means is

0 0 2 9 - 5 4 9 3 / 8 4 / $ 0 3 . 0 0 © Elsev ier Sc ience Publ i shers B.V. ( N o r t h - H o l l a n d Physics Pub l i sh ing Div i s ion )

50 E.E. Lewis +, F. B6hm / Monte Carlo simulation

required to determine the variance in the result of the fault tree analysis in terms of the variance of the failure and repair rate data from which the component characteristics are calculated. Invariably this is accomplished by Monte Carlo sampling of the failure rate data using log-normal or other distributions [13 18]. The fault tree is evaluated deterministically with data from each data sampling, and the mean, variance and other characteristics of the system are estimated. A similar procedure is also applied to Markov models [10], requiring that the solution of the coupled set of differential equations be repeated thousands of times.

What follows is the formulation of a class of Monte Carlo methods which provides a natural framework for the treatment of both component dependencies and data uncertainties. In section 2 we formulate Monte Carlo simulation of the unreliability of systems with repairable components within the framework of a Markov process. This approach retains the power of deterministic Markov methods in modeling component dependencies that would not be possible if direct Monte Carlo simulation were to be carried out. At the same time the Monte Carlo simulation requires very little computer memory. In section 3, variance reduction techniques, similar to those that have been highly devel- oped for neutral particle transport calculations [19 22], are applied to greatly increase the computational efficiency of Monte Carlo reliability calculations. In section 4 the Monte Carlo formulation is generalized to include probability distributions that represent the uncertainty in the component failure and repair rate data. The variance in the result is then due to two causes: the finite number of random walk simulations, and the uncertainty in the data. A batching technique is introduced and is shown to further reduce that part of the variance due to the finite number of random walks without a commensurate increase in computing effort.

2. Analog Markov Monte Carlo

2.1. Markov formulation

To formulate a Markov process [23,24] for system failure in a form suitable for Monte Carlo simulation of the random walks, we assume a system consisting of I components, each of which may be either operating or failed. There are then 2 / system states arising from all possible combinations of operating and failed components; let 12 represent the set of all possible system states. Certain combinations of component failures cor- respond to system failure; l e t / " be the set of all system failure states.

The equations governing system failure are constructed from two probability density functions. Let

Probability density that the system

f ( t l t ' , k ') -~ i will make a state transition at t given (1)

1 that it is state k ' at time t ' ( t ' ~< t ),

probability that the system will

q(k lk ' ) =- enter state k, following a transi- (2)

, tion out of state k ' .

Two classes of states may be considered, absorbing and nonabsorbing [23]. An absorbing state is one out of which no transitions can be made. Consequently f ( t ] t ' , k ' )=O and q(kik ')=Skk, . For nonabsorbing states the probability densities are normalized by

q ( k l k ' ) = 1. (4) k~$2

Since all transitions by definition must cause a change in system state, for nonabsorbing states

q(k lk ) = 0 . (5)

In this study we consider only system unreliability: Once the system fails it remains in the failed state. Thus the set of absorbing states is identical to F, the set of failed states. If system unavailability were to be considered, there would be failed states that are not absorbing states, for repairs could be made on the failed system.

In the numerical examples that follow only time-independent failure and repair rates are considered. Un- der these restrictions the transition probability density

may be written as

f ( t l t ' , k ' ) = 7k, e x p [ - v k , ( t - t ' ) ] , (6)

where 0 < Yk' < ~ for nonabsorbing states and 7k = () for absorbing states. We refer to ~'k as the transition rate, since it may include both failure and repair rates.

2.2. Integral equations

The Markov process is governed by a set of integral equations which we now derive. Although they are equivalent, these equations are not in the same form used in deterministic calculations. In deterministic calculations the equations are cast in terms of Pk ( t ) the probability that the system occupies state k at time t [10]. For Monte Carlo simulation were are interested in

n probability density that the system arrives in

+k ( t ) = state k at time t after the n th transition, (7)

E.E. Lewis, F. BShm / Monte Carlo simulation 51

and in the probability density for arrival in state k:

~k( t ) = ~ G ( t ) . (8) n = 0

The problem is initialized by placing the system in state k 0 = 0 a t t 0=0 . Hence

~ o ( t ) = S k o 3 ( t ). (9)

A recursive relationship for the states for which n > 0 follows immediately from the definitions of the probability densities:

j/,k(t) ~ q ( k l k , ) fotdt , f ( t l t , ' , ,-1 , = k ) ~ , ( t ) , n > 0 k '

(10)

The integral equation for ~bk(t ) is obtained by combining eqs. (8) through (10):

~kk( t ) = 3ko3( t ) + Y',q( klk ') f tdt ' f ( tlt ', k')t~k ( t' ) . k ' ~o

(11)

In order to relate the relationship between the solution of this set of integral equations and the Monte Carlo simulation it is useful to write the Von Neumann series solution for tkT,(t) and +k(t). From eqs. (9) and (10) we have

~Yk( t ) = q( klO)f( tD, 0),

~ z k ( t ) = ~ fotdg q( klk ~ ) q( k dO) f ( tlg ' ka ) f ( talO, 0), k~

l 12 tP3k(t)=]~ ~-~ f0dt2f0 dg q(k lk2)q(k2 lk l )q(k l [O )

k2 k 1

×f(t[ t2, k2 ) f ( t 2 lg , G ) f ( t d o , 0), (12)

and so on. Note that the probability density +~(t) consists of

the sum of contributions from random walks with transitions at t a, t 2 . . . . . t n_~ and passing through states ka, k 2 . . . . . k,_ 1, where t o = 0 and k 0 = 0. For higher terms we may write these sums of multiple integrals more compactly by defining a random walk in which n state transitions are made by transition time and state vectors

t. = f t , , t2 . . . . . t . } L (13)

k,, = { k,, k 2 . . . . . k . }T. (14)

We may write the Von Neumann expansion as follows:

Let

E---E E - E, (15) kn kn k . _ I k l

t t t n f dt.-= f0 dt"fo t . _ , ' . . / ' 2 d r , . (16)

Then generalizing eq. (12) we may write

+~(t) = E f'dt._l q ( k l k . _ , ) f ( t l t . _ a , k . _ l ) , (17) kn 1

where the probability densities for a particular random walk, specified by the components of t. and k., are

n

q ( k l k , ) = q ( k l k , ) I-I q(k , , Ik , , -~) , (18) n ' = l

n

f ( t [ t , , k , ) = f ( t l t n , k , ) 1-I f ( t , ' l t n ' - l , k , ' - a ) . (19) n ' = l

Finally, the probability density for the system entering a state k at t is determined by summing, as in eq. (8), over random walks of all lengths,

~bk( t )=~k0d( t )+ ~ E f t d t . q ( k l k . ) f ( t l t . , k . ) , n = 0 k.

(20)

where in the n = 0 term the sum and integral over k. and t. are replaced by unity.

Monte Carlo simulations of the integral equations may be used to estimate weighted integrals of the form

/ = E fdt~k(t),;~(t). (21) k

In particular, the unreliability is just the probability that the system will enter a failed state during 0 < t < T, where T is the design life. Thus the unreliability is given by

E (22) k ~ F 0

2.3. Analog simulation

The analog Monte Carlo simulation of the foregoing Markov process is straightforward. It is shown sche- matically for one random walk in fig. 1. The walk is initialized at time zero. To sample time intervals At between transitions, one first finds the cumulative distribution corresponding to eq. (6)

ft t ' + A t e , d t f ( O t , k ' ) = exp{ -yk,zlt }. (23)

52 E.E. Lewis, F. BOhm / Monte Carlo simulation

t = O , n = l

Determine t n

of n-th tronsition

~ N o L

Determine ko for n-th Mote

Fig. 1. Procedure for one random walk (i.e. trial) with design life T.

We sample At using the inverse transform method [24]. A random number ~ is sampled from a uniform distribution 0 < ~ ~< 1, and set equal to the cumulative distribution on the left. Inverting the equation then yields

At = -- ~ 1 In ~, (24) Yk'

where we have utilized the fact that ~, and 1 - ~ have the same probability densities. To sample for the resulting state k, one chooses a second uniformly distributed random number, say ~', and chooses the state which satisfies

k k + l

y~ q(k"lk')<~'<~ ~,, q(k"lk ' ). (25) k " = 0 k " = 0

Since the unreliability is given by eq. (22), the analog sampling is binomial, with 1 for failure and 0 for nonfailure. Thus if M random walks are carried out and m of these are found to result in system failure, the sample mean unreliability is

a = m / M . (26)

Each random walk is terminated when t exceeds the design life T, or when a failed state is reached. Failed state determination is made by comparing the set of failed components to the minimal cut sets [1] after each transition. After the first transition only the singlets in the cut sets need to be considered, and after the second transition only singlets or singlets and doublets, depend- ing on the number of failed components, and so on.

To estimate the variance of the result we first nole that with the general weight function given in eq. (21) the variance of fk is given by [21]

o = = Y, f d t ~ ( , ) + ~ ( 0 ~ - z 2. (27) k

For the unreliability defined by eq. (22) however, this reduces to the binomial form:

02 = if(1 --- ~ ) , (28}

which may be approximated by the sample variance

S 2 = fi(1 - t~). {29)

Finally, using the central limit theorem, for a large number of histories we find the variance of 0 to be

1 o=(~)= ~ (1 -~1, (30)

or using the sample variance

1 $ 2 ( a ) -- a~T2- T a(1 - a ) . (3 l )

In contrast to deterministic methods, Monte Carlo simulation of the foregoing Markov process is greatly facilitated by small computer memory requirements; it is unnecessary to store the large transition matrix required in deterministic methods, or even to enumerate the 2 t possible states for a system consisting of 1 components. Rather, the necessary coefficient may be calculated "on the fly" as the simulation proceeds from the 21 component failure and repair rates along with a nominal amount of data representing common mode failures and other dependencies. For clarity we consider the case of no component dependencies. One of a number of standard notations may be employed to include dependencies in the failure and repair rates.

Consider a system of I components, each component with a failure rate X, and a repair r a t e / t i. Initially the system is assumed to have no failed components. Hence the initial value of the transition rate is

1

Yo = E xi. (32) i = 1

Succeeding transition rates are determined by subtracting h i and adding/~i for each failure and adding %/and subtracting/x i for each repair of component i.

We express ), as

~, = X +/~, (33)

E.E. Lewis, F. B6hm / Monte Carlo simulation 5 3

where

X = ~ h~ (34) i e U

and

= E #," (35t i ~ F

Here U and F are the sets of unfailed and failed components, respectively.

Determination of which component has failed or been repaired, and thereby of the new state of the system, is carried out as follows. A random number 4' is first generated. If y~' < X, the failed component is determined from

i i + 1

E X,,~<7~'~< E Xi'- (36) i ' = l i ' = 1 i ' ~ U i ' ~ U

If y~' > ~,, the repaired component is determined from

i i + 1

;k+ E # i , < y ~ ' ~ < X + E ~i' (37) i ' = 1 i '=l i ' ~ F i ' v E

This proceeds may be further speeded up by using rejection sampling [21] on the failure rates.

2.4. Figure of merit

The standard criterion for judging the efficiency of Monte Carlo methods is the figure of merit [24,25]: 1 / (o2 t ) where o 2 is the variance of the sampled distribution and t is the time per trial (i.e. random walk). It is instructive to compare the figure of merit of the Markov Monte Carlo formulation to that of the more transi- tional formulation [4,5], realizing that with the traditional model dependencies between components are not readily simulated.

Consider a system with a substantial number of components and a small unreliability. In traditional simulation, each trial must consis t --at a minimum--of a sampling of the time to first failure of every component, even though most components do not fail within the design life T. In contrast, for systems with small unreliability the majority of Markov trials will consist of only one sampling of eq. (24) from which it is determined that no transitions take place before t = T. Thus the average time per trial is much shorter for the Markov process. Since both traditional and Markov simulation are binomial, the variance will be the same, as indicated by eq. (31). Consequently the Markov

formulation results in the larger figure of merit. While Markov Monte Carlo may offer substantial

improvements in modeling and computational efficiency over traditional analog Monte Carlo methods, it is still likely to be expensive for the simulation of highly reliable systems. For if o(f i) / f i is used as a measure of relative error, then for ~ << 1 we have from eqs. (26) and (31)

1 (38) fi ( a N ) 1/2 '

Even though the time per trial may be small, very large numbers of trials are required. Fortunately, Markov Monte Carlo lends itself well to a number of powerful variance reduction techniques that are similar to those in Monte Carlo neutral particle transport calculations. We next consider these.

3. Variance reduction methods

Broadly defined, variance reduction methods modify analog Monte Carlo simulations in order to increase the 1/ (o2t) , the figure of merit, while maintaining an unbiased estimate of the expectation value, ft. Ordinarily a substantial reduction in o 2 is achieved, more than com- pensating for any increase that may result in the time per trial [19-21]. Some techniques, such as Russian Roulette, however, achieve increased figures of merit by decreasing the time per history even though some increase in 02 may result. Here two techniques are considered, which we shall refer to as forced transition and trafisition biasing. Both of these may be considered within the following framework.

3.1. Biased random walk formulation

Suppose we consider a modified or "biased" random walk where the probability denisty f ( t l t ' , k') and the discrete probabilities q(klk') , both defined above, are replaced respectively by the modified distributions f ( t l t ' , k ' ) and ~(k lk ' ). We require that f and ~ also satisfy eqs. (3) and (4), and that the )~ > 0 for values of t, t ' , k, and k ' for which f > 0, and similarly if q(k lk ' ) > 0 then # (k lk ' ) > 0. Likewise we may define

~ , ( t ) = modified density that the system arrives (39) state k at time t after the nth transition,

and similar to eqs. (8) and (9)

~ k ( t ) = £ ~ ( t ) (40) n = 0

54 E.E. Lewis, 1:: Brhrn /Monte Carlo simulation

and

~ ° ( t ) = 8koS(t). (41)

Following the same logic as for analog Markov Monte Carlo we may show that ~ ( t ) satisfies integral equations analogous to eq. (10) and (11):

( k ~ ( t ) = ~ # ( k l k , ) [ ' d t , f ( t l t , ° , - , , , - k)~b k, ( t ) , (42) ~ JO k '

(k k ( t ) = 8koS ( t ) + ~_~ ~l( k lk ' ) £ ' d t ' f ( tlt', k ' ) ~k,( ," ). k '

(43)

Suppose a Markov Monte Carlo is carried out with )c(tlt', k ') and gl(klk' ) to sample "" +k(t). We require a relationship from which +k ( t) can be estimated, given a sampling of ~k(t) . To do this we define a weight w~(t) such that

~b~(t) = w~( t )~ /~( t ) . (44)

Substituting this expression into eq. (9) and requiring eq. (41) to hold yields the initial condition

w ° ( t ) = 1. (45)

Then if eq. (44) is substituted into eq. (10) we obtain

n ~ n t w k ( t ) ~ k ( )=~_~q(k lk ' ) k '

' ' 'k), ~, ( ),~,. (,'). x fodt / ( t i t ' w "-1 t' ",-1

(46)

However ~ ( t ) must also satisfy eq. (42). This condition is met provided the weights satisfy the recursive rela-

tionship

q ( k l k ' ) f ( t l t ' , k ' ) w~,_l(t ,) . (47) w; ( t ) = ~l( k lk , ) ~ ( tlt, ' k ' )

From this expression it is clear that the weighting of the ~,~(t) sampling is a function of the successive steps in the random walk defined by t , and k,,:

w~, ( t) = u( t, ki t "-a, k " - l ) , (48)

where

q( k l k , ) f ( tlt,t., k , ) (49) u( t, ki t", k " ) = O( k l k , ) ] ( t l t , , k , ) "

Analogous to eqs. (18) and (19) we have defined

0(k lk , , ) ~ gl(klk , ) f i O(k,,,lk,,,_l) (50) n ' ~ 0

and

l1

}(tl,~, t,,,) =-j(tl,~, k~) 1-] }(,<1,,,. ~, k,,. 1)- (5l)

With the foregoing weight definitions we may insert eqs. (44), (45) and (48) into eq. (20), to obtain the solution for ~bk(t), in terms o f f and c~:

4k(t)=Sk08(t)+ ~ Y'.f'dt°O(k[ko)f(tlt..k,,) n = 0 k .

× . ( t , kit., k.) . (52)

Then combining this expression with eq. (22) the unreliability may be expressed in terms of f and 0 as

T t ~ ~= E E E~ dtf d,,,O(kl*,,)f(tlt,,,k,,) k ~ l ' n = O tk~ 0 0

× u(t, kit,, k°). (53)

Since the unreliability is now expressed as an average of u(t, kit n, k , ) over those biased random walks leading to system failure, the variance of u becomes

°2= E E f d t f d t , O ( k l k , ) f ( t l t , , k , ) k G l . n = O k , UO u

× [~,(t, kl,o. k . ) - ~ q . (54)

Observe that as f--- ,f and 0 ~ q, the weight of the trial will go to one, and eqs. (53) and (55) revert to the binomial suppling expressions.

3.2. Implementation

With the foregoing equations we may construct a biased Monte Carlo simulation in which j? and 0 instead o f f and q are sampled, while still obtaining an unbiased estimate of the unrealiability. This is done by associat- ing a weight with each random walk. The weight is initialized to one, and then adjusted to correct for the bias at each sampling. The estimate of the unreliability, for M random walks is now

1 (55) a = ~ Z ....

where u,, here denotes the weight u(t, kit ., ko) of the ruth random walk at the time a failed state is entered. Likewise the variance of the random walk contributions about if, given by eq. (54), may be estimated from the sample variance:

Sz = 1 M - ~ E ( u ~ - a ) 2" (56) r n ~ F

E.E. Lewis, F BOhm / Monte Carlo simulation 55

Then from the central limit theorem the variance of the unreliability estimate u is given by

o(f i) 2 = S2/M. (57)

The object of the biasing is to derive Monte Carlo methods with reduced variance. This can be accomplished most directly by causing more trials to contribute to the result than in the binomial case. For as this happens there will be fewer zero weight trials, but those trials that do contribute will have weights much smaller than one, leading to smaller values of the bracketed term in eqs. (54) and thus in eq. (56). In what follows we utilize two such variance reduction techniques.

3.2.1. Forced transitions For the forced transition method we take

e - V ( t t ' )

](tit ' , k') = i - e -v~r- ' ') for t ' ~< t ~< T, (58)

= 0 otherwise.

We then multiply the trial weight by

f(t]t ' , k') 1 - e - r ( r - r ) (59) f ( t ] t ' , k ' )

to maintain an unbiased result. This technique is only applied when y is sufficiently small that ~ , (T- t') is not large, and therefore there is only a small chance that an additional transition will take place before the end of design life. When failed components are present V is usually sufficiently large that there is no significant gain in variance if we take simply )?=f. When no failed components are present often ~ , (T- t') << 1 and therefore the rare event approximation may be used. This is accomplished by expanding the foregoing exponentials to yield

1 , t ' < t < T , (60) ](tit ' , k') y ( T - t')

and

f ( t l t ' , k') ](tit ' , k')

y ( r - t'). (61)

3.2.2. Failure biasing The second type of variance reduction that is

employed we shall refer to as failure biasing. The notion here is to force the system toward states with more component failures and fewer repaired components in order to bring it closer to system failure. The need for

such biasing is amplified by the fact that for most component types repair rates are much larger than failure rates. Thus if analog sampling is used most failures are followed by repair at the next transition, leading to very few system failures even though the forced transition method is being applied.

Suppose for a particular system state k ' we divide all possible transitions k'---, k into two classes, the transition k ~ A corresponds to an additional component failure, while k ~ R corresponds to an additional component failure. Hence for nonabsorbing states

Y" q(klk' ) + ~ q(kJk')= 1. (62) k E A k ~ R

Note that if there are no dependences, then the two sums in this expression have numbers of terms equal to the number of unfailed components and to the number of failed components, respectively. Since repair rates are much larger than failure rates the second sum normally dominates systems with any failed components. To bias the transitions toward additional failures, we define a parameter x such that

Y'~ gl(k]k')= x, (63) k ~ A

and hence

0(k]k ' ) = 1 - x . (64) k ~ R

Thus in the nonanalog random walk a fraction x of the transitions is forced to result in additional component failures. Within the class of failure transitions, the gl(klk') are determined by increasing q(kl/~ ) by the same mulitiplier:

O(klk ' ) - q(ktk') x, k ~ F (65) ~., q( k"lk')

k " ~ A

and

q(kl k') q(kFk') ( 1 - x), k ~ R . (66) Y~. q( k"lk' )

k " ~ R

With such a transition the weight of the trial is multiplied by

q(klk') 1 y, q(k,,ik, ) (67) q(klk') x k"eA

for component failures, and by

q(klk') 1 ~_, q(k"lk') (68) gl(kl k') 1 - - X k " ~ R

for repair.

56 E.E. Lewis, F. B6hm / Monte Carlo simulation

As in analog Monte Carlo there is no need to store all of the O ( k l k ' ) in the computer, for in the absence of component dependencies only the failure and repair rates of the components are required. Suppose that in state k ' before the transition eqs. (34) and (35) are the sums of the failure and repair states for the unfailed and failed components respectively, and 7 is again given by eq. (33). The sampling proceeds as follows: Choose a random number ~' and compare it to x. If ~' < x , the failure is of the component m for which

i+1

~ k i ' < - - ~ E ~ki'" (69) x i ' --I i '--I

iG U iG- U

and the weight of the trial is multiplied by

q ( k J k ' ) X (70) O(klk') x~,

If ~' >/x, the newly repaired component i is determined from

E #,'<" ~< E /tc (71) ,'=1 (1 - x ) r= l i ' c U i ' ~ U

and the weight is multiplied by

q ( k l k ' ) - (1 -- x ) - ' ~-. (72) O(klk') v

The foregoing forced transition and failure biasing techniques lead to large improvements in the figure of merit for systems with smaller unreliabilities and repairable components. This may be seen from the numerical results that follow.

3.2.3. Code s tructure

Two versions of a Markov process Monte Carlo code have been written for a PDP 11/44 minicomputer. Both have the logical arrangement shown in fig. 1. The input consists of failure and repair rates for each component along with the minimal cut sets for the system. The analog and nonanalog versions differ in the methods for determining the time to transition, the system state following the transition, and the tallies.

The module for determining whether system failure has occurred compares the current system state to the minimum cut sets. This is carried out by testing only

Fig. 2. Fault tree for example problem.

E.E. Lewis, F. Brhm / Monte Carlo simulation 57

Table 1 Data for example problem

i X~ (10 -s h -1 ) ui(h -1)

1 0.26 0.042 2 0.26 0.042 3 0.26 0.042 4 3.5 0.17 5 3.5 0.17 6 3.5 0.17 7 0.5 0 8 0.5 0 9 0.8 0

10 0.8 0

those cut sets that are possible for the current number of component failures: singlets for one failure, singlets and doublets for two failures and so on. By defining a number of component failures beyond which the system is assumed to be failed the cut set testing can be truncated at the price of a conservative overestimate of the system unreliability.

3.3. Illustrative example

The characteristics of the variance reduction techniques are nicely illustrated by a problem very similar to that published by Vesely [5]. The fault tree is shown in fig. 2 and the failure and repair rate data is given in table 1.

The problem is a severe test of the Monte Carlo in that the unreliability is very small. Both analog and

nonanalog (with x = 0.85) methods have been applied to the problem, and the results are compared in columns a and b in table 2. A very large number of trials was required for the analog run in order to obtain statisti- cally significant results. The most important result is the increase of well over three orders of magnitude of the figure of merit, 1/(S2t ' ) , when the nonanalog method is employed. Although the time per history is nominally larger, the great reduction is the variance is the promi- nent feature of the nonanalog method.

The reason for the variance reduction in the nonanalog method may be illustrated quite graphically [25] by considering the density distribution of weights in the tallies. Suppose we let

probability denisty of a failed system weight u

f ( u ) = resulting from Monte Carlo random walk.

Written in terms of f ( u ) the unreliability is

a=fduuf(u),

(73)

(74)

and the variance of u is

a2=fdu(u-a)Zf(u). (75)

Since according to the central limit theorem the variance in the estimate t~ about ff is o 2/M, where M is the number of random walks, bunching the weights about ff leads to small variance.

In analog Monte Carlo the sampling is binomial,

Table 2 Comparison of Monte Carlo unreliability results

Quantity a Analog (point data)

Trials M 1,000,000

Time/trial t (m s) 1.61

Sample variance S 2 0 .5 X 10 - 4

Figure of merit 1/(S2t) 1.26 >( 107

Unreliability 0.50 >( 10 -4

Uncertainty

-t- S / I I~ + 0.50 X 10 -4

b c

Nonanalog Nonanalog (point data) (distributed data)

10,000 10,000

10.01 119.86

0.24 x l 0 -8 0.18 >(10 - 7

4.16 >(101° 6.53 >(108

0.444 >(10 -4 0.673 >(10 -4

+ 0.0049 >( 10 - 4 _+ 0.0136-0 - 4


yielding

f ( u ) = (1 - f f ) 6 ( # ) + ~ 6 ( u - 1), (76)

which is consistent with the results given in eqs. (28) and (30). In the nonanalog Monte Carlo the distribution takes on the form

f ( u ) = (1 - c ) 6 ( u ) + c g ( u ) , (77)

where

fdug(u) = 1, (78)

~= cfdu u g ( u ) , (79)

o 2 = (a - c ) ~ + c f d u ( u - a ) 2 g ( u ) . (80)

Here c is just the fraction of trials that terminate in failed states and thus contribute to the unreliability tally. The effectiveness of the nonanalog modification of the game depends first on the degree to which the fraction of random walks which contribute to the tally can be increased over the analog value ~, and second, the degree to which the weights are bunched about ft.

For the nonanalog sample of 10000 random walks the sample value of c is found to be increased from

= 0.5 )< 10 -4 to c = 0.994, since only 30 random walks do not contribute to the nonanalog results. The nonzero weights are clustered about the value of ~, as indicated by the histogram o f f ( u ) shown in fig. 3. These features lead to the small variance observed in table 2 when the variance reduction techniques are applied.

4. Monte Carlo with data uncertainties

It has been assumed implicitly in the foregoing analysis that the failure and repair rates h i in ~t i for each of the components are known with perfect precision. In actuality these data most often have substantial uncertainties associated with them [1]. Accordingly, to carry out realistic unrealiabilty analysis, it is necessary to represent failure and repair data as random variables characterized by probability density functions. In this section we perform this task as follows. First the input (i.e. failure and repair rate) data is expressed as probability density functions. Then a Monte Carlo random walk procedure is generalized to take the uncertainty of. the data into account. This leads naturally to the idea of batching as well as to some insight as to the relative contributions to the variance of the system unreliability

2~

2697

5 I

t~.~.~.

6

10 15 20 25 30 35 t~O 45 50

Weight [xlO -5] Fig. 3. Approximate representation of f(u).

made by the random walk simulation, on the one hand, and to the data uncertainty on the other.

4.1. Data distributions

To begin, we let ~ represent a vector whose components are the )~i and #i of each component of a system. Thus the probabilities of the preceding section are con- ditioned on the data p. In particular the probability density given by eq. (73) is expressed in terms of p as

probability density of a failed system weight u

f ( ulp ) = resulting from a Monte Carlo random walk,

given the data v.

(81)

Consequently, the unreliability and variance given by eqs. (74) and (75) also become dependent on p. We indicate this by a subscript I,:

fi, = f d u u f ( ulp ), (82)

o,~ = f d u [ u - f i , 1 2 f ( u l , ) . (83)

To estimate the unreliability fi in the presence of

E.E. Lewis, F. Bfhm / Monte Carlo simulation 59

data uncertainty, we define for each piece of data v i

f ( v , ) = probability denisty of v,. (84)

We then define

f ( v ) = probability density of v. (85)

If the uncertainties in each v, are independent

f ( n ) = f ( ~ l ) f ( v 2 ) . . . g ( u , ) . (86)

In the presence of data uncertainty the unreliability is given by

=fdufd~ uf(ul,)f(,), (87)

and the variance of the associated random walk is

o2 = f du f dv(u- a )2f(ulv)f(v). (88)

where ~, and o f with fixed data are given by eqs. (82) and (83). The variance is thus made up of two contributions. The first is due to variance of the random walk procedure with fixed data, while the second is due to the data uncertainty. If there is no data uncertainty, the second term vanishes. This is seen by tak ing , equal to the mean values, v 0,

f ( v ) = 8 ( , - V o ) . (91)

in which case o 2 = O2o . This is just the result obtained in the preceding section. Conversely, if the Monte Carlo simulation of the random walk is replaced by an ana- lytical calculation of if,, then a, = 0 and only the second term remains. This corresponds to the situation where Kinetic Tree methods are combined with Monte Carlo sampling of failure and repair rate data.

3.3. Random walk batching

where we have used the convention

fd,=fhfd,2...fd,.. (89)

4.2. Illustrative example

To carry out the Monte Carlo simulation, the PDP 11/44 code discussed earlier was generalized so that at the beginning of each random walk the failure and repair rates are chosen from probability density functions. At present each of these is sampled from independent log-normal distributions, with the spread in data characterized by a factor c as detailed in the Appendix. Generalizations to other distributions or to dependencies between component data is straightforward.

To examine the effects of data uncertainties the problem described in the preceding section was recon- sidered using the data in table 1 as the distribution mean values, and with c = 3 for all failure and repair rates. The not analog results with point and distributed data are compared in columns b and c of table 2. It is apparent that both the time per random walk t, and the sample variance of the result are larger for distributed than for point data. The reasons for this may be understood as follows. The time per random walk is substantially larger because a sampling of the log-normal distributions must be carried out at the beginning of each random walk. That the variance should be larger is clear if we perform some algebra to write eq. (88) in the more instructive form

= f d , l ( , ) . ; + (90)

Substantial improvements in the results given in table 2 have been found to result if instead of perform- ingjust one random walk for each set of data a batch of M random walks is run for each of N batches. Substan- tial variance reduction can be achieved. This is illustrated by the plot of the figure of merit, 1/(S2t), vs batch size shown in fig. 4.

In batch calculations the unreliability is estimated as an average over N independent batches,

N 1 u = ~ E fi,., (92)

n ~ l

where fi, is the average over the M random walks with data v. The variance in fi is SZ/N, where for large N

u 2 - a 2 (93) 1 s 2 = ~ ,. n = l

is the sample variance corresponding to eq. (90). For the 10000 batch calculations plotted in fig. 4, we obtain, for example t~ × 104 = 0.6878 + 0.0085 and 0.6827 + 0.0071 respectively for M = 8 and 64.

For small batch sizes the improvement in the figure of merit comes from two sources. First, the substantial computational effort required to sample the input data is carried out only once per batch. Thus the time per batch doesn't rise in proportion to the number of random walks per batch but rather as

7 = a + bM, (94)

where a is the data sampling time and b is the time per random walk. For the present problem we find empiri-

60 E.E. Lewis, F. BOhm / Monte Carlo simulation

o x.

4--

E 7

6 o

~s "u- 4.

% 3 x

I a'- 2

1

÷ observed

- - - - c o k u t a f e d

32 6 t ,

M, Botch size

Fig. 4. Figure of merit vs. batch size [calculated results from eqs. (94) and (97)].

~ 2 o, b x

c

L.

>

E

÷ observed

(,aicu[mfed

\<._

H, Batch size

Fig. 5. Sample variance vs batch sizes [calculated result from eq. (97)].

cally that a---110 ms and b---10 m s / r a n d o m walk. Thus for this problem about twelve random walks can be included in a batch before the time per batch dou- bles. The benefit should be larger as more components are considered.

For small batch sizes the variance of the result is substantially reduced. This is illustrated in fig. 5 where the sample variance is plotted versus batch size. The reason for this is understood by considering M sufficiently large (say > 30) that the central limit theorem can be applied to the batch average. According to the central limit theorem the batch averages form a Gaus- sian distribution is about ~,:

= - u , ) / oM, } . (95) OM~,

The batch variance then is related to the variance of the underlying random walk by

0 2 , = o 2 / M . (96)

If the batch quantities are used to evaluate the variance 02 in eq. (88), we obtain instead of eq. (90);

o 2 = - - - ~ / d p f ( , ) o 2 + f d p f ( ) O ( ~ - ~ ) 2. (97)

Thus by taking sufficiently large batches, the variance due to the random walk simulation can be reduced to insignificance compared to that due to the data uncer-

tainty. By comparing the results in fig. 5 with eq. (97) we

may estimate the contribution to the variance due to the

data uncertainty to be

f d ~ f ( ~ ) ( a , - ~)2 = 0.51 × 10 -~ (98)

Subtracting this result from the variance for M = 1 then yields the contribution from the underlying random walk to be

fd~f(~)o£ = 1.33 × 10 -8. (99)

The dotted line represents the estimate of eq. (97) using these values.

From the above it is seen that an optimum similar to that illustrated in fig. 4 may be expected to exist for all problems. For too small a batch size the variance o~ in the first term of eq. (95) causes the figure of merit to deteriorate; for too large a batch size o z remains essen- tially constant, causing the dominant effect to be the time per batch increase with batch size.

5. Discussion

In the foregoing sections we have formulated the evaluation of fault trees for the unreliability of repairable systems as a Markov process suitable for effective simulation by Monte Carlo methods. Following the exposition of analog Monte Carlo simulation, a pair of variance reduction methods were constructed. For illustrative example these were shown to lead to improvements in a figure of merit for computational efficiency of factor of well over three orders in magnitude.

E.E. Lewis, F B6hm / Monte Carlo simulation 61

The Markov Monte Carlo formulation was extended to problems with da~:a uncertainties, and a batching technique was shown to lead to further improvements in the figure of merit. While we have not had the oppor- tunity to make numerical comparisons between Markov Monte Carlo and Kinetic Tree methods which use Monte Carlo data sampling, an observation seems in order. For equal data sampling one would equate the number of Markov Monte Carlo batches to the number of Kinetic Tree trials. If one then chose the batch size just large enough so that the random walk variance could be ignored relative to the variance due to data uncertainty, a fair comparison of computational efficiency would be the Monte Carlo time per batch versus the Kinetic Tree time per trial. This, of course, assumes that the problem is chosen in which component dependencies do not rule out the use of Kinetic Tree methods.

With the encouraging results obtained from this initial study it appears reasonable to generalize the method both to treat more general reliability models and to achieve yet greater computational efficiency. Modeling generalization may proceed along several lines including (1) the inclusion of unavailability as well as unreliability estimates, (2) the restructing of the computational algo- rithms to treat the full range of component dependencies with which the Markov process is capable of deal- ing, and (3) the inclusion of time-dependent failure a n d / o r repair rate sampling. A variety of variance reduction techniques beyond those discussed in section 3 may lead to further improvements in computational efficiency; these may include Russian Roulette and splitting, correlated sampling, and number of impor- tance sampling techniques.

Acknowledgemens

This work was supported in part by the Deutscher Akademischer Austauschdienst. It is based in part on work submitted by the second author in partial fulfill- ment of the requiremens for the Diplom Ingenieur, Universit~it Stuttgart.

Appendix

The log-normal distribution is widely used in fault tree analysis for the representation of uncertainty in failure and repair rate data [1,13]. The probability density is written as

2~ } o~, exp 2o .2 ,

where ~, is a failure or repair rate. The parameters g and o 2 are not identical to the mean value g and the variance of p. They are determined as follows.

Suppose that there is a 90% probability the ~, lies between p/c and pc, where ~ >/1 is the error factor referred to in the text. Then using the log-normal probability density, it may be shown that the parameters and o 2 are given by

o = In c/1.645,

= In ~ - 0 . 5 o 2 .

In the Monte Carlo simulation the error function is first sampled using the central limit technique [24], with twelve random numbers,

12

~ = o E [ ~ i - 61 + g , i = 1

and then the value of p is determined from ~ = Inv .

References

[1] H.R. Roberts, W.E. Vesely, D.F. Hassl and F.F. Gold- berg, Fault Tree Handbook, U.S. Nuclear Regulatory Commission, NUREG-0492 (1981).

[2] E.J. Henley and H. Kumamoto, Reliabihty Engineering and Risk Assessment (Prentice-Hall, Englewood Cliffs, NJ, 1981).

[3] N.J. McCormick, Reliability and Risk Analysis (Academic Press, New York, 1981).

[4] B.J. Garrik, Principals of unified safety analysis, Nucl. Engrg. Des. 15 (1970) 245-321

[5] W.E. Vesely, Time dependent methodology for fault tree evaluation, Nucl. Engrg. Des. 13 (1970) 337-357.

[6] W.E. Vesely and R.E. Narum, PREP and KITT: computer codes for the automatic evaluation of a fault tree, USAEC Report IN-1349 (1970).

[7] W.E. Vesely and F.F. Goldberg, FRANTIC--a computer code for time-dependent unavailability analysis, U.S. Nuclear Regulatory Commission Report, NUREG-0193 (1977).

[8] R.C. Erdmann, J.E. Kelley, H.R. Kirch, F.L. Leverenz and E.T. Rumble, A method for quantifying logic models for safety analysis, in: Nuclear Systems Reliability Engineer- ing and Risk Assessment, Eds.: J.B. Fussel and G.R. Burdick, SIAM, Philadelphia (1977).

[9] J.K. Vaurio, Availability on standby safety systems, in: Proc. ANS ENS Intl. Topical Meeting on Probabilistic Risk Assessment, Sept. 20-24, 1981, Port Chester, NY.

[10] I.A. Papazoglou and E.P. Gyftopoulos, Markovian reliability analysis under uncertainty with an application on the shutdown system of the Clinch River Breeder Reactor, Nucl. Sci. Engrg. 73 (1980) 1-18.


[11] H. Saskawa and M. Sugawara, Application of time-dependent reliability code MARKOV to nuclear power plants, in: Proc. ANS/ENS Intl. Topical Meeting on Probabilistic Risk Assessment, Sept. 20-24, 1981, Port Chester, NY.

[12] R.J. Bartholomew, Failure mode analysis using state variables derived from fault trees with application, in: Proc. ANS/ENS Intl. Topical Meeting on Probabilistic Risk Assessment, Sept. 20-24, 1981, Port Chester, NY.

[13] Reactor Safety Study--An assessment of accident risks in U.S. commercial nuclear power plants, U.S. Nuclear Regu- latory Commission Report WASH-1400, NUREG-75/014 (1975).

[14] S.D. Mathews, MOCARS: a Monte Carlo simulation code for determining the distribution of simulation limits, U.S. Energy Research and Development Agency Report TREE-1138 (1977).

[15] R.C. Erdmann et al., Probabilistic Safety Analysis 111, Electric Power Research Institute Report EPRI NP-749 (1978).

[16] M. Modarres, N.O. Rasmussen and L. Wolf, Monte Carlo simulation by the DL-MODMC code, Trans. Am. Nucl. Soc. 35 (1980) 395.

[17] D.J. Wakefield and D. Ligon, Quantification of uncertainties in risk assessment using the STADIC-2 code, in: Proc. ANS/ENS Intl. Topical Meeting on Probabilistic Risk Assessment, Sept, 20-24, 19811 Port Chester, NY.

[18] P.S. Jackson, S.H. Lee, S.H. Levinson, M.Y. Yeater, R,W. Hockenbury and P.B. Gerrard, Comparison of the Monte Carlo and systems moments methods for uncertainty analysis, in: Proc. ANS/ENS Intl. Topical Meeting on Prob- abilistic Risk Assessment, Sept. 20-24, 1981, Port Chester, NY.

[19] J.M. Hammersley and D.C. Handscome, Monte Carlo Methods (Methuen, London, 1967).

[20] G. Goertzel and M.H. Kalos, Monte Carlo methods in transport problems, in: Progress in Nuclear Energy, Vol. 1I, Series II, Eds.: G.D.J. Hughes, J.E. Sanders and J. Horwitz (Pergamon Press, New York, 1958).

[21] E.M. Gelbard and J. Spanier, Monte Carlo Principles and Neutron Transport Problems (Addison-Wesley, Reading, Mass., 1969).

[22] D.K. Trubey and B.L. McGill (eds.) A review of the theory and applications of Monte Carlo, methods, Oak Ridge National Laboratory Report ORNL/RSIC- 44(1980).

[23] E. Cinlar, Introduction to Stochastic Processes (Prentice- Hall, Englewood Cliffs, New Jersey, 1975).

[24] R.Y. Rubinstein, Simulation and the Monte Carlo Method (John Wiley & Sons, New York, 1981).

[25] " N C N P - - a general Monte Carlo code for neutron and photon transport, Los Alamos Scientific Laboratory Re- port LA-7396-M (1981).

Date post:	17-Feb-2016
Category:	Documents
Upload:	ferasalkam
View:	15 times
Download:	0 times

Monte Carlo Simulation of Markov Unreliability Models

Documents