+ All Categories
Home > Documents > How to Measure Software Reliability and How Not To

How to Measure Software Reliability and How Not To

Date post: 08-Nov-2016
Category:
Upload: bev
View: 215 times
Download: 0 times
Share this document with a friend
8
IEEE TRANSACTIONS ON RELIABILITY, VOL. R-28, NO. 2, JUNE 1979 103 How to Measure Software Reliability and How Not To Bev Littlewood mathematical niceties. Although some of the criticisms of The City University, London existing techniques in the following pages rest on fairly subtle mathematical points, I believe that they have important prac- Key Words- Software reliability, Software errors, Software fail- tical implications. They should be judged on these alone. ure costs, Software life-cycle modeling, Bayesian reliability model- ing. 2. CLASSICAL RELIABILITY MEASURES ReaderAids- Let us begin by looking bnefly at the measures which have Purpose: Critique of existing models and suggestions for future research Letus in by looking biely at themesu reswi ave Special math needed: None. been used in the hardware field. One of the most careful ac- Results useful to: Software engineers, Reliability theoreticians. counts is still that of Barlow & Proschan [1] . They give two basic measures which will generally have meaning: reliability Summary & Conclusions-The paper criticises the underlying as- and availability. sumptions which have been made in much early modeling of com- puter software reliability. The following suggestions will improve modeling. 2.1 Reliability 1) Do not apply hardware techniques to software without thinking carefully. Software differs from hardware in important respects; we Barlow & Proschan give two definitions. We shall consider ignore these at our peril. In particular- 2) Do not use MTTF, MTBF for software, unless certain that they here only the more general one: Interval reliability is the prob- exist. Even then, remember that- ability that at a specified time, T, the system is operating and 3) Distributions are always more informative than moments or will continue to operate for an interval of duration x. parameters; so try to avoid commitment to a single measure of relia- In many situations, steady-state interval relability, i.e. the bility. Anyway- 4) There are better measures than MTTF. Percentiles and failure limit of the above as T -> oo, will suffice (so long as it exists). rates are more intuitively appealing than means. S) Software reliability means operational reliability. Who cares 2.2 Availability how many bugs are in a program? We should be concerned with their effect on its operation. In fact- 6) Bug identification (and elimination) should be separated from There are two common definitions: reliability measurement, if only to ensure that the measurers do not Point availability is the probability that the system will be able have a vested interest in getting good results. 7) Use a Bayesian approach and do not be afraid to be subjective. to operate withi the tolerances at a given instant of time, t. All our statements will ultimately be about our beliefs in the quality of Interpal availability is the s-expected fraction of a given inter- programs. val of time (a, b) that the system will be able to operate within 8) Do not stop at a reliability analysis; try to model life-time the tolerances (repair and/or replacement allowed). Barlow & utility (or cost) of programs. Proschan defie steady-state (limiting) interval availabflity to 9) Now is the time to devote effort to structural models. Proschanidefine steady-state (limiting)tinterval availabilityoto 10) Structure should be of a kind appropriate to software, e.g. be the limit of the above as b -o with a = 0. It is also some- top-down, modular. times sensible to consider a limiting version of point availabil- 1. INTRODUCTION ity. My intention in writing this paper is to provoke discussion 2.3 How adequate are these? about some aspects of software reliability measurement. The method I shall adopt is one of critical analysis of some previ- Given that, for reasons of simplicity, it is necessary to sum- ous research, together with suggestions for future directions. marise all available information into a single numerical measure In order not to overburden the argument, I shall not bend over of reliability, then one or other of these definitions will be ap- backwards to praise the positive aspects of work I criticise; I propriate for most situations. Often, however, there is no such am sure the authors concerned will be quite capable of per- overriding need for simplicity; and more insight into the pro- forming this task, and in doing so help make for interesting cess of failures will be gained by considering, say, percentiles discussion! of time-to-next-failure distributions. This is particularly true In order to forestall criticism, I explicitly state that the of software, as I hope to show below. aim of this paper is to improve the range of tools available to A technical criticism can be made of one of the availability software managers, not to instigate a theological debate about definitions which, again, will be particularly serious in the case 0018-9529/79/0600-0103 $00.75 ©> 1979 IEEE
Transcript
Page 1: How to Measure Software Reliability and How Not To

IEEE TRANSACTIONS ON RELIABILITY, VOL. R-28, NO. 2, JUNE 1979 103

How to Measure Software Reliabilityand How Not To

Bev Littlewood mathematical niceties. Although some of the criticisms ofThe City University, London existing techniques in the following pages rest on fairly subtle

mathematical points, I believe that they have important prac-Key Words- Software reliability, Software errors, Software fail- tical implications. They should be judged on these alone.

ure costs, Software life-cycle modeling, Bayesian reliability model-ing. 2. CLASSICAL RELIABILITY MEASURES

ReaderAids- Let us begin by looking bnefly at the measures which havePurpose: Critique of existing models and suggestions for future research Letus in by looking biely at themesu reswi aveSpecial math needed: None. been used in the hardware field. One of the most careful ac-Results useful to: Software engineers, Reliability theoreticians. counts is still that of Barlow & Proschan [1] . They give two

basic measures which will generally have meaning: reliabilitySummary & Conclusions-The paper criticises the underlying as- and availability.

sumptions which have been made in much early modeling of com-puter software reliability. The following suggestions will improvemodeling. 2.1 Reliability

1) Do not apply hardware techniques to software without thinkingcarefully. Software differs from hardware in important respects; we Barlow & Proschan give two definitions. We shall considerignore these at our peril. In particular-

2) Do not use MTTF, MTBF for software, unless certain that they here only the more general one: Interval reliability is the prob-exist. Even then, remember that- ability that at a specified time, T, the system is operating and

3) Distributions are always more informative than moments or will continue to operate for an interval of duration x.parameters; so try to avoid commitment to a single measure of relia- In many situations, steady-state interval relability, i.e. thebility. Anyway-

4) There are better measures than MTTF. Percentiles and failure limit of the above as T -> oo, will suffice (so long as it exists).rates are more intuitively appealing than means.

S) Software reliability means operational reliability. Who cares 2.2 Availabilityhow many bugs are in a program? We should be concerned with theireffect on its operation. In fact-

6) Bug identification (and elimination) should be separated from There are two common definitions:reliability measurement, if only to ensure that the measurers do not Point availability is the probability that the system will be ablehave a vested interest in getting good results.

7) Use a Bayesian approach and do not be afraid to be subjective. to operate withi the tolerances at a given instant of time, t.All our statements will ultimately be about our beliefs in the quality of Interpal availability is the s-expected fraction of a given inter-programs. val of time (a, b) that the system will be able to operate within

8) Do not stop at a reliability analysis; try to model life-time the tolerances (repair and/or replacement allowed). Barlow &utility (or cost) of programs. Proschan defie steady-state (limiting) interval availabflity to

9) Now is the time to devote effort to structural models. Proschanidefine steady-state (limiting)tinterval availabilityoto10) Structure should be of a kind appropriate to software, e.g. be the limit of the above as b -o with a = 0. It is also some-

top-down, modular. times sensible to consider a limiting version of point availabil-1. INTRODUCTION ity.

My intention in writing this paper is to provoke discussion 2.3 How adequate are these?about some aspects of software reliability measurement. Themethod I shall adopt is one of critical analysis of some previ- Given that, for reasons of simplicity, it is necessary to sum-ous research, together with suggestions for future directions. marise all available information into a single numerical measureIn order not to overburden the argument, I shall not bend over of reliability, then one or other of these definitions will be ap-backwards to praise the positive aspects of work I criticise; I propriate for most situations. Often, however, there is no sucham sure the authors concerned will be quite capable of per- overriding need for simplicity; and more insight into the pro-forming this task, and in doing so help make for interesting cess of failures will be gained by considering, say, percentilesdiscussion! of time-to-next-failure distributions. This is particularly true

In order to forestall criticism, I explicitly state that the of software, as I hope to show below.aim of this paper is to improve the range of tools available to A technical criticism can be made of one of the availabilitysoftware managers, not to instigate a theological debate about definitions which, again, will be particularly serious in the case

0018-9529/79/0600-0103 $00.75 ©> 1979 IEEE

Page 2: How to Measure Software Reliability and How Not To

104 IEEE TRANSACTIONS ON RELIABILITY, VOL. R-28, NO. 2, JUNE 1979

of software. Consider interval availability: "the s-expected software engineer will try to eliminate during program testingfraction of a given interval of time that the system will ... . and development. Some would prefer to concentrate effortoperate . . .". What is really of interest is the actual fraction on ensuring they never get into the program in the first placeof time the system will operate, but this is a random variable. (see for example Mills [13, 14]).Merely quoting the s-expected value of the random variable I am not conviced that the advocates of measuring reliabilitygives no idea of how much the actual result might deviate via bug-counting have presented a strong case. Our objectivefrom this. We need the distribution of the fraction in order should be to measure the quality of the behaviour of the soft-to be able to claculate a tolerance interval. In many practical ware, its operational reliability, rather than the quality of itssituations, of course, the steady-state availability is most ap- state. A good program is defined in terms of what it does,propriate. In such cases it might be thought that the fraction not what it is.of an interval (0, F) that the system will operate would con- Of course, when the program fails in some way, it is theverge in probability, for large T, to the steady-state s-expected software engineer's job to find the cause(s) of that failure; hefraction (i.e. the steady-state availability). This result would has to eliminate bugs. Although the bug eliminator and thenormally be established using Tchebycheff's inequality [3, reliability measurer might be embodied in the same person,p 46]. Unfortunately, convergence cannot be guaranteed. In the operations are quite different and only confusion ensuesfact, if we model the system's behaviour by an alternating re- from combining them. I suspect that the distinction mightnewal process [4, p 80 et seq.] with the two types of time have important economic and social consequences; would itintervals representing operating and repairing, then it can be not be a great help in getting better quality software to insistshown that the fraction of time spent working does converge that the reliability improvement/measurement interface coin-in probability to pw/O(pi + Pr) (where p,u and Pr are the means cide with that of the contractor/customer? I have no directof the time-to-failure and time-to-repair distributions) as long experience of USA practice, but I know of cases in the UKas these means exist. If the distributions do not have moments, where contractors have acted as their own judge and jury.convergence may still take place, but it is possible to construct It can be argued that the state of a program (number ofexamples where it does not. In such a situation, where the bugs) determines its performance (operational reliability),steady-state interval reliability does not have the interpretation but any relationship here is likely to be very complicated andof (probabilistic) limiting fraction of time operating, it is dif- unknown. Certainly the kind of assumptions which have beenficult to assign it any practical meaning. I contend that we are made seem very naive; thus Shooman [18] says:more likely to encounter this kind of difficulty with software ". .the software failure rate (crash rate) is proportional to thethan with hardware. number of remaining errors."

These comments can be summarised in the following: weshould be extremely careful of replacing the wealth of informa- I have never seen a program where this assumption could betion in a probability distribution with single summaries - valid. It is easy to imagine a scenario where a program withwhether these be parameters or moments of the distributions. two bugs in little exercised portions of code is more reliable

It is worth mentioning the origin of the conventional obses- than a program with only one frequently encountered bug.sion with these summaries. In hardware reliability there is jus- I concede that it might sometimes be possible to modeltification for thinking that certain complex devices might fol- realistically the relationship between operational reliabilitylow an exponential failure law [1, pp 18-22]. In such a case, and number of residual bugs. But why bother? Why intro-the mean time to failure (or failure rate) totally describes the duce this extra risk of modeling error, when we can do any-failure behaviour and comments such as those above no longer thing we want in terms of operational reliability, directly?hold. Unfortunately, such justification does not apply to Thus, for example, if we want to estimate the debuggingsoftware. It remains an open question whether any failure law time needed to obtain a specified operational reliability, wecan be developed which reflects the nature and structure of can achieve it via a suitable model based upon operationalsoftware; but in the absence of such a law we should beware of reliability (see, for example, [11, p 113]).a blind and ignorant adoption of those hardware reliability con-cepts which have the exponential distribution as their basis. 3.2 How to measure operational reliability

3. SOFTWARE RELIABILITY MEASUREMENT Because operational reliability is what we should be measur-ing, we now have to consider ways of doing this. Although the

There are instances, then, where the hardware reliability problem is a dual one of modeling and estimation, my presentexperience has proved a mixed blessing for software reliability, concern is with the former, since it is in the modeling stageIn what follows I shall give a personal view of some aspects of that I believe we must be concerned with some unique proper-software reliability measurement. ties of software.

I must consider a renewal process in continuous time; the3.1 Bugs bared successive renewals represent successive failures of the software.

For simplicity assume initially that repairs are instantaneousAt some risk of oversimplifying, bugs (errors) can be de- (there is surprisingly little information available about repair

fined as those defects in the program which cause failures in times). It is probably worth mentioning, also, that time shouldits dynamic operation. As such, they are the things which the be execution time, Musa [16] .

Page 3: How to Measure Software Reliability and How Not To

LITTLEWOOD: HOW TO MEASURE SOFTWARE RELIABILITY AND HOW NOT TO 105

Such renewal processes can be characterised either in terms the mean time to failure may cause difficulties with the de-of their successive inter-event times, or via the numbers of fmition of availability, as mentioned earlier. Even if theevents in fixed time intervals. The former method is more ap- mean does not exist, it might be that the fraction of timepropriate for my purposes. I contend that the distributions of available converges in probability to some constant, but thistimes between failures may have unusual properties, in partic- cannot be assumed. Detailed knowledge of the distributionsular their moments may not exist (viz, are infinite). If this of time-to-failure and time-to-repair will be needed. Scant at-were true then classical measures such as mean-time-to-next tention seems to have been paid to repair-time distributions infailure (MTTF), mean-time-between-failures (MTBF) would the literature.be infinite, and any estimates of them (although finite them- If we are to exchew the use of mean time-to-failure, whatselves) would be meaningless. measures are left? I have already argued that we should be

This assertion is quite revolutionary, and it would be plea- altogether less obsessed with single measures; preferring in-sant to be able to say that I have evidence to support it from stead distributions from which we can calculate, if required,actual software failure data. Unfortunately this is not the case. many appropriate measures. Thus from a time-to-next-failureAs far as I am aware there is no good statistical test of the hy- distribution we could quote the probability that the time-to-pothesis that a set of data comes from a momentless popula- next-failure exceeded any required value. Even more attrac-tion. In any case, the nature of the problem is likely to re- tive are tolerance bounds for the time-to-next-failure. Thesequire such a test be based upon a large amount of data. Soft- refer to the quantity of most practical interest, the time atware failure data are still notoriously difficult to obtain in which the next failure will occur, and can be calculated atlarge quantities. any appropriate level: 50%, 80%, 90,o etc. [10, 1 1 ] . It seems

Support for this idea, then, must be analytic rather than to me that they have a greater intuitive appeal than MTTFdirectly evidential. In the first place it could be asked, slightly (even when it exists). After all, how can a MTTF be used,frivolously, what evidence there is for the existence of mo- except in conjunction with an assumption about the distri-ments. Those people who wish to estimate MTTF should be bution, usually exponential, of times to failure? The caseasked to furnish evidence that the quantity they are estimating against MTTF is even stronger in practice, since most modelsdoes indeed exist. One can always obtain a finite average of end with s-confidence bounds for MTTF being quoted [16].some data, but it may not estimate a population mean. More What practical conclusions can be drawn from the fact thatseriously, there is the unique property of software that it suf- a particular interval contains the true MTTF with, say, 90ofers no natural degradation; once perfect, it will never fail. If s-confidence? All of this illustrates, a non-mathematical re-we concede, therefore, that there is some chance (however quirement of our modeling which must never be forgotten:small) that the program is perfect, then the mean time to fail- our work must result in management tools which are simple,ure must be infinite [6]. This case is so extreme that not even easy to use, and intuitively meaningful.fractional moments would exist. But is it, indeed, such an ex- A measure of quality which exists under wide conditions,treme assumption? When we come to look at the problem of and has a direct intuitive meaning, is the failure (hazard) rate.incorporating structural information into our reliability mod- Indeed, for the decreasing failure rate [1] situations whicheling we shall consider modular programming - it does not we hope to encounter in software, this has a more reasonableseem unreasonable to believe that a small enough module interpretation than instantaneous mean time-to-failure.might be perfect. Mills goes much further, asserting that top-down structuring is likely to produce perfect programs, even 3.3 Software engineers should be Bayesianswhen they are large [14]:

"The new reality is that you can learn to consistently In this section I hope to convince you that we should usewrite programs which are correct ab initio, and prove to be Bayesian interpretations and methods for software reliability.error free in their debugging and subsequent use." The subjective interpretation of probability, which is usu-ally associated with the Bayesian school, seems more appro-

It is possible to construct models which have moment- priate for software than a frequentist approach. Because eachless time-to-failure distributions as a consequence of quite program is unique, there is usually no sense in which one canunexceptional assumptions about the properties of software. envisage an ensemble of repetitions of the reliability measuringMy own (with coauthor) model [10, 12] is one such case. In operation upon which a frequentist interpretation would de-this work he and I took failure rate as the measure of interest, pend. Since there appears to be considerable disagreementand modeled the improvement of reliability in a subjective, about these matters, it is worth considering them in more de-Bayesian fashion (cf. the error removal models of Shooman, tail. The best place to start is by carefully analysing the oni-et al. [17, 18, 20], and Jelinski & Moranda [5] ). This model gins of the uncertainty (randomness) concerning the failurehas great flexibility and allows exact distributions of time-to- behavioulr of software.failure to be computed, with associated percentiles, medians, Why cannot the failure times of a program be predictedetc., in addition to failure rate measures. If we accept the exactly? If we knew how the program behaved for every con-plausibility of models of this kind, it seems to me that we ceivable input, and could predict future inputs, then I supposeshould not baulk at any consequences, such as infinite mean it would be possible to predict the next failure epoch. Unfor-

time to failure. ~~~~~~~tunately we never have such a total knowledge. Most authors,These considerations about the possible non-existence of therefore, agree that the failure process is random. But what

Page 4: How to Measure Software Reliability and How Not To

106 IEEE TRANSACTIONS ON RELIABILITY, VOL. R-28, NO. 2, JUNE 1979

I (ipn space) Fl 2IF

Fig. 1. The program is a mapping Fig. 2.of linto O,p: I-O.

r -

p(pvJgraJf)Pi p(:pp2

0F0

are the precise sources of this randomness? The most widelyused conceptual model of software is the input-program-out-put model (Fig. 1). This is usually interpreted as follows:some (random) mechanism selects points from the input a given input will lie inside IF for the program. It might bespace to be processed by the program and produces points possible to describe the first type of uncertainty in a frequen-of the output space. The program is a mapping of I into 0 tist fashion; we could argue, for example, that for a given pro-that is p: I -eO. We observe failures in the output whenever gram, the limiting proportion of inputs which result in a fail-the program receives input from the subset IF; this subset is ure defines the probability that an unknown input will resultencountered randomly, and thus the failures in the output in failure. But it does not seem to me that this argumentspace occur randomly. Thus, if we know the properties of could be used to define frequentist probability statementsthe program totally, it might be reasonable to assume that the about the program itself. In most cases we shall only writefailure process would be random and reflect the fluctuations one program; if we wish to talk about the probability thatof the input data stream. It could be modeled by a Poisson a particular input lies inside IF for this program (or, moreprocess, for example. It is at this point that the reasoning precisely, that the random set IF contains the input underabout this model usually stops; the only source of randomness consideration), we must use a subjective interpretation.is seen in the inputs. These effects which necessitate a subjective view of pro-

But what about the program itself, the mapping of I into bability statements become more pronounced in practiceO? Surely there is also uncertainty about the nature of the because of the effect of programmer intervention after fail-mapping? Imagine that two programming teams have been ure. If the program is changed, in a bug-fixing attempt, theset the same task, each to write a program to the same speci- partitioning of I changes: IF becomes IF. Although the in-fication. The resulting programs (p1, p2) then operate in iden- tention is to improve reliability by removing sources of fail-tical environments (i.e. have the same input space, 1) and their ure, i.e. making 4F C IF, this cannot be guaranteed. Indeed,outputs are compared (e.g. by a quality-engineer or customer). the bug fixing operation is itself a new source of uncertainty.The comparator will, from the program specification, define a The problem, then, is to incorporate the ideas of this con-single set of correct outputs, and a single set of incorrect out- ceptual model into a detailed mathematical model of softwareputs (failures), OF, which together form the total output set, reliability. This can be done in many ways; it seems to have0. Since the input set is the same for both programs, the dif- been attempted only once, by John Verrall & myself [10, 11].ference between them is revealed as a difference between two The brief details of this work are as follows. It is assumed thatmappings from I to 0, with the same subset OF defining po- the first source of uncertainty mentioned above can be de-tential failures in each case (Fig. 2). In other words, the two scribed by a Poisson process (i.e. inputs from IF occur asprograms will differ in the way they partition the input space points in a Poisson process in time). This can be justified ifinto a region 'F, which will produce failures, and its comple- points in IF are selected randomly, and 'F is small. The fail-ment. Our uncertainty about the program, then, can be re- ure rate of the Poisson process, X, represents the size of IF.garded as uncertainty about the nature of IF. We might go Since the second source of uncertainty concerns this failurefurther and suggest that the size of 'F is related to the failure rate, it is treated as a random variable. The unconditionalrate of the program. failure process therefore compounds these two sources of

If we try to predict whether a failure will occur at a partic- uncertainty, and results in a point process which is not Poissonular time, we now must consider two sources of uncertainty. (viz, times-to-next-failure are not exponentially distributed).We shall not be able to predict which part of the input space Uncertainty about the efficacy of the bug-fixing operation iswill be encountered, and we shall not be able to tell whether also included in the model.

Page 5: How to Measure Software Reliability and How Not To

LITTLEWOOD: HOW TO MEASURE SOFTWARE RELIABILITY AND HOW NOT TO 107

instantaneous,s trxx failure r-atefailure rate

(hazard rate) (hazard rate)

i th timitth (failure failure time i th ( i+l)thfailure failure time

Fig. 3. Reliability growth with effective bug-fixing.Fig. 4. Ineffective bug-flxing.

To summarise: there are two sources of uncertainty about 4. FUTURE WORKprogram reliability, relating to the input space and the programitself. Since the latter can only be described subjectively, any Apart from these questions concerning the ability of modelsglobal probability statements which incorporate both types of to achieve their desired objectives, it is striking that these objec-uncertainty must be interpreted subjectively. tives should have been so restricted. There are at least two

At the beginning of this section, I stated that we should use areas where great rewards might be gained by considering widerBayesian interpretations and methods for software reliability. problems than black-box failure point processes.I shall now try to justify the second assertion, using myBayesian model [10, 12] as an illustration. The essence ofBayes Theorem is that it provides a means of continuously 4.1 Reliability versus Utilityupdating previous reliability measurements in the light of newdata. Consider the kind of calculation which is possible withour model. Figure 3 gives a portion of the plot of failure ratewhich can be obtained, showing the development as time passes. A surprising omission from most of the software literatureThus, prior to failure i, whilst the program is working without is a concern for the consequence of failures. This, again, mayfault, faith in it increases and the failure rate falls. When fail- be a result of the close connection with hardware reliability,ure i occurs, faith drops and the failure rate increases, but im- which has traditionally (and likewise unfortunately) concen-mediately falls by a finite amount owing to the efficacy of the trated on modeling the failure process. The omission is recti-repair. After the (instantaneous) repair the program works a fied to some extent in the wider software engineering context:again in continuous time, and during this period the failure rate management techniques, on the whole, are cost-conscious.again falls continuously. If our faith in the skill of the debug- This literature, unfortunately, tends to be more qualitativeger is not sufficiently great to overcome our pessimism at the than quantitative.occurrence of a failure, then a plot such as Fig. 4 will result. Hardware redundancy allows us to make a system as reliableIn fact it is quite possible that faith in a program lessens pro- as we desire by using sufficiently many components of a givengressively if failures are sufficiently frequent, a situation unreliability. It is thus possible to compare the improvementwhich may occur in practice (Shooman & Natarajan [20, foot- in life-cycle cost with the extra system cost needed to makenote p 164] ). The strength of the model lies in its ability to that improvement. Since we do not have such techniques forproduce the appropriate answer automatically; reliability software (and we may never get them), it seems to me partic-growth or reliability decay does not need to be input a priori. ularly important that we at least estimate life-time costs of ourThe most important advantage of Bayesian methods, though, programs. I wonder how many projects we would embark onlies in their ability to reflect that attitude to programming if we based decisions on life-cycle costs rather than develop-which has Mills as its best proponent. He states [13]: ".. . ment costs.. . . Buzen et al. [2] compared virtual machinenever finding the first error gives more confidence than fmding and conventional operating systems using linear utility func-the last error". The Bayesian, no news is good news, property tions. My own work [9] attaches a random cost to each fail-represents this exactly: periods of failure-free working cause ure mode. Both these approaches, however, assume fairlythe reliability to improve. It is interesting to compare this specialised program structure; what is needed is a simple-to-usewith the bug-counting methods of Jelinski & Moranda, and life-cycle cost model for quite general programs.Shooman et al.; here reliability improvement can only take It is often argued, against this kind of approach, that weplace at a failure, since it is only at such a time that an error hardly ever have much information about costs. This seemscan be removed. Thus they directly contradict the Mills doc- unnecessarily defeatist; after all, merely counting failures istrine: they have increasing faith in the program as failures equivalent to giving a unit cost to each. Surely we alwaysoccur! know better than that? !

Page 6: How to Measure Software Reliability and How Not To

108 IEEE TRANSACTIONS ON RELIABILITY, VOL. R-28, NO. 2, JUNE 1979

4.2 Stn*ctural models Referee Comments

Most of the earlier part of this paper was concemed with W.E. Thompsonmeasuring the quality of a single, black-box program: the P.O. Chelsonequivalent of a device or component in hardware terminology.Fortunately, we usually have a large amount of information Littlewood states his purpose is to provoke discussionavailable about the structure of the program, and it is sensible about some aspects of software reliability measurement. Dis-to use this in our reliability modeling. This does not, however, cussion is very much needed at the present state of definitionseem to be an easy task. Most attempts so far have looked at and modeling of software reliability and he is well qualifieda fairly specialised structure, but the real goal is the modeling to lead off.of general systems. A key point made by Littlewood is that "software reliabil-

One of the most interesting attempts to go beyond the ity means operational reliability," and he points out that weblack-box approach is that of [2] for virtual machine tech- do not know how to define software reliability in terms otherniques. They [2] are able to prove the superiority of the than performance of the system in which software is executed.virtual machine approach in the organisation of a particular One bug, two bugs, corrected - no one knows how to charac-operating system under quite weak assumptions. My own terize a software corrective action other than in terms of re-work [7 - 9] has treated modular programs by assuming a sulting operation. The concept of software containing exactlyMarkov, or semi-Markov, dynamic behaviour; I show (under N bugs, each of which requires identification and correction isa plausible limiting operation) that failures of the overall pro- misleading in the application to software reliability measure-gram will occur approximately as a Poisson process. Shooman ment and testing. Once one decides, as Littlewood suggests,[19] has developed a model which incorporates microstructure not to attempt to count bugs and instead to characterize soft-via the execution frequencies and failure probabilities of paths ware reliability in terms of operational performance, then soft-within the program. ware reliability can be defined as the probability that there

All of this work is dogged by the difficulty of translating will occur no software-related malfunctions in a given timeinto dynamic behaviour the structural information which is interval. Then all the existing body of statistical techniquespresent in static forms such as program listings. However, relating to analysis of time series can be applied.it is beginning to be obvious that techniques such as mod- Littlewood states, "We must consider a renewal-type pro-ularity and top-down structuring are gaining in popularity for cess .. .; the successive failures of the software." Our argumenta good practical reason: they deliver the goods. If they do is with the terminology, "failure of the software." This ter-become more widely used, the following intriguing question minology is misleading. In fact, the software does not fail andbecomes of interest: Do these methodologies inevitably re- Littlewood does not mean that it does. An operational mal-sult in the final program having a particular failure law? The function occurs - a malfunction being ultimately any featurecase of top-down structuring is particularly interesting in view of operation that is unacceptable to the user and which occursof the simplicity of the structure-three basic program struc- as a point event in time. When that malfunction is traced to atures [13]. Alternatively, in modular terminology, where we property of the software, then the malfunction can be said toimagine a program to be comprised of subprograms which in be software related, but we do not mean that the computerturn are comprised of sub-subprograms, what is the relation- software fails in the same sense as the operational malfunctionship between the reliability of a module and the reliabilities of occurred.the submodules it comprises? It seems likely that modules and Littlewood recommends a Bayesian approach both in a sub-submodules have the same failure law (with, presumably, dif- jective interpretation of probability as a degree of belief andferent parameter values), because the methodology used in in the repeated application of Bayes theorem to the inferencetheir creation is similar. What would such a failure law be? problem to generate a posterior distribution of the reliabilityAn answer in a special case is given in my Markov model, parameters. This makes sense and has the further advantagewhere it is proved that if the subprograms follow a Poisson that the posterior distribution of reliability is a natural pointlaw, then the program itself will also (in the limit). It would of departure for modeling risk or cost in final testing, for ex-be of great interest to see whether there are other cases where ample.the form of the failure law is dictated by program structure. Littlewood points out the fact that if the prior probability

of zero errors is greater than zero, then the posterior mean lifemay be infinite. The current trend is to ignore the possibilityof infinite mean life in reliability models, even though the po-

ACKNOWLEDGMENT tential of infinite mean life is real when one considers thatthere is a non-zero probability that some software modules

An earlier version of this paper was presented at the Soft- (or hardware not subject to wear out) could be free fromware Life Cycle Management Workshop, Airlie, VA USA in source of failure at the onset of operational use.1977 August. I am grateful for the long discussions and arg- The criticism of most current popular models can be ex-uments I had there; in particular with Bob McHenry, Marty tended in that (for the same reason just mentioned) theseShooman, John Musa, Francis Parr, and Barry Boehm. models do not behave properly in the limit as software be-

Page 7: How to Measure Software Reliability and How Not To

LITTLEWOOD: HOW TO MEASURE SOFTWARE RELIABILITY AND HOW NOT TO 109

comes more and more reliable - in particular, as the last the real world cannot be captured in a precise mathematicalseveral bugs are removed from a program. The software framework. But, from such attempts, we can learn.reliability models of Jelinski-Moranda, Shick-Wolverton,and Shooman, for example, are meant for software systemsin development with a large number of initial errors and thus Author Replya large malfunction and correction rate. The fact that thesemay give acceptable answers is probably more a result of the I would like to make a few comments on the referees' re-robustness of the rate parameter and process model rather than marks, and clear up a few misunderstandings.an accurate assessment of number of errors in the software. I cannot agree with Amster's first comment. How can a

Littlewood's arguments for not trying to measure software good estimate ofM be used when F is unknown? I realise thatbugs, but instead measuring operational reliability are valid. in hardware reliability MTBF has an almost mystical signifi-But if an attempt is made to account for errors present, his cance, but even here there is usually an unacknowledged (oftenarguments suggest that it would be more realistic to include a unconscious) assumption that F is negative exponential. Inmeasure of the importance or impact of operational reliability all the bug-counting models referenced in my paper, such ex-is our concern, then not only would a program with two bugs ponentially assumptions are made. The issue, then, is notin little exercized portions of code be more reliable than a pro- whether to make assumptions about F (we all have to do that),gram with only one frequently encountered bug, but code in but whether these assumptions should be openly used in ourwhich an error resulted only in an infrequent round-off error quantitative analysis. I believe that if we quote F we are leav-might be considered more reliable than one in which the error ing our customers free to use whatever measures are most con-results in a system crash. It is true that this effect fits more in venient and informative for their purposes.the category of system effectiveness, but it is ignored by most Amster's third comment completely misunderstands mymodelers in the field today. argument, and I must take some blame for this. My hypothet-

Littlewood, in his paper, has surveyed the current state of ical experiment involving two programmers working to thesoftware reliability modeling in a harsh but accurate manner. same specification was intended only to suggest that there isAll practitioners in this field would do well to heed his com- variability, or unpredictability, in the program as well as thements input stream. I then go on to argue that this unpredictability

can only be modeled via Bayesian, subjectivist probability -Referee Comments the reason being that in practice we only write one program an

and cannot define the hypothetical ensemble of programs re-S.J. Amster quired for a frequentist interpretation. It is the frequentist in-

terpretation which depends on programs which could haveThe following comments are concerned only with the 'prac- been written but were not.

tical implications' of this paper: I wonder whether Amster's objection is not to mathematical

1. It is better to know F (the time-to failure distribution) modeling per se? We all agree that naive models will not pro-than justM (its mean). But a good estimate ofM is more duce good decisions, but I believe the answer is to make ouruseful than an estimate of F which requires additional models less naive. My intention was to pinpoint naiveties and

assumptions of doubtful validity. suggest improvements. The danger in not having formal mod-2. Distributions without means have interesting mathemat- els is that decisions will be made on hunch and prejudice; these

ical properties but are not the real problem. The only are very naive.cited reference for "perfect" programs is H.D. Mills, Thompson & Chelson argue that 'failure' is a misleadingwhose paper has many qualifiers. Also, even if such a word to apply to software. I agree. In my early work with

program was written, failures are caused by environ- Verrall we discussed this issue at some length and adopted a

mental changes, improvements, new inputs, etc. subjective definition: a failure is any event which the user

3. The Bay n athinks of as a malfunction. Certainly there is less agreementcouldTheBavesn writtenut werens ono.Since the about what constitutes a failure in software than in hardware.could have been written but were not. Since the datacome from a sampling frame of one single program, the In particular, we shall hardly ever experience the dichotomous,relevance of this hypothetical ensemble is not clear. on/off, behaviour which is allegedly common for hardware.

4. Despite some nice mathematical results, additional com- This is an added reason for the importance of a study of soft-ponents are seldom used to increase hardware reliability, ware failure consequences.Improved deig an*auacuigi th usa*rg I thank the referees for their careful reading and comments.

matic approach. I hope other readers who have opinions on these issues will be5. Naive models with inaccurate data do not produce good encouraged to continue the debate.

decisions. Hardware history indicates that 'a simple-to-use life cycle cost model for quite general programs' is REFERENCESa futuristic dream.

Despite my negative tone, I consider this a stimulating paper [11 R.E. Barlow, F. Proschan, Mathematical Theory of Reliability.and worthy of several readings. Many important nuances of New York: Wiley, 1965.

Page 8: How to Measure Software Reliability and How Not To

110 IEEE TRANSACTIONS ON RELIABILITY, VOL. R-28, NO. 2, JUNE 1979

[2] J.P. Buzen, P.P. Chen, R.P. Goldberg, "Virtual machine tech- [16] J.D. Musa, "A theory of software reliability and its application",niques for improving system reliability", in [21, pp 12-171. IEEE Trans. Software Engineering, vol SE-1, 1975 Sep, pp 312-

[31 K.L. Chung, A Course in Probability Theory. New York: 327.Harcourt, Brace and World, 1968. [17] M. Shooman, "Probabilistic models for software reliability and

[4] D.R. Cox, Renewal Theory. London: Methuen, 1962. prediction", same source as [51, pp 485-502.[5] Z. Jelinski, P.B. Moranda, "Software reliability research", [181 M. Shooman, "Operational testing and software reliability

Statistical Computer Performance Evaluation, Ed.: W. estimation during program development", in [21, pp 51-57].Freiberger. New York: Academic, 1972, pp 465484. [191 M. Shooman, "Structural models for software reliability predic-

[6] B. Littlewood, "MTBF is meaningless in software reliability", tion", in Proc. 2nd International ConfSoftware Engineering, San(letter) IEEE Trans. Reliability, vol R-24, 1975 Apr, p 82. Francisco, 1976 Oct, pp 268-280.

[7] B. Littlewood, "A reliability model for Markov structured soft- [20] M. Shooman, S. Natarajan, "Effect of manpower deploymentware", Proc. 1975 International Conf. Reliable Software", Los and bug generation on software error models", same source asAngeles, Cal., 1975 Apr 21-23, pp 204-207. [9], pp 155-170.

[8] B. Littlewood, "A reliability model for systems with Markov [21] Record, 1973 IEEE Symp. Computer Software Reliability, Newstructure", Applied Statistics (J. Roy. Statist. Soc., Series C) York, N.Y., 1973 Apr 30 - May 2,vol 24, No. 2, 1975, pp 172-177.

[9] B. Littlewood, "A semi-Markov model for software reliabilitywith failure costs", Proc. Symp. Computer Software Engineering,New York, N.Y., 1976 Apr 20-22, pp 281-300. AUTHOR

[10] B. Littlewood, J.L. Verrall, "A Bayesian reliability growthmodel for computer software", Applied Statistics (J.Roy. Statist. Dr. B. Littlewood; Mathematics Department; The City University;Soc., Series C) vol 22, No. 3, 1973, pp 332-346. Northampton Square; London EC1V OHB ENGLAND.

[11] B. Littlewood, J.L. Verrall, "A Bayesian reliability model with astochastically monotone failure rate", IEEE Trans. Reliability, Bev Littlewood holds BSc and MSc degrees from the University ofvol R-23, 1974 Jun, pp 108-114. London in Mathematics and Statistics, respectively, and a PhD from

[121 B. Littlewood, J.L. Verrall, "A Bayesian reliability growth model The City University, London in Statistics and Computer Science. Hisfor computer software", in [21, pp 70-76]. teaching and research interests are in Applied Probability. He is a Fellow

[13] H.D. Mills, "On the development of large programs", in [21, pp of the Royal Statistical Society.155-159].

[14] H.D. Mills, "How to write correct programs and know it", same Manuscript TR77-150 received 1977 November 29; revised 1978 Augustsource as [7], pp 363-370. 3. Formal Referee comments received 1978 September 8 & 18. Formal

[15] D.E. Morgan, D.J. Taylor, "A survey of methods of achieving author reply received 1979 February 28.reliable software", Computer, vol 10, 1977 Feb, pp 44-53.

Books Received for Review

Reliability Modeling in Electric Power Systems Techniques of Safety ManagementJ. Endrenyi, 1978, $12.50, 338 pp. 2nd EditionJohn Wiley & Sons, Inc.; One Wiley Drive; Somerset, Dan Petersen, 1978, $19.50, 314 pp.NJ 08873 USA. McGraw-Hill Book Company; 1221 Avenue of theISBN: 0 471 99664 5; LCCCN: 78-6222. Americas; New York, NY 10020 USA.

ISBN: 0-07-049596-3; LCCCN: 77-9384.

Structured Systems DevelopmentIEC Publication 605-1: Equipment Reliability Testing Kenneth T. Orr, 1977, $12.50 softcover, $17.00 hard-Part 1: General requirements cover, 170 pp.

International Electrotechnical Commission, 1978, Yourdon Press; 1133 Avenue of the Americas; New50 Fr.s., 59 pp.YokNY106UAInternational Electrotechnical Commission; 1-3, rue de ISBN: 0-917072-06-5; LCCCN: 77-885-93.Varembe; Geneva, SWITZERLAND.

Structured Analysis and System SpecificationTom DeMarco, 1978, $25.00, 352 pp.Yourdon Press; 1133 Avenue of the Americas; NewYork, NY 10036 USA.

lEC Publication 605-7: Equipment Reliability Testing ISBN: 0-917072-07-3; LCCCN: 78-51285.Part 7: Compliance test plans for failure rate andmean time between failures assuming constant failure Top-Down Structured Programming Techniquesrate Clement L. McGowan and John R. Kelly, 1975, $16.75,

International Electrotechnical Coinmission, 1978, 288 pp.50 Fr.s., 41 pp. Van Nostrand Reinhold; 135 West 50-th Street; NewInternational Electrotechnical Commission; 1-3, rue de York, NY 10020 USA.Varembe; Geneva, SWITZERLAND. ISBN: 0-88405-304-0; LCCCN: 74-30427.


Recommended