+ All Categories
Home > Documents > Availability, Reliability, And Survivability_Murphy

Availability, Reliability, And Survivability_Murphy

Date post: 14-Apr-2018
Category:
Upload: slavikp
View: 216 times
Download: 0 times
Share this document with a friend
5
26 CROSSTALK  The Journal of Defense Software Engin eering March 2006 I nformation technology (IT) outsourc- ing arrangements frequently employ service-level agreements (SLAs) that use terms such as availability and reliability.  The intent i s that the buyer reque sts a spe- cific system availability and reliability (e.g., 98 percent to 99.9 p ercent, and 85 percent to 90 per cent, res pec ti ve ly) . The serv ice provider is typically rewarded for exceed- ing specified limits and/or punished for falling below these limits. In rece nt year s, another term, surviv- ability , has become po pular and is used to express yet anothe r objectiv e: the ability of a system to continue functio ning after the fa il ure of one of it s compone nt s. This article examines these terms so buyer and seller can understand and use them in a contractual context and designers/opera- tors can ch oose op timal a ppro aches to satisfying the SLAs.  The northeastern U .S. power grid fail- ure in Au gus t 20 03 d rew att ent ion to t he av ailab ility, relia bilit y , and surviva bilit y of busin ess-c ritic al IT syste ms . Cata strop he can be the catalyst for new thinking about the surviv abili ty of IT syste ms. From the buyer ’s perspec tiv e, an incr ease in avail abil ity , reli abili ty , and sur-  vivabil ity comes at a price: 100 percent is not po ssible, but 98 perc ent might b e affordable and adequate while 99.99 per- cent might be unaffordable or excessive. From the service provider’s perspective, under-engine ering or inadequate operating practices can result in penalties for failing to meet SLAs. Availability What Is Availability?  Av ailability is influenced by the following: Component Reliability.  A measure of the expecte d time betw een compo- nent failures. Component reliability is affected by electromechanical failures as well as component-level software failure. System Design.  The manner in  which components are interrelated to satisfy required functionality and relia- bility. Designers can enhanc e availab il- ity through judic ious use of redun dan- cy in th e arrange ment o f system com- ponents. Operational Practices. Operational practices come into play after the sys- tem is designed and implemented with selected compon ents . Inter estingl y , after a sy stem is design ed, components are selected and the system is imple- mented . The on ly factor that can improve or degrade availability is oper- ational practices. Informa lly , system availability is the expe cted ratio of uptime to total elapsed time. More pr eci sel y , av ail abi lit y is t he rat io of upt ime and the sum of upt ime , schedul ed downtime, and unsc hed uled downtime: Uptime A1 =  ___ _______ __ (1) Uptime + Downtime  The formula (1) is useful for measuring av ailab ility o ver a g iv en per iod of time such as a ca lend ar quarte r, but not v ery useful for predicting availability or engi- neering a system to satisfy availability req uirements. Fo r thi s pur pose, system desig ners frequently empl oy a model based on mean time between failure (MTBF) and mean time to repair (MTTR), usual ly expr essed in units of hours: MTBF A2 =  ___ ___ _ (2) MTBF + MTTR Formula (2) is analogous to formula (1), but is based on statistical measures instead of dir ect observ ati on. Mos t vendor s pub- lish MTBF data. MTTR data c an often b e colle cted from histor ical data. Intere st- ingly , MTTR is partially within the control of the sys tem ope rator. For exampl e, the system operator may establish a strategy for spare s or provide mor e traini ng to the support staff to reduce the MTTR. Becaus e of these factors, compon ent ven- dors typically do not publish MTTRs. Finall y , it sho uld b e noted that A2 does not explicitly account for scheduled downtime.  The most common approa ch to include scheduled maintenance time is to include it in the total time represented in the mathematical model’s denominator, thus reducing expected availability com- mensurately . This pro vides an a dditional challenge for t he designer , but like MTT R it is somewhat controllable through oper- ationa l procedures. If the s yst em ca n be designed for only infrequent preventive maintenance, then ava ilability is enhance d. In most ope rati onal en vir onme nts , a system is allowed to operate normally, includi ng un schedul ed ou tages, for some fi xe d pe ri od of time, t , aft er whi ch i t is brought down for maintenance for some small fraction of that ti me, λ t (see Figure 1). If the sc hedu led mai nten ance is peri- odic and on a predictable schedule , then following t hours there is a scheduled out- age of λ t so tha t the fra ctio n of time that the system is not in maintenance is: t / (t + λ t) = 1 / (1 + λ) (3) Since the denominator in A2 does not include the time in preventive mainte- nanc e, an adj ustme nt to f ormula (1) is needed whenever preventive maintenance is part of the op er ationa l rout ine. T o include this time, the de nominator needs to be increased by a factor of 1+ λ to accurately reflect the smaller actual avail- abi lit y exp ect ed. Stated dif ferent ly , the denominator in A2 needs to be modified to accura tely represe nt all time, including oper atio nal (MTB F), in repair (MT TR), or in maintenance λ (MTB F + MTTR ). The revised availability model in cases of scheduled downtime is: MTBF A3 =  ____ ____ _ (4) (1 + λ )(MTBF + MTTR)  This model, like the model A2, assumes independence between the model vari- ables. In reality there may be some r ela- tionship between the variables MTBF, MTTR, and λ.  Availability Engineering  A complex system is composed of many int erre lat ed compo nen ts; failure of one component may not impact availability if the system is designed to withstand such a Av ailability , Reliability , and Survivability: An Introduction and Some Contractual Implications This article is directed toward information technology professionals that enter into contractual agreements requiring service- level agreements (SLAs) that specify availability, reliability, or survivability objectives. Its purpose is to show a relationship between cost, performance, and SLA levels established by the customer. Dr. Thoma s Ward Morga n CACI Federal Dr. Jac k Murph y DeXIsive Inc.
Transcript
Page 1: Availability, Reliability, And Survivability_Murphy

7/30/2019 Availability, Reliability, And Survivability_Murphy

http://slidepdf.com/reader/full/availability-reliability-and-survivabilitymurphy 1/426 CROSSTALK  The Journal of Defense Software Engineering March 2006

Information technology (IT) outsourc-ing arrangements frequently employ 

service-level agreements (SLAs) that useterms such as availability and reliability. The intent is that the buyer requests a spe-cific system availability and reliability (e.g.,98 percent to 99.9 percent, and 85 percentto 90 percent, respectively). The serviceprovider is typically rewarded for exceed-ing specified limits and/or punished forfalling below these limits.

In recent years, another term, surviv-ability, has become popular and is used toexpress yet another objective: the ability of a system to continue functioning after thefailure of one of its components. Thisarticle examines these terms so buyer andseller can understand and use them in acontractual context and designers/opera-tors can choose optimal approaches tosatisfying the SLAs.

 The northeastern U.S. power grid fail-ure in August 2003 drew attention to theavailability, reliability, and survivability of business-critical IT systems. Catastrophe

can be the catalyst for new thinking aboutthe survivability of IT systems.From the buyer’s perspective, an

increase in availability, reliability, and sur- vivability comes at a price: 100 percent isnot possible, but 98 percent might beaffordable and adequate while 99.99 per-cent might be unaffordable or excessive.From the service provider’s perspective,under-engineering or inadequate operating practices can result in penalties for failing to meet SLAs.

AvailabilityWhat Is Availability?  Availability is influenced by the following:• Component Reliability.  A measure

of the expected time between compo-nent failures. Component reliability isaffected by electromechanical failuresas well as component-level softwarefailure.

• System Design.  The manner in which components are interrelated tosatisfy required functionality and relia-bility. Designers can enhance availabil-ity through judicious use of redundan-

cy in the arrangement of system com-ponents.

• Operational Practices. Operationalpractices come into play after the sys-tem is designed and implemented withselected components. Interestingly,after a system is designed, componentsare selected and the system is imple-mented. The only factor that canimprove or degrade availability is oper-ational practices.Informally, system availability is the

expected ratio of uptime to total elapsedtime. More precisely, availability is theratio of uptime and the sum of uptime,scheduled downtime, and unscheduleddowntime:

UptimeA1 =

 __________________ (1)

Uptime + Downtime

 The formula (1) is useful for measuring availability over a given period of timesuch as a calendar quarter, but not very useful for predicting availability or engi-

neering a system to satisfy availability requirements. For this purpose, systemdesigners frequently employ a modelbased on mean time between failure(MTBF) and mean time to repair (MTTR),usually expressed in units of hours:

MTBFA2 =

 __________________ (2)

MTBF + MTTR

Formula (2) is analogous to formula (1),but is based on statistical measures insteadof direct observation. Most vendors pub-lish MTBF data. MTTR data can often be

collected from historical data. Interest-ingly, MTTR is partially within the controlof the system operator. For example, thesystem operator may establish a strategy for spares or provide more training to thesupport staff to reduce the MTTR.Because of these factors, component ven-dors typically do not publish MTTRs.Finally, it should be noted that A2 does notexplicitly account for scheduled downtime.

 The most common approach toinclude scheduled maintenance time is toinclude it in the total time represented in

the mathematical model’s denominatorthus reducing expected availability com-mensurately. This provides an additionachallenge for the designer, but like MTTRit is somewhat controllable through oper-ational procedures. If the system can bedesigned for only infrequent preventivemaintenance, then availability is enhanced

In most operational environments, asystem is allowed to operate normally,including unscheduled outages, for somefixed period of time, t , after which it isbrought down for maintenance for somesmall fraction of that time,λ t  (see Figure 1)

If the scheduled maintenance is peri-odic and on a predictable schedule, thenfollowing t hours there is a scheduled out-age of  λ t so that the fraction of time thatthe system is not in maintenance is:

t / (t + λt) = 1 / (1 + λ) (3)

Since the denominator in A2 does notinclude the time in preventive mainte-nance, an adjustment to formula (1) is

needed whenever preventive maintenanceis part of the operational routine. Toinclude this time, the denominator needsto be increased by a factor of 1+ λ  toaccurately reflect the smaller actual avail-ability expected. Stated differently, thedenominator in A2 needs to be modifiedto accurately represent all time, includingoperational (MTBF), in repair (MTTR), orin maintenance λ  (MTBF + MTTR). Therevised availability model in cases ofscheduled downtime is:

MTBF

A3 =

 _____________________ 

(4)(1 + λ )(MTBF + MTTR)

 This model, like the model A2, assumesindependence between the model vari-ables. In reality there may be some rela-tionship between the variables MTBFMTTR, and λ.

 Availability Engineering  A complex system is composed of manyinterrelated components; failure of onecomponent may not impact availability ifthe system is designed to withstand such a

Availability, Reliability, and Survivability:An Introduction and Some Contractual Implications

This article is directed toward information technology professionals that enter into contractual agreements requiring service- 

level agreements (SLAs) that specify availability, reliability, or survivability objectives. Its purpose is to show a relationship

between cost, performance, and SLA levels established by the customer.

Dr. Thomas Ward MorganCACI Federa

Dr. Jack Murphy DeXIsive Inc.

Page 2: Availability, Reliability, And Survivability_Murphy

7/30/2019 Availability, Reliability, And Survivability_Murphy

http://slidepdf.com/reader/full/availability-reliability-and-survivabilitymurphy 2/4

Availability, Reliability, and Survivability:An Introduction and Some Contractual Implication

March 2006 www.stsc.hill.af.mil 27

failure, while failure of another compo-nent may cause system downtime andhence degradation in availability. Conse-quently, system design can be used toachieve high availability despite unreliablecomponents.

For example, if Entity1 has compo-nent availability 0.9 and Entity2 has com-ponent availability 0.6, then the systemavailability depends on how the entities

are arranged in the system. As shown inFigure 2, the availability of System1 isdependent on the availability of bothEntity1 and Entity2.

In Figure 3, the system is unavailableonly when both components fail. To com-pute the overall availability of complexsystems involving both serially dependentand parallel components, a simple recur-sive use of the formulas in Figures 2 and3 can be employed.

 Thus far, the engineer has two tools toachieve availability requirements specifiedin SLAs: a selection of reliable compo-nents and system design techniques. Withthese two tools, the system designer canachieve almost any desired availability,albeit at added cost and complexity.

 The system designer must consider athird component of availability: opera-tional practices. Reliability benchmarkshave shown that 70 percent of reliability issues are operations induced versus thetraditional electromechanical failures. Inanother study, more than 50 percent of Internet site outage events were a directresult of operational failures, and almost

60 percent of public-switched telephonenetwork outages were caused by such fail-ures [1]. Finally, Cisco asserts that 80 per-cent of non-availability occurs because of failures in operational practices [2]. Onething is clear: Only a change in operationalpractices can improve availability after thesystem components have been purchasedand the system is implemented asdesigned.

In many cases, operational practicesare within control of the service provider while product choice and system designare outside of its control. Conversely, asystem designer often has little or no con-trol over the operational practices.Consequently, if a project specifies a cer-tain availability requirement, say 99.9 per-cent (3-nines), the system architect mustdesign the system with more than 3-ninesof availability to leave enough head space for operational errors.

 To develop a model for overall avail-ability, it is useful to consider failure rateinstead of MTBF. Let α denote the failurerate due to component failure only. Thenα = 1/MTBF. Also, let τ denote the total

failure rate, including component failureas well as failure due to operational errors. Then 1/τ is the mean time between failure when both component failure and failuresdue to operational errors are considered(MTBF Tot ).

If  β denotes the fraction of outagesthat are operations related, then (1 - β ) τis the fraction of outages that are due tocomponent failure. Thus:

(1 – β) τ = αSo τ = α / (1 –β) and

MTBFTot = (1 – β) / α (5)

 The revised model becomes:

MTBFTot

A4 = _____________________ 

(6)(1 + λ )(MTBFTot + MTTR)

 where,

1 – βMTBFTot =

 _ _ _ _ _α

 When an SLA specifies an availability of 99.9 percent, the buyer typically assumes the service provider considers allforms of outage, including componentfailure, scheduled maintenance outage,and outages due to operational error. Sothe buyer has in mind a model like thatdefined by A4. But the designer typically has in mind a model like A2 because thedesign engineer seldom has control overthe maintenance outages or operationally induced outages, but does have control

over product selection and system design. Thus, the buyer is frequently disappointedby insufficient availability and the serviceprovider is frustrated because the SLAsare difficult or impossible to achieve at thecontract price.

If the system design engineer is giveninsight into λ , the maintenance overheadfactor, and β, then A2 can be accurately determined so that A4 is within the SLA.For example, if A4 = 99 percent, it may benecessary for the design engineer to builda system with A2 = 99.999 percent avail-ability to leave sufficient room for mainte-nance outages and outages due to opera-tional errors.

Given an overall availability require-ment (A4 ) and information about λ and β,the design availability A2 can be computedfrom formula (7). Note that high mainte-nance ratios become a limiting factor inbeing able to engineer adequate availability.

(1 + λ)A4

A2 = _____________________ 

(7)1 + β ((1 + λ) A4 – 1)

For 3-nines of overall availability it is

necessary to engineer a system for over 6-nines of availability (less than 20 sec-onds/year downtime due to componentfailure) even if only 50 percent of outagesare the result of operational errors whenthe maintenance overhead is 0.1 percentEngineering systems for 6-nines of avail-ability may have a dramatic impact on sys-tem cost and complexity. It may be betterto develop operational practices that min-

imize repair time and scheduled mainte-nance time.

ReliabilityWhat Is Reliability? 

 There is an important distinction betweenthe notion of availability presented in thepreceding section and reliability. Availa-bility is the expected fraction of time thata system is operational. Reliability is theprobability that a system will be available(i.e., will not fail) over some period oftime, t . It does not measure or modeldowntime. Instead reliability only modelsthe time until failure occurs without con-cern for the time to repair or return toservice.

Reliability Engineering  To model reliability, it is necessary toknow something about the failure stochas-tic process, that is, the probability of fail-ure before time, t . The Poisson Processbased upon the exponential probabilitydistribution, is usually a good model. For

t

(operational time) (maintenance time)

t +

 λt

(total cycle time)

λt

 

Figure 1: Preventative Maintenance Cycle 

i i i i

 

i

 

System1 

 A 

Entity1

90% 

Entity2

60% 

 Availability: 90% x 60% = 54%

 

i

i

Figure 2: Availability With Serial Components

 

i i i l

 

i

 

i

 

il ili

System2 

 A  B

Entity 1 

90% 

Entity 2 60% 

 Availability: 1 - (1  – 0.9)(1  – 0.6) = 96%

Figure 3: Availability With Parallel Components

Page 3: Availability, Reliability, And Survivability_Murphy

7/30/2019 Availability, Reliability, And Survivability_Murphy

http://slidepdf.com/reader/full/availability-reliability-and-survivabilitymurphy 3/4

Software Engineering Technology

28 CROSSTALK  The Journal of Defense Software Engineering March 2006

this process, it is only necessary to esti-mate the mean of the exponential distrib-ution to predict reliability over any giventime interval. Figure 4 depicts the expo-nential distribution function and the relat-ed density function. As shown, the proba-bility of a failure approaches 1 as the peri-od of time increases.

If F(t) and f(t) are the exponential dis-tribution and density functions respective-

ly, then the reliability function R(t) = 1 – F(t). So that,

R(t) = 1- ∫ ∞t ƒ(t) = ∫ t0 ƒ(t) = ∫ t0 1 –θ

= e-t/θ (8)

 where,

θ = MTBF

 The mean of this distribution is θ , theMTBF. It can be measured directly fromempirical observations over some histori-cal period of observation or estimatedusing the availability models presentedearlier.

 Table 1 shows the reliability for various values of  θ  and t . Note that when t =MTBF, R(t) = 36.79 percent. That is, what-ever the MTBF, the reliability over thatsame time period is always 36.79 percent.

Network components such asEthernet switches typically have an MTBFof approximately 50,000 hours (about 70months). Thus the annual reliability of asingle component is about 85 percent (uset=12 and θ=70 in the formula above orinterpolate using Table 2). If that compo-

nent is a single point of failure from theperspective of an end workstation, thenbased on component failure alone the

probability of outage for such worksta-tions is at least 15 percent. When opera-tional errors are considered, it is MTBF Tot,not MTBF that determines reliability, sothe probability of an unplanned outage within a year’s time increases accordingly.

 The preceding assumes that the expo-nential density function accurately modelssystem behavior. For systems with period-ic scheduled downtime, this assumption is

invalid. At a discrete point in time, there isa certainty that such a system will beunavailable: R(t) = e-t/θ for any t < t0, wheret0 is the point in time of the next sched-uled maintenance, and R(t) = 0 for t ≥ t0.

SurvivabilityWhat Is Survivability? Survivability of IT systems is a significantconcern, particularly among critical infra-structure providers. Availability and relia-bility analysis assume that failures aresomewhat random and the engineer’s jobis to design a system that is robust in theface of random failure. There is thus animplicit assumption that system failure islargely preventable.

Survivability analysis implicitly makesthe conservative assumption that failure will occur and that the outcome of thefailure could negatively impact a large seg-ment of the subscribers to the IT infra-structure. Such failures could be the resultof deliberate, malicious attacks against theinfrastructure by an adversary, or they could be the result of natural phenome-non such as catastrophic weather events.

Regardless of the cause, survivability analysis assumes that such events can and will occur and the impact to the IT infra-structure and those who depend on it willbe significant.

Survivability has been defined as “thecapability of a system to fulfill its missionin a timely manner, in the presence of attacks, failures, or accidents” [3]. Surviv-ability analysis is influenced by severalimportant principles:• Containment. Systems should be

designed to minimize mission impactby containing the failure geographical-ly or logically.

• Reconstitution. System designersshould consider the time, effort, and

skills required to restore essential mis-sion-critical IT infrastructure after acatastrophic event.

• Diversity. Systems that are based onmultiple technologies, vendors, loca-tions, or modes of operation couldprovide a degree of immunity toattacks, especially those targeted atonly one aspect of the system.

• Continuity. It is the business of mis-

sion-critical functions that they mustcontinue in the event of a catastrophicevent, not any specific aspect of the ITinfrastructure.

If critical functions are composed of bothIT infrastructure (network) and function-specific technology components (servers)then both must be designed to be surviv-able. An enterprise IT infrastructure can bedesigned to be survivable, but unless thefunction-specific technologies are also sur- vivable, irrecoverable failure could result.

 Measuring Survivability From the designers’ and the buyers’ per-spectives, comparing various designsbased upon their survivability is critical formaking cost and benefit tradeoffs. Nex we discuss several types of analysis thatcan be performed on a network designthat can provide a more quantitativeassessment of survivability.

Residual measures for an IT infra-structure are the same measures used todescribe the infrastructure before a cata-strophic event but are applied to theexpected state of the infrastructure after

the effects of the event are taken into con-sideration. Here we discuss four residuameasures that are usually important:• Residual Single Points of Failure

In comparing two candidate infra-structure designs, the design withfewer single points of failure is gener-ally considered more robust than thealternative. When examining the sur- vivability of an infrastructure withrespect to a particular catastrophicevent, the infrastructure with the fewerresidual single points of failure is intu-itively more survivable. This measureis a simple count.

• Residual Availability.  The same avail-ability analysis done on an undamagedinfrastructure can be applied to an infra-structure after it has been damaged by acatastrophic event. Generally, the higherthe residual availability of an infrastruc-ture the more survivable it is withrespect to the event being analyzed.

• Residual Performance.  A residuainfrastructure that has no single pointof failure and has high residual avail-ability may not be usable from the per-

,

 

,

 

,

 

,

 

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6

t

Exponential density f unction with mean 1.0

Pr obability of  f ailure bef ore time t 

  :  .

 

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Figure 4: Exponential Density and Distribution Functions With MTBF = 1.0

 

t (months)R(t)

1 2 4 8 12 24 48 96

3 71.65% 51.34% 26.36% 6.95% 1.83% 0.03% 0.00% 0.00%

6 84.65% 71.65% 51.34% 26.36% 13.53% 1.83% 0.03% 0.00%

12 92.00% 84.65% 71.65% 51.34% 36.79% 13.53% 1.83% 0.03%

24 95.92% 92.00% 84.65% 71.65% 60.65% 36.79% 13.53% 1.83%

48 97.94% 95.92% 92.00% 84.65% 77.88% 60.65% 36.79% 13.53%

MTBF

(months)

96 98.96% 97.94% 95.92% 92.00% 88.25% 77.88% 60.65% 36.79%

 Table 1: Reliability Over t Months For MTBF Ranging From 3 to 96 Months 

Page 4: Availability, Reliability, And Survivability_Murphy

7/30/2019 Availability, Reliability, And Survivability_Murphy

http://slidepdf.com/reader/full/availability-reliability-and-survivabilitymurphy 4/4


Recommended