Availability, Reliability, And Survivability_Murphy

7/30/2019 Availability, Reliability, And Survivability_Murphy

http://slidepdf.com/reader/full/availability-reliability-and-survivabilitymurphy 1/426 CROSSTALK The Journal of Defense Software Engineering March 2006

Information technology (IT) outsourc-ing arrangements frequently employ

service-level agreements (SLAs) that useterms such as availability and reliability. The intent is that the buyer requests a spe-cific system availability and reliability (e.g.,98 percent to 99.9 percent, and 85 percentto 90 percent, respectively). The serviceprovider is typically rewarded for exceed-ing specified limits and/or punished forfalling below these limits.

In recent years, another term, surviv-ability, has become popular and is used toexpress yet another objective: the ability of a system to continue functioning after thefailure of one of its components. Thisarticle examines these terms so buyer andseller can understand and use them in acontractual context and designers/opera-tors can choose optimal approaches tosatisfying the SLAs.

The northeastern U.S. power grid fail-ure in August 2003 drew attention to theavailability, reliability, and survivability of business-critical IT systems. Catastrophe

can be the catalyst for new thinking aboutthe survivability of IT systems.From the buyer’s perspective, an

increase in availability, reliability, and survivability comes at a price: 100 percent isnot possible, but 98 percent might beaffordable and adequate while 99.99 per-cent might be unaffordable or excessive.From the service provider’s perspective,under-engineering or inadequate operating practices can result in penalties for failing to meet SLAs.

AvailabilityWhat Is Availability? Availability is influenced by the following:• Component Reliability. A measure

of the expected time between compo-nent failures. Component reliability isaffected by electromechanical failuresas well as component-level softwarefailure.

• System Design. The manner in which components are interrelated tosatisfy required functionality and relia-bility. Designers can enhance availabil-ity through judicious use of redundan-

cy in the arrangement of system com-ponents.

• Operational Practices. Operationalpractices come into play after the sys-tem is designed and implemented withselected components. Interestingly,after a system is designed, componentsare selected and the system is imple-mented. The only factor that canimprove or degrade availability is oper-ational practices.Informally, system availability is the

expected ratio of uptime to total elapsedtime. More precisely, availability is theratio of uptime and the sum of uptime,scheduled downtime, and unscheduleddowntime:

UptimeA1 =

__________________ (1)

Uptime + Downtime

The formula (1) is useful for measuring availability over a given period of timesuch as a calendar quarter, but not very useful for predicting availability or engi-

neering a system to satisfy availability requirements. For this purpose, systemdesigners frequently employ a modelbased on mean time between failure(MTBF) and mean time to repair (MTTR),usually expressed in units of hours:

MTBFA2 =

__________________ (2)

MTBF + MTTR

Formula (2) is analogous to formula (1),but is based on statistical measures insteadof direct observation. Most vendors pub-lish MTBF data. MTTR data can often be

collected from historical data. Interest-ingly, MTTR is partially within the controlof the system operator. For example, thesystem operator may establish a strategy for spares or provide more training to thesupport staff to reduce the MTTR.Because of these factors, component ven-dors typically do not publish MTTRs.Finally, it should be noted that A2 does notexplicitly account for scheduled downtime.

The most common approach toinclude scheduled maintenance time is toinclude it in the total time represented in

the mathematical model’s denominatorthus reducing expected availability com-mensurately. This provides an additionachallenge for the designer, but like MTTRit is somewhat controllable through oper-ational procedures. If the system can bedesigned for only infrequent preventivemaintenance, then availability is enhanced

In most operational environments, asystem is allowed to operate normally,including unscheduled outages, for somefixed period of time, t , after which it isbrought down for maintenance for somesmall fraction of that time,λ t (see Figure 1)

If the scheduled maintenance is peri-odic and on a predictable schedule, thenfollowing t hours there is a scheduled out-age of λ t so that the fraction of time thatthe system is not in maintenance is:

t / (t + λt) = 1 / (1 + λ) (3)

Since the denominator in A2 does notinclude the time in preventive mainte-nance, an adjustment to formula (1) is

needed whenever preventive maintenanceis part of the operational routine. Toinclude this time, the denominator needsto be increased by a factor of 1+ λ toaccurately reflect the smaller actual avail-ability expected. Stated differently, thedenominator in A2 needs to be modifiedto accurately represent all time, includingoperational (MTBF), in repair (MTTR), orin maintenance λ (MTBF + MTTR). Therevised availability model in cases ofscheduled downtime is:

MTBF

A3 =

_____________________

(4)(1 + λ )(MTBF + MTTR)

This model, like the model A2, assumesindependence between the model vari-ables. In reality there may be some rela-tionship between the variables MTBFMTTR, and λ.

Availability Engineering A complex system is composed of manyinterrelated components; failure of onecomponent may not impact availability ifthe system is designed to withstand such a

Availability, Reliability, and Survivability:An Introduction and Some Contractual Implications

This article is directed toward information technology professionals that enter into contractual agreements requiring service-

level agreements (SLAs) that specify availability, reliability, or survivability objectives. Its purpose is to show a relationship

between cost, performance, and SLA levels established by the customer.

Dr. Thomas Ward MorganCACI Federa

Dr. Jack Murphy DeXIsive Inc.


http://slidepdf.com/reader/full/availability-reliability-and-survivabilitymurphy 2/4

Availability, Reliability, and Survivability:An Introduction and Some Contractual Implication

March 2006 www.stsc.hill.af.mil 27

failure, while failure of another compo-nent may cause system downtime andhence degradation in availability. Conse-quently, system design can be used toachieve high availability despite unreliablecomponents.

For example, if Entity1 has compo-nent availability 0.9 and Entity2 has com-ponent availability 0.6, then the systemavailability depends on how the entities

are arranged in the system. As shown inFigure 2, the availability of System1 isdependent on the availability of bothEntity1 and Entity2.

In Figure 3, the system is unavailableonly when both components fail. To com-pute the overall availability of complexsystems involving both serially dependentand parallel components, a simple recur-sive use of the formulas in Figures 2 and3 can be employed.

Thus far, the engineer has two tools toachieve availability requirements specifiedin SLAs: a selection of reliable compo-nents and system design techniques. Withthese two tools, the system designer canachieve almost any desired availability,albeit at added cost and complexity.

The system designer must consider athird component of availability: opera-tional practices. Reliability benchmarkshave shown that 70 percent of reliability issues are operations induced versus thetraditional electromechanical failures. Inanother study, more than 50 percent of Internet site outage events were a directresult of operational failures, and almost

60 percent of public-switched telephonenetwork outages were caused by such fail-ures [1]. Finally, Cisco asserts that 80 per-cent of non-availability occurs because of failures in operational practices [2]. Onething is clear: Only a change in operationalpractices can improve availability after thesystem components have been purchasedand the system is implemented asdesigned.

In many cases, operational practicesare within control of the service provider while product choice and system designare outside of its control. Conversely, asystem designer often has little or no con-trol over the operational practices.Consequently, if a project specifies a cer-tain availability requirement, say 99.9 per-cent (3-nines), the system architect mustdesign the system with more than 3-ninesof availability to leave enough head space for operational errors.

To develop a model for overall avail-ability, it is useful to consider failure rateinstead of MTBF. Let α denote the failurerate due to component failure only. Thenα = 1/MTBF. Also, let τ denote the total

failure rate, including component failureas well as failure due to operational errors. Then 1/τ is the mean time between failure when both component failure and failuresdue to operational errors are considered(MTBF Tot ).

If β denotes the fraction of outagesthat are operations related, then (1 - β ) τis the fraction of outages that are due tocomponent failure. Thus:

(1 – β) τ = αSo τ = α / (1 –β) and

MTBFTot = (1 – β) / α (5)

The revised model becomes:

MTBFTot

A4 = _____________________

(6)(1 + λ )(MTBFTot + MTTR)

where,

1 – βMTBFTot =

_ _ _ _ _α

When an SLA specifies an availability of 99.9 percent, the buyer typically assumes the service provider considers allforms of outage, including componentfailure, scheduled maintenance outage,and outages due to operational error. Sothe buyer has in mind a model like thatdefined by A4. But the designer typically has in mind a model like A2 because thedesign engineer seldom has control overthe maintenance outages or operationally induced outages, but does have control

over product selection and system design. Thus, the buyer is frequently disappointedby insufficient availability and the serviceprovider is frustrated because the SLAsare difficult or impossible to achieve at thecontract price.

If the system design engineer is giveninsight into λ , the maintenance overheadfactor, and β, then A2 can be accurately determined so that A4 is within the SLA.For example, if A4 = 99 percent, it may benecessary for the design engineer to builda system with A2 = 99.999 percent avail-ability to leave sufficient room for mainte-nance outages and outages due to opera-tional errors.

Given an overall availability require-ment (A4 ) and information about λ and β,the design availability A2 can be computedfrom formula (7). Note that high mainte-nance ratios become a limiting factor inbeing able to engineer adequate availability.

(1 + λ)A4

A2 = _____________________

(7)1 + β ((1 + λ) A4 – 1)

For 3-nines of overall availability it is

necessary to engineer a system for over 6-nines of availability (less than 20 sec-onds/year downtime due to componentfailure) even if only 50 percent of outagesare the result of operational errors whenthe maintenance overhead is 0.1 percentEngineering systems for 6-nines of avail-ability may have a dramatic impact on sys-tem cost and complexity. It may be betterto develop operational practices that min-

imize repair time and scheduled mainte-nance time.

ReliabilityWhat Is Reliability?

There is an important distinction betweenthe notion of availability presented in thepreceding section and reliability. Availa-bility is the expected fraction of time thata system is operational. Reliability is theprobability that a system will be available(i.e., will not fail) over some period oftime, t . It does not measure or modeldowntime. Instead reliability only modelsthe time until failure occurs without con-cern for the time to repair or return toservice.

Reliability Engineering To model reliability, it is necessary toknow something about the failure stochas-tic process, that is, the probability of fail-ure before time, t . The Poisson Processbased upon the exponential probabilitydistribution, is usually a good model. For

t

(operational time) (maintenance time)

t +

λt

(total cycle time)

λt

Figure 1: Preventative Maintenance Cycle

i i i i

i

System1

A

B

Entity1

90%

Entity2

60%

Availability: 90% x 60% = 54%

i

i

Figure 2: Availability With Serial Components

i i i l

i

i

il ili

System2

A B

Entity 1

90%

Entity 2 60%

Availability: 1 - (1 – 0.9)(1 – 0.6) = 96%

Figure 3: Availability With Parallel Components



Software Engineering Technology

28 CROSSTALK The Journal of Defense Software Engineering March 2006

this process, it is only necessary to esti-mate the mean of the exponential distrib-ution to predict reliability over any giventime interval. Figure 4 depicts the expo-nential distribution function and the relat-ed density function. As shown, the proba-bility of a failure approaches 1 as the peri-od of time increases.

If F(t) and f(t) are the exponential dis-tribution and density functions respective-

ly, then the reliability function R(t) = 1 – F(t). So that,

R(t) = 1- ∫ ∞t ƒ(t) = ∫ t0 ƒ(t) = ∫ t0 1 –θ

= e-t/θ (8)

where,

θ = MTBF

The mean of this distribution is θ , theMTBF. It can be measured directly fromempirical observations over some histori-cal period of observation or estimatedusing the availability models presentedearlier.

Table 1 shows the reliability for various values of θ and t . Note that when t =MTBF, R(t) = 36.79 percent. That is, what-ever the MTBF, the reliability over thatsame time period is always 36.79 percent.

Network components such asEthernet switches typically have an MTBFof approximately 50,000 hours (about 70months). Thus the annual reliability of asingle component is about 85 percent (uset=12 and θ=70 in the formula above orinterpolate using Table 2). If that compo-

nent is a single point of failure from theperspective of an end workstation, thenbased on component failure alone the

probability of outage for such worksta-tions is at least 15 percent. When opera-tional errors are considered, it is MTBF Tot,not MTBF that determines reliability, sothe probability of an unplanned outage within a year’s time increases accordingly.

The preceding assumes that the expo-nential density function accurately modelssystem behavior. For systems with period-ic scheduled downtime, this assumption is

invalid. At a discrete point in time, there isa certainty that such a system will beunavailable: R(t) = e-t/θ for any t < t0, wheret0 is the point in time of the next sched-uled maintenance, and R(t) = 0 for t ≥ t0.

SurvivabilityWhat Is Survivability? Survivability of IT systems is a significantconcern, particularly among critical infra-structure providers. Availability and relia-bility analysis assume that failures aresomewhat random and the engineer’s jobis to design a system that is robust in theface of random failure. There is thus animplicit assumption that system failure islargely preventable.

Survivability analysis implicitly makesthe conservative assumption that failure will occur and that the outcome of thefailure could negatively impact a large seg-ment of the subscribers to the IT infra-structure. Such failures could be the resultof deliberate, malicious attacks against theinfrastructure by an adversary, or they could be the result of natural phenome-non such as catastrophic weather events.

Regardless of the cause, survivability analysis assumes that such events can and will occur and the impact to the IT infra-structure and those who depend on it willbe significant.

Survivability has been defined as “thecapability of a system to fulfill its missionin a timely manner, in the presence of attacks, failures, or accidents” [3]. Surviv-ability analysis is influenced by severalimportant principles:• Containment. Systems should be

designed to minimize mission impactby containing the failure geographical-ly or logically.

• Reconstitution. System designersshould consider the time, effort, and

skills required to restore essential mis-sion-critical IT infrastructure after acatastrophic event.

• Diversity. Systems that are based onmultiple technologies, vendors, loca-tions, or modes of operation couldprovide a degree of immunity toattacks, especially those targeted atonly one aspect of the system.

• Continuity. It is the business of mis-

sion-critical functions that they mustcontinue in the event of a catastrophicevent, not any specific aspect of the ITinfrastructure.

If critical functions are composed of bothIT infrastructure (network) and function-specific technology components (servers)then both must be designed to be surviv-able. An enterprise IT infrastructure can bedesigned to be survivable, but unless thefunction-specific technologies are also survivable, irrecoverable failure could result.

Measuring Survivability From the designers’ and the buyers’ per-spectives, comparing various designsbased upon their survivability is critical formaking cost and benefit tradeoffs. Nex we discuss several types of analysis thatcan be performed on a network designthat can provide a more quantitativeassessment of survivability.

Residual measures for an IT infra-structure are the same measures used todescribe the infrastructure before a cata-strophic event but are applied to theexpected state of the infrastructure after

the effects of the event are taken into con-sideration. Here we discuss four residuameasures that are usually important:• Residual Single Points of Failure

In comparing two candidate infra-structure designs, the design withfewer single points of failure is gener-ally considered more robust than thealternative. When examining the survivability of an infrastructure withrespect to a particular catastrophicevent, the infrastructure with the fewerresidual single points of failure is intu-itively more survivable. This measureis a simple count.

• Residual Availability. The same avail-ability analysis done on an undamagedinfrastructure can be applied to an infra-structure after it has been damaged by acatastrophic event. Generally, the higherthe residual availability of an infrastruc-ture the more survivable it is withrespect to the event being analyzed.

• Residual Performance. A residuainfrastructure that has no single pointof failure and has high residual avail-ability may not be usable from the per-

,

,

,

,

:

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6

t

Exponential density f unction with mean 1.0

Pr obability of f ailure bef ore time t

: .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Figure 4: Exponential Density and Distribution Functions With MTBF = 1.0

t (months)R(t)

1 2 4 8 12 24 48 96

3 71.65% 51.34% 26.36% 6.95% 1.83% 0.03% 0.00% 0.00%

6 84.65% 71.65% 51.34% 26.36% 13.53% 1.83% 0.03% 0.00%

12 92.00% 84.65% 71.65% 51.34% 36.79% 13.53% 1.83% 0.03%

24 95.92% 92.00% 84.65% 71.65% 60.65% 36.79% 13.53% 1.83%

48 97.94% 95.92% 92.00% 84.65% 77.88% 60.65% 36.79% 13.53%

MTBF

(months)

96 98.96% 97.94% 95.92% 92.00% 88.25% 77.88% 60.65% 36.79%

Table 1: Reliability Over t Months For MTBF Ranging From 3 to 96 Months



Date post:	14-Apr-2018
Category:	Documents
Upload:	slavikp
View:	216 times
Download:	0 times

Availability, Reliability, And Survivability_Murphy

Documents