Scalability analysis in gracefully-degradable large systems

IEEE TRANSACTIONS ON RELIABILITY, VOL. 40, NO. 2, 1991 JUNE 189

Scalability Analysis in Gracefully-Degradable Large Systems

Walid A. Najjar

Jean-Luc Gaudiot Colorado State University, Fort Collins

University of Southern California, Los Angeles

Key Words - Large system, Scalability, Computational reliability.

Reader Aids - Purpose: Report a reliability analysis Special math needed for explanation: Probability theory Special math needed to use results: Same Results useful to: Reliability analysts

Summary & Conclusions - The availability of large multiprocessor systems raises new issues in the design of highly fault- tolerant architectures. As the number of active processors in large systems increases, its computing power increases but so does the anticipated failure rate in the system. On the other hand, the hard- ware redundancy available in such system is capable of providing, in addition to large amounts of computing power, improved reliability and graceful degradation. This paper analyzes the scalability of large de-gradabk homogemus multiprocessors. The objective is to assess the limitations, imposed by reliability considerations, on the number of processors. The analysis of the mean- time-to-failure and the mission-time shows that, for a given value of the coverage factor, there exists a value of the number of processors at which these measures are maximal. As the system size is increased beyond this value, the reliability of the system becomes a rapidly decreasing function of the number of processors.

The measure of processor-hours is defined as the amount of potential computational work. This measure is upper-bounded, and the upper-bound is independent of the initial number of processors. For computations with linear speed-up, the amount of reliable computational work is constant for large system-sizes. When the speed- up is not linear, this amount is a decreasing function of the number of processors. Therefore, for large system-sizes and same technology, increasing the number of processors results in a decrease of the average amount of reliable computational work the system can deliver.

Graceful degradation in large fault-tolerant systems is not scalable. In order to preserve the same performance and reliability level, an increase in the number of processors must be matched by a decrease of the same magnitude in the probability of failed recovery.

1. INTRODUCTION

The past few years have witnessed the advent of large distributed systems (LSDS). These systems are characterized by many computing nodes (several hundreds or thousands) com- municating over a high-bandwidth interconnection network and designed for highly-parallel compute-intensive applications.

As the number of processors (N) is increased, the effective computational power of the system increases at a rate less than or equal to N because of the added parallelkation overhead. On the other hand, the rate of failures in the system increases at a rate larger than or equal to N because of the added com- plexity of the system and the increase in the size of the inter- connecting network. Therefore the reliability of the system could become a limiting factor on the size of a feasible large distributed system. On the other hand, LSDSs have a large degree of in- trinsic redundancy that can be exploited to achieve a higher reliability, albeit at the expense of reduced performance. Large systems can therefore be made to degrade gracefully upon the successful detection and isolation of a faulty element [ 1-31. The cumulative probability of the detection, isolation, and successful recovery from a failure is defined as the coverage ( c ) .

The main objective of this paper is to analyze the scalability of gracefully degradable large system. A large system is scalable if an increase in the number of active processors results in pro- portionately improved reliability and performance levels. Tradi- tional reliability and performability analysis has focused most- ly on the steady-state and transient evaluation of time varying measures such as reliability and availability. The present analysis, therefore, focuses on the evaluation of reliability and performability measures as a function of the number of processors in the system.

The rest of this paper is organized as follows. A brief survey of previous work is presented in section 2. The system and fault models are described in section 3. An analysis of time- based measures (such as meun-time-to-failure and the mission- time) as function of the number of processors is presented in section 4. Measures of computational reliability (such as reliable-processor-burs and computational-work) are introduc- ed and analyzed in section 5 . The results are discussed in section 6.

I

2. BACKGROUND

Fault-tolerant systems, such as the SIFT [4], the FTMP [5], the ESS [6] and several commercial systems [7] have used redundancy to achieve a higher reliability, a better availability, or a longer mission-time. In a multiprocessor system, the redundancy is intended for higher throughput or faster execution. It could, however, be exploited for fault-tolerance to provide gracefully degraded operation.

The traditional fault-tolerance measures of reliability, availability and mission-time do not provide an adequate evaluation of degradable systems. New measures, therefore, have been devised that provide a combined performance/reliability evaluation of degradable fault-tolerant computing systems. Meyer pro- posedpegodi l i ty , viz, the probability that the system main- tains a certain accomplishment level over a utilization interval [SI. Several researchers have used this measure in the evaluation

190 IEEE TRANSACTIONS ON RELIABILITY, VOL. 40, NO. 2 , 1991 JUNE

of the availability of different types of degradable systems. De Souza, et al [9] apply randomization methods to estimate numerically the distribution of cumulative up-time, and therefore the distribution of the availability, for repairable systems over a finite observation period. Iyer, et al [lo] show that for large mission-times, the cumulative reward, in a degradable system, has an asymptotic s-normal distribution. Closed-form solutions for performability and performability distribution have been proposed in [ll-141. Ingle & Siewiorek [15] show that periodic maintenance and integrity checks can increase the mean life of gracefully degradable and reconfigurable multiprocessor systems. This improvement, however, is shown to be limited by the non-redundant parts of a system.

A set of computation-oriented reliability measures have been proposed by Beaudry [ 161. These include computational reliability, viz, the probability that at a given time the system correctly executes a given task; the mean computation before failure (MCBF) , viz, the mean amount of computational work the system can deliver before the first failure; the computational availability, viz, the mean value of the computation capacity

recovery procedure relies on communication among processors, and therefore on the network being c ~ ~ e ~ t e d . Unless the system has a fully connected network topology, it might be partitioned by successive failures. The probability of disconnection is therefore a parameter of the coverage. An analytic evaluation of this probability is presented in [22].

Assumptions

1. No communication costs among processors. 2. Fail-proof communication links. 3. No overhead associated with recovery and reconfigura-

4. All failures are statistically independent events. 5 . The system is fully degradable, viz, allows a graceful

degradation up to the last processor. 6. Constant, state-independent, coverage. This assump-

tion is unrealistic since the probability of partitioning the network increases with increasing i and might reach values com- parable to c [22].

tion procedures.

of the system; and the capacity threshold, viz, the timewhen a specific value of the computational availability is reached. The use of these measures was demonstrated in the comparative

2 processors. Performability and computational availability are measures that evaluate the computational capacity of a system as function of time and therefore are ideal for the evaluation of throughput oriented or highly available systems.

Fortes & Raghavendra [17] described the design of a gracefully degradable and reconfigurable array processor and

These assumptions are &hough totic behavior. Based on such

reached. The system is modeled by a conhuous-time Markov ch&

(CT-C), shown in figure 1, [23,241. sinm d y s i s focuses on gracefully degradable systems, we do not consider system repair and therefore the CTMC is acyclic.

and strongly and they are justifiable in an analysis of asymp

assumptions, this of graceful degradation and for a with analysis determines upper-bounds that are in practice never

derived closed form expressions of the reliabiLi6, &rformability and computational availability. Smotherman, et al [18] demonstrate a set of methods for the derivation of provable bounds on reliability, both optimistic and conservative, for complex systems.

3. SYSTEM AND FAULT MODELS

The model is a large, homogeneous multiprocessor. The computation is, initially, uniformly partitioned among N iden- tical processing elements. The system supports graceful degradation: Upon the detected failure of a processor its computational load is picked up by another processor or set of processors with near uniform load partitioning. An algorithm for distributed fault-tolerance has been proposed by Preparata, et al [19]. A detailed description and discussion of a distributed fault- tolerance algorithm is in [20]. It is based on 3 steps:

fault detection fault isolation system reconfiguration and recovery.

However, the ability of a system to degrade gracefully hinges on the combined success of these three steps. The failure to detect, isolate, or recover from a fault can result in a system failure. The cumulative probability of success of these three steps is defined as coverage [21]. In a distributed system, the

U

Figure 1. Markov Model of Failures

Notation

i

Pi(t) D

number of failed processors; i = 0,. . . ,D, and F the state of system failure occupation probability of state i number of allowable degradation states ( D = N - 1 unless otherwise noted)

single failure in any state i .

failure rate of a single processor

C coverage, probability of successful recovery from a

hi,pi state transition rates x R (N,t ) system reliability MTTF(N,c) mean time to first failure. MT (N ,c ) system mission-time

NAJJAWGAUDIOT: SCALBILITY ANALYSIS IN GRACEFULLY-DEGRADABLE LARGE SYSTEMS 191

Other, standard notation is given in “Information for Readers & Authors” at the rear of each issue.

The state transition rates are a function of the single processor failure rate:

hi = c ( N - i ) h

pi = ( 1 - c ) ( N - i ) h

The steady-state description of this Markov model is:

= - ( N - i ) V i ( t ) + c ( N - i + l)XPi-l(t) i = 1, ..., D

Subject to the following constraints:

P o ( 0 ) = 1

Pi(0) = 0 i = 1 , .... D

The state probabilities are:

Po( t ) = e-*

The reliability is the probability of being in any one of the states i = O , . .. ,D:

D

R ( t ) = Pi ( t ) . i = O

The mean time to first failure is:

MTTF = R( t )d t . so

4. TIME-BASED ANALYSIS

This section analyzes two time measures:

Mean-Time-To-Failure (MTTF) Mission-Time (MT)

as functions of the number of processors (N) and the coverage factor (c).

4.1 Analysis of MlTF

(1) and (2): The MTTF as function of N and c can be obtained from

The MTTF(N,c) according to (5) is plotted in figure 2. The following recurrence relation can be derived for MTTF ( N J ) :

M T T F ( N , c ) I 3.50

c = 0.95 ’ c = 0.99

. . . ............... .... ........ :::::: .:.::::., . *... .....

. . . . . .

1, i a i a i, 6 5 T””% A ....... Ti * I o g a N

The mision time is defined for a g- .‘en minimum rc-Jbili- ty R- as the time interval during which the reliability is larger than R&:

for all t < MT, R ( t ) 2R-. (3)

The unit-time is normalized to 1/X = MTTFl (MTTF of a single processor) and a value of R- = 0.99.

Figure 1. MTTF as Function of N and c

We conclude that:

MTTF(N,c) is a unimodal function that converges to zero as N--oo. There exists a value of N (N,) for which MTTF(N,c) is maximal. For large values of N the MTTF ( N J ) decreases rapidly with increasing N . The maximum value of MTTF(N,c) is an increasing function of c. MTTF (N,,c) is proportional to 1 / ( 1 - c).

192 IEEE TRANSACTIONS ON RELIABILITY, VOL. 40, NO. 2, 1991 JUNE

In order to evaluate N, and the peak value MTF (N,,c) we use the following approximation in (5) :

obtaining -

= HN( 1 -NE) +NE. (7)

N 1 . i HN = - is harmonic number N .

i = l

Since In ( N ) < HN < In (N+ 1 ) we can approximate (7) further by:

TABLE 1 Maximum Values of MTTF(N,c)

C Nnl MTTF (N,,,,c)

0.99 30 3.2 0.95 8 2.0 0.90 6 1.6

The approximation in (6) is valid when i is small and therefore the approximation of MTTF ( N , c) in (7) is not valid for large N . Table 1 shows the values of N, and MTTF (N,,c) as obtained from the above two approximations. These values cor- respond to those in figure 2. The following relations can be derived:

1 N,ln(N,) = -

1-c

MTTF(N,,~) < In(").

These relations demonstrate that very-large gracefullydegradable systems are inherently unreliable since the MTTF-0 as the number of processors increases.

4.2 Analysis of MT

The Mission-Time is defined as the time interval where R(N,t) 2 R-. The values of MT, as obtained from (3), are plotted in figure 3 as a function of N for various values of c. We deduce that:

For a given c, there exists a value of N at which the MT is maximal. We denote this value by Np'

c = 0.05 * c = 0.08

Figure 3. MissionTime as Function of N and c

MT(c,Np) 1 MT(c,N) for all N

These curves show that for smaller values of N ( N < N p ) the inherent redundancy of the system provides a higher mission time. When N > N p the higher failure rate dominates and reduces the mission time. As c increases the value of Np also increases. The peak value of mission-time is appreciably larger than that of a single processor. For example: for R-=O.99 the mission-time of a single processor is MT( 1) =0.01 and therefore MT ( Np,O. 99) 2: 23MT ( 1 ) . As the number of processors is increased beyond Np, ie, N > Np, the mission-time decreases. This decrease is inversely proportional to N , ie, MT(2N,c) = O.SMT(N,c). While the peak value of the mission-time of a multicomputer system can be appreciably larger than that of a single processor, the reverse becomes true for very large N . For example, MT( 1) 5: lOMT( 1024, 0.99). For N > Np, a 10-fold decrease in (1 -c ) (the probability of failed recovery) results in a 10-fold increase in the MT for the same number of processors. For example, MT( 128, 0.99) = lOMT( 128,0.9). In other words, the mission-time

U Because of the cumulative effects of the probability of suc-

is inversely proportional to ( 1 - c).

cessive recovery, the reliability of the system, after the failure i, is constrained by:

R(t) I cis

Let K' be defined such that:

Rmi".

Therefore, an integer value of K' , K, can be derived as:

K = gilb ~ ( lo:gR:)


Given the definition of MT, then K’ is the mean number of failures in the interval [0, MT] constrained by the condition that D L K‘ ; ie, if the number of processors is large enough, K failures are sufficient to reach R ( t ) = R-. Since K , as defined in (8), can take only integer values, it is an approximation of the mean number of failures K’ . K strictly depends on R- and c (it is a function of the desired reliability level to be maintained and the quality of the recovery process) and is the number of degradation states corresponding to a system with Np processors.

We can deduce that when the number of allowed degradation states is sufficiently large ( D 9 K) , the necessary condition to reach the minimum reliability level ( R ( t ) = R-), and therefore the mission-time, is that K processors fail. Since the rate of failures is proportional to the number of processors, the time interval [0, MT] is inversely proportional to N. The following proportionality relation holds:

MT oc K / (Nx) (9)

For E 4 1 and x = 1 - E we use the following approximation for log x: log x = 1 -x. We have therefore demonstrated that for D P K

log R- MT OC -

(1 - c )N

This proportionality expression implies that in order to maintain a constant mission-time, any increase in the number of processors must be matched by an equivalent decrease in the probability of failed recovery ( 1 - c) or by a proportional increase in the MTTF of a single processor.

The time-based analysis can be summarized by comparing the results of the analysis of the mean-time-to-failure and the mission-time. Both are unimodal functions of N. While the maximum value of MTTF (N, c ) is determined by c alone, that of MT (N,c) is a function of both R- and c. In both cases the values of N where the maximum is reached are relatively small for acceptable values of c. In figures 2 and 3 for c = 0.99, then N,,, = 32 and Np = 4; in these figures, N takes only values that are powers of 2.

5. COMPUTATION-BASED ANALYSIS

This section evaluates performance and reliability of large degradable systems based on the notion of computational work. There is no formally defined unit of computational work; we use processor-hours. Another related unit of computational work is machine instructions. Any computational task is characterized by a certain amount of computational work, measured in processor-hours. When this task is executed over several processors, the execution time is reduced, but the amount of processor-hours required for that computation is kept constant if the speed-up is linear. For a non-linear speed-up the amount of required processor-hours increases due to added overhead.

Notation & Dejnitions

T,,

S,,

execution time of a given computation over n processors. attainable speed-up define by T,/T,,. S,, is equivalent to the number of effective processors (number of vir- tual processors fully utilized by the given computation).

E,, parallel efficiency of the computation; defined by S,,/N.

CW(N, t ) amount of effective computational work a system will deliver for a given computational speed-up:

P . D

PH(N,t) amount of processor-hours a system, with initially N processors, can deliver up to time t:

PH(N,t) = 1 E ( N - j ) ( N - i ) P i ( ~ ) d 7 . (11)

RF‘H (N,c) amount of reliable processor-hours, defined as the processor-hours available while the reliability is maintained above a given minimum:

t D

r=O i = o

RF’H(N,c) = PH(N,MT)

= lm (N-i)Pi( t )dt . t=O i = o

P H (N, 00 ) mean computation before failure (MCBF) CW (N,t ) integral of the computational availability [16]

For a computation that exhibits linear speed-up (S,, = n) we have: CW(N, t ) = PH(N,t) . Unless otherwise noted, in the rest of this discussion, we assume a best case of linear speed- up and therefore EN = 1. Both PH(N,t) and CW(N,t) are mean values of the processor-hours and computational work measures.

5.1 Upper Bound on PH

This section proves that the amount of computational work a purely degradable system can deliver is upper bounded and that the upper bound is independent of the initial number of processors. We derive this upper bound using: 1) a discrete analysis, and 2) a continuous-time Markov model.

Zkeorem 1. For all N and c< 1 there exists P H , such that PH(N,t) <PH,, for all t .

5.1.1 Proof 1 : Continuous-time Markov model

Using the binomial theorem, (1) can be rewritten as:

P i ( t ) = ci(i”) (:)( -l)(i-k)(e-h)(N-k). (13) k=O

194

16..

12..

8

IEEE TRANSACTIONS ON RELIABILITY, VOL. 40, NO. 2, 1991 JUNE

i ,? "

.... ....... .c ....... ....... .c ..... .+ ....... *

-.

Therefore, (1 1) can be rewritten as: Therefore:

1 1 PH, = lim PH(i) = -~

1 - 0 2 x ( l - c )

Therefore PH, is a constant upper bound on PH (i) as i - 03. Q. E. D. (14) . ( - 1 ) ( w ( e - A r ) (N-k) ( j7 ,

Integrate over 7, and take the limit as t - 00: The conclusion from theorem 1 is that no matter how large the initial number of processors, there is an upper bound on the processor-hours that are obtainable when c < 1. This upper bound is determined only by c and A, and is reached asymptotically.

PH, is therefore the upper limit on the mean computation beforefailure. Comparing this result to that in section 3 shows that while the mean time to failure of a degradable system increases logarithmically with N, the mean computational work performed in that interval is upper bounded for all N. Therefore increasing the system size does not increase the mean computational work the system can deliver before total failure.

The second summation in (15) can be transformed using binomial identities into:

1 (N+i) + ( i - k ) k=O

Eq (15) reduces to:

l D 1 -cD

i=o A( 1 -c) PH(N,=) = - ci = -

Therefore: ,$ ........................................................................ <:. ........................................................................ f...'

... ... 2 4 6 8 ,o - b % N 1 1

A l - c PH, = - -

Figure 4. RPH as Function of N and c

and PH (N, 03 ) < PH, for all N. Q. E. D. 5.2 Reliable Processor-Hours

5.1.2 Proof 2: Discrete Analysis

The amount of processor-hours, PH, can be expressed as a function of the number of failures i . PH (i) is therefore the amount of processor-hours at failure i .

Figure 4 shows RPH according to (12). The values of c have been chosen in the range [l , 0.991; R- has been set to 0.99. The RPH(N,c) is expressed in processor-hours where the unit time is taken as 1 /X. Two observations can be made:

1. For a given c, there is a value of N, Nph (c) , beyond which an increase in N does not increase the amount of reliable processor-hours. For example, Nph(0.999) = 64. We formal-

N 1 PH(1) = - = - m x

ly define Nph (c) as:

for all c< 1, N2Nph(c) - RPH(N,c)

2. Nph and RPH, increase with example,

Nph(0.99) = 16 and RPH, = 1.0,

N - i - 1 ( N - i - 1 ) A

PH(i) = cPH(i -1) + = RPH,(c).

increasing c. For 1 = c P H ( i - l ) + - x

1 1 l-c' x x l - c

= -( l+c+c2+ ...+ c(i-1)) = --

1

195 NAJJAWGAUDIOT: SCALBILITY ANALYSIS IN GRACEFULLY-DEGRADABLE LARGE SYSTEMS

2.00..

1.80..

1.60..

1.40

1.20-

I . M ) - .

0.80-.

Nph(0.999) = 64 and RPH, = 10.0

1

.-

Nph(0.9999) = 512 and RPH, = 100.

Theorem 1 states that the mean amount of processor-hours in the interval [0,00] is upper bounded by PH,. These results show that the mean amount of processor-hours in the interval [O,MT] is also upper bounded, the upper bound being RPH,.

RPH, is therefore the maximum mean amount of computational work the system can deliver subject to the constraint of R ( t ) 2R-. Section 5.3 presents an analytic derivation of

The maximum value of RPH ( N , c ) , RPH, ( c ) , can be derived analytically by using the expression for the mean number of failures in the interval [O,MT], K, as defined in section 4:

RPH, (C).

RPH, RPH' = (l-CK)/(l-C). (21)

Eq (21) is an approximation of RPH, because K can take only integer values while MT in (12) has real values.

These results show that there is no increase in reliable computational work when N is increased above Nph for a given c. This confirms the results obtained in the MT based evaluation. Although the data in figure 3 cover only a few values of N , the values of Np, where MT is maximal, and those of Nph, (where RPH(N,c) become constant), are both determined by K.

Therefore, for a given coverage c there exists an optimal value of N, Nopt, that maximizes the mission-time MT and the amount of reliable processor hours.

5 .3 Reliable ,Computational Work

RPH evaluates the amount of reliable processor-hours potentially available from the system. The fraction of RPH that is actually used by a computation depends on the speed-up S,,. When S,, = n the computation exhibits linear speed-up. This implies that the communication and synchronization overhead in that computation are negligible compared to the execution time. When S,, < n the speed-up is sub-linear. Similarly to RPH we define RCW as:

RCW(N,c) = CW(N,MT)

5.3 Reliable Computational Work

RPH evaluates the amount of reliable processor-hours potentially available from the system. The fraction of RPH that is actually used by a computation depends on the speed-up S,. When S,, = n the computation exhibits linear speed-up. This implies that the communication and synchronization overhead in that computation are negligible compared to the execution time. When S,, < n the speed-up is sub-linear. Similarly to RPH we define RCW as:

RCW(N,c) = CW(N,MT)

RCW is therefore the amount of effective reliable computational work a system can deliver with respect to a given computation

while R ( t ) 2 R-. Therefore RPH = RCW for S,, = n. In evaluating RCW, we take as example a sub-linear speed-up case where:

ncrv(N, e)

:.;., ! '.,

: '., : ..

: i . . . . . . . . . . . . 0.90

0.80 ; + ., . . i . i *..

0.60 0.70 t i ........... ..&....

I:i'=""] F = 0.90

t c = 0.095

O.I0 I .......... : ......................................................................... r 1 i 1 6 i i Q B io - l o g a N

n Figure 5. RCW as Function of N and c for Sn = -

log n

The results, plotted in figure 5, show that there exists N, such that:

RCW(N,) 1 RCW(N). for all N .

In other words, there exists a value of N denoted by N, at which the value of RCW is maximal.

Figure 6 is a plot of both RPH and RCW vs N for c = 0.995 and S,, = n/(log n). The effect is predictable: since RPH is constant for N > Nph and a linear speed-up, for a sub- linear speed-up RCW must be a decreasing function of N .

._* ....................... ["""'

. . i i".,

i : '., : i ' . . . . . . . . . : . . . . . . . . .

..._ ....... .- ...... .._ .........

Figure 6. RCW and RPH as Function of N, for c=0.995 and n s,, = -

log n

196 IEEE TRANSACTIONS ON RELIABILITY, VOL. 40, NO. 2, 1991 JUNE

This implies that as the system size is increased over Nph, the probability of a computation not completing reliably decreases if the speed-up of the computation is sub-linear. This result has implications on the scalability of graceful degradation. For a large gracefully degradable system to be scalable, any increase in the system size must be matched by an increase in the quality of the recovery scheme, viz, the coverage, in order to maintain the same performability level.

6. DISCUSSION

The results of our analysis show that gracefully degradable large systems do not scale-up. If a minimum reliability level

only decrease the derived performance and reliability upper-bounds.

ACKNOWLEDGMENT

This material is based upon work supported in part by the US Department of Energy, Office of Energy Research under Grant No. DE-FG03-87ER25043, and in part by the US National Science Foundation Research Initiation Grant NO. CCR-9010240.

REFERENCES

is to be maintained throughout a computation, then there mission time [l] J. G. Kuhl, S. M. Reddy, “Fault-tolerance considerations in large multiple-

processor systems”, IEEE Computer, vol 19, 1986 Mar, pp 56-67. and amount Of For [2] P. Agrawal, “RAFT: A recursive algorithm for fault-tolerance”, Proc. any larger system there would be a decrease in either perfor- m c e (as expressed in computational work) or reliability [3] P. Agrawal, R. Agrawal, “Software implementation of a recursive fault-

tolerance algorithm on a network of computers”, Proc. 13th Ann. Symp. (as represented by the minimum level). Computer Architecture, 1986, pp 65-72.

Throughout this One has main- [4] Wensley, Lamport, Goldberg, Green, Levitt, Mellia-Smith, Shostak, tained constant and used as a measure of unit time, viz, Weinstock, “SIFT: Design and analysis of a fault-tolerant computer for the mean-time-to-failure of the single processor (MTTFl). aircraft control”, Proc. IEEE, vol 66, 1978 Oct.

tolerant multiprocessor for aircraft”, Proc. ZEEE, vol66, 1978 Oct.

vol 66, 1978 Oct.

puter, vol 17, 1984 Aug, pp 19-30.

a system size that provides the

1985 znt? con$ Parallel Processing, 1985, pp 814-821.

We have therefore assumed that the system is built using [5] A. L. HOP-, T. B. Smith, J. H. Lala, “FTMP-A highly reliable fault-

the same set Of basic building A system architect, [6] W. N. Troy, “Fault-mlerant design of local mwrs”, proc. ,TEEE, however, has a wide range of choice of basic building blocks, eg, from high-speed high-power ECL to low-speed low-power CMOS. In addition to the SD& and mwer consumDtion fac-

[7] 0. serlin, “Fault-tolerant systems in commercial applications”, ZEEE Com-

- - typically have very narrow tolerance levels and they general- vol c-35, 1986 Apr, pp 322-332. ly-have a very low failure rate. New technologies,-on the other hand, are more prone to either design errors or higher component-failure rates.

Therefore, the choice of a technology by a system designer not only determines the allowable switching speed, and thereby the potential computing power of the system, but also the anticipated rate of failure. By determining the failure rate, the age of a technology determines the unit-time MTTFl. Another major relevant factor to this choice is the cost of design and components associated with a given technology. This factor, however, is not immediately relevant to our analysis. While new technologies can offer switching speeds several orders of magnitude larger than the older

[lo] B. R. Iyer, L. Donatiello, P. Heidelberger, “Analysis of performability for stochastic models of fault-tolerant systems”, ZEEE Trans. Computers,

[ l l ] J. F. Meyer, “Closed-form solutions of performability”, IEEE Trans. Computers, vol C-31, 1982 Jul, pp 648-657.

[12] D. G. Furchtgott, J. F. Meyer, “A performability solution method for degradable non-repairable systems”, IEEE Trans. Computers, vol C-33, 1984 Jun.

[13] L. Donatello, B. R. Iyer, “Analysis of a composite performance reliability measure for fault-tolerant systems”, J. ACM, vol 34, 1987 Jan, pp

[14] A. Goyal, A. N. Tantawi, “Evaluation of performability for degradable computer systems”, IEEE Trans. Computers, vol C-36, 1987 Jun, pp

[15] A. D. Ingle, D. P. Siewiorek, “Reliability models for multiprocessor systems with and without periodic maintenance”, Proc. I977Symp. Fuult-

VOI C-35, 1986 OCt, pp 902-907.

179-199.

738-744.

technologies where failure rates can be several orders of Tderant Computing system.% 1977, PP 3-9. [I61 M. D. Beaudry, “P~rfomCe-related reliability measures for computing

systems”, IEEE Truns. Computers, vol C-27, 1978 Jun, pp 540-547. [17] J. A. B. Fortes, C. S. Raghavendra, “Gracefully degradable processor

[18] M. Smotherman, R. M. Geist, K. S. Trivedi, “Provably conservative approximations to complex reliability models”, IEEE Truns. Computers, vol C-35, 1986 Apr, pp 333-338.

~ n t p ~ 1 ~ o f ~ ~ ~ l e systenw, ZEEE nm. ~lecnoni~ complcters, v01 EC-16, 1967 DX, pp 848-854.

magnitude better. It is conceivable, therefore, that a system with a time-proven but slow-switching technology might

actually outperfom a system built with faster but less reliable arrays”, IEEE Trans. Computers, voi c-34,1985 NOV, pp 1033-1044. technology. mS analysis is based on and thereby

into account the failure rate of the communication network, and the overhead costs of fault-detection and recovery can

unrealistic, assumptions. It is evident, however, that taking [I91 F. p , fieparam, G. Mebe, R, T. Chien, “On the assigne-


[20] J. G. Kuhl, S. M. Reddy, “Distributed fault-tolerance for large multiprocessor systems”, Proc. 7th Ann. Symp. Computer Architecture,

[21] Bouricius, Carter, Jessep, Schneider, Wadia, ‘‘Reliability modeling for fault-tolerant computers”, IEEE Trans. Compurers, vol C-20, 1971 Nov,

[22] W. Najar, J-L. Gaudiot, “Network disconnection in distributed system”, Proc. 8th Int’l ConJ: Distributed Computing Systems, 1988 Jun.

[23] K. S. Trivedi, Probability and Statistics with Reliability, Queueing and Computer Science Applications, 1982; Prentice-Hall.

[24] D. P. Sieworek, R. S. Swartz, Ihe %ory and PraCiice of Reliable System Design, 1982; Digital Press.

1980 Jul, pp 23-30.

pp 1306-1311.

AUTHORS

Dt. Walid Najar; Computer Science Dept.; Colorado State University; Fort Collins, Colorado 80523 USA.

Walid Najjar (S’84-M’88) was born in Beirut, Lebanon, in 1957. He received a BE in Electrical Engineering from the American University of Beirut in 1979 and the MSc and PhD in Computer Engineering from the University of Southern California, Los Angeles, in 1985 and 1988 respectively. He has done research work parallel processing and concurrent discrete-event simulation at the USClInformation Sciences Institute, Marina del Rey (1986-1989). He has been on the faculty of the Computer Science Department at the Col-

orado State University since 1989 where he is an Assistant Professor. His interests include fault-tolerance in multicomputers, parallel processing and discrete- event simulation models. Dr. Najjar is member of the Association for Com- puting Machinery.

Dr. Jean-Luc Gaudiot; Dept. of Electrical Engineering-System; University of Southern California; Los Angeles, California 90089-0781 USA.

Jean-Luc Gaudiot (S’75-M’82-SM’91) was born in Nancy, France in 1954. He received the Dipl6me d’Ing6ieur from the Ecole Sugrieure d’Ing6nieurs en Electrotwhnique et Electronique, Paris in 1976 and the MSc and PhD in Computer Science from the University of California, Los Angeles, in 1977 and 1982, respectively. His experience includes microprocessor systems design at Teledyne Controls, Santa Monica (1979-1980), and research in in- novative arhitectures for the TRW Technology Research Center, El Segundo. Since graduating in 1982, he has been on the Faculty of the Department of Electrical Engineering-Systems, University of Southem California, Los Angeles, where he is an Associate Professor. His research interests include data-flow architectures, fault-tolerant multiprocessor systems and implementation of ar- tificial neural systems. In addition to his academic interests, he has consulted for several aerospace companies in Southern California area. Dr. Gaudiot is a member of the Association for Computing Machinery.

Manuscript TR88-160 received 1988 August 31; revised 1990 June 3.

IEEE Log Number 42316 4 T R b

FREE Proceedings Members, and only members, of the Reliability Society of IEEE and of the Electronics Division of ASQC can receive the following publications free of extra charge. Jpst write to the place indicated for that group and publication; you MUST state that YOU are a member of the group to which you are writing. Quantities are limited, and are available (ONLY to the members) on a first-come first-served basis. If you are not a member of either group and would like to join, see the inside front and rear covers for more information on the two groups. The codbenefit ratio is hard to beat!

Reliability Society of IEEE

Sent annually to all members. A few extra copies of the 1991 AR&MS proceedings are available, but only for those who did not get them. Address your request to: Anthony Coppola; RL/RBE-1; Griffiss AFB, NY 13441-5700 USA. The request MUSTstate that you are a member of the IEEE Reliability Society. 1

Proceedings Annual Reliability and Maintainability Symposium (mailed in February).

Proceedings International Reliability Physics Symposium (mailed in April).

Electronics Division of ASQC

The request MUST state that you are a member of the Electronic Division, ASQC! and be sent to: Electronics Division, ASQC; c/o Mark Southerland;

1105 Madison Square; Shertz, Texas 78154 USA.

Proceedings Annual Reliability and Maintainability Symposium for 1988, 1989, 1990, 1991.

Proceedings International Reliability Physics Symposium for 1989. 1990.

Proceedings QIE@ (Quality In Electronics) for 1982, 1983, 1984, 1985, 1989.

Date post:	22-Sep-2016
Category:	Documents
Upload:	j-l
View:	212 times
Download:	0 times

Scalability analysis in gracefully-degradable large systems

Documents