Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 2 times |
A. Bobbio Bertinoro, March 10-14, 2003 1
Dependability Theory and Methods
Part 1: Introduction and definitions
Andrea BobbioDipartimento di Informatica
Università del Piemonte Orientale, “A. Avogadro”15100 Alessandria (Italy)
[email protected] - http://www.mfn.unipmn.it/~bobbio
Bertinoro, March 10-14, 2003
A. Bobbio Bertinoro, March 10-14, 2003 2
Dependability: DefinitionDependability: Definition
Dependability is the property of a system to be dependable in time, i.e. such that reliance can justifiably be placed on the service it delivers.
Dependability extends the interest on the system from the design and construction phase to the operational phase (life cycle).
A. Bobbio Bertinoro, March 10-14, 2003 3
What dependability theory and practicewants to avoid
A. Bobbio Bertinoro, March 10-14, 2003 4
dependability
measures
reliabilityavailabilitymaintainabilitysafetysecurity
means
fault forecastingfault tolerancefault removalfault prevention
threats faults errorsfailures
Dependability: TaxonomyDependability: Taxonomy
A. Bobbio Bertinoro, March 10-14, 2003 5
Quantitative analysisQuantitative analysis
The quantitative analysis aims at numerically evaluating measures to characterize the dependability of an item:
Risk assessment and safety
Design specifications
Technical assistance and maintenance
Life cycle cost
Market competition
A. Bobbio Bertinoro, March 10-14, 2003 6
Risk assessment and safetyThe risk associated to an activity is given proportional to the probability of occurrence of the activity and to the magnitute of the consequences.
A safety critical system is a system whose incorrect behavior may cause a risk to occur, causing undesirable consequences to the item, to the operators, to the population, to the environment.
R = P M
A. Bobbio Bertinoro, March 10-14, 2003 7
Design specifications
Technological items must be dependable.
Some times, dependability requirements (both qualitative and quantitative) are part of the design specifications:
Mean time between failures
Total down time
A. Bobbio Bertinoro, March 10-14, 2003 8
Technical assistance and maintenance
The planning of all the activity related to the technical assistance and maintenance is linked to the system dependability (expected number of failure in time).
planning spare parts and maintenance crews;
cost of the technical assistance (warranty period);
preventive vs reactive maintenance.
A. Bobbio Bertinoro, March 10-14, 2003 9
Market competition
The choice of the consumers is strongly influenced by the perceived dependability.
advertisement messages stress the dependability;
the image of a product or of a brand may depend on the dependability.
A. Bobbio Bertinoro, March 10-14, 2003 10
Purpose of evaluation
Understanding a system
– Observation
– Operational environment
– Reasoning
Predicting the behavior of a system
– Need a model
– A model is a convenient abstraction
– Accuracy based on degree of extrapolation
A. Bobbio Bertinoro, March 10-14, 2003 11
Methods of evaluation
Measurement-Based
Most believable, most expensive
Not always possible or cost effective during
system design
Model-Based
Less believable, Less expensive
Analytic vs Discrete-Event Simulation
Combinatorial vs State-Space Methods
A. Bobbio Bertinoro, March 10-14, 2003 12
Measurement-Based
Most believable, most expensive;
Data are obtained observing the behavior of physical
objects.
field observations;
measurements on prototypes;
measurements on components (accelerated tests).
A. Bobbio Bertinoro, March 10-14, 2003 13
Closed-form
Answers
Numerical
SolutionAnalytic
Simulation
All models are wrong; some models are useful
Models
A. Bobbio Bertinoro, March 10-14, 2003 14
Methods of evaluation
Measurements + Models data bank
A. Bobbio Bertinoro, March 10-14, 2003 15
The probabilistic approachThe probabilistic approachThe mechanisms that lead to failure a technological object are very complex and depend on many physical, chemical, technical, human, environmental … factors.
The time to failure cannot be expressed by a determin-istic law.
We are forced to assume the time to failure as a random variable.
The quantitative dependability analysis is based on a probabilistic approach.
A. Bobbio Bertinoro, March 10-14, 2003 16
ReliabilityReliability
The reliability is a measurable attribute of the dependability and it is defined as:
The reliability R(t) of an item at time t is the probability that the item performs the required function in the interval (0 – t) given the stress and environmental conditions in which it operates.
A. Bobbio Bertinoro, March 10-14, 2003 17
Basic Definitions: cdfLet X be the random variable representing the time to failure of an item.
The cumulative distribution function (cdf) F(t) of the r.v. X is given by:
F(t) = Pr { X t }
F(t) represents the probability that the item is already failed at time t (unreliability) .
A. Bobbio Bertinoro, March 10-14, 2003 18
Basic Definitions: cdf
Equivalent terminoloy for F(t) :
CDF (cumulative distribution function)
Probability distribution function
Distribution function
A. Bobbio Bertinoro, March 10-14, 2003 19
Basic Definitions: cdf
1
0
F(t)
ta
F(b)
F(a)
b
F(0) = 0lim F(t) = 1t
F(t) = non-decreasing
A. Bobbio Bertinoro, March 10-14, 2003 20
Basic Definitions: Reliability
Let X be the random variable representing the time to failure of an item.
The survivor function (sf) R(t) of the r.v. X is given by:
R (t) = Pr { X > t } = 1 - F(t)
R(t) represents the probability that the item is correctly working at time t and gives the reliability function .
A. Bobbio Bertinoro, March 10-14, 2003 21
Basic Definitions
Equivalent terminology for R(t) = 1 -F(t) :
Reliability
Complementary distribution function
Survivor function
A. Bobbio Bertinoro, March 10-14, 2003 22
Basic Definitions: Reliability
1
0
R(t)
ta b
R(0) = 1lim R(t) = 0t
R(t) = non-increasing
R(a)
A. Bobbio Bertinoro, March 10-14, 2003 23
Basic Definitions: density
Let X be the random variable representing the time to failure of an item and let F(t) be a derivable cdf:
The density function f(t) is defined as:
d F(t)f (t) = ——— dt
f (t) dt = Pr { t X < t + dt }
A. Bobbio Bertinoro, March 10-14, 2003 24
Basic Definitions: Density
0
f (t)
ta b
f(x) dx = Pr { a < X b } = F(b) – F(a) a
b
A. Bobbio Bertinoro, March 10-14, 2003 25
Basic Definitions: Density
1
0
f (t)
t
00
dttRdtttfXEMTTF
A. Bobbio Bertinoro, March 10-14, 2003 26
Basic Definitions
Equivalent terminology: pdf
probability density function
density function
density
f(t) = dt
dF ,)(
)()(
0
t
t
dxxf
dxxftF
For a non-negativerandom variable
A. Bobbio Bertinoro, March 10-14, 2003 27
Quiz 1:The higher the MTTF is, the higher the
item reliability is.
1. Correct
2. Wrong
The correct answer is wrong !!!
A. Bobbio Bertinoro, March 10-14, 2003 28
Hazard (failure) rate
h(t) t = Conditional Prob. system will fail in (t, t + t) given that it is survived until time t
f(t) t = Unconditional Prob. System will fail in (t, t + t)
)(1
)(
)(
)()(
tF
tf
tR
tfth
A. Bobbio Bertinoro, March 10-14, 2003 29
is the conditional probability that the unit will fail in the interval given that it is functioning at time t.
is the unconditional probability that the unit will fail in the interval
Difference between the two sentences:– probability that someone will die between 90 and 91, given that he
lives to 90– probability that someone will die between 90 and 91
The Failure Rate of a Distribution
tΔth),( ttt
ttf ),( ttt
30A. Bobbio
DFR IFR
Decreasing failure rate Increasing fail. rate
h(t)
t
CFRConstant fail. rate
(useful life)
(infant mortality – burn in) (wear-out-phase)
Bathtub curve
A. Bobbio Bertinoro, March 10-14, 2003 31
Infant mortality (dfr)Also called infant mortality phase or reliability growth phase. The failure rate decreases with time.
Caused by undetected hardware/software defects; Can cause significant prediction errors if steady-state failure rates are used;Weibull Model can be used;
A. Bobbio Bertinoro, March 10-14, 2003 32
Useful life (cfr)
The failure rate remains constant in time (age independent) .
Failure rate much lower than in early-life period.
Failure caused by random effects (as environmental shocks).
A. Bobbio Bertinoro, March 10-14, 2003 33
Wear-out phase (ifr)The failure rate increases with age.
It is characteristic of irreversible aging phenomena (deterioration, wear-out, fatigue, corrosion etc…)
Applicable for mechanical and other systems.
(Properly qualified electronic parts do not exhibit wear-out failure during its intended service life)
Weibull Failure Model can be used
A. Bobbio Bertinoro, March 10-14, 2003 34
Cumul. distribution function:
Reliability :
Density Function :
Failure Rate (CFR):
Mean Time to Failure:
0 1 tetF t
0 t tetf
0 ttR e t
tRtf
th
1MTTF
Exponential Distribution
Failure rate is age-independent (constant).
A. Bobbio Bertinoro, March 10-14, 2003 35
2.50
The Cumulative Distribution Function of an Exponentially Distributed Random
Variable With Parameter = 1
F(t)1.0
0.5
0 1.25 3.75 5.00 t
F(t) = 1 - e - t
A. Bobbio Bertinoro, March 10-14, 2003 36
2.50
The Reliability Function of an Exponentially Distributed Random
Variable With Parameter = 1
R(t)1.0
0.5
0 1.25 3.75 5.00 t
R(t) = e - t
A. Bobbio Bertinoro, March 10-14, 2003 37
Exponential Density Function (pdf)
f(t)
MTTF = 1/
A. Bobbio Bertinoro, March 10-14, 2003 38
Memoryless Property of the Exponential Distribution
Assume X > t. We have observed that the
component has not failed until time t
Let Y = X - t , the remaining (residual) lifetime
y
t
etXP
tyXtP
tXtyXP
tXyYPyG
1)(
)(
)|(
)|()(
A. Bobbio Bertinoro, March 10-14, 2003 39
Memoryless Property of the Exponential Distribution (cont.)
Thus Gt(y) is independent of t and is identical to the
original exponential distribution of X
The distribution of the remaining life does not
depend on how long the component has been
operating
An observed failure is the result of some suddenly
appearing failure, not due to gradual deterioration
A. Bobbio Bertinoro, March 10-14, 2003 40
Quiz 3: If two components (say, A and B) have independent
identical exponentially distributed times to failure, by the “memoryless” property, which of the following is
true?
1. They will always fail at the same time
2. They have the same probability of failing at time ‘t’ during operation
3. When these two components are operating simultaneously, the component which has been operational for a shorter duration of time will survive longer
A. Bobbio Bertinoro, March 10-14, 2003 41
0
0
0 1
1
tetR
tettf
tetF
t
t
t
Weibull Distribution
Distribution Function:
Density Function:
Reliability:
A. Bobbio Bertinoro, March 10-14, 2003 42
1
1
0 1
)(
)( ttth
tR
tf
Weibull Distribution : shape parameter;
: scale parameter.
Failure Rate:
1 Dfr
Cfr
Ifr
A. Bobbio Bertinoro, March 10-14, 2003 43
Failure Rate of the Weibull Distribution with Various Values of
A. Bobbio Bertinoro, March 10-14, 2003 44
Weibull Distribution for Various Values of
Cdf density
A. Bobbio Bertinoro, March 10-14, 2003 45
We use a truncated Weibull Model
Infant mortality phase modeled by DFR Weibull and the steady-state phase by the exponential
0 2,190 4,380 6,570 8,760 10,950 13,140 15,330 17,520
Operating Times (hrs)
Fa
ilu
re-R
ate
Mu
ltip
lie
r
7
6
5
4
3
2
1
0
Figure 2.34 Weibull Failure-Rate Model
Failure Rate Models
A. Bobbio Bertinoro, March 10-14, 2003 46
Failure Rate Models (cont.)
This model has the form:
where:
steady-state failure rate
is Weibull shape parameter
Failure rate multiplier =
SS
W tCt
1)(
760,8
760,81
t
t
SSWC ,11
SSW t)(
A. Bobbio Bertinoro, March 10-14, 2003 47
Failure Rate Models (cont.)
There are several ways to incorporate time dependent failure rates in availability models
The easiest way is to approximate a continuous function by a piecewise constant step function
2,190 4,380 6,570 10,950 13,140 15,330 17,520
Operating Times (hrs)
Fa
ilu
re-R
ate
Mu
ltip
lie
r
7
6
5
4
3
2
1
0
Discrete Failure-Rate Model
8,7600
1
2 SS
A. Bobbio Bertinoro, March 10-14, 2003 48
Failure Rate Models (cont.)
Here the discrete failure-rate model is defined by:
ss
W t
2
1)(
760,8
760,8380,4
380,40
t
t
t
A. Bobbio Bertinoro, March 10-14, 2003 49
A lifetime experimentA lifetime experiment
N i.i.d components are put in a life test experiment.
1
2
3
4
N
t = 0
X 1
X 2
X 3
X 4
X N
A. Bobbio Bertinoro, March 10-14, 2003 50
A lifetime experimentA lifetime experiment1234
N
X 1X 2
X 3X 4
X N
A. Bobbio Bertinoro, March 10-14, 2003 51
Repairable systems
Availability
A. Bobbio Bertinoro, March 10-14, 2003 52
Repairable systems
X 1, X 2 …. X n Successive UP times
Y1, Y 2 …. Y n Successive DOWN times
t
UP
DOWN
X 1 X 2 X 3
Y 1 Y 2
• • • • •
A. Bobbio Bertinoro, March 10-14, 2003 53
Repairable systems
The usual hypothesis in modeling repairable systems is that:
The successive UP times X 1, X 2 …. X n are i.i.d. random variable: i.e. samples from a common cdf F (t)
The successive DOWN times Y1, Y 2 …. Y n are i.i.d. random variable: i.e. samples from a common cdf G (t)
A. Bobbio Bertinoro, March 10-14, 2003 54
Repairable systems
The dynamic behaviour of a repairable system is characterized by:
the r.v. X of the successive up times
the r.v. Y of the successive down times
t
UP
DOWN
X 1 X 2 X 3
Y 1 Y 2
• • • • •
A. Bobbio Bertinoro, March 10-14, 2003 55
Maintainability
Let Y be the r.v. of the successive down times:
G(t) = Pr { Y t } (maintainability)
d G(t) g (t) = ——— (density) dt g(t) h g (t) = ———— (repair rate) 1 - G(t)
MTTR = t g(t) dt (Mean Time To Repair) 0
A. Bobbio Bertinoro, March 10-14, 2003 56
Availability
The avaiability A(t) of an item at time t is the probability that the item is correctly working at time t.
The measure to characterize a repairable system is the availability (unavailability):
A. Bobbio Bertinoro, March 10-14, 2003 57
Availability
The measure to characterize a repairable system is the availability (unavailability):
A(t) = Pr { time t, system = UP }
U(t) = Pr { time t, system = DOWN }
A(t) + U(t) = 1
A. Bobbio Bertinoro, March 10-14, 2003 58
Definition of Availability
An important difference between reliability and
availability is:
reliability refers to failure-free operation during an
interval (0 — t) ;
availability refers to failure-free operation at a given
instant of time t (the time when a device or system is
accessed to provide a required function),
independently on the number of cycles failure/repair.
A. Bobbio Bertinoro, March 10-14, 2003 59
Definition of Availability
Operating and providinga required function
Failed andbeing
restored
1Operating and providing
a required function
System Failure and Restoration Process
t
I(t) indicator function
0
I(t)
1 working0 failed
A. Bobbio Bertinoro, March 10-14, 2003 60
Availability evaluation
In the special case when times to failure and times to restoration are both exponentially distributed, the alternating process can be viewed as a two-state homogeneous Continuous Time Markov Chain
Time-independent failure rate Time-independent repair rate
A. Bobbio Bertinoro, March 10-14, 2003 61
2-State Markov Availability Model
MTTR
MTTF
1
1UP1
DN0
Transient Availability analysis:
for each state, we apply a flow balance equation:
– Rate of buildup = rate of flow IN - rate of flow OUT
A. Bobbio Bertinoro, March 10-14, 2003 62
2-State Markov Availability Model
MTTR
MTTF
1
1
UP1
DN0
A. Bobbio Bertinoro, March 10-14, 2003 63
2-State Markov Availability Model
1A(t)
Ass=
MTTRMTTF
MTTFASS
11
1
A. Bobbio Bertinoro, March 10-14, 2003 64
2-State Markov Model
tetA )()(
Ass
t
te
1) Pointwise availability A(t) :
2) Steady state availability: limiting value as
3) If there is no restoration (=0) the availability becomes the reliability A(t) = R(t) =
A. Bobbio Bertinoro, March 10-14, 2003 65
Steady-state Availability
Steady-state availability:
In many system models, the limit:
exists and is called the steady-state availability
t
ss tAA lim
ssA
The steady-state availability represents the probability of finding a system operational after many fail-and-restore cycles.
A. Bobbio Bertinoro, March 10-14, 2003 66
Steady-state Availability1
t
0UP DOWN
Expected UP time E[U(t)] = MUT = MTTF
Expected DOWN time E[D(t)] = MDT = MTTR
MTTRMTTF
MTTF
MDTMUT
MUTASS
A. Bobbio Bertinoro, March 10-14, 2003 67
Availability: Example (I)Let a system have a steady state availability
Ass = 0.95
This means that, given a mission time T, it is expected that the system works correctly for a total time of: 0.95*T.
Or, alternatively, it is expected that the system is out of service for a total time:
Uss * T = (1- Ass) * T
A. Bobbio Bertinoro, March 10-14, 2003 68
Availability: Example (II)Let a system have a rated productivity of W $/year.
The loss due to system out of service can be estimated as:
Uss * W = (1- Ass) * W
The availability (unavailability) is an index to estimate the real productivity, given the rated productivity.
Alternatively, if the goal is to have a net productivity of W $/year, the plant must be designed such that its rated productivity W’ should satisfy:
Uss * W’ = W
A. Bobbio Bertinoro, March 10-14, 2003 69
AvailabilityWe can show that:
This result is valid without making any assumptions on the form of the distributions of times to failure & times to repair.
Also:
MTTRMTTF
MTTFASS
)yearminutes(
60*8760*)1(
perin
Adowntime ss
Availability, A 0.99 0.999 0.9999 0.99999 0.999999
Unavailability, U
Downtime in min./year
0.01
5,256
0.001
525.6
0.0001
52.56
0.00001
5.256
0.000001
0.5256
A. Bobbio Bertinoro, March 10-14, 2003 70
Motivation – High Availability
A. Bobbio Bertinoro, March 10-14, 2003 71
MDT (Mean Down Time or MTTR - mean time to restoration).
The total down time (Y ) consists of:
• Failure detection time• Alarm notification time• Dispatch and travel time of the repair person(s)• Repair or replacement time• Reboot time
0
)( dtttgYEMDT
Maintainability
A. Bobbio Bertinoro, March 10-14, 2003 72
The total down time (Y ) consists of:
• Logistic time
Administrative times
Dispatch and travel time of the repair person(s)
Waiting time for spares, tools …
• Effective restoration time Access and diagnosis time Repair or replacement time Test and reboot time
Maintainability
A. Bobbio Bertinoro, March 10-14, 2003 73
The total cost of a maintenance action consists of:
Cost of spares and replaced parts
Cost of person/hours for repair
Down-time cost (loss of productivity)
The down-time cost (due to a loss of productivity) can be
the most relevant cost factor.
Maintenance Costs
A. Bobbio Bertinoro, March 10-14, 2003 74
Is the sequence of action that minimizes the total cost
related to a down time:
Reactive maintenance:
maintenance action is triggered by a failure.
Proactive maintenance:
preventive maintenance policy.
Maintenance Policy