Prediction of Reliability of Environmental Control and ...

Prediction of Reliability of Environmental

Control and Life Support Systems

Haibei Jiang∗ and Luis F. Rodrıguez†

University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA

Scott Bell‡ and David Kortenkamp§

NASA-Johnson Space Center, Houston, TX, 77058, USA

Francisco Capristan¶

Georgia Institute of Technology, Atlanta, GA, 30332, USA

An increasing awareness of life support system reliability has been no-

ticed in the aerospace community as long-term space missions become real-

istic objectives. Literature review indicates a significant knowledge gap in

the accurate evaluation of the reliability of environmental control and life

support systems. Quantitative determination of system reliability, however,

is subject to large data requirements, often limiting their applicability. In

an effort to address this issue, this paper presents an approach to reliability

analysis for exploration life support system design. A simulation tool has

been developed with the capability of representing complex dynamic sys-

tems with configurable failure rate functions for life support hardware. This

tool has been applied and compared with classical reliability prediction ap-

proaches. As a result of this work, it has been determined that typical life

support system configurations are likely to be more reliable than classical

∗Graduate Student, Department of Agricultural and Biological Engineering, and Department of AerospaceEngineering, 376D Agricultural Engineering Sciences Building, MC-644, 1304 West Pennsylvania Avenue,Urbana, IL 61801†Corresponding author. Assistant Professor, Department of Agricultural and Biological Engineering,

376C Agricultural Engineering Sciences Building, MC-644, 1304 West Pennsylvania Avenue, Urbana, IL61801. AIAA Member.‡Research Scientist, TRACLabs Inc., 1012 Hercules, Houston, TX 77058§Senior Scientist, TRACLabs Inc., 1012 Hercules, Houston, TX 77058. AIAA Member.¶Undergraduate Student, School of Aerospace Engineering, 47 Northwest Boulevard, Miami, FL 33126

1 of 35

approaches might suggest. This is due to an inherent buffering capacity in

life-support system design, which might be leveraged to improve the cost

effectiveness of future life support system design.

Nomenclature

R(t) system reliability

Ri(t) reliability of subsystem i

F (t) cumulative failure function

f(t) probability density function

λ the characteristic parameter of the exponential distribution, and a characteristic parameter

of the two parameter Weibull distribution. In the case of the exponential distribution,

generally units are failure per unit time, or failure/hour, in this case. The reciprocal of

lambda is generally regarded as the mean time to failure of the exponential distribution.

In the Weibull distribution, λ is a shape parameter.

µ one of the two characteristic parameters describing the normal distribution. Generally

the units of µ are in time, hours in this case. The mean time to failure of the normal

function is µ.

σ one of the two characteristic parameters describing the normal distribution. Generally σ

is a unitless measure of variance.

L(X1, . . . , Xn; Θ) the likelihood function of the system described by failure data Xi, . . . , Xn,

as controlled by the mean likelihood estimator Θ.

Θ the mean likelihood estimator

λ∗ the predicted value of the mean likelihood estimator of the exponential distribution

n the number of data points utilized to determine the mean likelihood estimator of the

exponential distribution

β a characteristic parameter of the two parameter Weibull distribution.

Tsys the estimated mean time to failure of the system

Ti the estimated mean time to failure of subsystem i

min(a, b, c) minimum function, returning the smallest value of a, b, or c

2 of 35

I. Introduction

Large scale complex environmental control systems, such as those considered by the

National Aeronautics and Space Administration (NASA) feature non-deterministic charac-

teristics. These systems are not accurately represented by combinations of series and parallel

reliability block diagrams, thus they present major challenges for quantitative risk analysis.

Nevertheless, robust environmental control and life support systems with multi-degree fault

tolerance and well proven contingency plans are desired by NASA and its space exploration

programa. Long duration human activity in a Lunar Outpost has been proposed as a gate-

way to future Martian exploration. As mission length increases, resupplies of food, water,

air and life essentials become more and more costly. Since crew survivability is the most im-

portant factor in manned space exploration, designing and building an authentically reliable

regenerative life support system is of critical importance. The classical design of reliable

systems involves accurate prediction of random component failure, the related cascading

effects, contingency planning, and maintenance strategies.

Current NASA reliability analysis is a “lessons learned” style database built on historical

data and expert opinions. Reliability, or failure probability, is determined by experiment or,

more often, by assumption. A widely used database compiled based on the International

Space Station (ISS) is known as the ISS Risk Management Application (IRMA)1 which

emerges from the Futron Integrated Risk Management Applicationb. It uses a two dimen-

sional risk assessment approach to predict likelihood and consequence of any given event.

These judgments are made by designers, operators, astronauts and analysts in a score ma-

trix. Possible reliability issues will thus be addressed according to the priority decided by

these scores. NASA has also developed a Probabilistic Risk Assessment (PRA) tool con-

sidering the failure modes of the Space Shuttlec. In this case, failure modes are identified

by personnel working in Space Shuttle design, maintenance, operations, or analysis. Failure

modes and their related effects are evaluated for their impacts on system health. Failure

Modes and Effects Analysis (FMEA)2 is a similar approach, popular in the industry due

to its successful applications in many important projects, such as the Concorde and Airbus

projects,3 the Lunar Module, and many other applications, such as military systems, car

manufacturing, and nuclear power plants.4,5 Other alternative approaches exist as well, in-

cluding Fault Tree Analysis (FTA),6 What-If Analysis, Functions-Components-Parameters

Analysis (FCP),7 and Hazard and Operability Method (HAZOP),8 all coming from analogous

aSee Exploration Life Support Overview available at: http://els.jsc.nasa.gov/, last accessed on December9, 2009.

b“Risk Management: Futron Integrated Risk Management Application (FIRMA),”http://www.futron.com/, accessed last: July 2009

c“Probabilistic Risk Assessment: What is it and Why is it Worth Performing”,http://www.hq.nasa.gov/office/codeq/qnews/pra.pdf, accessed last: August 2009

3 of 35

challenges existing within the chemical processing or nuclear industry.9,10 The limitation of

these approaches can be summarized as follows:

1. All these approaches heavily rely on operational data, which can only be acquired after

the systems is operational.

2. The magnitude of effort required to assemble all the possible failure modes limits their

applicability.

3. In the case there is a large but incomplete amount of data available, the effectiveness

of the analysis depends heavily on the focus and objectivity of the assessment team

due to the inherent use of opinion from individuals close to the system.

4. None of the existing approaches can address the impact of buffering capacity, repairable

components, maintenance quality, or reliability degradation.

Overall, these limitations reveal the concern that the classical reliability and risk analysis

approaches may not be precise and effective for systems like life support systems. In the

case of environmental control and life support system (ECLSS), there exists a demonstrated

capacity to recover from major system failures given the ingenuity of crew and mission control

and an opportunity to provide corrective maintenance. For example, recall the miraculous

recovery during the Apollo 13 mission, where the very volume of the of the habitat provided

enough of a buffer to allow the crew and mission control to reconfigure the system and return

to Earth safely. It is this buffering capacity that we seek to quantify here by considering

the design of ECLSS. Buffering capacity in life support system design is represented by

additional stored resources in the crew habitat environment or in storage buffers. These

resources can be utilized by the crew in the event of failure of life support hardware, and can

prove critical in ensuring crew survival. Moreover, as mission length and distance from Earth

increases, crew challenges such as the failure of life support hardware must be expected and

contingency plans will need to be prepared considering limited availabilty of support from

Earth. With better quantification of the amount of buffering capacity available, contingency

design can be based on a quantitative understanding as to which resouces define the most

critical buffers. It is shown here that the buffering capacity of ECLSS can have a large impact

on the accuracy of standard reliability approaches and, therefore, alternative methods should

be considered.

This paper will present the recent findings to address this challenge. The main con-

tribution of this paper lies in the demonstration of the capability of several developmental

reliability assessment approaches and a component-based simulation tool for studying the

reliability and cost of complex environmental systems in space applications. The virtual

environment we built is intended to deliver the following advantages:

4 of 35

• Provide a virtual test bed which allows mission designers to test different system designs

and study the tradeoff between design and system reliability;

• Predict the reliability function for the integrated system based on component reliability

functions;

• Determine the minimum component reliability requirements given various system level

reliability objectives;

• Test different corrective and preventive maintenance strategies to determine the opti-

mal maintenance scheduling;

• Compare system ESM (Equivalent System Mass) and MTTF (Mean Time To Failure);

• Address the buffering capacity in ECLSS and its impact on system reliability and cost.

Due to the complexity of the problem and the depth of study we plan to conduct, the

overall objectives of this study can be divided into three interrelated phases.

Phase I Compare life testing results using classical reliability block diagrams, modified

reliability block diagrams, and simulation experiments coupled with quantitative sta-

tistical methods.

Phase II Establish a reliability theory which considers system buffering capacity (similar

to response delay defined in modern control theory). With such a theory, the objective

is to obtain more accurate reliability prediction results using a modified conventional

reliability theory in studying complex environmental systems.

Phase III Model preventive and corrective maintenance functions and study the impact

of their quality and schedules. Demonstrate the importance of employing appropriate

contingency plans by testing systems with and without them. Optimize system design

by balancing the tradeoff between reliability and cost. Reconfigurable control systems

can be designed and tested at this stage as well.

The first two phases of the plan have been completed and the corresponding results

are presented in this paper. The remainder of the paper is organized as follows: Section II

introduces the reliability prediction approaches adopted for this analysis, including reliability

block diagram, modified reliability block diagram, simulation experiments and statistical

methods; Section III describes a simplified life support system in a 180-day Lunar Outpost

mission and discusses the experimental results obtained for this system; Section IV presents

the conclusions and the directions for future research.

5 of 35

II. Methodology

A. Reliability Modeling and Prediction Approaches

Four reliability prediction methods and the reasoning behind their seledction will be discussed

in this section. These methods include RBD (Reliability Block Diagram), MRBD (Modified

Reliability Block Diagram), MTTF (Mean Time To Failure), and MC (Monte Carlo) style

simulation with MLE (Maximum Likelihood Estimation).

1. Reliability Block Diagrams

A fundamental approach to represent system reliability in terms of component reliability

is the use of RBDs.11 Component interactions are presented by a network of blocks in

accordance to the actual physical relationship of the components in the system. Let n

denote the number of components in the system, four special configurations are depicted in

Figure 1 where

• System A represents a series system.

• System B represents a parallel system.

• System C represents a k-out-of- n system.

• System D represents a system with passive (offline) redundancy.

Figure 1. Reliability Block Diagrams

In conventional reliability theory, the integrated system is in its operational state when

there is an open pathway between the start (I) and end (O), representing the inputs and

outputs of the system; the system is determined to be in a failed state when there is no

6 of 35

continuous path between I and O. The advantage of such a graphical representation of

system configuration is that the reliability can be determined using a binary characterization

of the state of each component within the system. A time-variant binary structure function,

Φ(t), equals one if the system is in a working state (UP), and zero if the system is in a failed

state (DOWN). Thus, the reliability can be defined as the probability that the structure

function Φ is equal to one, R(t) = P (Φ(t) = 1). Mathematically, the reliability function of

a series system can be expressed as,

R(t) = R1(t)R2(t) . . . Rn(t) =n∏i=1

Ri(t). (1)

For parallel systems, the reliability function is,

R(t) = 1− F1(t)F2(t) . . . Fn(t) = 1−n∏i=1

(1−Ri(t)). (2)

The generalized k-out-of-n system, commonly used for systems with higher reliability

requirements. This type of reliability improvement is also known as active (online) redun-

dancy. The reliability function of such systems can be mathematically represented in the

form,

R(t) =n∑i=k

(n

k

)R(t)r(1−R(t))n−r. (3)

Passive (offline) redundancy considers a two-unit system where a standby unit assumes

the function of the primary unit. The reliability of the system is the sum of the probability

that the primary unit does not fail until time t and the probability that the primary unit

fails at some time τ , 0 < τ < t while the standby unit functions successfully from τ to time

t. Mathematically,

R(t) = R1(t) +

∫ t

τ=0

f1(τ)R2(t− τ)dt, (4)

where R1(t) and R2(t) denote the reliabilities of the primary unit and the standby unit at

time t respectively, and f1(t) is the probability density function of the failure of the first unit.

More generally, we can extend the two-unit standby system to a n-unit standby system with

the assumption that each unit process has a constant failure rate λ. The reliability of such

a multi-unit standby system is given by

R(t) = e−λt[1 + λt+

(λt)2

2!+ · · ·+ (λt)n−1

(n− 1)!

]. (5)

In general, these reliability functions indicate that system reliability increases as the

number of standby units increases. However, the rate of system reliability improvement

7 of 35

decreases exponentially as the number of standby units increases. Maintenance and cost

requirements increase with additional standby units. Hence, a decision regarding the number

of standby units needed by the system needs to be made, which should account for both

the cost of adding standby units and the requirements for system reliability. To properly

apply these methods to life support system analysis, several critical assumptions need to be

made. First of all, conditional component failure probability functions need to be determined,

and independent component failures need to be assumed. Secondly, the components in the

system need to have a linear relationship with specified inputs and outputs. A reasonable

approximation is to use mass flows as component inputs and outputs. Lastly, it is currently

assumed that no preventive or corrective maintenance is available for system components

since the maintenance actions will drastically change the fundamental implementation of

RBD. Such, contingency plans are to be tested in the future.

2. Modified Reliability Block Diagrams

The modified reliability block diagram approach is introduced for the purpose of modeling

the buffering capacity in life support systems. The buffering capacity in ECLSS is caused by

the fact that system failure is no longer determined by component states, rather, the crew

state and their productivity is the major concern. The major innovation presented here is

the use of reliability blocks to represent the system buffering capacity which supports crew

habitat after certain regenerative components fail. A graphical representation of the modified

system reliability diagram is depicted in Figure 2. The blocks in subsystem E represent the

buffering capacity, or more generally, the remaining resources in the environment. The

blocks numbered 10, 11, 12 and 13 contain the same resource that is produced by system

A, B, C and D respectively. They will begin to provide the necessary resources to keep

the crew members alive when regenerative system A, B, C or D fail to be functional. This

modification in system reliability diagram is believed to affect system reliability prediction

results since the system will not fail instantly even if the components are connected in a

series configuration.

To quantify the reliability of such a system appears to be straight forward since it is very

similar to a parallel configuration. However, the remaining challenge of selecting a function

that physically represents the buffering capacity, and properly sizing that function, is still

to be addressed. At the current stage, we assume that the environmental buffers are idle

when the regenerative components are functional, and they will only be activated under

the circumstances when the production of a certain resource is ceased due to a component

failure.

A normal reliability model has been proposed to represent the buffering capacity. The

advantage of a normal model is that it can mathematically simulate binary states. It can also

8 of 35

Figure 2. Modified Reliability Block Diagrams

be utilized as a system reliability indicator since the probability of system failure is one when

the reliability of buffer becomes zero, which physically means the exhaustion of a critical

resource. Mathematically, a normal reliability model can be expressed in the following form:

R(t) = 1− F (t) = 1−∫ t

−∞

1

σ√

2πe[−

12( τ−µσ

)2]dτ, (6)

where µ and σ are, respectively, the mean and the standard deviation of the distribution.

The plot in Figure 3 has a very sharp reliability decrement after 4,320 hours since the selected

µ and σ are 4320 and 1 respectively. These values need to be carefully selected by the system

analyst for properly sizing the buffering capacity for real systems.

3. Mean Time To Failure Approach

Another way to approximate system reliability is the direct use of component MTTF. This

is a deterministic method based on the reliability assumptions for individual components.

The core idea in this approach is to find the bound for system life time from subsystem

life times by decomposition. For a series system, MTTFsys = min{MTTF1,MTTF2, . . . };for a system with active redundancy, MTTFsys = max{MTTF1,MTTF2, . . . }; for a sys-

tem with standby redundancy, MTTFsys =∑n

i=1{MTTFi}, where n is number of standby

components. The major advantage of this approach is its convenience.

4. Monte Carlo Style Simulation with Maximum Liklihood Estimation

BioSim is a dynamic system simulation tool developed by NASA Johnson Space Center.12–15

Mathematical models for typical components found in various life support systems are fully

9 of 35

Figure 3. A normal reliability function with µ = 4320, σ = 1.

10 of 35

integrated and highly configurable. Simulation progresses in hourly time increments, with

each unit process producing and consuming various resources in designated stores. An Ex-

tensible Markup Language (XML) configuration file containing the design of the system

initializes the simulation including settings such as random failure and stochastic perfor-

mance.16 BioSim has been successfully utilized and verified in many ECLSS design applica-

tions, including optimization,17 reliability analysis,16 control system testing,13,18 and power

system design verification.19

Monte Carlo Simulation20 (MC) allows the analyst to consider various outcomes the

system may encounter. The simulation environment we developed also enables us to study

many reliability and cost related aspects which cannot be easily captured by analytical

models, such as different maintenance schedules and quality, reliability degradation, repair

priorities, and the focus of this paper, buffering capacity. In this study, five reliability

testing experiments are conducted, each of which involves destructive simulation on 100

identical systems, whose failure time and causes are captured for reliability prediction. In

our application, the major concern is that even if a numerous trials have been conducted,

there’s still no guarantee that we can exhaustively span the whole search space and identify

all the possible consequences.

Maximum Likelihood Estimation (MLE) method is one of the most widely used methods

for estimating the parameters of a probability distribution function using the likelihood

function. The likelihood function L is given by

L(X1, . . . , Xn; Θ) =n∑i=1

f(Xi; Θ). (7)

The maximum likelihood estimator (MLE) of Θ, or Θ∗, will maximizes L. In most cases,

the MLE is obtained by differentiating Equation 7, setting equal to zero and solving for the

unknowns. For an exponential distribution the characteristic variable is the failure rate λ,

and the MLE is determined by taking the reciprocal of the mean of the failure times as in

Equation 8.

λ∗ =nn∑i=1

xi

=1

x, (8)

where xi is the time of the ith failure and the MTTF is simply the inverse of λ∗.

The same approach can be applied to many other widely used reliability models, such

as the Two-Parameter Exponential distribution model, the Weibull distribution model, the

Normal and Lognormal distribution model, and the Inverse Gaussian distribution model.

The final selection of system reliability model needs to be made so as to best match the

11 of 35

actual experiment results. If any of these probability distribution functions can adequately

model the failure data of the system, the parameters of those distributions can be identified

utilizing this approach.

Parameters identified can be subsequently utilized to predict reliability, R(t). In the case

of the exponential distribution the the form of the reliability function is as in Equation 9.

R(t) = e−λ∗t (9)

B. Sensitivity Analysis

A sensitivity analysis was conducted to study the relationship between varying buffer sizes

and system reliability. It was determined through this work that the envionment defined a

key resource and therefore the analysis focused on the impact of varying the size of this buffer.

Seven different environmental buffer sizes ranging from 15 days to 105 days of life support

capacity were considered. The MRBD approach and MTTF approach were employed to

estimate the system reliability boundaries. In the MRBD approach, 100 system failure times

were captured for each buffer size and the average value was used to calculate the system

reliability parameter assuming an exponential reliability model. The MRBD approach is used

to predict the lower bound for system reliability and it treats the buffering capacity as if

it’s a parallel resource component. The MTTF approach is considered to be more optimistic

and therefore, it is used to predict the upper bound for system reliability. Observed MTTF

is reported and shown to be within each of the two bounds on the system.

III. Case Study

The objectives of the case study presented here are as follows:

1. Develop a software application where various reliability modeling and prediction ap-

proaches can be performed and compared;

2. Utilize the application to study a representative example and compare the difference

between the reliability prediction results;

3. Discuss the cause of deviation in reliability prediction results and its impact on system

design and operation.

A. Life Support System for Lunar Outpost

The ECLSS tested in this project is designed for a six-month Lunar Outpost mission. It

consists of four types of components: bulk storage components (i.e. gases and water), regen-

12 of 35

erative components (i.e. Oxygen Generation System and Water Recovery System), control

components, and crew members. A typical series configuration of such a system is depicted

in Figure 4, where the mass and power flow is shown. In Figure 4, horizontal cylinders repre-

sent regenerative processors and vertical cylinders represent storage units. Arrows represent

the flow of mass from one unit to the next. Color coding is utilized to represent the type of

material flowing. External power is required to operate the regenerative components includ-

ing the oxygen generation system (OGS), water recovery system (WRS), and the variable

configuration carbon dioxide removal (VCCR) system. Gaseous flows are generally mixed

air streams of various quality, save for the pure carbon dioxide stream exiting the VCCR.

The WRS handles both waste and grey water produced by the crew and produces a potable

water, though the related quality of these water steams is not modeled. The available storage

volume of all resources is sized to target the six month mission length selected. Currently

one crew member is considered, and all hardware has been sized accordingly. The crew ex-

changes gases directly with the crew environment, which models the interior volume of the

habitat. However water and food are taken directly from the appropriate stores. Similarly, a

parallel configuration with standby components is illustrated in Figure 5 where the standby

components are connected using dashed lines with perfect switches.

Some system level assumptions are designed and applied to all reliability prediction ap-

proaches. Most important, component failure is assumed to be independent. In simulation,

however, failed hardware no longer consume or produce resources. Thus, some failures may

be observed as resources are not provided down the process chain. For example, all power

consumers, including the OGS, VCCR and WRS cannot function if the power supply has

failed. Note that those unit processes are still functional. This will later allow us to test

parallel systems with multiple resources stores. In addition, several other assumptions are

held in all cases and should be made explicit:

1. Components in the system have two states, UP and DOWN. Performance degradation

is not currently under consideration.

2. The habitat environment can provide enough resources for the crew member to survive

for 60 days (1,440 hours).

3. All components are non-repairable and no preventive maintenance is provided.

4. System failure is determined by component reliability function in RBD and MRBD,

while for simulation, it is determined by crew survival conditions.

The crew habitat failure model is modeled via the normal distribution. The parameter

µ, representing buffering capacity, is selected to be 1, 440 hours with a standard deviation

13 of 35

Figure 4. Mass and power flow diagram in BioSim simulation tool.

14 of 35

Figure 5. Mass flow diagram for a parallel configuration in the BioSim simulation tool

15 of 35

of 1. This is approximately 1/3 of the length of the baseline mission and is selected from

the perspective of system survivability in Martian missions, where 60 days would provide

the crew ample time to diagnose and mitigate system upsets. A sensitivity analysis of the

results considering the impact of the size of this buffer is provided.

B. Assumptions for Component Reliability

Before assigning realistic reliability models to each of the components within the system,

a preliminary experiment is conducted using the assumption that all the components are

modeled with an exponential probability density function. This test exploits the fact that

the exponential reliability model is mathematically convenient for reliability analysis. The

only parameter for exponential model is λ whose inverse is the MTTF. The same MTTF

values are later utilized for more realistic probability density function assumptions. The

following section describes the assumptions made for system components and the component

reliability functions are graphically represented in Figure 6.

(a) Stores (b) OGS (c) VCCR

(d) WRS (e) Injector (f) Crew

Figure 6. Assumptions for component reliability.

1. Storage Component Reliability

Gas and water stores are modeled with similar reliabiity as these tanks are similar in function

and very reliable. An exponential reliability model is assigned to the storage components

16 of 35

with an MTTF value of eight years. The assumptions are made such that the hazard ratesd

of the storage components remain as constants throughout the entire mission.

Resource stores for food, power and water also use exponential models, but different

failure rates. Unlike the waste store, which is simply a recycling tank, the food store is

considered more vulnerable due to various risks such as limited food shelf life and sensitivity

to the storage environment. The power store is also more likely to fail since it faces many

failure modes, for example, short circuit, overload, overheat, or blackout periods. Table 1

summarizes the design parameters for each of the storage components within the system.

Table 1. Storage component reliability assumptions.

Component Model λ MTTF

O2 Store Exponential 0.0000145 69120 hrs

CO2 Store Exponential 0.0000145 69120 hrs

H2 Store Exponential 0.0000145 69120 hrs

Potable Water Store Exponential 0.0000145 69120 hrs

Dirty Water Store Exponential 0.0000145 69120 hrs

Grey Water Store Exponential 0.0000145 69120 hrs

Waste Store Exponential 0.0000145 69120 hrs

Food Store Exponential 0.000231 4320 hrs

Power Store Exponential 0.000231 4320 hrs

2. Regenerative Components

When considering the reliability of regenerative components, assumptions based on previous

operation of similar devices was utilized to select assumptions. The OGS is considered to be

the most unreliable component within the system boundary since there were three reported

OGS failures on ISS, occuring on September 8, 2004, January 1, 2005 and September 18,

2006 respectively during the eight-year mission. For the purpose of demonstrating the impact

of component random failure on system reliability, an exponential model is selected for the

OGS with a down-scaled MTTF which is half of the mission length. Another important

regenerative component, the WRS, consists of tubes, valves and various tanks. Most of its

components are associated with increasing risks caused by repeated cyclic loads and severe

wear-out during long term missions. Historical testing data show that although there is no

recorded integrated WRS failure, many of its components have to be replaced in practice

due to performance degradation and water leakage. A 2-parameter Weibull model is thus

selected for the WRS to exhibit the hazard rate variation over time. The VCCR, on the

dHazard rate, or hazard function h(t), is the conditional probability of failure in the interval t to t+ δt,given that there was no failure at t. It is expressed as h(t) = f(t)

R(t)

17 of 35

other hand, is much more reliable. A normal model is assumed for the VCCR so that each

regenerative component has its distinct reliability model. Table 2 summarizes the design

parameters for each of the regenerative component included in the system.

Table 2. Regenerative component reliability assumptions.

Component Model λ µ σ β MTTF

OGS Exponential 0.00046 – – – 2160 hrs

WRS Weibull 2 0.00023 – – 3 4320 hrs

VCCR Normal – 4320 5 – 4320 hrs

3. Control Components

The injector in the system is designed to consume oxygen from the storage tank and inject

it into the habitation environment to adjust oxygen and carbon dioxide partial pressure.

The injector undergoes repeated cyclic loads, therefore, a MTTF value of 90% of the desired

mission length is assigned with a Normal model. This suggests that the injectors are generally

less reliable than the WRS, although strictly speaking this choice is arbitrary. Table 3 shows

the parameters selected for the reliability function of the control components.

Table 3. Control component reliability assump-tions.

Components Model µ σ MTTF

Injector Normal 3888 3 3888 hrs

4. Crew Members

The crew members are considered to be very reliable, although they are still subject to

failures. Thus, since a crew failure will impact the system, just as any other unit process, a

crew failure rate model has been incorporated. The failre rate model is based on previous

work by Horneck and Comet,21 a linearly decreasing reliability function, which degrades

from 1 to 0.9953 in 180 days. Table 4 shows the parameters selected for the crew reliability

function.

Table 4. Crew reliability assumptions.

Components Model Slope

Crew Linear −1.09× 10−6

18 of 35

C. Reliability Prediction

1. Reliability Block Diagrams

As previously described, the system reliability has been determined using the proposed re-

liability prediction approaches. The ‘naıve’ RBD approach, since it does not take system

buffering capacity into account, is first tested by theoretical derivation and stochastic sim-

ulation, using Excel and Matlab respectively. This validation of the stochastic approach

against theory is performed to provide confidence for more complicated experiments where

stochastic simulation becomes the only viable approach for reliability prediction. System

reliability over time is first computed in Excel using the assumed component reliability func-

tions Matlab simulations are conducted using a tool we have named the “FailureDecider,”

which determines component status—functional or failed—at any given time by applying

random numbers to the distribution functions described in Section III.B.16 Then system

failure data is collected and processed using the MLE method. An exponential fit was de-

termined to be superior to Normal and Weibull models. This was due to the quality of the

fit observed across the various techniques (Figure 8). This practice has been maintained

throughout the case study and results in a similarly shaped curve in all analyses, facilitating

comparison of results across the various scenarios. Figure 7 illustrates the simplified system

whose components are simply connected in series and will cause system failure if any one of

them fails.

The reliability prediction results are presented in Figure 8, which illustrates the several

different scenarios designed for the RBDs approach. Given an exponential fit, Equation 9

was utilized for reliability prection. It can be observed that the RBD approach using an ex-

ponential model for all components is outperformed by those using various reliability models.

This is because given the same MTTF, the reliability of exponential model degrades faster

as compared to Normal or Weibull models early in the life cycle. It is also shown that the

reliability prediction results obtained from deterministic calculations and Matlab simulations

are quite consistent with each other. This provides credibility for the FailureDecider tool.

Lastly, it should be noticed that the system reliability becomes rather low at the end of the

mission.

2. Modified Reliability Block Diagrams

The second approach employed is the proposed MRBD method designed for modeling the

impact of buffering capacity in reliability prediction. In this experiment, the system diagram

is slightly different from the naıve system representation. The major difference is the intro-

duction of buffers for each regenerative subsystem, or the entire system, as is illustrated in

Figure 9 and Figure 10, respectively.

19 of 35

Figure 7. Reliability block diagram for ECLSS without buffering capacity.

20 of 35

Figure 8. Reliability prediction results from the RBD approach.

21 of 35

Figure 9. Modified reliability block diagram for ECLSS with several buffers.

22 of 35

Figure 10. Modified reliability block diagram for ECLSS with one buffer.

23 of 35

The reliability prediction results for both scenarios are based on simulation since the reli-

ability models in a parallel-series configuration are cumbersome for mathematical derivation.

The FailureDecider tool is once again used for simulating component random failures.16 The

system failure time, taken at the end of mission, is recorded for 100 identical system config-

urations stochastically. Those data are analyzed using MLE to determine the exponential

parameter capable of predicting system reliability and comparable with the results from the

naıve series system tested using RBDs. Again, Equation 9 was utilized to display the re-

sults presented in Figure 11, which demonstrates the difference between RBD and MRBD in

reliability prediction. The MRBD prediction results are consistently higher than the RBD

approach. The dashed lines represent the 95% confidence interval of the predicted system

reliability over time, based on the observed variance in the the data utilized to determine

of the MTTF. The overlap of the reliability prediction results of the one buffer model and

the multiple buffer model suggests that these models are very similar in reliability and may

be interchaneable. Interestingly, systems with buffering capacity have a reliability less than

one, even at times less than the buffer MTTF. This is due to the nature of the exponential

function selected (Equation 9), where reliability is one only at time equal to zero.

3. Mean Time To Failure Approach

The system MTTF (Tsys) for the ECLSS in the Lunar Outpost mission can be numerically

expressed in terms of component or subsystem MTTF, for example TOGS and TV CCR, or,

Tair and Twater. More specifically, for the ECLSS configuration that has one buffer for the

entire air and water regenerative system, the estimated system MTTF can be expressed as

follows,

Tsys = min{(Tbuffer + min{Tair, Twater}), Tfood, Twaste, Tpower, Tcrew} (10)

On the other hand, if the ECLSS configuration has multiple buffers for each regenerative

subsystem, each buffer is equivalent to a standby parallel subsystem. Thus, the equation for

calculating the estimated system MTTF becomes,

Tsys = min{(Tairbuffer + Tair), (Twaterbuffer + Twater), Tfood, Twaste, Tpower, Tcrew} (11)

By substituting the MTTF assumptions for each component into equation 10 and 11, the

results are 3, 600 hours for both configurations.

For comparing the estimated values of system MTTF with the results from the simulation

approach, the results obtained above are assumed to define the parameters of an exponential

system as suggested by the simplest system configurations. Note, however, that the MTTF

24 of 35

Figure 11. Reliability prediction results from the RBD and MRBD approaches.

25 of 35

assumed for the buffering capacity is based on the designed size of the buffer assumed at

time zero. However, when a failure occurs, the actual mass of material stored in the buffer is

not likely to be the same as at time zero, despite the use of regenerative technologies. This

leads to the observed over-prediction of system reliability using this function.

4. Monte Carlo Style Simulation with Maximum Liklihood Estimation

The simulation tool, BioSim, is utilized to perform destructive life testing and generate

system failure data. These data are then processed using MLE to assess the parameters

for the exponential model that describes the system reliability. As described previously,

Figure 4 and Figure 5 depict the mass flow of the simulated series and parallel systems

correspondingly. Several additional assumptions are made for the simulation, including:

1. OGS, VCCR, and WRS require power, suggesting that if the power store is DOWN,

all the regenerative component will not be able to consume and produce any resources

despite the fact that they are still functional.

2. For all stores, if the amount of resources to be stored exceeds their designed capacities,

the extra materials will be dumped into space.

3. WRS has 100% conversion efficiency.

4. Crew daily schedule is: 8-hour of sleep (Intensity level of 0), 12-hour of lab work

(Intensity level of 2), and 4-hour of exercises (Intensity level of 4).

5. The initial power, food, and water storage levels are designed to satisfy the require-

ments for the nominal mission length.

In this experiment, results are generated only using the BioSim simulation tool. The

system is subject to failure only when the crew member can no longer survive and the

crew survival conditions are bounded by food, water availability, and oxygen, carbon dioxide

concentration. The crew is assumed to be capable of living without food for three weeks

and without water for two days. The oxygen concentration limit takes into account both an

upper bound where increased fire risk occurs and a lower bound where insufficient oxygen

is available for crew respiration. The carbon dioxide concentration is limited for carbon

dioxide toxicity. Two illustrative examples regarding system failure modes are discussed in

Section III.D.

The reliability prediction results shown in Figure 12 exhibit that the average MTTF

obtained using the simulation tool is approximately 27 times higher than those from RBD

and four times better than those from MRBD. The parallel configuration improves MTTF

26 of 35

by 20%, while the MTTF approach consistently over predicts simulated system MTTF by

approximately 7-8%.

It is believed that the simulation approximates system dynamics more accurately, and

therefore, the difference in reliability prediction results validated the concerns raised previ-

ously. It is clearly demonstrated that RBD, MRBD and MTTF approaches all have limited

ability in modeling and predicting reliability for complex systems and they tend to either

underestimate reliability for systems with buffering capacity or overestimate reliability due

to inaccurate description of the buffering capacity. However, marked improvement using the

MRBD and MTTF approaches is observed by taking the buffering capacity into account.

Figure 12. Reliability prediction results from the RBD, MRBD, MTTF and simulation ap-proach

5. Sensitivity Analysis

A sensitivity analysis was implemented, varying the size of the envrionment, to consider the

impact of buffering capacity on system MTTF (Figure 13). The horizontal axis represents

27 of 35

seven different environmental buffer sizes, in terms of MTTF. The vertical axis defines the

range of corresponding system MTTFs obtained using the MRBD method, denoted by dia-

mond dots, and the MTTF method, denoted by circles. The middle bars represent the actual

system MTTFs determined via a series of 100 simulations at each level. It is suggested that

the MTTF and MRBD techniques may be utilized to define a confidence interval bounding

the actual system MTTF.

The results indicate that the predicted system MTTF upper bound has a ceiling of 4,320

hours, given large buffering capacity, limited by the power component MTTF in the current

systems design. This is due to the effective cap on maximum MTTF imposed by the current

assumptions on the power system. The lower bound, however, continues to increase with

increasing buffering capacity. Overall, the magnitude of the range of the confidence interval

increases with buffer size, until the limitations in the power system limits the increase in the

upper bound.

Figure 13. The impact of varying the environmental buffer volume on system reliability

28 of 35

D. System Failure Modes

A wide range of failure modes have been observed for the ECLSS under investigation. The

most frequently observed failure is the air subsystem failure, where the carbon dioxide con-

centration exceeds tolerance limits and terminates the simulation. Failures in food and water

system have also been observed. An example failure event is presented below to demonstrate

how system failure can occur in the BioSim simulation tool. Figures 14 to 16 are the plots

representing sensor data collected during the simulations. Those data describe the inputs

and outputs of the regenerative hardware components, the storage levels of various resources,

and the environmental conditions for crew habitation. It is by studying typical failure modes

such as those illustrated by the modeling tool that system designers are presented with an

opportunity to improve system design. With this implementation of BioSim designers are

afforded an opportunity to better understand the impact of component impact on system

performance and the adjustments necessary for improving system reliability. The model also

provides critical evidence of the buffering capacity within ECLSS which allows the system

to continue being functional until the buffer itself is exhausted. The current results suggest

that one can define an upper bound for system reliability by utilizing the MTTF approach,

however, these results may be deceptive when choosing which buffer to augment in order to

maximize system reliability.

In this example a system failure is caused by the water subsystem. In Figure 14, we

see the oxygen production rate suddenly drops to zero after 389 hours of operation and

the system fails 48 hours later. One may initially conclude that an OGS failure must have

occurred, however, Figure 15 shows that the OGS is not the cause for the system failure since

the injector manages to maintain the oxygen and carbon dioxide concentrations even after

the malfunction. The actual cause for the system failure is identified by looking at Figure 16

where the potable water storage level drops to zero due to a failure in the potable water

store. Component random failures in BioSim are assumed to cause zero input and output,

whereas storage failures cause tank levels to become zero instantly. Therefore, because there

is no potable water available for OGS to produce oxygen, the production rate becomes zero.

In actuality, the crew member’s potable water demand could no longer be satisfied and the

mission comes to an end 48 hours later.

IV. Conclusion

This paper demonstrates the use of several approaches for studying the reliability of life

support systems in long term space missions. The comparison between the prediction results

shows a significant difference between classical and simulated approaches, which is believed to

be caused by the unique characteristics of environmental systems. Classical reliability theory

29 of 35

Figure 14. I/O Sensors

30 of 35

Figure 15. Environmental Condition Sensors

31 of 35

Figure 16. Store Level Sensors

32 of 35

focuses heavily on the operational state of individual components. This is due to the original

application area of reliability engineering in logistics. This ignores the potential impact that

the environment can have on the function of the system. There is no doubt that life support

hardware certainly enables the work of the crew, but system success or failure may be

decided by the ability of the crew to perform, rather than strictly focusing on the ability

of hardware to function. Experiments have been designed to show the impact of buffering

capacity on system reliability and examples are given to illustrate how the system performs

from this perspective. An approach utilizing the predicted mean time to failure of individual

subsystems has been proposed here, and although it over-predicts system reliability slightly,

the accuracy is improved. These results depend highly on the system design assumptions

the analyst selects, thus a sensitivity analysis has been prepared showing the behavior of

system MTTF as the buffer MTTF is adjusted. As expected, when the environmental buffers

are reduced, the bounds on systems reliability are similarly reduced. Future system designs

can now be improved by this information; if the designer is confident in their selection of

what the controlling buffer to their system may be, systems may be designed with reliability

performance as a design constraint.

Thus, for a system designer, this work should lead toward a new perspective on design.

Given a classical approach to reliability prediction, and the observed under-prediction, there

is either an opportunity to greatly reduce system cost by reducing the buffering capacity

provided to the crew, or there is an opportunity to utilize the crew survival time available

after malfunctions to repair failed components. This amount of time is not trivial, and given

adequate resources it is expected that the crew will have ample time to diagnose a wide

variety of unknown failures and fabricate solutions. However, a system designer needs to

have a strong command of the system dynamics to understand what resources will become

most limiting for the crew in the event of failures. Without such understanding, it is not

necessarily obvious exactly which buffer should be augmented in size, where to perform

preventive versus corrective maintenance, or where to provide redundancy.

Acknowledgments

The authors would gratefully like to acknowledge the generosity of the University of Illi-

nois, the National Aeronautics and Space Administration, the National Science Foundation,

and the Illinois Space Grant Consortium in support of this work. The authors would also like

to thank several individuals also contributed to this work particularly Izaak Neveln, David

Kane, and Christian Douglass who supported this work while partaking in an NSF Research

Experience for Undergraduates.

33 of 35

References

1Perera, J. and Field, S., “Integrated Risk Management Application (IRMA),” NASA Risk ManagementConference, 2005.

2Leveson, N., Safeware, Addison-Wesley Publishing Company, Inc., 1995.3Lievens, C., System Security , Caepadues Editions, Toulouse, 1976.4Anonymous, “IEEE Guide for General Principles of Reliability Analysis of Nuclear Power Generating

Station Protection Systems,” IEEE, 1975.5Yamada, K., “Reliability Activities at Toyota Motor Company,” Reports of Statistical Application

Research, Vol. 24, 1977.6Fussel, J. B., Fault Tree Analysis - Concepts and Techniques, Vol. E, University of Liverpool, UK,

1973.7Pages, A. and Gondran, M., System Reliability Evaludation & Prediction in Engineering , Springer-

Verlag, NY, 1st ed., 1986.8Kletz, T., Hazop and Hazan, Taylor & Francis, 4th ed., 1999.9Anonymous, Guidelines for Hazard Evaluation Procedures, with Worked Examples, Wiley-AIChE, 2nd

ed., 1992.10Vesely, W. E., Goldberg, F. F., Roberts, N. H., and Hassel, D. F., Fault Tree Handbook , U.S. Nuclear

Regulatory Commission, 1981.11O’Connor, D. T., Newton, D., and Bromley, R., Practical Reliability Engineering , Wiley, West Sussex,

England, 4th ed., 2002.12Kortenkamp, D. and Bell, S., “BioSim: An Integrated Simulation of an Advanced Life Support System

for Intelligent Control Research,” International Conference on Environmental Systems, SAE, 2003.13Rodrıguez, L. F., Bell, S., and Kortenkamp, D., “The Role of Modeling in Advanced Life Support

System Design and Operation,” Tech. rep., 2004.14Rodrıguez, L. F., Bell, S., and Kortenkamp, D., “Using Dynamic Simulations and Automated Decision

Tools to Design Lunar Habitats,” International Conference on Environmental Systems, No. 2005-01-3011,SAE, 2005.

15Rodrıguez, L. F., Jiang, H., Bell, S., and Kortenkamp, D., “Testing Heuristic Tools for Life SupportSystem Analysis,” International Conference on Environmental Systems, No. 2007-01-3225, SAE, 2007.

16Jiang, H., Bhalerao, K., Soboyejo, A., Bell, S., Kortenkamp, D., and Rodrıguez, L. F., “ModelingStochastic Performance and Random Failure,” International Conference on Environmental Systems, No.2007-01-3027, SAE, 2007.

17Rodrıguez, L. F., Bell, S., and Kortenkamp, D., “Use of Genetic Algorithms and Transient Models forLife Support Systems Analysis,” AIAA Journal of Spacecraft and Rockets, Vol. 43, No. 6, 2006.

18Klein, T., Subramanian, D., Kortenkamp, D., and Bell, S., “Using Reinforcement Learning to ControlLife Support Systems,” Proceedings of the International Conference on Environmental Systems, SAE, 2004.

19Kortenkamp, D., Izygon, M., Lawler, D., Schreckenghost, D., Bonasso, R. P., Wang, L., and Kennedy,K., “A Testbed for Evaluating Lunar Habitat Autonomy Architectures,” Proceedings of the 6th Conference onHuman/Robotic Technology and the Vision for Space Exploration in the Space Technology and ApplicationsInternational Forum (STAIF), Vol. 969, American Institute of Physics Conference Proceedings, 2008.

34 of 35

20Righini, R., Bottazi, A., Cobopoulos, Y., Fichera, C., Giacomo, M., and Perasso, L., “A New MonteCarlo Method for Reliability Centered Maintenance Improvement,” International Conference on Safety andReliability , Vol. 3, 1996, p. 14.

21Horneck, G. and Comet, B., “General human health issues for Moon and Mars missions: Results fromthe HUMEX study,” Advanced Space Research, Vol. 37, 2006, pp. 100–108.

35 of 35

Date post:	21-Jan-2022
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Prediction of Reliability of Environmental Control and ...

Documents