Coordinability and Consistency: Application of Systems Theory to Accident Causation ...users.wpi.edu...

transcript

Coordinability and Consistency: Application of Systems

Theory to Accident Causation and Prevention

Raghvendra V. Cowlagi1

Worcester Polytechnic Institute, Worcester, MA, USA.

Joseph H. Saleh2

Georgia Institute of Technology, Atlanta, GA, USA.

Abstract – Recent works in the safety literature report several fruitful attempts to introduce

mathematically rigorous results from systems and control theory to bear upon accident prevention and

system safety. Previously, we discussed the implications on safety of the systems theoretic principles of

coordinability and consistency, and we identified the lack of coordinability and/or consistency as

fundamental failure modes in hierarchical multilevel systems. In this work, we further develop system

safety analysis techniques based on these principles. We demonstrate that these principles not only

provide a domain-independent vocabulary for expressing the results of post-mortem accident analyses,

but they can also be applied to guide design and operational choices for accident prevention and system

safety. We develop these ideas with the help of an illustrative case study. This case study represents a

broad class of systems where operational policies and procedures of individual stakeholders in the system

interact with physical processes such that new system behaviors emerge, and unanticipated safety issues

arise. We argue, and illustrate our arguments using this case study, that the coordinability and consistency

principles can be developed to deliver a threefold impact on accident analysis and prevention: firstly,

these principles provide domain-independent procedural templates and vocabulary for post-mortem

accident analysis. Secondly, these principles provide theoretical safety specifications to be met during

system design and operation. Finally, these safety specifications can precipitate the formulation of a series

of questions directly related to safety-oriented choices in the design, operation, and control of systems.

Keywords: coordinability; consistency; chemical reactor; accident prevention; system safety.

1 Assistant Professor, Aerospace Engineering Program, Department of Mechanical Engineering. 2 Associate Professor, Guggenheim School of Aerospace Engineering.

1 Introduction

Accident prevention and system safety are influenced not only by the reliability and failure

behavior of various subsystems and components, but also by the nature of interactions between

these components, as well as their interactions with external factors or environmental conditions.

For example, large scale systems such as nuclear power plants, air traffic control systems, and

offshore oil platforms, exhibit closely interacting technical, managerial, regulatory, and social

components. Within the realm of technical systems, emerging cyber-physical systems such as

intelligent transportation systems and mobile robots exhibit close interactions between

components of fundamentally different nature: namely, computational and physical components

(Asare, Broman, Lee, Torngren, & Sunder, 2012). In the safety literature, the terms man-made

disasters (Turner, 1978), organizational accidents (Reason, 1997), and system accidents

(Perrow, 1999) have been used to describe adverse events arising due not only to isolated failures

in human and technical elements of large systems, but also due to their flawed interactions.

These interactions are not properly understood and, when examined, it is often on an ad-hoc

basis and without an underlying formal and theoretical foundation. Such a theoretical foundation

is nevertheless essential to the study of domain-independent principles of accident prevention

and system safety, and the identification of such principles for hierarchical multilevel systems is

a crucial area of ongoing research. In this work, we contribute to this research by further

developing a previously introduced formal framework (Cowlagi & Saleh, 2013) for accident

analysis and system safety. To this end, we briefly review the relevant literature and accordingly

contextualize the proposed work. The reader interested is referred to (Saleh, Marais, Bakolas, &

Cowlagi, 2011) for a thorough review and critical appraisal of the major ideas in accident

prevention and system safety.

The literature reports on qualitative ideas and quantitative methods to guide design,

operational, and organizational choices for accident prevention and system safety. Notably on the

one hand, the High Reliability Organization (HRO) paradigm (Rochlin, La Porte, & Roberts,

1987; Weick & Sutcliffe, 2007) presents a qualitative description of the salient managerial and

organizational features of entities that maintain high safety standards and low accident

occurrence rates. On the other hand, the method of quantitative risk assessment (QRA)

(Apostolakis, 2004) – first introduced as Probabilistic Risk Assessment (PRA) for nuclear power

plants (N. Rasmussen, 1975) – provide quantitative bases for making risk-informed design and

operational decisions related to system safety. QRA and PRA involve technical details of the

system configuration and operation, and develop an exhaustive list of possible accident

scenarios, along with their potential consequences, and the likelihood of their occurrences.

Excellent examples of such analyses for informing risk assessment in the chemical process

industries include (Khan & Abbasi, 1999; Kleindorfer et al., 2003; Kleindorfer, Makris, &

Conomos, 1999). The recent literature exhibits a thrust to support ideas and methods such as

HRO and QRA/PRA with domain-independent design principles to inform technical, managerial,

and organizational design choices for system safety (Saleh, Marais, & Favaro, 2014). Most

notably, the defense-in-depth safety principle (NRC, 2000; Sorensen, Apostolakis, Kress, &

Powers, 1999; Sorensen, Apostolakis, & Powers, 2000) emphasizes the implementation of

multiple and diverse “barriers” (Hollnagel, 2004) for interrupting potential accident sequences at

various stages. The purpose of these “barriers” is to prevent accident sequences from initiating,

and/or to prevent them from escalating, and/or to mitigate their eventual consequences. The

inherent safety principle (Khan & Amyotte, 2003, 2004) complements defense-in-depth by

providing guidelines for choosing in the early design stages the types and locations of safety

barriers.

These perspectives on accident prevention and system safety have now culminated in the so-

called systems and control theoretic approach to system safety (Saleh et al., 2011), which

pursues two complementary objectives: (1) to encapsulate the preceding perspectives on system

safety originating from diverse technological domains into a single theoretical and

mathematically rigorous framework, and (2) to leverage for accident prevention and system

safety the vast arsenal of analytical and algorithmic tools from systems and control theory. The

connections between control theory and the implementation and enforcement of safety barriers

and safety constraints have been recognized (Leveson, 2004a; J. Rasmussen, 1997), and the role

in system safety of the control theoretic notion of observability has been recently highlighted

(Bakolas & Saleh, 2011; Favaro & Saleh, 2014). The connections between systems theory

(Bertalanffy, 1969; Mesarovic, Macko, & Takahara, 1970; Weinberg, 1975) and system safety is

motivated by the observation that accidents can result “from dysfunctional interactions among

system components” (Leveson, 2004a), and that fundamental failure modes resulting due to such

dysfunctional interactions are ill-understood (Leveson, 2004b). In a recent work (Cowlagi &

Saleh, 2013), we discussed the implications on accident causation and system safety of the

systems theoretic principles of coordinability and consistency, hereafter referred to as C&C.

Specifically, we identified the lack of coordinability and/or consistency as fundamental

failure modes in hierarchical multilevel systems, and we illustrated this claim using relevant

accident case studies.

In this work, we further develop system safety analysis techniques based on the

coordinability and consistency (C&C) principles. Specifically, the novel contributions of this

paper are as follows. Firstly, we demonstrate that the C&C principles provide a theoretical

vocabulary for expressing the results of post-mortem accident analyses, which can assist in

extracting important lessons to be learnt, and in identifying common accident pathogens from

epidemiological studies of accidents in diverse technological domains. Secondly, and more

importantly, we demonstrate the value of C&C-based system safety analysis for making design

and operational choices. In particular, we illustrate the influence of this safety analysis on the

choice of measurement equipment and estimation algorithms for various attributes of the system,

thereby relating the “systems-” and “control-” theoretic facets of the system safety problem.

More generally, we demonstrate that, for system design, the C&C principles can provide

theoretical and general “safety specifications” that are more informative than the tautological

specification of “the system must be safe”, and more concise and domain-independent than

specifications consisting of an exhaustive list of potential scenarios that must be avoided. To aid

the exposition of these ideas, we present details on the application of the C&C principles for

system safety analysis via a detailed illustrative example of a chemical reactor. Although the

case study treated here is from the chemical industry, and the analytical model developed is

specific to our reactor, this case study represents a broad class of multi-level systems where

operational policies and procedures of individual stakeholders in the system interact with

physical processes such that new system behaviors emerge, and unanticipated safety issues arise.

The rest of this paper is organized as follows: In Section 2, we provide a brief discussion of

the C&C principles for the sake of completeness. The reader interested in further details is

referred to (Cowlagi & Saleh, 2013) for a thorough discussion of these principles. In Section 3,

we introduce a model of a chemical process plant, and in Section 4, we illustrate the application

of the C&C principles for safety analysis of this plant. Finally, in Section 5, we provide

conclusions of the proposed analysis, and directions for future research.

2 System Theoretic Framework for Accident Analysis and System

Safety

In this work, we focus on systems involving components interacting over multilevel

hierarchies. Hierarchical multilevel structures are omnipresent in systems both in a purely

technical context (e.g., cyber-physical systems) and in a sociotechnical context. Hierarchical

multilevel structures enable tractable solutions of management and control of systems with ever-

increasing technical and organizational complexity. Specifically, these structures support

functional specialization, modular design, and multiplicity of decision-making units to break

down the overall problem into manageable sub-problems. For simplicity of exposition, we

consider a two-level hierarchy as is common practice (Mesarovic et al., 1970; Zhong &

Wonham, 1990), with the understanding that the proposed developments can be iteratively

applied to multilevel systems by analyzing pairs of components at successive hierarchical levels,

and by aggregating components at multiple levels. In this section, we first introduce the formal

concepts of coordinability and consistency in hierarchical multilevel systems using the

terminology and definitions of (Mesarovic et al., 1970). Then, we summarize the implications of

the C&C principles on system safety, which we discussed in detail in (Cowlagi & Saleh, 2013)

using illustrative examples. With this background information covered, we will be ready for the

analytical model development and the accident analysis using the C&C concepts in Sections 3

and 4.

2.1 Formal Definitions of Coordinability and Consistency

The following description of the C&C principles is a summary of the extended discussion in

(Cowlagi & Saleh, 2013). This subsection provides a brief overview of coordinability and

consistency. The reader familiar with the subject may skip this subsection.

We represent a system as a mapping between a set of inputs and a set of outputs. In

particular, a decision-making system chooses a decision x to solve a decision problem . The

output y of a decision-making system is a transformation of the decision, i.e., y x . A

decision problem, in general, involves the selection of a decision that satisfies a pre-specified set

of constraints and may involve the optimization of a cost or reward function. A two-level

hierarchical system, consisting of a single higher-level unit and N lower-level units, and

controlling a certain process, is shown in Figure 1. Following (Mesarovic et al., 1970), we refer

to the higher level decision-making unit as the supremal unit and to the lower level decision-

making units as the infimal units.

Figure 1: Illustration of a two-level hierarchical decision-making system.

Each of the N infimal units contributes to controlling the process: the thi unit provides input

iy to the process and receives feedback iz from the process. The supremal decision-making unit

provides an input i to the th i infimal unit, and receives feedback i from that unit. The

decision problem of each infimal unit depends on the input i .

The overall decision problem is what the supremal and infimal units attempt to jointly

solve by addressing their local decision problems. We denote by S and I, , i respectively, the

local decision problems of supremal unit and of the thi infimal unit. The supremal unit cannot

solve by itself the overall decision problem; instead it solves its own decision problem S to

determine the coordination inputs 1, , ,N which it then provides to the infimal units.

These coordination inputs define the decision problems of the infimal units, which, in turn,

attempt to solve their decision problems I,i and thereby affect the process to solve the overall

decision problem . To make explicit the dependence of the infimal decision problems on the

coordination inputs, we denote these problems by I, .i i

The C&C principles are formulated using predicate logic (Alagar & Periyasamy, 1998; Huth

& Ryan, 2004), the basic elements of which are briefly presented here for the reader’s

convenience: a proposition is a statement that can be either true or false. For example, the

statement “The Earth is flat” is a valid proposition, but the sentence “Will it rain today?” is not.

A predicate indicates either a property of an object or a relationship between different objects. A

predicate acting on a particular object is a proposition. Following (Mesarovic et al., 1970), we

introduce a logical predicate ( , ) P defined such that ,xP is true if and only if x is a solution

to the decision problem .

In this framework for hierarchical multilevel systems, we first discuss the notion of

coordinability of the infimal decision problem relative to the supremal decision problem

(Mesarovic et al., 1970), henceforth abbreviated as, simply, coordinability. Conceptually,

coordinability means that there exists a coordinating input such that each of the corresponding

infimal decision problems has at least one solution. Stated differently, coordinability means that

the supremal unit can solve its decision problem to find a coordinating input 1, , ,N and

that there exists a solution for each of the infimal decision problems I,i i . Conversely, the

lack of coordinability implies that no solution for the supremal unit decision problem exists for

which all the infimal units can solve their associated problems. Stated differently, lack of

coordinability means that the infimal units have been given coordinating inputs and decision

problems that cannot be solved—more colloquially, they have been inadvertently set up to fail.

Formally, the infimal decision problems I,i are said to be coordinable relative to the supremal

decision problem if and only if the following proposition is true (Mesarovic et al., 1970):

I S| , and , .x x P P (1)

Proposition (1) reads as follows: there exist and ,x such that solves the supremal

decision problem and x solves the infimal decision problems. In other words, there exists some

solution to the supremal decision problem for which the infimal units can also solve their

problems. Figure 2 illustrates this definition of coordinability.

Figure 2: Illustration of the concept of coordinability relative to the supremal decision problem (Cowlagi &

Next, we discuss the notion of consistency as proposed in (Mesarovic et al., 1970), which

relates the local decision problems of the supremal and infimal units to the overall decision

problem . Conceptually, the two-level system of Figure 1 is “consistent” if the overall decision

problem is solved when the supremal unit and each of the infimal units solve their own decision

problems. In other words, consistency in multilevel systems means that by solving their local

decision problems, the infimal and supremal decision units solve the overall problem. Formally,

the infimal and supremal decision problems are said to be consistent if the following proposition

is true:

I S| , and , ( ), .x x x P P P (2)

Proposition (2) is called the consistency postulate (Mesarovic et al., 1970) and its reads as

follows: for all x and all that solve, respectively, the infimal and supremal decision problems,

the outputs x of the infimal units solve the overall decision problem . Stated differently, if

a two-level system is consistent, the overall decision problem is solved whenever the supremal

decision unit coordinates the infimal units relative to its own objectives. Figure 3 illustrates this

definition of consistency.

2.2 Implications on System Safety Analysis

We proposed in (Cowlagi & Saleh, 2013) that the C&C principles are important bases for

accident prevention, and their absences are fundamental failure modes that cause or contribute to

accidents in systems involving hierarchical multilevel structures. The C&C principles identify

fundamental properties that are independent of the system’s technological domain. In (Cowlagi

& Saleh, 2013), we analyzed the Tenerife accident, and we identified the lack of coordinability

between the air traffic control tower at Tenerife airport and the flight crew one of the two

involved aircraft as a factor contributing to the disaster. We also studied examples of accidents

from domains as diverse as mobile robotics, marine transport, and airport ground operations, to

identify absences of coordinability and/or consistency as contributing factors. We argued that the

identification of lack of coordinability and/or consistency as fundamental failure modes can help

to reformulate in a mathematically rigorous manner some established views on accident

causation and system safety. For example, it has been claimed that “sloppy management” is the

root cause of many industrial accidents (Turner, 1978) and that “sloppy management” is

responsible for inadequate efforts to implement and respond to early warning systems (Hopkins,

2001). We asserted that formal analyses based on the C&C principles can refine or displace the

narrative of “sloppy management”, which is of limited value for making technical, managerial,

or organizational decisions for accident prevention and system safety. Finally, we also asserted

that the C&C principles may help to formalize qualitative HRO characteristics, and in particular,

to examine whether HRO characteristics are aligned with the safety ramifications of the C&C

principles.

To further study the application of C&C principles for accident analysis and system safety in

hierarchical multilevel systems, we develop in what follows an illustrative example of a chemical

reactor and discuss an accident scenario for this reactor. The purpose of this discussion is to

examine (a) failure modes that can arise due to flawed interactions between properly functional

components of a system, and (b) the application of C&C principles to not only identify these

failure modes in post-mortem accident analysis but also to provide guidance for making design

and operational choices for accident prevention and system safety.

3 Model Development and Illustrative Example

In this section, we describe and mathematically model a chemical reactor. Next, we describe

an accident scenario in this chemical reactor, which will serve as a concrete case study to

exemplify the role of the C&C principles in accident analysis and prevention. This case study

represents a broad class of systems where operational policies and procedures of individual

stakeholders in the system interact with physical processes such that new system behaviors

emerge, and unanticipated safety issues arise. This case study involves a particular example of

the general two-level hierarchical control structure discussed in Section 2, where the supremal

and infimal units each solve their local decision problems to contribute towards the solution of

the overall decision problem. We also examine via this case study important details involved in

this framework, such as communication delays, predetermined operational policies, the role of

observation and estimation of process variables, and differences in decision-making and process

time scales. As we discuss later, these issues influence accident causation and prevention in

hierarchical multilevel systems in ways that are not yet well-characterized.

After describing the chemical reactor model and an accident scenario in this Section, we will

turn to its analysis using the C&C principles in Section 4. Thereafter, we will discuss broad

lessons that can be extracted from this analysis and applied more generally to hierarchical

multilevel systems.

3.1 Chemical Reactor Model

Consider the operation of a chemical reactor illustrated in Figure 4. A large enclosed vessel is

provided with an inlet, an outlet, and two gas vents, with one referred to as the “standard” vent,

and the other as the “emergency” vent. Substances 1C and 2C are let into the vessel, where they

chemically react with each other – we will henceforth refer to this chemical reaction as the core

reaction – to yield a useful substance 3C in liquid form, which is the desired product of this

chemical process. The reaction of 1C and 2C also results in a gaseous by-product 4.C

A mathematical model of this chemical reactor is given by the following pair of linear

differential equations. Table 1 explains the nomenclature used in these equations.

s0 s e p( )ˆ( ) ( ) ( ) ( ),tp t q t r w tr e t r (3)

n 0 q n b

h 0 q b

( ) ( ) ( ) ( ), ( ) ,

( ) ( ) ( ) (

), .ˆ ( )

k s t v t t w t p p t p

k s t v t t w t p p t

By Eqn. (4), the core reaction is pressure-sensitive: whereas the rate of production of 3,C

namely ,q is low (and therefore not of our interest) when the pressure of 4C is below np , it is

high when the pressure exceeds b ,p that is, h n.k k

Figure 4: Illustration of the chemical reactor.

To maintain the variables ,p q at their nominal operating values n n, ,p q the standard

vent is designed to cause a nominal pressure loss at the rate s0 r and the outlet flow rate is

nominally maintained at a value 0v . An automatic controller, described in Section 3.2, makes

small adjustments sr and ˆ,v respectively, to the standard vent and the outlet to counteract

process disturbances. These adjustments are limited: specifically, s sˆ 0, ,r r and ˆ , ,v v v

where s s0r r and 0v v are prespecified. The emergency vent is manually operated, as we

explain next. Furthermore, whenever the emergency vent is open, the aforesaid automatic

controller is designed to saturate the adjustments sr and v to their maximum values.

Instrumentation is installed on the reactor to measure and indicate the pressure p of the gas

4 ,C and the quantity q of the substance 3C inside the reaction vessel.

The gas 4C is inflammable, and self-ignites when compressed beyond a certain threshold

pressure s.p One of the major safety hazards identified for this reactor is an uncontrolled self-

ignition of the gas 4 ,C which constitutes a catastrophic failure of the reactor. To maintain safe

and profitable operation, this chemical reactor is controlled and supervised as described next.

Table 1: Nomenclature used in this case study.

Symbol Meaning

p Pressure of the gaseous substance 4C within the reaction vessel

q Mass of the substance 3C within the reaction vessel

s Mass flow rate of each of the input substances 1C and 2C at the inlet

0 ,, ˆv v v Mass flow rates of the substance 3C at the outlet: respectively, setpoint, automatic

controller adjustment, and total flow rate

n n, p q Nominal operating values of p and q

bp Threshold pressure beyond which the production rate of 3C significantly increases

dp Threshold pressure, the breaching of which is considered a near-miss accident

sp Pressure at which the gaseous substance 4C self-ignites

,N Ss s Preset inlet flow rates in modes N and , S respectively

, , N S Dv v v Preset outlet flow rates in modes ,N , S and ,D respectively

m v True outlet flow rate in the accident scenario due to malfunctioning outlet pump

n h, ,k k Constants of proportionality, with h nk k

s0 s eˆ, , r r r Rates of pressure loss due to standard and emergency vents

e Binary state (closed = 0, open = 1) of emergency vent

c Communication delay

p q,w w Process disturbances

,p qw w Known maximum absolute values of process disturbances

3.2 Hierarchical Control Structure and Safety Policies

The reactor is controlled by a two-level hierarchy of human supervisors. The lower-level

supervisor, whom we call the operator, chooses set-points for the inlet and outlet flow rates, and

decides whether to open or close the emergency vent. The higher-level supervisor, whom we call

the manager, instructs the operator of the mode of operation of the reactor. Specifically, the

manager chooses one of the following modes of operation of the reactor:

1. Normal mode :N the inlet and outlet flow rates are set to the nominal values of,

respectively, Ns and ,Nv to maintain the production of 3C at a nominal rate and to

maintain the quantity of 3C within the reactor vessel at a constant, nominal value of nq .

2. Safe mode :S the inlet and outlet flow rates are set to, respectively, S Ns s and

S Nv v to lower the rate of production and to reduce the quantity of 3C within the

reactor vessel.

3. Dead-stop mode :D the inlet flow rate is reduced to zero, and the outlet flow rate is set

to its maximum value .D S Nv v v

As previously mentioned, an automatic controller makes small adjustments to the standard

vent and the outlet flow rate. Specifically:

1. During the Safe and Dead-stop modes, and/or whenever the emergency vent is open, this

controller saturates these adjustments to their maximum values, i.e., it sets

s sˆ ˆ , ( () ) .r r v tt v

2. During the Normal mode, and when the emergency vent is closed, this controller

determines these adjustments by the following linear feedback control law:

,ˆ ˆ (

ˆ( )ˆ )

where n( ) ( )ˆ ,p t p t p n( ) ( )ˆ ,q t q t q and K is a gain matrix. The matrix K can be

obtained by several control-theoretic algorithms: linear quadratic regulation (LQR), for

example (Anderson & Moore, 1990). This control law counteracts small process

disturbances to stabilize the variables p and q at their nominal set-points, respectively,

np and n .q

The following safety policies are adopted by the manager and operator hierarchy to

safeguard against the self-ignition of the gaseous substance 4.C Firstly, the operator opens the

emergency vent whenever s/he notices an upward trend in the pressure p that is not successfully

counteracted by the maximum capability of the automatic controller (i.e. the pressure rises

despite saturation of the automatic control inputs). Secondly, reactor is designed to switch

automatically (i.e. without the manager’s explicit approval) to the Dead-stop mode when the

pressure p exceeds a threshold d s.p p The threshold dp is designed such that if the plant is

switched to the Dead-stop mode at that threshold, the self-ignition pressure will not be breached.

Thirdly, a set of adverse safety events is defined in terms of certain predetermined inequality

constraints on the variables ,p q . These constraints define envelopes for safe operation of the

reactor. Whenever any of these constraints is violated, i.e., an envelope of safe operation is

breached, an adverse safety event is said to have occurred. The operator notifies the manager,

who, in turn, decides whether to change the reactor’s mode of operation and accordingly notifies

the operator to implement the change of mode (if any). This back-and-forth communication

between the operator and manager requires a time of at most c , including the time required by

the manager to assess the situation and make a decision. Two such adverse safety events are as

follows:

SE1 – defined as the set of all ,p q such that p will exceed bp in time c assuming worst-

case disturbances, given that the current reactor mode is Normal and the emergency vent is

currently open. Precisely, 1SE occurs when the following inequality is violated3:

c n p e n c q c b( ) (( ( ) ) ( ) )2

.w rp t q t v w pq r (6)

2SE – defined as the set of all ,p q such that p will exceed dp in time c assuming

worst-case disturbances, given that the current reactor mode is either Normal or Safe and the

emergency vent is currently open. Precisely, 2SE occurs when the following inequality is

violated:

c n p e n c q c d( ) (( ( ) ) ( ) )2

.w rp t q t v w pq r (7)

In addition to these predetermined and well-defined safety events, the safety policy allows

for so-called operator-triggered adverse safety events. These are events where the operator

detects highly anomalous reactor behavior, and notifies the manager to request a mode change.

The operator can recommend switching to a particular mode based on his/her experience and

situational awareness about the reactor’s state of operation. However, the operator is trained to

3 The inequalities (6) and (7) can be easily derived from the process equations (3)-(4) and the stated definitions

of the safety events.

sparingly trigger these “early” adverse safety events – and instead let the “standard” adverse

safety events occur – in order to avoid excessive false alarms that disrupt the reactor’s

production schedule.

The hierarchical control structure of the chemical reactor is summarized in Figure 5.

Figure 5: Illustration of the hierarchical control structure of the chemical reactor.

3.3 Accident Scenario

Consider an accident scenario consisting of the following sequence of events.

1. The reactor is operating in the Normal mode and the variables ,p q are at their nominal

operating values. Unknown to the operator, the outlet pump malfunctions at time 0 ,t t

and the outlet flow rate gradually starts to decrease. At time 1, t t the outlet flow rate

settles to a value of m Nv v (nomenclature as in Table 1).

2. The automatic controller responds to the decreased flow rate. However, because mv is

significantly lower than ,Nv both of the control inputs sr and v saturate to their

maximum values. The quantity of 3C within the reaction vessel exhibits a consistent

upward trend, followed by the pressure of the gas 4C , which also exhibits the same trend.

3. Noting these upward trends, and noting that the automatic control inputs are saturated,

the operator opens the emergency vent at time 2.t t

4. The operator notices a downward trend in ,p which matches his/her expectations. The

operator notes that q continues to increase (according to the gauge that indicates q to the

operator), which is anomalous behavior. However, because p is decreasing as expected,

the operator does not notify the manager, assuming an instrument fault in the outlet flow

meter. Note that the outlet flow rate is not indicated to the operator.

5. At time 3,t t safety event 1SE occurs. The operator notifies the manager, who instructs

the operator to switch the reactor to the Safe mode.

6. At time 4 3 c ,t t t the operator switches the reactor to the Safe mode. By this time,

however, p has already exceeded b.p Therefore, the rate of production of 3,C and

consequently, the rates of increases of p and ,q significantly increase.

7. At time 5,t t safety event 2SE occurs. The time period 5 4t t is too short for the

operator to react to the situation and notify the manager before the occurrence of 2.SE

8. Following the manager’s instructions, the operator switches the plant to Dead-stop mode.

By this time, p has exceeded d .p The variables p and q both continue to rise, and at

time 6 ,t t the variable p breaches the self-ignition pressure s ,p and a catastrophic

failure occurs.

In Section 4.2, we discuss a “post-mortem” analysis of this accident in the C&C framework.

Informally, the “standard” adverse safety events (namely, 1SE and 2SE ) were designed

assuming that the actual outlet flow rate and the actual rates of pressure loss due to the two vents

were equal to the set-point values. Any serious anomalies between the actual and set-point rates

were left for the operator to detect and accordingly trigger an early adverse safety event. In this

accident scenario, the actual outlet flow rate was significantly less than the set-point flow rates of

the Normal and Safe modes. Furthermore, the operator did not trigger an early adverse safety

event. After the operator opened the emergency vent at time 2 ,t t an initial reduction in the

pressure p occurred due to the process dynamics described by Eqns. (3) and (4).

To demonstrate the plausibility of this accident scenario, we performed a numerical

simulation of the process with the MATLAB® software tool, and using the parameter values

shown in Table 2. Figure 6 shows the accident trajectory resulting from this numerical

simulation. Figure 7 and Figure 8 show, respectively, the evolutions over time of the variables

,p q and the control inputs s , .r v For this simulation, Table 3 shows the time instants of

occurrence of each of the events described above.

Table 2: Parameters values used for numerical simulation.

Parameters Values Parameters Values

n b d s, , ,p p p p 100, 250, 600, 1000 kPa nq 1000 kg

m, , ,N S Dv v v v 125, 200, 300, 25 kg/min , N Ss s 125, 75 kg/min

v 25 kg/min s0 , sr r 300, 30 kPa/min

n h,k k 1.0, 1.8 c 1 min

0.3 kPa/kg-min ,p qw w 0, 0

Table 3: Time instants (units: min.) of the various events described in the accident scenario.

0t 1t 2t 3t 4t 5t 6t

5.000 9.050 12.525 13.525 15.500 16.500 19.375

Figure 6: Accident trajectory obtained via numerical simulation. The red-colored dots indicate events listed

in the preceding description of the accident scenario. The red-colored dot with bold black border at 𝒑, 𝒒 =(𝟏𝟎𝟎 𝐤𝐏𝐚, 𝟏𝟎𝟎𝟎 𝐤𝐠), is the nominal operating condition. For adverse safety events 𝑺𝑬𝟏 and 𝑺𝑬𝟐, a crossing

from the left to the right of the boundary indicated for each event constitutes an occurrence of that adverse

safety event.

Figure 7: The time-histories of the variables 𝒑 and 𝒒 in the numerically simulated accident. The dotted red

lines indicate the time instants for each of the events listed in the accident scenario.

Figure 8: The time-histories of the automatic control inputs ��𝐬 and �� in the numerically simulated accident.

The dotted red lines indicate the time instants for each of the events listed in the accident scenario. Note that

both the control inputs saturate soon after the outlet pump malfunction occurs, and the operator reacts to

these saturations by opening the emergency vent. Recall that, once the emergency vent is opened, the

automatic controller is designed to saturate the control inputs regardless of the state of the reactor.

4 Accident Analysis and Discussion

In this section, we analyze the chemical reactor model and the preceding accident scenario

in the systems-theoretic framework introduced in Section 2. We show that the consistency

postulate did not hold true for the manager-operator hierarchy. This lack of consistency may be

considered a fundamental failure mode that was among the “root causes” of the accident. We

discuss the ramifications of identifying the lack of consistency as one of the “root causes”,

including the design guidelines that can be extracted. In addition to the conclusions drawn

regarding consistency (or lack thereof) in hierarchical control structures, a secondary purpose of

the following analysis is to provide a detailed and rigorous example of the systems- and control-

theoretic treatment of system safety, and to spur further research into such analyses for real-

world systems.

4.1 Real-World Accident Analogs of the Case Study

We constructed the preceding chemical reactor model and accident scenario for illustrative

purposes, and we showed that this accident scenario is mathematically feasible. Here, we further

corroborate the argument that our case study is illustrative of real-world accidents – and that, by

consequence, the proposed analysis based on the C&C principles carries practical value of – by

pointing out several characteristics of this example that bear resemblance to real-world, large-

scale systems that suffered catastrophic accidents.

Firstly, consider the absence of a gauge indicating the outlet flow rate to the operator, which

may be considered a glaring and “obvious” omission that hinders the operator’s situational

awareness. However, an example of a similarly “obvious” (in hindsight) omission of

instrumentation may be found in the accident analysis of the isomerization plant at the BP Texas

City refinery (Bakolas & Saleh, 2011; Mogford, 2005). There, the raffinate tower level indicator

was designed such that overflows above 100% would not be indicated, and this instrumentation

flaw played a crucial role in the accident (Mogford, 2005).

Secondly, consider the “simple” malfunction of the outlet pump, perhaps pointing to a lack

of redundancy of this crucial component. The real-world parallels of this accident are found in

the Piper Alpha disaster, where a crucial pump malfunctioned at a time when its redundant pump

was shut down for maintenance, and another crucial fire-fighting pump had no redundancy (Pate-

Cornell, 1993).

Thirdly, consider the operator’s inaction before the occurrence of each safety event despite

clear indications that the quantity q of 3C in the reaction was abnormally high and rising. This

situation illustrates operator confusion and impaired situational awareness during accident

sequences, as exemplified in the loss-of-coolant accident at the Three Mile Island nuclear power

station (Hopkins, 2001). In that accident, the operators inadvertently abetted the loss of coolant,

despite warnings and indications pointing to contradictory actions; nevertheless, it has been

argued that their reactions may not have been “erroneous,” and that “operator error” should not

be considered among the root causes of the Three Mile Island accident (Hopkins, 2001; Perrow,

1999).

Finally, consider the manager’s decision to switch to the Safe mode following the

occurrence of 1,SE in conjunction with the operator’s inaction in requesting to switch directly to

the Dead-stop mode. In addition to the issue of the operator’s confusion and impaired situational

awareness, organizational policies and, perhaps, cost-related considerations influenced the

decisions of both manager and operator. These decisions and actions (or lack thereof) bear

resemblance to a similar decision that contributed to the Piper Alpha offshore platform accident:

there, a nearby interconnected platform called Claymore decided to continue production despite

unambiguous indications that a fire had already started at the Piper Alpha platform (Pate-

Cornell, 1993).

4.2 Analysis in the C&C Framework

We first identify the elements of the chemical reactor case study in the context of the

framework of the C&C principles presented in Section 2. The hierarchical control structure is

clear: the manager constitutes the supremal unit, the operator constitutes the (solitary) infimal

unit, and the reactor, including the automatic controller, constitutes the process (see Figure 1).

We assume that the system safety is to be assessed over the time interval f0, t , where ft defines

some timeframe of interest.

For the purposes of this analysis, we assume that the overall decision problem is

informally expressed as “ensure safety of the reactor,” which we consider to be equivalent to the

problem of satisfying the constraint sp t p at all times f0, .t t In practice, this decision

problem will be an optimization problem with objectives related to profitability and constraints

related to supply chains and market demands, with the aforesaid safety constraint as one of the

constraints. The supremal decision problem S will involve some of these objectives with a

narrower scope. We assume that the objective of minimizing the deviation of the variables ( , )p q

from their nominal operating condition n n,p q is included among the objectives of the supremal

unit. To circumvent the need for listing all of the objectives and constraints in the supremal

decision problem, we will consider a control policy for the manager, which encapsulates – and

decouples from our analysis – these additional objectives and constraints in S , so that we may

focus on the safety-related objective of minimizing the deviation of ,p q from the nominal

operating condition. The infimal decision problem I is narrower in scope, and we define it as

the problem of (solely) minimizing the deviation of the reactor’s operating condition from the

nominal operating condition n n, p q . This problem may be mathematically expressed as the

minimization of the quantity

ˆ ˆ( ) ( )

ˆ ˆ( ) (d

t p t p t

q t qQ t

where p and q are deviations from the nominal operating condition (see Eqn. (5)), and oQ is a

positive-definite matrix. Note that an LQR-based automatic controller will “help” the operator by

adjusting the standard vent and the outlet flow rate to minimize a quantity similar to that in

expression (8).

Next, we discuss the decision sets of the supremal and infimal units – in the context of this

case study, these are the “safety levers” that the manager and operator hold. Informally, the

supremal decision set is the set of all possible sequences of modes of operation in the interval

f0, .t Precisely, the supremal decision set, which we denote by *

S is defined by the set

S f 10, , and { , , }, [0, ], and {0,1, ,| } .

i i i i i iiM M N S D t nn i

Any supremal decision must be an element of this set, and is characterized by the number

of mode switches ,n the time instants i at which the switches occur, and the modes of

operation iM selected at each switch 0,1, ,i n . For example, the manager’s decision in the

case study accident scenario is: 2

3acc 00 5( , ) ( , ),( , ),( , ) ,i i iM N t S t D t

0 1 2, , ,M N M S M D and 0 0 1 3, ,t t and 2 5.t

To characterize the infimal decision set, note that the operator has access to four variables

by which s/he can control the reactor: the inlet flow rate set-point ,s the outlet flow rate set-point

0 ,v the opening/closing of the emergency vent (previously denoted by e ), and the time instants

of notifying the manager about adverse safety events (“standard” as well as operator-triggered

events), which we denote by 0,

where ,m and for each 0,1, , ,j m f0, ,j t

and 1 .j j Recall from Section 3.2 that the inlet and outlet flow rate set-points are decided

by the manager’s decision regarding the mode of operation. This observation conforms with the

framework in Section 2, where it was noted that each infimal decision problem depends on the

coordinating input chosen by the supremal unit. Here, this dependence is in the form of equality

constraints for the inlet and outlet flow rate set-points. Specifically, the supremal decision

i i iM

automatically determines s and 0v as:

0 00 0

, for ) , for )

( ) ; ( )

, for ] , for

[0, , [0, ,

[ , . [ , .]n n

M n n M n n

s t v t

Therefore, the infimal decision set is informally described as the combination of (1) all

possible ways of switching the emergency vent between the “open” and “closed” states during

the time interval f0, ,t (2) all possible ways of selecting the adverse safety event notification

times, (3) all possible ways of selecting the inlet and outlet flow rate set-points. Precisely, the

infimal decision set is

* * * *

I I,E f 1 I,I I,O0| , and 0, , {0 } ,, ,1, ,

j j j jjm t j m

where *

I,E is the set of all piecewise constant functions taking values in 0,1 , *

I,I is the set of

all piecewise constant functions taking values in S N{0, , },s s and *

I,O is the set of all piecewise

constant functions taking values in D S N{ , , }.v v v Any infimal decision x must be an element of

I . For example, the operator’s actions in the case study accident scenario are concisely

denoted by the infimal decision acc 0 1 0, ,, ,,s vx e where 0 2 1 4, ,t t and the functions

I,E I,I 0 I,O, ,s ve are given by the expressions

3 5 3 5

, , , ,0, for ,

( ) ; ( ) , ,; ( ) , ,1, for ,

0 , ., ,

s t vt t

t t t tt

e t s t s t v t v tt

t t tv t

We define the output of the infimal decision unit, ( ),x as 0( ( )) ( ), ( ), ( ) .x t e t s t v t The

sequence 0

of safety event notification times constitutes the feedback given by the

infimal unit to the supremal unit (see Figure 1). Table 4 summarizes the preceding identification

of the chemical reactor case study with the elements of the systems-theoretic framework

discussed in Section 2.

Table 4: Identification of the chemical reactor case study with elements of the framework in Section 2.

Entity Description

Overall problem Ensure sp t p for all f0, .t t

Supremal decision set *

S All possible sequences of modes (see Eqn. (9)).

Supremal decision A particular sequence of modes, with switching times.

Supremal problem S Higher-level objectives and constraints, including the objective of

minimizing deviation from nominal operating condition.

Infimal decision set *

I All possible ways of operating emergency vent, combined with all

possible ways of choosing adverse safety event notification times

(see Eqn. (11)).

Infimal decision x A particular way (time history) of operating the emergency vent,

combined with a particular sequence of notification times.

Infimal problem I Minimize deviation from nominal operating condition (see Eqn.

(10)).

Output of infimal unit 0( ( )) ( ), ( ), ( ) .x t e t s t v t

Feedback from infimal

unit to supremal unit Sequence of safety event notification times

Next, we examine the supremal and infimal decisions in the case study accident scenario.

We assume that the supremal decision unit, i.e., the manager, makes its decision according a

policy that recommends the next mode based on the current situations. An example of such a

policy for is the following:

1 2 10, ,

, if , ( ) , and occurs,

, if , and occurs, ; min | .

i i i j jmj

S M N p t p SE

M D M S SE t

Such a policy not only incorporates feedback from the infimal unit, but also reacts to the

current situation, rather than implementing a preprogrammed plan of action. We may assume

that the policy adopted by the manager also incorporates objectives and constraints other than the

”minimize deviation from the nominal operating condition” objective, and that any particular

decision resulting from this policy solves the supremal decision problem S. In particular, the

supremal decision in the case study accident scenario 2

30acc 0 5( , ) ( , ),( , ),( , )ii iM N t S t D t

solves S , and the proposition acc S,P is true (recall from Section 2.1 that the truth of

predicate ( , ) P means that the first argument solves the decision problem in the second

argument of the predicate).

To determine whether the infimal decision acc 0 1( , , ),x e where e is given by (12),

solves the infimal decision problem, we must answer the question “Did the operator take the

best possible actions to minimize the deviation of the reactor state from the nominal operating

condition?” We reexamine the accident scenario and note that the operator opened the

emergency vent soon after4 s/he noticed the rise in p and q and the saturation of the automatic

control inputs sr and ˆ,v and did not close this vent again. Assuming no malicious intent from the

operator, and noting that the initial reduction in pressure immediately after opening the

emergency vent may have corroborated his/her belief that there was no need to trigger an “early”

adverse safety event, we conclude that the operator’s decision accx was in fact the best possible

decision that s/he could make, and that the proposition acc I,xP is true.

The fact that a catastrophic failure occurred, that is s( )p t p was not satisfied, during the

operation of the reactor implies that the overall decision problem was not solved. Therefore,

the proposition acc( ( ), )xP is false. Now consider the truth of the statement

acc S acc I acc ,& ., ,x x P P P (14)

Because we know the truth values of each of these propositions in (14), we may rewrite it as:

& .True True False (15)

Comparing the expression in (15) with the truth table (Huth & Ryan, 2004) for implication ( ),

we find that the expression in (15) is false. Therefore, we have found one pair of supremal and

infimal decisions, namely, the pair acc acc, ,x for which the statement

I S, and , ( ),x x P P P is false. In other words, we have shown that the

consistency postulate given in Proposition (2) is violated in the hierarchical control

structure of the chemical reactor. In what follows, we discuss the ramifications of this

observation.

4 The delay between the saturation of the automatic controller inputs and the operator opening the emergency

vent, seen in Figure 8, may be attributed to the operator deliberating about his/her actions. In any case, it is easy to

show that this accident would have occurred even if the operator had opened the emergency vent immediately after

the saturation of the automatic controller inputs.

4.3 Discussion

4.3.1 Identification of a Fundamental Failure Mode

Firstly, we claim that the diagnosis of “violation of the consistency postulate” (briefly, “lack

of consistency”) can be considered a fundamental failure mode of the system. For the chemical

reactor case study, the lack of consistency may be considered a “root cause” of the accident. The

individual elements of the manager-operator hierarchy operated “correctly”, yet they were

together unable to prevent the accident. The diagnosis of lack of consistency provides a rigorous

and unambiguous answer to the question of “what was wrong” about the interaction between the

manager and operator that caused this accident. Furthermore, due to its theoretical and domain-

independent formulation, the lack of consistency (and also, as we showed in (Cowlagi & Saleh,

2013), the lack of coordinability) provides a powerful vocabulary to express the results of post-

mortem accident analyses. Consequently, abstract similarities in accidents from widely different

technological domains can be identified from epidemiological studies, and the lessons learnt (on,

say, how to prevent the lack of consistency) from accident analysis in one technological domain

can be transferred to another. As we discuss next, the diagnosis of lack of consistency can also

serve as a starting point for learning lessons from post-mortem accident analyses. Before this

discussion, we comment on the following potential counter-arguments – and corresponding

rebuttals – to the claim that the lack of consistency was a “root cause” of the accident:

“The malfunctioning of the outlet pump was the root cause.” – The malfunctioning of the

outlet pump was an initiating event, but by itself was not sufficient to cause catastrophic

failure. Conversely, a different event, such as a malfunction in one of the gas vents, could

have initiated a similar accident scenario.

“The organizational focus on profit over safety (evident in operator training) was the root

cause.” – The operator mistook the initial reduction in pressure after opening the emergency

vent as an indication of relative normalcy of operation. Whereas the operator’s training may

have contributed to his/her inaction in triggering an early adverse safety event, there is no

reason to believe that s/he would have done so otherwise, especially given his/her

compromised situational awareness.

“The ‘standard’ safety events ( 1SE and 2SE ) were incorrectly designed, and the flaws in

their design was the root cause.” – The safety events were designed to withstand worst-case

disturbances. It may be argued that if the process disturbance caused by the malfunctioning

outlet pump was lower than the worst case value, that is, if m ,N qv v w then the “usual”

safety events would have sufficed to stop the accident trajectory from reaching the point of

catastrophic failure. The worst-case disturbances cannot be chosen arbitrarily high, for

otherwise the nominal operating condition itself would violate the safety event constraints.

For “sufficiently high” values of worst-case disturbances, there will be sufficiently small

values of mv that will initiate similar accidents.

Each of these candidates for “root causes” are clearly contributing factors to this particular

accident scenario. However, unlike the lack of consistency, these factors are specific to the

particular system and accident scenario considered in this case study, and they do not provide

easily transferable lessons that may be learnt to improve systems in other technological domains.

Whereas these contributing factors need to be recognized, there is also the need for identifying

fundamental failure modes that manifested themselves through these (and, possibly, other)

factors.

4.3.2 Further Analysis

The identification of lack of consistency in the manager-operator hierarchy is a starting point

for deeper analysis of the accident, which may be guided by questions such as: Why did the lack

of consistency occur? How can it be identified and prevented? We pursue the resolution of this

question for the chemical reactor case study with a chain of questions and answers that examine

the accident in progressively greater detail.

By definition, the lack of consistency occurred because (at least) for the pair of supremal

and infimal decisions acc acc, ,x the overall decision problem was not solved.

Why was not solved? As the manager and operator both made “correct” decisions (i.e.,

the problems S and I were both solved), either the manager or the operator, or both, must

have made assumptions that led to remain unsolved.

What were these assumptions? One such assumption made by the manager is that his/her

responses to adverse safety events (whether “standard” or operator-triggered), were barriers that

prevented the reactor from entering a hazardous regime of operation. For example, 1SE was

designed such that the manager’s response (including the communication delay c ) to switch to

the Safe mode would prevent the pressure from breaching the threshold of b ,p where the rate of

production of 3C and 4C would significantly increase. However, 1SE was designed assuming

the outlet flow rate of ,Nv whereas the actual outlet flow rate was m .Nv v Therefore, the

pressure had already breached bp by the time the reactor was switched to Safe mode.

How could this situation have been prevented? One possible method is have made the

operator aware that 1SE was likely to fail in its role as a barrier to the hazard of exceeding the

pressure threshold b .p For example, if the operator had been provided guidelines for triggering

early adverse safety events based on the observed or estimated rates of change of p and ,q

rather than p and q themselves, then the operator would have noted that (1) the rate of

reduction in pressure immediately after opening the emergency vent was anomalous, and (2) the

rate of increase of the quantity of 3C in the reaction vessel was anomalous. The observation of

these two process anomalies and of the saturation of the automatic control inputs may have

provided the operator sufficient evidence to trigger an early safety event.

As previously mentioned, we developed a MATLAB®-based simulation of this case study

to aid its understanding. This simulation is freely available at the following URL:

http://users.wpi.edu/~rvcowlagi/software.html. The accident scenario and various prevention

scenarios for this accident have been pre-programmed. The reader is invited to download this

simulation and to not only examine these scenarios, but also to modify the parameters to create

his/her own scenarios to further pursue this research.

4.3.3 Broader Lessons Learned

Beyond the particulars of the chemical reactor case study, we point out a series of

characteristics of the preceding analysis that are more generally applicable in systems with

similar hierarchical multilevel structures.

Firstly, we note that the chain of reasoning – recall that the identification of lack of

consistency led to the identification of some of the flaws in the system’s control structure – can

be further developed into a set of domain-independent “procedural templates” for accident

analysis. Post-mortem accident reports expressed using these templates may be used to identify

common flaws and to develop common accident prevention principles.

Secondly, we propose that ensuring the truth values of the coordinability and consistency

propositions (1) and (2) provides the basis for developing a “safety specification” during system

design. For example, safety specifications have often been applied in software development

(Alagar & Periyasamy, 1998). These specifications are expressed in the form of logical

propositions, whose truth must be ensured in all possible ways of execution of the software.

There are several algorithms, such as model checking (Alagar & Periyasamy, 1998) that aid the

design and analysis of software systems to ensure the satisfaction of these safety specifications.

Similar safety specifications for large-scale systems, where software and physical systems

interact with each other, can be developed using the C&C propositions. Note that such safety

specifications will be concise and abstract, so as to avoid exhaustively listing all possible

undesirable scenarios (an exercise that can be time-consuming and not always feasible during the

systems design phase).

Thirdly, the problem of satisfying these C&C-based safety specifications, namely, the

problem of ensuring the truth values of propositions (1) and (2), leads to a series of questions that

must be resolved, and the resolutions to which will lead to important design guidelines for

system safety. For example, we encounter the question “What quantities must be measured or

estimated, to ensure satisfaction of the propositions (1) and (2)?” Note that the control-theoretic

notion of observability may not suffice here. For example, in the chemical reactor case study, we

concluded that the measurement or estimation of the rates p and q were necessary in addition

to the measurement of the variables p and q themselves (the measurement of p and q means

that the system is observable). Further crucial questions related to the satisfaction of C&C-based

safety specifications, which we did not address in the chemical reactor case study, include:

How can we ensure that a hierarchical multilevel system remains coordinable and consistent

under changing circumstances?

Which entity or entities should be responsible for ensuring coordinability and consistency at

all times?

Finally, we point out potential avenues for integrating system safety analysis based on the

C&C principles with some of the traditional and well-established approaches to safety. In the

chemical reactor case study, we studied safety barriers implemented in the form of adverse safety

events and the associated control policies of the manager, and the relation between these barriers

and consistency in the hierarchical control structure. More generally, the study of relations

between the C&C principles and the defense-in-depth principle is a potential area for future

research that can help establish strong theoretical guidelines for establishing and implementing

multiple safety barriers for accident prevention. Similarly, the study of relations between the

C&C principles and HRO paradigm may provide formal, mathematically rigorous meanings to

the qualitative HRO characteristics. For example, one HRO characteristic is the “shifting from

centralized authority during routine operations to local/decentralized authority for hazardous

operations” (Saleh et al., 2011). In the chemical reactor case study, this characteristic closely

relates to the approach of ensuring consistency by providing the operator additional authority to

trigger safety events early. As a final comment on the need to further investigate potential

relations between the C&C principles and established approaches to system safety, we note that

the fundamental failure modes identified by the C&C principles may serve as the starting point

of exhaustive analyses to identify different accident trajectories for QRA and PRA-based system

design. These quantitative design methods strongly depend on such exhaustive analyses:

“brainstorming for possible accident scenarios is an essential step [in PRA] for understanding

how a system might fail, and in turn, for making risk-informed design and operational choices to

support accident prevention” (Saleh et al., 2011). The failure modes identified by the C&C

principles can add a mathematically rigorous element to this brainstorming process.

5 Conclusions

In this work, we developed a case study to illustrate details of the application of C&C

principles for accident analysis and system safety. This case study was chosen to represent a

broad class of multi-level systems where operational policies and procedures of individual

stakeholders interact with physical processes to spawn unanticipated system behaviors – and,

consequently, new failure modes. We argued that the lack of coordinability and/or consistency

constitutes one such fundamental failure mode in hierarchical multilevel systems. We proposed,

on the one hand, that the recognition of this failure mode in post-mortem accident analysis can

assist to conceptually relate the lessons learnt from post-mortem analyses of accidents in

different technological domains. On the other hand, we proposed that efforts directed at

preventing and mitigating the manifestations of these fundamental failure modes can directly

lead to a series of questions related to the safety-oriented design, operation, and control of a

hierarchical multilevel system. Crucially, we emphasized that accident prevention and system

safety based on the C&C principles is not an isolated idea, and we provided specific avenues to

allow the C&C principles to inform and embellish well-established ideas in system safety, such

as the HRO paradigm, QRA/PRA, and the defense-in-depth principle.

References

Alagar, V. S., & Periyasamy, K. (1998). Specification of Software Systems. New York, NY: Springer-Verlag.

Anderson, B. D. O., & Moore, J. B. (1990). Optimal Control: Linear Quadratic Methods. Englewood Cliffs, NJ:

Prentice-Hall.

Apostolakis, G. E. (2004). How useful is quantitative risk assessment? Risk Analysis, 24(3), 515-520. doi: DOI

10.1111/j.0272-4332.2004.00455.x

Asare, P., Broman, D., Lee, E. A., Torngren, M., & Sunder, S. S. (2012). Cyber-physical Systems - A Concept Map.

from http://cyberphysicalsystems.org/

Bakolas, E., & Saleh, J. H. (2011). Augmenting Defense-in-depth with the Concepts of Observability and

Diagnosability from Control Theory and Discrete Event Systems. Reliability Engineering and System

Safety, 96, 184--193.

Bertalanffy, L. (1969). General System Theory: Foundations, Development, Applications. New York: George

Braziller.

Cowlagi, R. V., & Saleh, J. H. (2013). Coordinability and Consistency in Accident Causation and Prevention:

Formal System-Theoretic Concepts for Safety in Multilevel Systems. Risk Analysis, 33(3), 420-433.

Favaro, F. M., & Saleh, J. H. (2014). Observability-in-Depth: an Essential Complement to the Defense-in-Depth

Safety Strategy in the Nuclear Industry. Nuclear Engineering & Technology.

Hollnagel, E. (2004). Barriers and Accident Prevention. Burlington, VT: Ashgate.

Hopkins, A. (2001). Was Three Mile Island a 'Normal Accident'? Journal of Contingencies and Crisis Management,

9(2), 65-72.

Huth, M., & Ryan, M. (2004). Logic in Computer Science. Cambridge, UK: Cambridge University Press.

Khan, F. I., & Abbasi, S. A. (1999). Major accidents in process industries and an analysis of causes and

consequences. Journal of Loss Prevention in the Process Industries, 12(5), 361-378. doi: Doi

10.1016/S0950-4230(98)00062-X

Khan, F. I., & Amyotte, P. R. (2003). How to make inherent safety practice a reality. Canadian Journal of Chemical

Engineering, 81(1), 2-16.

Khan, F. I., & Amyotte, P. R. (2004). Integrated inherent safety index (12SI): A tool for inherent safety evaluation.

Process Safety Progress, 23(2), 136-148. doi: Doi 10.1002/Prs.10015

Kleindorfer, P. R., Belke, J. C., Elliott, M. R., Lee, K., Lowe, R. A., & Feldman, H. I. (2003). Accident

epidemiology and the US chemical industry: Accident history and worst-case data from RMP*Info. Risk

Analysis, 23(5), 865-881. doi: Doi 10.1111/1539-6924.00365

Kleindorfer, P. R., Makris, J. L., & Conomos, M. G. (1999). Accident epidemiology: Understanding the major

drivers of major accident risk. Epidemiology, 10(4), S115-S115.

Leveson, N. G. (2004a). A new accident model for engineering safer systems. Safety Science, 42(4), 237-270. doi:

Doi 10.1016/S0925-7535(03)00047-X

Leveson, N. G. (2004b). Role of Software in Spacecraft Accidents. Journal of Spacecaft and Rockets, 41(4), 564-

Mesarovic, M. D., Macko, D., & Takahara, Y. (1970). Theory of Hierarchical, Multilevel Systems: Academic Press,

New York.

Mogford, J. (2005). Fatal Accident Investigation Report: Isomerization Unit Explosion Final Report. Texas City,

NRC. (2000). Causes and significance of design basis issues at US nuclear power plants. Washington, DC: US

Nuclear Regulatory Commission, Office of Nuclear Regulatory Research.

Pate-Cornell, M. E. (1993). Learning from the Piper Alpha Accident: A Postmortem Analysis of Technical and

Organizational Factors. Risk Analysis, 13(2), 215-232.

Perrow, C. (1999). Normal Accidents:Living with High-risk Technologies. Princeton, NJ: Princeton University

Press.

Rasmussen, J. (1997). Risk management in a dynamic society: A modelling problem. Safety Science, 27(2-3), 183-

213. doi: Doi 10.1016/S0925-7535(97)00052-0

Rasmussen, N. (1975). Reactor safety study: an assessment of accident risks in US nuclear power plants.

Washington, DC: US Nuclear Regulatory Commission.

Reason, J. T. (1997). Managing the Risks of Organizational Accidents. Vermont: Ashgate.

Rochlin, G. I., La Porte, T. R., & Roberts, K. H. (1987). The Self-designing High Reliability Organization. Naval

War College Review, 40(4), 76--90.

Saleh, J. H., Marais, K., Bakolas, E., & Cowlagi, R. V. (2011). Highlights from the literature on accident causation

and system safety: Review of major ideas, recent contributions, and challenges. Reliability Engineering and

System Safety, 95(11), 1105-1116.

Saleh, J. H., Marais, K. B., & Favaro, F. M. (2014). System Safety Principles: A Multidisciplinary Engineering

Perspective. Journal of Loss Prevention in the Process Industries, 29, 283-294.

Sorensen, J. N., Apostolakis, G. E., Kress, T. S., & Powers, D. A. (1999, Aug 22-26). On the Role of Defense-in-

Depth in Risk-informed Regulation. Paper presented at the PSA '99 International Topical Meeting on

Probabilistic Safety Assessment, Washington, DC.

Sorensen, J. N., Apostolakis, G. E., & Powers, D. A. (2000). On the role of safety culture in risk-informed

regulation. Psam 5: Probabilistic Safety Assessment and Management, Vols 1-4(34), 2205-2210.

Turner, B. A. (1978). Man-made Disasters. London: Wykenham Publications.

Weick, K. E., & Sutcliffe, K. M. (2007). Managing the Unexpected: Resilient Performance in an Age of Uncertainty

(2nd ed.). San Francisco, CA: Jossey-Bass.

Weinberg, G. M. (1975). An Introduction to General Systems Thinking: Dorset House.

Zhong, H., & Wonham, W. M. (1990). On the Consistency of Hierarchical Supervision in Discrete-Event Systems.

IEEE Transactions on Automatic Control, 35(10), 1125-1134.

Coordinability and Consistency: Application of Systems Theory to Accident Causation ...users.wpi.edu...

Documents