Post on 27-Jun-2020
transcript
1
Coordinability and Consistency: Application of Systems
Theory to Accident Causation and Prevention
Raghvendra V. Cowlagi1
Worcester Polytechnic Institute, Worcester, MA, USA.
Joseph H. Saleh2
Georgia Institute of Technology, Atlanta, GA, USA.
Abstract – Recent works in the safety literature report several fruitful attempts to introduce
mathematically rigorous results from systems and control theory to bear upon accident prevention and
system safety. Previously, we discussed the implications on safety of the systems theoretic principles of
coordinability and consistency, and we identified the lack of coordinability and/or consistency as
fundamental failure modes in hierarchical multilevel systems. In this work, we further develop system
safety analysis techniques based on these principles. We demonstrate that these principles not only
provide a domain-independent vocabulary for expressing the results of post-mortem accident analyses,
but they can also be applied to guide design and operational choices for accident prevention and system
safety. We develop these ideas with the help of an illustrative case study. This case study represents a
broad class of systems where operational policies and procedures of individual stakeholders in the system
interact with physical processes such that new system behaviors emerge, and unanticipated safety issues
arise. We argue, and illustrate our arguments using this case study, that the coordinability and consistency
principles can be developed to deliver a threefold impact on accident analysis and prevention: firstly,
these principles provide domain-independent procedural templates and vocabulary for post-mortem
accident analysis. Secondly, these principles provide theoretical safety specifications to be met during
system design and operation. Finally, these safety specifications can precipitate the formulation of a series
of questions directly related to safety-oriented choices in the design, operation, and control of systems.
Keywords: coordinability; consistency; chemical reactor; accident prevention; system safety.
1 Assistant Professor, Aerospace Engineering Program, Department of Mechanical Engineering. 2 Associate Professor, Guggenheim School of Aerospace Engineering.
2
1 Introduction
Accident prevention and system safety are influenced not only by the reliability and failure
behavior of various subsystems and components, but also by the nature of interactions between
these components, as well as their interactions with external factors or environmental conditions.
For example, large scale systems such as nuclear power plants, air traffic control systems, and
offshore oil platforms, exhibit closely interacting technical, managerial, regulatory, and social
components. Within the realm of technical systems, emerging cyber-physical systems such as
intelligent transportation systems and mobile robots exhibit close interactions between
components of fundamentally different nature: namely, computational and physical components
(Asare, Broman, Lee, Torngren, & Sunder, 2012). In the safety literature, the terms man-made
disasters (Turner, 1978), organizational accidents (Reason, 1997), and system accidents
(Perrow, 1999) have been used to describe adverse events arising due not only to isolated failures
in human and technical elements of large systems, but also due to their flawed interactions.
These interactions are not properly understood and, when examined, it is often on an ad-hoc
basis and without an underlying formal and theoretical foundation. Such a theoretical foundation
is nevertheless essential to the study of domain-independent principles of accident prevention
and system safety, and the identification of such principles for hierarchical multilevel systems is
a crucial area of ongoing research. In this work, we contribute to this research by further
developing a previously introduced formal framework (Cowlagi & Saleh, 2013) for accident
analysis and system safety. To this end, we briefly review the relevant literature and accordingly
contextualize the proposed work. The reader interested is referred to (Saleh, Marais, Bakolas, &
Cowlagi, 2011) for a thorough review and critical appraisal of the major ideas in accident
prevention and system safety.
The literature reports on qualitative ideas and quantitative methods to guide design,
operational, and organizational choices for accident prevention and system safety. Notably on the
one hand, the High Reliability Organization (HRO) paradigm (Rochlin, La Porte, & Roberts,
1987; Weick & Sutcliffe, 2007) presents a qualitative description of the salient managerial and
organizational features of entities that maintain high safety standards and low accident
occurrence rates. On the other hand, the method of quantitative risk assessment (QRA)
(Apostolakis, 2004) – first introduced as Probabilistic Risk Assessment (PRA) for nuclear power
3
plants (N. Rasmussen, 1975) – provide quantitative bases for making risk-informed design and
operational decisions related to system safety. QRA and PRA involve technical details of the
system configuration and operation, and develop an exhaustive list of possible accident
scenarios, along with their potential consequences, and the likelihood of their occurrences.
Excellent examples of such analyses for informing risk assessment in the chemical process
industries include (Khan & Abbasi, 1999; Kleindorfer et al., 2003; Kleindorfer, Makris, &
Conomos, 1999). The recent literature exhibits a thrust to support ideas and methods such as
HRO and QRA/PRA with domain-independent design principles to inform technical, managerial,
and organizational design choices for system safety (Saleh, Marais, & Favaro, 2014). Most
notably, the defense-in-depth safety principle (NRC, 2000; Sorensen, Apostolakis, Kress, &
Powers, 1999; Sorensen, Apostolakis, & Powers, 2000) emphasizes the implementation of
multiple and diverse “barriers” (Hollnagel, 2004) for interrupting potential accident sequences at
various stages. The purpose of these “barriers” is to prevent accident sequences from initiating,
and/or to prevent them from escalating, and/or to mitigate their eventual consequences. The
inherent safety principle (Khan & Amyotte, 2003, 2004) complements defense-in-depth by
providing guidelines for choosing in the early design stages the types and locations of safety
barriers.
These perspectives on accident prevention and system safety have now culminated in the so-
called systems and control theoretic approach to system safety (Saleh et al., 2011), which
pursues two complementary objectives: (1) to encapsulate the preceding perspectives on system
safety originating from diverse technological domains into a single theoretical and
mathematically rigorous framework, and (2) to leverage for accident prevention and system
safety the vast arsenal of analytical and algorithmic tools from systems and control theory. The
connections between control theory and the implementation and enforcement of safety barriers
and safety constraints have been recognized (Leveson, 2004a; J. Rasmussen, 1997), and the role
in system safety of the control theoretic notion of observability has been recently highlighted
(Bakolas & Saleh, 2011; Favaro & Saleh, 2014). The connections between systems theory
(Bertalanffy, 1969; Mesarovic, Macko, & Takahara, 1970; Weinberg, 1975) and system safety is
motivated by the observation that accidents can result “from dysfunctional interactions among
system components” (Leveson, 2004a), and that fundamental failure modes resulting due to such
dysfunctional interactions are ill-understood (Leveson, 2004b). In a recent work (Cowlagi &
4
Saleh, 2013), we discussed the implications on accident causation and system safety of the
systems theoretic principles of coordinability and consistency, hereafter referred to as C&C.
Specifically, we identified the lack of coordinability and/or consistency as fundamental
failure modes in hierarchical multilevel systems, and we illustrated this claim using relevant
accident case studies.
In this work, we further develop system safety analysis techniques based on the
coordinability and consistency (C&C) principles. Specifically, the novel contributions of this
paper are as follows. Firstly, we demonstrate that the C&C principles provide a theoretical
vocabulary for expressing the results of post-mortem accident analyses, which can assist in
extracting important lessons to be learnt, and in identifying common accident pathogens from
epidemiological studies of accidents in diverse technological domains. Secondly, and more
importantly, we demonstrate the value of C&C-based system safety analysis for making design
and operational choices. In particular, we illustrate the influence of this safety analysis on the
choice of measurement equipment and estimation algorithms for various attributes of the system,
thereby relating the “systems-” and “control-” theoretic facets of the system safety problem.
More generally, we demonstrate that, for system design, the C&C principles can provide
theoretical and general “safety specifications” that are more informative than the tautological
specification of “the system must be safe”, and more concise and domain-independent than
specifications consisting of an exhaustive list of potential scenarios that must be avoided. To aid
the exposition of these ideas, we present details on the application of the C&C principles for
system safety analysis via a detailed illustrative example of a chemical reactor. Although the
case study treated here is from the chemical industry, and the analytical model developed is
specific to our reactor, this case study represents a broad class of multi-level systems where
operational policies and procedures of individual stakeholders in the system interact with
physical processes such that new system behaviors emerge, and unanticipated safety issues arise.
The rest of this paper is organized as follows: In Section 2, we provide a brief discussion of
the C&C principles for the sake of completeness. The reader interested in further details is
referred to (Cowlagi & Saleh, 2013) for a thorough discussion of these principles. In Section 3,
we introduce a model of a chemical process plant, and in Section 4, we illustrate the application
5
of the C&C principles for safety analysis of this plant. Finally, in Section 5, we provide
conclusions of the proposed analysis, and directions for future research.
2 System Theoretic Framework for Accident Analysis and System
Safety
In this work, we focus on systems involving components interacting over multilevel
hierarchies. Hierarchical multilevel structures are omnipresent in systems both in a purely
technical context (e.g., cyber-physical systems) and in a sociotechnical context. Hierarchical
multilevel structures enable tractable solutions of management and control of systems with ever-
increasing technical and organizational complexity. Specifically, these structures support
functional specialization, modular design, and multiplicity of decision-making units to break
down the overall problem into manageable sub-problems. For simplicity of exposition, we
consider a two-level hierarchy as is common practice (Mesarovic et al., 1970; Zhong &
Wonham, 1990), with the understanding that the proposed developments can be iteratively
applied to multilevel systems by analyzing pairs of components at successive hierarchical levels,
and by aggregating components at multiple levels. In this section, we first introduce the formal
concepts of coordinability and consistency in hierarchical multilevel systems using the
terminology and definitions of (Mesarovic et al., 1970). Then, we summarize the implications of
the C&C principles on system safety, which we discussed in detail in (Cowlagi & Saleh, 2013)
using illustrative examples. With this background information covered, we will be ready for the
analytical model development and the accident analysis using the C&C concepts in Sections 3
and 4.
2.1 Formal Definitions of Coordinability and Consistency
The following description of the C&C principles is a summary of the extended discussion in
(Cowlagi & Saleh, 2013). This subsection provides a brief overview of coordinability and
consistency. The reader familiar with the subject may skip this subsection.
We represent a system as a mapping between a set of inputs and a set of outputs. In
particular, a decision-making system chooses a decision x to solve a decision problem . The
6
output y of a decision-making system is a transformation of the decision, i.e., y x . A
decision problem, in general, involves the selection of a decision that satisfies a pre-specified set
of constraints and may involve the optimization of a cost or reward function. A two-level
hierarchical system, consisting of a single higher-level unit and N lower-level units, and
controlling a certain process, is shown in Figure 1. Following (Mesarovic et al., 1970), we refer
to the higher level decision-making unit as the supremal unit and to the lower level decision-
making units as the infimal units.
Figure 1: Illustration of a two-level hierarchical decision-making system.
Each of the N infimal units contributes to controlling the process: the thi unit provides input
iy to the process and receives feedback iz from the process. The supremal decision-making unit
provides an input i to the th i infimal unit, and receives feedback i from that unit. The
decision problem of each infimal unit depends on the input i .
The overall decision problem is what the supremal and infimal units attempt to jointly
solve by addressing their local decision problems. We denote by S and I, , i respectively, the
local decision problems of supremal unit and of the thi infimal unit. The supremal unit cannot
solve by itself the overall decision problem; instead it solves its own decision problem S to
7
determine the coordination inputs 1, , ,N which it then provides to the infimal units.
These coordination inputs define the decision problems of the infimal units, which, in turn,
attempt to solve their decision problems I,i and thereby affect the process to solve the overall
decision problem . To make explicit the dependence of the infimal decision problems on the
coordination inputs, we denote these problems by I, .i i
The C&C principles are formulated using predicate logic (Alagar & Periyasamy, 1998; Huth
& Ryan, 2004), the basic elements of which are briefly presented here for the reader’s
convenience: a proposition is a statement that can be either true or false. For example, the
statement “The Earth is flat” is a valid proposition, but the sentence “Will it rain today?” is not.
A predicate indicates either a property of an object or a relationship between different objects. A
predicate acting on a particular object is a proposition. Following (Mesarovic et al., 1970), we
introduce a logical predicate ( , ) P defined such that ,xP is true if and only if x is a solution
to the decision problem .
In this framework for hierarchical multilevel systems, we first discuss the notion of
coordinability of the infimal decision problem relative to the supremal decision problem
(Mesarovic et al., 1970), henceforth abbreviated as, simply, coordinability. Conceptually,
coordinability means that there exists a coordinating input such that each of the corresponding
infimal decision problems has at least one solution. Stated differently, coordinability means that
the supremal unit can solve its decision problem to find a coordinating input 1, , ,N and
that there exists a solution for each of the infimal decision problems I,i i . Conversely, the
lack of coordinability implies that no solution for the supremal unit decision problem exists for
which all the infimal units can solve their associated problems. Stated differently, lack of
coordinability means that the infimal units have been given coordinating inputs and decision
problems that cannot be solved—more colloquially, they have been inadvertently set up to fail.
Formally, the infimal decision problems I,i are said to be coordinable relative to the supremal
decision problem if and only if the following proposition is true (Mesarovic et al., 1970):
I S| , and , .x x P P (1)
8
Proposition (1) reads as follows: there exist and ,x such that solves the supremal
decision problem and x solves the infimal decision problems. In other words, there exists some
solution to the supremal decision problem for which the infimal units can also solve their
problems. Figure 2 illustrates this definition of coordinability.
Figure 2: Illustration of the concept of coordinability relative to the supremal decision problem (Cowlagi &
Saleh, 2013). © 2013 by John Wiley & Sons.
Next, we discuss the notion of consistency as proposed in (Mesarovic et al., 1970), which
relates the local decision problems of the supremal and infimal units to the overall decision
problem . Conceptually, the two-level system of Figure 1 is “consistent” if the overall decision
problem is solved when the supremal unit and each of the infimal units solve their own decision
problems. In other words, consistency in multilevel systems means that by solving their local
decision problems, the infimal and supremal decision units solve the overall problem. Formally,
the infimal and supremal decision problems are said to be consistent if the following proposition
is true:
I S| , and , ( ), .x x x P P P (2)
Proposition (2) is called the consistency postulate (Mesarovic et al., 1970) and its reads as
follows: for all x and all that solve, respectively, the infimal and supremal decision problems,
the outputs x of the infimal units solve the overall decision problem . Stated differently, if
a two-level system is consistent, the overall decision problem is solved whenever the supremal
decision unit coordinates the infimal units relative to its own objectives. Figure 3 illustrates this
definition of consistency.
9
Figure 3: Illustration of the notion of consistency (Cowlagi & Saleh, 2013). © 2013 by John Wiley & Sons.
2.2 Implications on System Safety Analysis
We proposed in (Cowlagi & Saleh, 2013) that the C&C principles are important bases for
accident prevention, and their absences are fundamental failure modes that cause or contribute to
accidents in systems involving hierarchical multilevel structures. The C&C principles identify
fundamental properties that are independent of the system’s technological domain. In (Cowlagi
& Saleh, 2013), we analyzed the Tenerife accident, and we identified the lack of coordinability
between the air traffic control tower at Tenerife airport and the flight crew one of the two
involved aircraft as a factor contributing to the disaster. We also studied examples of accidents
from domains as diverse as mobile robotics, marine transport, and airport ground operations, to
identify absences of coordinability and/or consistency as contributing factors. We argued that the
identification of lack of coordinability and/or consistency as fundamental failure modes can help
to reformulate in a mathematically rigorous manner some established views on accident
causation and system safety. For example, it has been claimed that “sloppy management” is the
root cause of many industrial accidents (Turner, 1978) and that “sloppy management” is
responsible for inadequate efforts to implement and respond to early warning systems (Hopkins,
2001). We asserted that formal analyses based on the C&C principles can refine or displace the
narrative of “sloppy management”, which is of limited value for making technical, managerial,
or organizational decisions for accident prevention and system safety. Finally, we also asserted
that the C&C principles may help to formalize qualitative HRO characteristics, and in particular,
10
to examine whether HRO characteristics are aligned with the safety ramifications of the C&C
principles.
To further study the application of C&C principles for accident analysis and system safety in
hierarchical multilevel systems, we develop in what follows an illustrative example of a chemical
reactor and discuss an accident scenario for this reactor. The purpose of this discussion is to
examine (a) failure modes that can arise due to flawed interactions between properly functional
components of a system, and (b) the application of C&C principles to not only identify these
failure modes in post-mortem accident analysis but also to provide guidance for making design
and operational choices for accident prevention and system safety.
3 Model Development and Illustrative Example
In this section, we describe and mathematically model a chemical reactor. Next, we describe
an accident scenario in this chemical reactor, which will serve as a concrete case study to
exemplify the role of the C&C principles in accident analysis and prevention. This case study
represents a broad class of systems where operational policies and procedures of individual
stakeholders in the system interact with physical processes such that new system behaviors
emerge, and unanticipated safety issues arise. This case study involves a particular example of
the general two-level hierarchical control structure discussed in Section 2, where the supremal
and infimal units each solve their local decision problems to contribute towards the solution of
the overall decision problem. We also examine via this case study important details involved in
this framework, such as communication delays, predetermined operational policies, the role of
observation and estimation of process variables, and differences in decision-making and process
time scales. As we discuss later, these issues influence accident causation and prevention in
hierarchical multilevel systems in ways that are not yet well-characterized.
After describing the chemical reactor model and an accident scenario in this Section, we will
turn to its analysis using the C&C principles in Section 4. Thereafter, we will discuss broad
lessons that can be extracted from this analysis and applied more generally to hierarchical
multilevel systems.
11
3.1 Chemical Reactor Model
Consider the operation of a chemical reactor illustrated in Figure 4. A large enclosed vessel is
provided with an inlet, an outlet, and two gas vents, with one referred to as the “standard” vent,
and the other as the “emergency” vent. Substances 1C and 2C are let into the vessel, where they
chemically react with each other – we will henceforth refer to this chemical reaction as the core
reaction – to yield a useful substance 3C in liquid form, which is the desired product of this
chemical process. The reaction of 1C and 2C also results in a gaseous by-product 4.C
A mathematical model of this chemical reactor is given by the following pair of linear
differential equations. Table 1 explains the nomenclature used in these equations.
s0 s e p( )ˆ( ) ( ) ( ) ( ),tp t q t r w tr e t r (3)
n 0 q n b
h 0 q b
( ) ( ) ( ) ( ), ( ) ,
( ) ( ) ( ) (
ˆ( )
), .ˆ ( )
k s t v t t w t p p t p
k s t v t t w t p p t
vq t
v
(4)
By Eqn. (4), the core reaction is pressure-sensitive: whereas the rate of production of 3,C
namely ,q is low (and therefore not of our interest) when the pressure of 4C is below np , it is
high when the pressure exceeds b ,p that is, h n.k k
Figure 4: Illustration of the chemical reactor.
12
To maintain the variables ,p q at their nominal operating values n n, ,p q the standard
vent is designed to cause a nominal pressure loss at the rate s0 r and the outlet flow rate is
nominally maintained at a value 0v . An automatic controller, described in Section 3.2, makes
small adjustments sr and ˆ,v respectively, to the standard vent and the outlet to counteract
process disturbances. These adjustments are limited: specifically, s sˆ 0, ,r r and ˆ , ,v v v
where s s0r r and 0v v are prespecified. The emergency vent is manually operated, as we
explain next. Furthermore, whenever the emergency vent is open, the aforesaid automatic
controller is designed to saturate the adjustments sr and v to their maximum values.
Instrumentation is installed on the reactor to measure and indicate the pressure p of the gas
4 ,C and the quantity q of the substance 3C inside the reaction vessel.
The gas 4C is inflammable, and self-ignites when compressed beyond a certain threshold
pressure s.p One of the major safety hazards identified for this reactor is an uncontrolled self-
ignition of the gas 4 ,C which constitutes a catastrophic failure of the reactor. To maintain safe
and profitable operation, this chemical reactor is controlled and supervised as described next.
Table 1: Nomenclature used in this case study.
Symbol Meaning
p Pressure of the gaseous substance 4C within the reaction vessel
q Mass of the substance 3C within the reaction vessel
s Mass flow rate of each of the input substances 1C and 2C at the inlet
0 ,, ˆv v v Mass flow rates of the substance 3C at the outlet: respectively, setpoint, automatic
controller adjustment, and total flow rate
n n, p q Nominal operating values of p and q
bp Threshold pressure beyond which the production rate of 3C significantly increases
13
dp Threshold pressure, the breaching of which is considered a near-miss accident
sp Pressure at which the gaseous substance 4C self-ignites
,N Ss s Preset inlet flow rates in modes N and , S respectively
, , N S Dv v v Preset outlet flow rates in modes ,N , S and ,D respectively
m v True outlet flow rate in the accident scenario due to malfunctioning outlet pump
n h, ,k k Constants of proportionality, with h nk k
s0 s eˆ, , r r r Rates of pressure loss due to standard and emergency vents
e Binary state (closed = 0, open = 1) of emergency vent
c Communication delay
p q,w w Process disturbances
,p qw w Known maximum absolute values of process disturbances
3.2 Hierarchical Control Structure and Safety Policies
The reactor is controlled by a two-level hierarchy of human supervisors. The lower-level
supervisor, whom we call the operator, chooses set-points for the inlet and outlet flow rates, and
decides whether to open or close the emergency vent. The higher-level supervisor, whom we call
the manager, instructs the operator of the mode of operation of the reactor. Specifically, the
manager chooses one of the following modes of operation of the reactor:
1. Normal mode :N the inlet and outlet flow rates are set to the nominal values of,
respectively, Ns and ,Nv to maintain the production of 3C at a nominal rate and to
maintain the quantity of 3C within the reactor vessel at a constant, nominal value of nq .
2. Safe mode :S the inlet and outlet flow rates are set to, respectively, S Ns s and
S Nv v to lower the rate of production and to reduce the quantity of 3C within the
reactor vessel.
14
3. Dead-stop mode :D the inlet flow rate is reduced to zero, and the outlet flow rate is set
to its maximum value .D S Nv v v
As previously mentioned, an automatic controller makes small adjustments to the standard
vent and the outlet flow rate. Specifically:
1. During the Safe and Dead-stop modes, and/or whenever the emergency vent is open, this
controller saturates these adjustments to their maximum values, i.e., it sets
s sˆ ˆ , ( () ) .r r v tt v
2. During the Normal mode, and when the emergency vent is closed, this controller
determines these adjustments by the following linear feedback control law:
s ( )
,ˆ ˆ (
ˆ( )ˆ )
)
(
r p t
q t
tK
v t
(5)
where n( ) ( )ˆ ,p t p t p n( ) ( )ˆ ,q t q t q and K is a gain matrix. The matrix K can be
obtained by several control-theoretic algorithms: linear quadratic regulation (LQR), for
example (Anderson & Moore, 1990). This control law counteracts small process
disturbances to stabilize the variables p and q at their nominal set-points, respectively,
np and n .q
The following safety policies are adopted by the manager and operator hierarchy to
safeguard against the self-ignition of the gaseous substance 4.C Firstly, the operator opens the
emergency vent whenever s/he notices an upward trend in the pressure p that is not successfully
counteracted by the maximum capability of the automatic controller (i.e. the pressure rises
despite saturation of the automatic control inputs). Secondly, reactor is designed to switch
automatically (i.e. without the manager’s explicit approval) to the Dead-stop mode when the
pressure p exceeds a threshold d s.p p The threshold dp is designed such that if the plant is
switched to the Dead-stop mode at that threshold, the self-ignition pressure will not be breached.
15
Thirdly, a set of adverse safety events is defined in terms of certain predetermined inequality
constraints on the variables ,p q . These constraints define envelopes for safe operation of the
reactor. Whenever any of these constraints is violated, i.e., an envelope of safe operation is
breached, an adverse safety event is said to have occurred. The operator notifies the manager,
who, in turn, decides whether to change the reactor’s mode of operation and accordingly notifies
the operator to implement the change of mode (if any). This back-and-forth communication
between the operator and manager requires a time of at most c , including the time required by
the manager to assess the situation and make a decision. Two such adverse safety events are as
follows:
SE1 – defined as the set of all ,p q such that p will exceed bp in time c assuming worst-
case disturbances, given that the current reactor mode is Normal and the emergency vent is
currently open. Precisely, 1SE occurs when the following inequality is violated3:
2
c n p e n c q c b( ) (( ( ) ) ( ) )2
.w rp t q t v w pq r (6)
2SE – defined as the set of all ,p q such that p will exceed dp in time c assuming
worst-case disturbances, given that the current reactor mode is either Normal or Safe and the
emergency vent is currently open. Precisely, 2SE occurs when the following inequality is
violated:
2
c n p e n c q c d( ) (( ( ) ) ( ) )2
.w rp t q t v w pq r (7)
In addition to these predetermined and well-defined safety events, the safety policy allows
for so-called operator-triggered adverse safety events. These are events where the operator
detects highly anomalous reactor behavior, and notifies the manager to request a mode change.
The operator can recommend switching to a particular mode based on his/her experience and
situational awareness about the reactor’s state of operation. However, the operator is trained to
3 The inequalities (6) and (7) can be easily derived from the process equations (3)-(4) and the stated definitions
of the safety events.
16
sparingly trigger these “early” adverse safety events – and instead let the “standard” adverse
safety events occur – in order to avoid excessive false alarms that disrupt the reactor’s
production schedule.
The hierarchical control structure of the chemical reactor is summarized in Figure 5.
Figure 5: Illustration of the hierarchical control structure of the chemical reactor.
3.3 Accident Scenario
Consider an accident scenario consisting of the following sequence of events.
1. The reactor is operating in the Normal mode and the variables ,p q are at their nominal
operating values. Unknown to the operator, the outlet pump malfunctions at time 0 ,t t
and the outlet flow rate gradually starts to decrease. At time 1, t t the outlet flow rate
settles to a value of m Nv v (nomenclature as in Table 1).
2. The automatic controller responds to the decreased flow rate. However, because mv is
significantly lower than ,Nv both of the control inputs sr and v saturate to their
17
maximum values. The quantity of 3C within the reaction vessel exhibits a consistent
upward trend, followed by the pressure of the gas 4C , which also exhibits the same trend.
3. Noting these upward trends, and noting that the automatic control inputs are saturated,
the operator opens the emergency vent at time 2.t t
4. The operator notices a downward trend in ,p which matches his/her expectations. The
operator notes that q continues to increase (according to the gauge that indicates q to the
operator), which is anomalous behavior. However, because p is decreasing as expected,
the operator does not notify the manager, assuming an instrument fault in the outlet flow
meter. Note that the outlet flow rate is not indicated to the operator.
5. At time 3,t t safety event 1SE occurs. The operator notifies the manager, who instructs
the operator to switch the reactor to the Safe mode.
6. At time 4 3 c ,t t t the operator switches the reactor to the Safe mode. By this time,
however, p has already exceeded b.p Therefore, the rate of production of 3,C and
consequently, the rates of increases of p and ,q significantly increase.
7. At time 5,t t safety event 2SE occurs. The time period 5 4t t is too short for the
operator to react to the situation and notify the manager before the occurrence of 2.SE
8. Following the manager’s instructions, the operator switches the plant to Dead-stop mode.
By this time, p has exceeded d .p The variables p and q both continue to rise, and at
time 6 ,t t the variable p breaches the self-ignition pressure s ,p and a catastrophic
failure occurs.
In Section 4.2, we discuss a “post-mortem” analysis of this accident in the C&C framework.
Informally, the “standard” adverse safety events (namely, 1SE and 2SE ) were designed
assuming that the actual outlet flow rate and the actual rates of pressure loss due to the two vents
were equal to the set-point values. Any serious anomalies between the actual and set-point rates
were left for the operator to detect and accordingly trigger an early adverse safety event. In this
accident scenario, the actual outlet flow rate was significantly less than the set-point flow rates of
the Normal and Safe modes. Furthermore, the operator did not trigger an early adverse safety
18
event. After the operator opened the emergency vent at time 2 ,t t an initial reduction in the
pressure p occurred due to the process dynamics described by Eqns. (3) and (4).
To demonstrate the plausibility of this accident scenario, we performed a numerical
simulation of the process with the MATLAB® software tool, and using the parameter values
shown in Table 2. Figure 6 shows the accident trajectory resulting from this numerical
simulation. Figure 7 and Figure 8 show, respectively, the evolutions over time of the variables
,p q and the control inputs s , .r v For this simulation, Table 3 shows the time instants of
occurrence of each of the events described above.
Table 2: Parameters values used for numerical simulation.
Parameters Values Parameters Values
n b d s, , ,p p p p 100, 250, 600, 1000 kPa nq 1000 kg
m, , ,N S Dv v v v 125, 200, 300, 25 kg/min , N Ss s 125, 75 kg/min
v 25 kg/min s0 , sr r 300, 30 kPa/min
n h,k k 1.0, 1.8 c 1 min
0.3 kPa/kg-min ,p qw w 0, 0
Table 3: Time instants (units: min.) of the various events described in the accident scenario.
0t 1t 2t 3t 4t 5t 6t
5.000 9.050 12.525 13.525 15.500 16.500 19.375
19
Figure 6: Accident trajectory obtained via numerical simulation. The red-colored dots indicate events listed
in the preceding description of the accident scenario. The red-colored dot with bold black border at 𝒑, 𝒒 =(𝟏𝟎𝟎 𝐤𝐏𝐚, 𝟏𝟎𝟎𝟎 𝐤𝐠), is the nominal operating condition. For adverse safety events 𝑺𝑬𝟏 and 𝑺𝑬𝟐, a crossing
from the left to the right of the boundary indicated for each event constitutes an occurrence of that adverse
safety event.
20
Figure 7: The time-histories of the variables 𝒑 and 𝒒 in the numerically simulated accident. The dotted red
lines indicate the time instants for each of the events listed in the accident scenario.
Figure 8: The time-histories of the automatic control inputs ��𝐬 and �� in the numerically simulated accident.
The dotted red lines indicate the time instants for each of the events listed in the accident scenario. Note that
both the control inputs saturate soon after the outlet pump malfunction occurs, and the operator reacts to
these saturations by opening the emergency vent. Recall that, once the emergency vent is opened, the
automatic controller is designed to saturate the control inputs regardless of the state of the reactor.
21
4 Accident Analysis and Discussion
In this section, we analyze the chemical reactor model and the preceding accident scenario
in the systems-theoretic framework introduced in Section 2. We show that the consistency
postulate did not hold true for the manager-operator hierarchy. This lack of consistency may be
considered a fundamental failure mode that was among the “root causes” of the accident. We
discuss the ramifications of identifying the lack of consistency as one of the “root causes”,
including the design guidelines that can be extracted. In addition to the conclusions drawn
regarding consistency (or lack thereof) in hierarchical control structures, a secondary purpose of
the following analysis is to provide a detailed and rigorous example of the systems- and control-
theoretic treatment of system safety, and to spur further research into such analyses for real-
world systems.
4.1 Real-World Accident Analogs of the Case Study
We constructed the preceding chemical reactor model and accident scenario for illustrative
purposes, and we showed that this accident scenario is mathematically feasible. Here, we further
corroborate the argument that our case study is illustrative of real-world accidents – and that, by
consequence, the proposed analysis based on the C&C principles carries practical value of – by
pointing out several characteristics of this example that bear resemblance to real-world, large-
scale systems that suffered catastrophic accidents.
Firstly, consider the absence of a gauge indicating the outlet flow rate to the operator, which
may be considered a glaring and “obvious” omission that hinders the operator’s situational
awareness. However, an example of a similarly “obvious” (in hindsight) omission of
instrumentation may be found in the accident analysis of the isomerization plant at the BP Texas
City refinery (Bakolas & Saleh, 2011; Mogford, 2005). There, the raffinate tower level indicator
was designed such that overflows above 100% would not be indicated, and this instrumentation
flaw played a crucial role in the accident (Mogford, 2005).
Secondly, consider the “simple” malfunction of the outlet pump, perhaps pointing to a lack
of redundancy of this crucial component. The real-world parallels of this accident are found in
the Piper Alpha disaster, where a crucial pump malfunctioned at a time when its redundant pump
22
was shut down for maintenance, and another crucial fire-fighting pump had no redundancy (Pate-
Cornell, 1993).
Thirdly, consider the operator’s inaction before the occurrence of each safety event despite
clear indications that the quantity q of 3C in the reaction was abnormally high and rising. This
situation illustrates operator confusion and impaired situational awareness during accident
sequences, as exemplified in the loss-of-coolant accident at the Three Mile Island nuclear power
station (Hopkins, 2001). In that accident, the operators inadvertently abetted the loss of coolant,
despite warnings and indications pointing to contradictory actions; nevertheless, it has been
argued that their reactions may not have been “erroneous,” and that “operator error” should not
be considered among the root causes of the Three Mile Island accident (Hopkins, 2001; Perrow,
1999).
Finally, consider the manager’s decision to switch to the Safe mode following the
occurrence of 1,SE in conjunction with the operator’s inaction in requesting to switch directly to
the Dead-stop mode. In addition to the issue of the operator’s confusion and impaired situational
awareness, organizational policies and, perhaps, cost-related considerations influenced the
decisions of both manager and operator. These decisions and actions (or lack thereof) bear
resemblance to a similar decision that contributed to the Piper Alpha offshore platform accident:
there, a nearby interconnected platform called Claymore decided to continue production despite
unambiguous indications that a fire had already started at the Piper Alpha platform (Pate-
Cornell, 1993).
4.2 Analysis in the C&C Framework
We first identify the elements of the chemical reactor case study in the context of the
framework of the C&C principles presented in Section 2. The hierarchical control structure is
clear: the manager constitutes the supremal unit, the operator constitutes the (solitary) infimal
unit, and the reactor, including the automatic controller, constitutes the process (see Figure 1).
We assume that the system safety is to be assessed over the time interval f0, t , where ft defines
some timeframe of interest.
23
For the purposes of this analysis, we assume that the overall decision problem is
informally expressed as “ensure safety of the reactor,” which we consider to be equivalent to the
problem of satisfying the constraint sp t p at all times f0, .t t In practice, this decision
problem will be an optimization problem with objectives related to profitability and constraints
related to supply chains and market demands, with the aforesaid safety constraint as one of the
constraints. The supremal decision problem S will involve some of these objectives with a
narrower scope. We assume that the objective of minimizing the deviation of the variables ( , )p q
from their nominal operating condition n n,p q is included among the objectives of the supremal
unit. To circumvent the need for listing all of the objectives and constraints in the supremal
decision problem, we will consider a control policy for the manager, which encapsulates – and
decouples from our analysis – these additional objectives and constraints in S , so that we may
focus on the safety-related objective of minimizing the deviation of ,p q from the nominal
operating condition. The infimal decision problem I is narrower in scope, and we define it as
the problem of (solely) minimizing the deviation of the reactor’s operating condition from the
nominal operating condition n n, p q . This problem may be mathematically expressed as the
minimization of the quantity
f
T
o0
ˆ ˆ( ) ( )
ˆ ˆ( ) (d
),
t p t p t
q t qQ t
t
(8)
where p and q are deviations from the nominal operating condition (see Eqn. (5)), and oQ is a
positive-definite matrix. Note that an LQR-based automatic controller will “help” the operator by
adjusting the standard vent and the outlet flow rate to minimize a quantity similar to that in
expression (8).
Next, we discuss the decision sets of the supremal and infimal units – in the context of this
case study, these are the “safety levers” that the manager and operator hold. Informally, the
supremal decision set is the set of all possible sequences of modes of operation in the interval
f0, .t Precisely, the supremal decision set, which we denote by *
S is defined by the set
24
*
S f 10, , and { , , }, [0, ], and {0,1, ,| } .
n
i i i i i iiM M N S D t nn i
(9)
Any supremal decision must be an element of this set, and is characterized by the number
of mode switches ,n the time instants i at which the switches occur, and the modes of
operation iM selected at each switch 0,1, ,i n . For example, the manager’s decision in the
case study accident scenario is: 2
3acc 00 5( , ) ( , ),( , ),( , ) ,i i iM N t S t D t
where
0 1 2, , ,M N M S M D and 0 0 1 3, ,t t and 2 5.t
To characterize the infimal decision set, note that the operator has access to four variables
by which s/he can control the reactor: the inlet flow rate set-point ,s the outlet flow rate set-point
0 ,v the opening/closing of the emergency vent (previously denoted by e ), and the time instants
of notifying the manager about adverse safety events (“standard” as well as operator-triggered
events), which we denote by 0,
m
j j
where ,m and for each 0,1, , ,j m f0, ,j t
and 1 .j j Recall from Section 3.2 that the inlet and outlet flow rate set-points are decided
by the manager’s decision regarding the mode of operation. This observation conforms with the
framework in Section 2, where it was noted that each infimal decision problem depends on the
coordinating input chosen by the supremal unit. Here, this dependence is in the form of equality
constraints for the inlet and outlet flow rate set-points. Specifically, the supremal decision
0
,n
i i iM
automatically determines s and 0v as:
0 00 0
0
1 1
, for ) , for )
( ) ; ( )
, for ] , for
[0, , [0, ,
[ , . [ , .]n n
M M
M n n M n n
s t v t
s t v t
s t v t
(10)
Therefore, the infimal decision set is informally described as the combination of (1) all
possible ways of switching the emergency vent between the “open” and “closed” states during
the time interval f0, ,t (2) all possible ways of selecting the adverse safety event notification
times, (3) all possible ways of selecting the inlet and outlet flow rate set-points. Precisely, the
infimal decision set is
25
* * * *
I I,E f 1 I,I I,O0| , and 0, , {0 } ,, ,1, ,
m
j j j jjm t j m
(11)
where *
I,E is the set of all piecewise constant functions taking values in 0,1 , *
I,I is the set of
all piecewise constant functions taking values in S N{0, , },s s and *
I,O is the set of all piecewise
constant functions taking values in D S N{ , , }.v v v Any infimal decision x must be an element of
*
I . For example, the operator’s actions in the case study accident scenario are concisely
denoted by the infimal decision acc 0 1 0, ,, ,,s vx e where 0 2 1 4, ,t t and the functions
* * *
I,E I,I 0 I,O, ,s ve are given by the expressions
3 3
1
3 5 3 5
1
5 5
, , , ,0, for ,
( ) ; ( ) , ,; ( ) , ,1, for ,
0 , ., ,
N N
S S
D
s t vt t
t t t tt
tt t
e t s t s t v t v tt
t t tv t
(12)
We define the output of the infimal decision unit, ( ),x as 0( ( )) ( ), ( ), ( ) .x t e t s t v t The
sequence 0
m
j j
of safety event notification times constitutes the feedback given by the
infimal unit to the supremal unit (see Figure 1). Table 4 summarizes the preceding identification
of the chemical reactor case study with the elements of the systems-theoretic framework
discussed in Section 2.
Table 4: Identification of the chemical reactor case study with elements of the framework in Section 2.
Entity Description
Overall problem Ensure sp t p for all f0, .t t
Supremal decision set *
S All possible sequences of modes (see Eqn. (9)).
Supremal decision A particular sequence of modes, with switching times.
Supremal problem S Higher-level objectives and constraints, including the objective of
minimizing deviation from nominal operating condition.
Infimal decision set *
I All possible ways of operating emergency vent, combined with all
possible ways of choosing adverse safety event notification times
(see Eqn. (11)).
Infimal decision x A particular way (time history) of operating the emergency vent,
26
combined with a particular sequence of notification times.
Infimal problem I Minimize deviation from nominal operating condition (see Eqn.
(10)).
Output of infimal unit 0( ( )) ( ), ( ), ( ) .x t e t s t v t
Feedback from infimal
unit to supremal unit Sequence of safety event notification times
0.
m
jj
Next, we examine the supremal and infimal decisions in the case study accident scenario.
We assume that the supremal decision unit, i.e., the manager, makes its decision according a
policy that recommends the next mode based on the current situations. An example of such a
policy for is the following:
b 1
1 2 10, ,
, if , ( ) , and occurs,
, if , and occurs, ; min | .
i
i i i j jmj
S M N p t p SE
M D M S SE t
(13)
Such a policy not only incorporates feedback from the infimal unit, but also reacts to the
current situation, rather than implementing a preprogrammed plan of action. We may assume
that the policy adopted by the manager also incorporates objectives and constraints other than the
”minimize deviation from the nominal operating condition” objective, and that any particular
decision resulting from this policy solves the supremal decision problem S. In particular, the
supremal decision in the case study accident scenario 2
30acc 0 5( , ) ( , ),( , ),( , )ii iM N t S t D t
solves S , and the proposition acc S,P is true (recall from Section 2.1 that the truth of
predicate ( , ) P means that the first argument solves the decision problem in the second
argument of the predicate).
To determine whether the infimal decision acc 0 1( , , ),x e where e is given by (12),
solves the infimal decision problem, we must answer the question “Did the operator take the
best possible actions to minimize the deviation of the reactor state from the nominal operating
condition?” We reexamine the accident scenario and note that the operator opened the
27
emergency vent soon after4 s/he noticed the rise in p and q and the saturation of the automatic
control inputs sr and ˆ,v and did not close this vent again. Assuming no malicious intent from the
operator, and noting that the initial reduction in pressure immediately after opening the
emergency vent may have corroborated his/her belief that there was no need to trigger an “early”
adverse safety event, we conclude that the operator’s decision accx was in fact the best possible
decision that s/he could make, and that the proposition acc I,xP is true.
The fact that a catastrophic failure occurred, that is s( )p t p was not satisfied, during the
operation of the reactor implies that the overall decision problem was not solved. Therefore,
the proposition acc( ( ), )xP is false. Now consider the truth of the statement
acc S acc I acc ,& ., ,x x P P P (14)
Because we know the truth values of each of these propositions in (14), we may rewrite it as:
& .True True False (15)
Comparing the expression in (15) with the truth table (Huth & Ryan, 2004) for implication ( ),
we find that the expression in (15) is false. Therefore, we have found one pair of supremal and
infimal decisions, namely, the pair acc acc, ,x for which the statement
I S, and , ( ),x x P P P is false. In other words, we have shown that the
consistency postulate given in Proposition (2) is violated in the hierarchical control
structure of the chemical reactor. In what follows, we discuss the ramifications of this
observation.
4 The delay between the saturation of the automatic controller inputs and the operator opening the emergency
vent, seen in Figure 8, may be attributed to the operator deliberating about his/her actions. In any case, it is easy to
show that this accident would have occurred even if the operator had opened the emergency vent immediately after
the saturation of the automatic controller inputs.
28
4.3 Discussion
4.3.1 Identification of a Fundamental Failure Mode
Firstly, we claim that the diagnosis of “violation of the consistency postulate” (briefly, “lack
of consistency”) can be considered a fundamental failure mode of the system. For the chemical
reactor case study, the lack of consistency may be considered a “root cause” of the accident. The
individual elements of the manager-operator hierarchy operated “correctly”, yet they were
together unable to prevent the accident. The diagnosis of lack of consistency provides a rigorous
and unambiguous answer to the question of “what was wrong” about the interaction between the
manager and operator that caused this accident. Furthermore, due to its theoretical and domain-
independent formulation, the lack of consistency (and also, as we showed in (Cowlagi & Saleh,
2013), the lack of coordinability) provides a powerful vocabulary to express the results of post-
mortem accident analyses. Consequently, abstract similarities in accidents from widely different
technological domains can be identified from epidemiological studies, and the lessons learnt (on,
say, how to prevent the lack of consistency) from accident analysis in one technological domain
can be transferred to another. As we discuss next, the diagnosis of lack of consistency can also
serve as a starting point for learning lessons from post-mortem accident analyses. Before this
discussion, we comment on the following potential counter-arguments – and corresponding
rebuttals – to the claim that the lack of consistency was a “root cause” of the accident:
“The malfunctioning of the outlet pump was the root cause.” – The malfunctioning of the
outlet pump was an initiating event, but by itself was not sufficient to cause catastrophic
failure. Conversely, a different event, such as a malfunction in one of the gas vents, could
have initiated a similar accident scenario.
“The organizational focus on profit over safety (evident in operator training) was the root
cause.” – The operator mistook the initial reduction in pressure after opening the emergency
vent as an indication of relative normalcy of operation. Whereas the operator’s training may
have contributed to his/her inaction in triggering an early adverse safety event, there is no
reason to believe that s/he would have done so otherwise, especially given his/her
compromised situational awareness.
29
“The ‘standard’ safety events ( 1SE and 2SE ) were incorrectly designed, and the flaws in
their design was the root cause.” – The safety events were designed to withstand worst-case
disturbances. It may be argued that if the process disturbance caused by the malfunctioning
outlet pump was lower than the worst case value, that is, if m ,N qv v w then the “usual”
safety events would have sufficed to stop the accident trajectory from reaching the point of
catastrophic failure. The worst-case disturbances cannot be chosen arbitrarily high, for
otherwise the nominal operating condition itself would violate the safety event constraints.
For “sufficiently high” values of worst-case disturbances, there will be sufficiently small
values of mv that will initiate similar accidents.
Each of these candidates for “root causes” are clearly contributing factors to this particular
accident scenario. However, unlike the lack of consistency, these factors are specific to the
particular system and accident scenario considered in this case study, and they do not provide
easily transferable lessons that may be learnt to improve systems in other technological domains.
Whereas these contributing factors need to be recognized, there is also the need for identifying
fundamental failure modes that manifested themselves through these (and, possibly, other)
factors.
4.3.2 Further Analysis
The identification of lack of consistency in the manager-operator hierarchy is a starting point
for deeper analysis of the accident, which may be guided by questions such as: Why did the lack
of consistency occur? How can it be identified and prevented? We pursue the resolution of this
question for the chemical reactor case study with a chain of questions and answers that examine
the accident in progressively greater detail.
By definition, the lack of consistency occurred because (at least) for the pair of supremal
and infimal decisions acc acc, ,x the overall decision problem was not solved.
Why was not solved? As the manager and operator both made “correct” decisions (i.e.,
the problems S and I were both solved), either the manager or the operator, or both, must
have made assumptions that led to remain unsolved.
30
What were these assumptions? One such assumption made by the manager is that his/her
responses to adverse safety events (whether “standard” or operator-triggered), were barriers that
prevented the reactor from entering a hazardous regime of operation. For example, 1SE was
designed such that the manager’s response (including the communication delay c ) to switch to
the Safe mode would prevent the pressure from breaching the threshold of b ,p where the rate of
production of 3C and 4C would significantly increase. However, 1SE was designed assuming
the outlet flow rate of ,Nv whereas the actual outlet flow rate was m .Nv v Therefore, the
pressure had already breached bp by the time the reactor was switched to Safe mode.
How could this situation have been prevented? One possible method is have made the
operator aware that 1SE was likely to fail in its role as a barrier to the hazard of exceeding the
pressure threshold b .p For example, if the operator had been provided guidelines for triggering
early adverse safety events based on the observed or estimated rates of change of p and ,q
rather than p and q themselves, then the operator would have noted that (1) the rate of
reduction in pressure immediately after opening the emergency vent was anomalous, and (2) the
rate of increase of the quantity of 3C in the reaction vessel was anomalous. The observation of
these two process anomalies and of the saturation of the automatic control inputs may have
provided the operator sufficient evidence to trigger an early safety event.
As previously mentioned, we developed a MATLAB®-based simulation of this case study
to aid its understanding. This simulation is freely available at the following URL:
http://users.wpi.edu/~rvcowlagi/software.html. The accident scenario and various prevention
scenarios for this accident have been pre-programmed. The reader is invited to download this
simulation and to not only examine these scenarios, but also to modify the parameters to create
his/her own scenarios to further pursue this research.
4.3.3 Broader Lessons Learned
Beyond the particulars of the chemical reactor case study, we point out a series of
characteristics of the preceding analysis that are more generally applicable in systems with
similar hierarchical multilevel structures.
31
Firstly, we note that the chain of reasoning – recall that the identification of lack of
consistency led to the identification of some of the flaws in the system’s control structure – can
be further developed into a set of domain-independent “procedural templates” for accident
analysis. Post-mortem accident reports expressed using these templates may be used to identify
common flaws and to develop common accident prevention principles.
Secondly, we propose that ensuring the truth values of the coordinability and consistency
propositions (1) and (2) provides the basis for developing a “safety specification” during system
design. For example, safety specifications have often been applied in software development
(Alagar & Periyasamy, 1998). These specifications are expressed in the form of logical
propositions, whose truth must be ensured in all possible ways of execution of the software.
There are several algorithms, such as model checking (Alagar & Periyasamy, 1998) that aid the
design and analysis of software systems to ensure the satisfaction of these safety specifications.
Similar safety specifications for large-scale systems, where software and physical systems
interact with each other, can be developed using the C&C propositions. Note that such safety
specifications will be concise and abstract, so as to avoid exhaustively listing all possible
undesirable scenarios (an exercise that can be time-consuming and not always feasible during the
systems design phase).
Thirdly, the problem of satisfying these C&C-based safety specifications, namely, the
problem of ensuring the truth values of propositions (1) and (2), leads to a series of questions that
must be resolved, and the resolutions to which will lead to important design guidelines for
system safety. For example, we encounter the question “What quantities must be measured or
estimated, to ensure satisfaction of the propositions (1) and (2)?” Note that the control-theoretic
notion of observability may not suffice here. For example, in the chemical reactor case study, we
concluded that the measurement or estimation of the rates p and q were necessary in addition
to the measurement of the variables p and q themselves (the measurement of p and q means
that the system is observable). Further crucial questions related to the satisfaction of C&C-based
safety specifications, which we did not address in the chemical reactor case study, include:
How can we ensure that a hierarchical multilevel system remains coordinable and consistent
under changing circumstances?
32
Which entity or entities should be responsible for ensuring coordinability and consistency at
all times?
Finally, we point out potential avenues for integrating system safety analysis based on the
C&C principles with some of the traditional and well-established approaches to safety. In the
chemical reactor case study, we studied safety barriers implemented in the form of adverse safety
events and the associated control policies of the manager, and the relation between these barriers
and consistency in the hierarchical control structure. More generally, the study of relations
between the C&C principles and the defense-in-depth principle is a potential area for future
research that can help establish strong theoretical guidelines for establishing and implementing
multiple safety barriers for accident prevention. Similarly, the study of relations between the
C&C principles and HRO paradigm may provide formal, mathematically rigorous meanings to
the qualitative HRO characteristics. For example, one HRO characteristic is the “shifting from
centralized authority during routine operations to local/decentralized authority for hazardous
operations” (Saleh et al., 2011). In the chemical reactor case study, this characteristic closely
relates to the approach of ensuring consistency by providing the operator additional authority to
trigger safety events early. As a final comment on the need to further investigate potential
relations between the C&C principles and established approaches to system safety, we note that
the fundamental failure modes identified by the C&C principles may serve as the starting point
of exhaustive analyses to identify different accident trajectories for QRA and PRA-based system
design. These quantitative design methods strongly depend on such exhaustive analyses:
“brainstorming for possible accident scenarios is an essential step [in PRA] for understanding
how a system might fail, and in turn, for making risk-informed design and operational choices to
support accident prevention” (Saleh et al., 2011). The failure modes identified by the C&C
principles can add a mathematically rigorous element to this brainstorming process.
5 Conclusions
In this work, we developed a case study to illustrate details of the application of C&C
principles for accident analysis and system safety. This case study was chosen to represent a
broad class of multi-level systems where operational policies and procedures of individual
stakeholders interact with physical processes to spawn unanticipated system behaviors – and,
33
consequently, new failure modes. We argued that the lack of coordinability and/or consistency
constitutes one such fundamental failure mode in hierarchical multilevel systems. We proposed,
on the one hand, that the recognition of this failure mode in post-mortem accident analysis can
assist to conceptually relate the lessons learnt from post-mortem analyses of accidents in
different technological domains. On the other hand, we proposed that efforts directed at
preventing and mitigating the manifestations of these fundamental failure modes can directly
lead to a series of questions related to the safety-oriented design, operation, and control of a
hierarchical multilevel system. Crucially, we emphasized that accident prevention and system
safety based on the C&C principles is not an isolated idea, and we provided specific avenues to
allow the C&C principles to inform and embellish well-established ideas in system safety, such
as the HRO paradigm, QRA/PRA, and the defense-in-depth principle.
References
Alagar, V. S., & Periyasamy, K. (1998). Specification of Software Systems. New York, NY: Springer-Verlag.
Anderson, B. D. O., & Moore, J. B. (1990). Optimal Control: Linear Quadratic Methods. Englewood Cliffs, NJ:
Prentice-Hall.
Apostolakis, G. E. (2004). How useful is quantitative risk assessment? Risk Analysis, 24(3), 515-520. doi: DOI
10.1111/j.0272-4332.2004.00455.x
Asare, P., Broman, D., Lee, E. A., Torngren, M., & Sunder, S. S. (2012). Cyber-physical Systems - A Concept Map.
from http://cyberphysicalsystems.org/
Bakolas, E., & Saleh, J. H. (2011). Augmenting Defense-in-depth with the Concepts of Observability and
Diagnosability from Control Theory and Discrete Event Systems. Reliability Engineering and System
Safety, 96, 184--193.
Bertalanffy, L. (1969). General System Theory: Foundations, Development, Applications. New York: George
Braziller.
Cowlagi, R. V., & Saleh, J. H. (2013). Coordinability and Consistency in Accident Causation and Prevention:
Formal System-Theoretic Concepts for Safety in Multilevel Systems. Risk Analysis, 33(3), 420-433.
Favaro, F. M., & Saleh, J. H. (2014). Observability-in-Depth: an Essential Complement to the Defense-in-Depth
Safety Strategy in the Nuclear Industry. Nuclear Engineering & Technology.
Hollnagel, E. (2004). Barriers and Accident Prevention. Burlington, VT: Ashgate.
Hopkins, A. (2001). Was Three Mile Island a 'Normal Accident'? Journal of Contingencies and Crisis Management,
9(2), 65-72.
Huth, M., & Ryan, M. (2004). Logic in Computer Science. Cambridge, UK: Cambridge University Press.
34
Khan, F. I., & Abbasi, S. A. (1999). Major accidents in process industries and an analysis of causes and
consequences. Journal of Loss Prevention in the Process Industries, 12(5), 361-378. doi: Doi
10.1016/S0950-4230(98)00062-X
Khan, F. I., & Amyotte, P. R. (2003). How to make inherent safety practice a reality. Canadian Journal of Chemical
Engineering, 81(1), 2-16.
Khan, F. I., & Amyotte, P. R. (2004). Integrated inherent safety index (12SI): A tool for inherent safety evaluation.
Process Safety Progress, 23(2), 136-148. doi: Doi 10.1002/Prs.10015
Kleindorfer, P. R., Belke, J. C., Elliott, M. R., Lee, K., Lowe, R. A., & Feldman, H. I. (2003). Accident
epidemiology and the US chemical industry: Accident history and worst-case data from RMP*Info. Risk
Analysis, 23(5), 865-881. doi: Doi 10.1111/1539-6924.00365
Kleindorfer, P. R., Makris, J. L., & Conomos, M. G. (1999). Accident epidemiology: Understanding the major
drivers of major accident risk. Epidemiology, 10(4), S115-S115.
Leveson, N. G. (2004a). A new accident model for engineering safer systems. Safety Science, 42(4), 237-270. doi:
Doi 10.1016/S0925-7535(03)00047-X
Leveson, N. G. (2004b). Role of Software in Spacecraft Accidents. Journal of Spacecaft and Rockets, 41(4), 564-
575.
Mesarovic, M. D., Macko, D., & Takahara, Y. (1970). Theory of Hierarchical, Multilevel Systems: Academic Press,
New York.
Mogford, J. (2005). Fatal Accident Investigation Report: Isomerization Unit Explosion Final Report. Texas City,
TX.
NRC. (2000). Causes and significance of design basis issues at US nuclear power plants. Washington, DC: US
Nuclear Regulatory Commission, Office of Nuclear Regulatory Research.
Pate-Cornell, M. E. (1993). Learning from the Piper Alpha Accident: A Postmortem Analysis of Technical and
Organizational Factors. Risk Analysis, 13(2), 215-232.
Perrow, C. (1999). Normal Accidents:Living with High-risk Technologies. Princeton, NJ: Princeton University
Press.
Rasmussen, J. (1997). Risk management in a dynamic society: A modelling problem. Safety Science, 27(2-3), 183-
213. doi: Doi 10.1016/S0925-7535(97)00052-0
Rasmussen, N. (1975). Reactor safety study: an assessment of accident risks in US nuclear power plants.
Washington, DC: US Nuclear Regulatory Commission.
Reason, J. T. (1997). Managing the Risks of Organizational Accidents. Vermont: Ashgate.
Rochlin, G. I., La Porte, T. R., & Roberts, K. H. (1987). The Self-designing High Reliability Organization. Naval
War College Review, 40(4), 76--90.
Saleh, J. H., Marais, K., Bakolas, E., & Cowlagi, R. V. (2011). Highlights from the literature on accident causation
and system safety: Review of major ideas, recent contributions, and challenges. Reliability Engineering and
System Safety, 95(11), 1105-1116.
Saleh, J. H., Marais, K. B., & Favaro, F. M. (2014). System Safety Principles: A Multidisciplinary Engineering
Perspective. Journal of Loss Prevention in the Process Industries, 29, 283-294.
35
Sorensen, J. N., Apostolakis, G. E., Kress, T. S., & Powers, D. A. (1999, Aug 22-26). On the Role of Defense-in-
Depth in Risk-informed Regulation. Paper presented at the PSA '99 International Topical Meeting on
Probabilistic Safety Assessment, Washington, DC.
Sorensen, J. N., Apostolakis, G. E., & Powers, D. A. (2000). On the role of safety culture in risk-informed
regulation. Psam 5: Probabilistic Safety Assessment and Management, Vols 1-4(34), 2205-2210.
Turner, B. A. (1978). Man-made Disasters. London: Wykenham Publications.
Weick, K. E., & Sutcliffe, K. M. (2007). Managing the Unexpected: Resilient Performance in an Age of Uncertainty
(2nd ed.). San Francisco, CA: Jossey-Bass.
Weinberg, G. M. (1975). An Introduction to General Systems Thinking: Dorset House.
Zhong, H., & Wonham, W. M. (1990). On the Consistency of Hierarchical Supervision in Discrete-Event Systems.
IEEE Transactions on Automatic Control, 35(10), 1125-1134.