Reliability Engineering and System Safetysunnyday.mit.edu/leadingindicators-published.pdf · 2018....

A systems approach to risk management through leadingsafety indicators$

Nancy Leveson n

Aeronautics and Astronautics, Engineering Systems, MIT, Cambridge, MA, USA

a r t i c l e i n f o

Article history:Received 22 April 2014Received in revised form5 October 2014Accepted 8 October 2014Available online 18 October 2014

Keywords:Leading indicatorsSystem safetyProcess safetySTAMPSTPARisk management

a b s t r a c t

The goal of leading indicators for safety is to identify the potential for an accident before it occurs. Pastefforts have focused on identifying general leading indicators, such as maintenance backlog, that applywidely in an industry or even across industries. Other recommendations produce more system-specificleading indicators, but start from system hazard analysis and thus are limited by the causes consideredby the traditional hazard analysis techniques. Most rely on quantitative metrics, often based onprobabilistic risk assessments. This paper describes a new and different approach to identifyingsystem-specific leading indicators and provides guidance in designing a risk management structure togenerate, monitor and use the results. The approach is based on the STAMP (System-Theoretic AccidentModel and Processes) model of accident causation and tools that have been designed to build on thatmodel. STAMP extends current accident causality to include more complex causes than simplycomponent failures and chains of failure events or deviations from operational expectations. Itincorporates basic principles of systems thinking and is based on systems theory rather than traditionalreliability theory.

& 2014 Elsevier Ltd. All rights reserved.

1. Introduction

There are always warning signs before a major accident, butthese signs may only be noticeable or interpretable as a leadingindicator in hindsight. In fact, most major accidents have multipleprecursors and cues that an accident is likely to happen. Before anaccident, such “weak signals” are often perceived only as noise.The problem then becomes how to distinguish the importantsignals from the noise. Defining effective leading indicators is away to accomplish this goal by providing specific clues that can bemonitored.

There is commonly a belief—or perhaps, hope—that a smallnumber of general “leading indicators” can identify increasing riskof an accident. While some general indicators may be useful, largeamounts of effort over decades has not provided much progress[1]. The lack of progress may be a sign that such general, industry-wide indicators do not exist or will not be particularly effective inidentifying increasing risk. An alternative, which is the focus ofthis paper, is to identify leading indicators that are specific to thesystem being monitored.

Underlying and justifying the use of leading indicators is a beliefthat most major accidents do not result simply from a unique set ofproximal, physical events but from the migration of the organiza-tion to a state of heightened risk over time as safeguards andcontrols are relaxed due to conflicting goals and tradeoffs [2]. If thisbelief is correct, there should be ways to detect evidence of thismigration and intervene before a loss occurs.

As an example, consider the accidental release of methylisocyanate (MIC) from the Union Carbide plant in Bhopal, India,in 1984, one of the worst industrial accidents in history. Almost allthe factors involved at Bhopal existed before the actual triggeringevent that led directly to the loss. The plant was losing money.In response, Union Carbide had ordered that costs be reduced,without considering how these cuts might conflict with safety.Requirements in the operating manual, such as never filling thetanks more than half their volume, the use of safety equipment forpotentially hazardous operations, and the operation of a refrigera-tion unit to keep the MIC at a safe temperature, were not followed.In fact, when the refrigeration unit was turned off (most likelyto save money), the high temperature alarm threshold was raisedcorrespondingly, which eliminated the possibility of an earlywarning of rising temperatures. Valves leaks and gauges fre-quently were inaccurate or out of order. Maintenance procedureswere severely cutback and critical jobs were left unfilled in shiftswhen someone called in sick.

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/ress

Reliability Engineering and System Safety

http://dx.doi.org/10.1016/j.ress.2014.10.0080951-8320/& 2014 Elsevier Ltd. All rights reserved.

☆This research was partially supported by a research grant from BP.n Correspondence address: Room 33-334, Massachusetts Institute of Technology,

77 Massachusetts Ave, Cambridge, MA 02139, USA. Tel.: þ1 617 258 0505.E-mail address: [email protected]

Reliability Engineering and System Safety 136 (2015) 17–34

www.sciencedirect.com/science/journal/09518320www.elsevier.com/locate/resshttp://dx.doi.org/10.1016/j.ress.2014.10.008http://dx.doi.org/10.1016/j.ress.2014.10.008http://dx.doi.org/10.1016/j.ress.2014.10.008http://crossmark.crossref.org/dialog/?doi=10.1016/j.ress.2014.10.008&domain=pdfhttp://crossmark.crossref.org/dialog/?doi=10.1016/j.ress.2014.10.008&domain=pdfhttp://crossmark.crossref.org/dialog/?doi=10.1016/j.ress.2014.10.008&domain=pdfmailto:[email protected]://dx.doi.org/10.1016/j.ress.2014.10.008

A review and audit two years before had noted that many of thesafety devices, such as alarms, the flare tower and the gasscrubber, were inoperable or inadequate. Most of the specificpractices leading directly to the accident, such as filter-cleaningoperations without using slip blinds, leaking valves, bad pressuregauges, etc., were noted in the report and never fixed. UnionCarbide did not follow up to ensure the deficiencies werecorrected. Qualifications of personnel went down. Training andoversight were reduced. A similar accident had occurred the yearbefore at the plant but under circumstances where the resultswere less severe (one person was killed), but nothing was doneabout fixing the hazardous operation of the plant. Given this stateof the plant and its operations, some events were bound to occurthat would trigger an accident.

While the events and practices at Bhopal were strikingly bad, inhindsight nearly every major accident has similar migrationtoward the accident over time that potentially could have beendetected and the accident prevented. These changes are oftenignored in accident reports, which tend to concentrate on prox-imal events. The challenge in preventing accidents is to try toprevent and, if unsuccessful, detect migration toward a state ofunacceptable risk before an accident occurs.

But detection alone is not enough—there must be a manage-ment process in place to act when the leading indicators show thataction is necessary. Note that at Bhopal there had been an auditreport showing the conditions existed, but they were neveradequately addressed.

The process of tracking leading indicators of increasing risk,where that process is embedded within an effective risk manage-ment structure, can play an important role in preventing accidents,but a way to derive effective leading indicators is required. Thesigns are not always as clear as at Bhopal, and, of course, wecannot wait until hindsight shows us what we should have notedbefore the loss occurred.

This paper proposes an approach to identifying and monitoringsystem-specific leading indicators and provides guidance in designinga risk management structure to use such indicators effectively.In contrast to the usual ad hoc approach to leading indicators, thepaper suggests a formal foundation and structured process foridentifying them. It also includes suggestions for operationalizingand managing a leading indicator program.

The approach is based a newmodel of accident causation calledSTAMP and on tools that have been designed to build on thatmodel [3,4]. STAMP extends current accident causality models toinclude more complex causes than simply component failures andchains of failure events. It incorporates basic principles of systemsthinking and is based on systems theory rather than traditionalreliability theory.

While the subject of the paper is limited to identifying leadingindicators related to safety and accidents, the ideas apply toleading indicators and risk management for system propertiesother than safety.

2. Background

There has been much industrial effort devoted to developingleading indicators as well as academic interest in precursors. Theproblems in assessing risk, which arise in determining whatprecursors to check, are also relevant.

2.1. Leading indicators

Much effort has been spent on trying to identify leading indicators,particularly in the petrochemical industry. Almost all of the past efforthas involved finding a set of generally applicable metrics or signals

that presage an accident. Examples of such identified leading indica-tors are quality and backlog of maintenance, inspection, and correctiveaction; minor incidents such as leaks or spills; equipment failure rates,and so on. Some depend on surveys about employee culture andbeliefs, with the underlying assumption that all or most accidents arecaused by employee misbehavior, and include as leading indicatorssuch culture aspects as safety awareness, mutual trust, empowerment,and promotion of safety [5].

A large number of proposals for leading indicators outside thepetrochemical industry focus on occupational safety rather thansystem safety, and some are simply a listing of potential hazards,such as lack of safety training; whether there is a lock-out, tag-outpolicy or a stop-work policy; and whether there are medicalfacilities on site [6]. In fact, the BP Grangemouth Major IncidentInvestigation Report suggested that industries may have a falsesense of safety performance due to their focus on managingpersonal safety rates rather than process safety1 [7].

As a result of major accidents in the chemical industry, aconcerted and long-term effort has been devoted to identifyingleading indicators of risk. Khawaji [1] provides a comprehensivedescription of these efforts. To summarize Khawaji’s analysis, earlyattempts to develop process safety performance metrics (leadingindicators) date from the mid-1900s, but attempts acceleratedafter the Grangemouth report recommended that “companiesshould develop key performance indicators for major hazardsand ensure that process safety performance is monitored" [7].

A series of documents have been issued since that time by theAICE [8–11], OECD [12,13,], UK HSE [14], OSHA [15], IEC [16], StepChange in Safety [17], and the API [18,19]. Most of these standardsrecommend that the identification of leading indicators start fromthe hazard analysis, but they assume that accidents are caused by alinear chain of events and do not address indirect interactions andcomplex systemic factors in accidents [1]. Most assume thataccidents are caused by component failures and that likelihoodof failures should be used to reduce the scope of the search forleading indicators despite the fact that likelihood may often beunknown and the practice may result in overlooking low like-lihood events.

Beyond these industrial efforts, a large number of researchpapers have been written about identifying precursors to acci-dents. The proposals generally can be divided into those thatconsider technical or organizational precursors.

On the technical side, many people have suggested usingincident reporting systems to identify precursors, for example[20–22]. The information could come from a root cause analysisthat identifies the events that led up to the specific loss or nearmiss that occurred. A limitation is that only those events that haveoccurred will be identified and usually simple chains of failureevents are the only precursors identified. Most root cause analysistechniques used widely are limited in the factors they can identify.

Another common suggestion is to use probabilistic risk analysisto detect and analyze precursor events. A leading proponent ofthis approach is Pate-Cornell [23].

A third general approach to identifying technical precursors isto use Hazard Analysis, for example [24]. The power of the hazardanalysis to identify scenarios leading to losses will impact theeffectiveness of the approach. Most current hazard analysis tech-niques focus on component failures and do not handle softwarerequirements flaws, system design errors, the role of operators inaccidents very well and usually ignore management and sophis-ticated errors in decision making.

1 While the term “system safety” is common in most industries, the same thingis called “process safety” in the process industries. The more general term is used inthis paper as the approach being described applies in any industry.

N. Leveson / Reliability Engineering and System Safety 136 (2015) 17–3418

Kongvik [25] provides a nice summary of the suggested approachesidentifying organizational precursors to accidents. Many of thesetry to identify a small number of common dimensions or generalfactors (say 5 or 6) that are applicable to every organization [26,27].Some emphasize the input obtained from safety surveys and riskanalyses [26].

Many of the organizational precursor proposals apply quanti-tative risk analysis, for example fault trees and Bayesian networks,to try to quantify the effect of safety management systems on risk,for example [28,29]. Oien uses what he calls organizational riskinfluence model using Bayesian networks [30]. The human andtechnical factors in the causal chain are used to calculate condi-tional probabilities. The approach leans heavily on expert judg-ment, as do many of the others.

Kongvik [25] has proposed a qualitative description of organi-zational risk controlling systems as well as some quantitativemeasures. Investigated accidents are used to identify causal factorsas well as using task analyses of critical operations associated witha hazard. Again the goal appears to be finding general factorsrelevant for everyone in an industry.

While the processes identifying technical precursors use acommon but perhaps too simple a model of accident causation,the organizational approaches suffer from not having a model thatspecifies the causes, content, and consequences of safety culture/climate [25]. Hudson and others describe the problem as a lack of atheoretically coherent framework to how and why accidentshappen [31]

2.2. Risk and risk assessment

Risk is usually defined as the severity of an event combinedwith the probability or likelihood of that event occurring. Combin-ing the severity of all loss events associated with a system withtheir estimated future probability or likelihood provides a metricfor risk of loss associated with the system itself. This definition iswidely used and accepted.

Risk is used in engineering in many ways, including determin-ing how much effort and resources should be applied duringdevelopment of an engineered system to prevent particular typesof hazards or loss events in order to provide an acceptable level ofrisk in the operational system. It can also be used to select whatevents or conditions should be monitored during use of the systemto prevent losses that occur despite the efforts applied duringdevelopment. It is the latter topic that is the subject of this paper.There is usually a relationship between the two because eventsthat are considered unlikely enough to bother about duringdevelopment will often not be assigned a high priority duringoperations, although there may be exceptions.

Determining severity, or worst case losses, is usually notdifficult. The problems arise in estimating the likelihood of eventsin the future. While these estimates may be informed estimates,they are always estimates. Handling risk effectively and preventingaccidents is often highly dependent on their accuracy. If thesystem design and use in the future are very similar to those inthe past and adequate historical data is available, estimatingprobability is not subject to much controversy. If new designs ornew technology is involved or usage may differ from the past,likelihood estimates are more difficult to establish.

There is very little scientific data validating probabilistic riskassessment or evaluating the methods for calculating it, particularlyfor complex engineered systems that include software and humans[32–34]. An obvious problem is that risk involves predicting thefuture so validation would require long periods of time. There havebeen some studies comparing probabilistic risk assessments per-formed by different groups on the same system where the resultsindicated large differences in the frequencies calculated for the

event [32,35]. Further problems were identified by Rae and Nicholson,i.e., that there can be significant divergence between the modeledsystem and the as-built system, interactions between the social andtechnical parts of the system may invalidate the technical assump-tions underlying the probabilistic analysis, and the effectiveness of themitigation measures may change over time [32].

Many major, well-known (as well as lesser-known) accidentshave occurred in systems where the probability of an accident waspreviously calculated to be 10�9 or less, including Chernobyl,Fukushima, Texas City, Deep Water Horizon, the Therac-25, Chal-lenger, and Columbia, to name but a few. Follensbee, in a reportskeptical of the use of probabilistic risk assessment, cites five largetransport aircraft accidents and one near accident where thecalculated probabilities were 10�9 or less [36]. In several of thesecases, the need for specific protection against the loss events wasjudged not necessary based on these probability calculations.Boeing has had recent poor experiences with estimates of risk inLithium-ion batteries on the B-787, where the likelihood assumedwas one in 10,000,000 flight hours but the actual occurrence wastwice in the first 52,000 [37].

Why are likelihood estimates inaccurate in practice? Oftenimportant causal factors are omitted, such as operator error,flawed decision making, and sometimes software, for whichprobability estimates of unsafe behavior are difficult (and perhapsimpossible) to determine. Humans usually play a major role inaccidents, but human error when supervising complex, usuallyautomated systems is not quantifiable (although that has notstopped some from trying). When technology changes or condi-tions differ from the past, historical experience is not available.The other problem is that many of the errors involved in theleading indicator events are not stochastic and therefore thelikelihood cannot be estimated using a probability.

The reason these limitations are important in identifyingleading indicators is that an unlimited number cannot be checkedso there needs to be a way to determine which are the mostimportant. There is usually some type of selection or judgmentinvolved. A non-probabilistic way to make these decisions issuggested in this paper.

A problem common to both probabilistic and non-probabilisticestimates of risk is heuristic biases. Psychologists have writtenextensively about the biases inherent in assessing risk, e.g., [38–43].These biases may have an impact on the leading indicators wedesign and how we react to them. For example, confirmation bias isthe name given to the tendency of people to pay more attention toinformation that supports their views than to evidence that con-flicts with them. So people tend to be overconfident in the accuracyof their forecasts, tending to deny uncertainty and vulnerability.

Another common bias is called the availability heuristic andsuggests that people tend to base likelihood judgments of an eventon the ease with which instances or occurrences of that or similarevents can be brought to mind. While this heuristic may often be areasonable one to use, it can also lead to systematic bias. Forexample, psychologists have found that judgments of the risk ofvarious hazards or events will tend to be correlated with howoften they are mentioned in the news media.

A third bias occurs when people think about future eventswhose likelihood cannot be based on past historical rates. Theywill often construct their own simple causal scenarios of how theevent could occur, using the difficulty of producing reasons for anevent’s occurrence as an indicator of the event’s likelihood. If noplausible cause or scenario comes to mind, an assumption may bemade that the event is impossible or highly unlikely.

People also have difficulty predicting cumulative causes. Theytend to identify simple, dramatic events rather than causes that arechronic or cumulative. Dramatic changes are given a relatively highprobability or likelihood whereas a change resulting from a slow

N. Leveson / Reliability Engineering and System Safety 136 (2015) 17–34 19

shift in social attitudes is more difficult to imagine and thus is givena lower probability. At the same time, the conjunction fallacy saysthat an outcome paired with a likely cause is often judged to bemore probable than the outcome alone even though this conclusionviolates the laws of probability.

A further bias is caused by an incomplete search for possiblecauses. Searches are often stopped once one possible cause orexplanation for an event has been identified. If that first possiblecause is not very compelling, stopping the search can mean thatother, more plausible and compelling causes, are not identifiedand likelihood is underestimated.

A final common psychological bias is called defensive avoidance.This type of bias may be reflected in the rejection or downgradingof the accuracy of leading indicators or in people’s inability to takethem seriously or to accept that risk may be increasing. Defensiveavoidance is based on the common psychological tendency torationalize and avoid consideration of a topic that is stressful orconflicts with other pressing goals.

In addition to these psychological biases, organizational cultureand politics can cause likelihood and vulnerability to be under-estimated or overestimated.

Successful creation and use of leading indicators will requireways to control the psychological biases involved in assessing risk.Suggestions for achieving this goal are included in this paper.

3. Assumption-based leading indicators

The basic hypothesis in this paper is that useful leadingindicators can be identified based on the assumptions underlyingour safety engineering practices and on the vulnerability of thoseassumptions rather than on likelihood of loss events. All engineeringinvolves assumptions about the behavior of the operational systemand its components and the environment in which the systemwilloperate. The goal of a leading indicators program is to monitor theassumptions upon which the safety of the system was assured,both to find assumptions that originally were incorrect and thosethat have become incorrect over time. As will be shown, theassumptions considered go beyond simply mechanical or struc-tural assumptions but include social and managerial ones. The restof this paper describes how to create an assumption-based,leading indicator program.

The idea of assumptions being the basis for identifying leadingindicators was originally proposed for risk management programsoutside of engineering. RAND developed the methodology ofassumption-based planning (ABP) primarily to assist U.S. Army clientswith mid- and long-term defense planning and to reduce uncertaintyand manage risk [44]. Some terminology and ideas from ABP are usedin the leading indicator process being proposed in this paper.

3.1. Basic concepts and definitions

Some basic terminology is needed.

Leading indicator: A warning sign that can be used in monitor-ing a safety-critical process to detect when a safety-relatedassumption is broken or dangerously weak and that action isrequired to prevent an accident. Alternatively, a leading indicatoris a warning signal that the validity or vulnerability of anassumption is changing.

Shaping actions: Actions intended to maintain assumptions, toprevent hazards and to control migration to states of higher risk.These are essentially actions taken during the design of the physicalsystem or the safety control structure to prevent hazards and toprevent the violation of the assumptions underlying the analysis anddesign. In control theory terms, these provide feedforward control and

are built into the physical and organizational safety controls eitheroriginally or later added in response to an accident or serious incident.Physical examples might be an interlock to ensure that two eventsoccur in a particular sequence or the use of a desiccant to preventmoisture that could lead to corrosion in a tank or pipe. For humanbehavior, shaping actions may be to design the operation of a safetycontrol action to be easy and difficult to omit. A final example of ashaping action is the design of operational procedures to be followedunder various types of conditions and following hypothesized events,such as creating an evacuation plan.

Hedging (contingency) actions: Actions that prepare for thepossibility that an assumption will fail. Hedging actions come fromthinking through a possible scenario (hazard analysis) in which theassumption collapses and asking what might be done now toprepare for that scenario. Some responses will be shaping actions.Others will involve situations where shaping actions are deemedimpossible or impractical or the shaping actions are unsuccessful.Another way of saying this is that the hazard analysis generatesscenarios from broken assumptions (worst case analysis) to identifyhedging actions that might be taken. In control theory and STAMPterminology, hedging actions involve feedback control duringsystem operation using set points that maintain safety constraints.Examples of feedback include performance audits to determinewhether the system and the safety controls are operating asdesigned and operators are following designed procedures.

Signposts: Points in the unfolding future where changes in thecurrent safety controls (shaping and hedging actions) may benecessary or advisable. In essence, they involve planning formonitoring and responding to particular identified changes inthe assumptions underlying the safety controls. For example, newconstruction or known future changes in the system or in theenvironment may trigger a planned response. Signposts may beestablished to trigger standard Management of Change procedurescommon in industry.

Assumption checking: The process of checking whether theassumptions underlying the safety design are still valid. Assumptionchecking differs from signposts in that signposts are identifiedduring the design and development process and specific responsescreated and specified. In assumption checking, risk managers andcontrollers monitor the system during the operation of the plantand ask whether the assumptions are still valid. Such monitoringmight focus on signposts or perhaps just changes and failures ofassumptions that have not been adequately handled by shaping andhedging actions.

Accidents often occur after some type of change [35]. Whilesignposts can be used for planned or expected changes, assump-tion checking is useful for detecting unplanned and potentiallyunsafe change.

3.2. Characteristics of a good leading indicators process

In some organizations, the desire to predict the future leads tocollecting a large amount of information based on the hope thatsomething will be obtained that is useful. The NASA Space Shuttleprogram was collecting 600 metrics a month, for example, rightbefore the loss of the Columbia, none of which turned out to behelpful in predicting the loss or identifying the clear migration ofthe program to states of increasing risk [45].

A structured process may provide a more effective set ofleading indicators than an ad hoc process. There are several goalsfor such a process and for the resulting set of leading indicators:

� Complete: All critical assumptions leading to an accident areidentified. Of course, no process is perfect, but that does notnegate the goal of aiming for perfection. Because completenessmay mean that a very large set of leading indicators is


identified, a process for determining what should be checked,how, and when will be a critical part of the leading indicatorsprogram.

� Consistent: Inconsistencies in the assumptions underlying theleading indicators need to be identified and handled. Incon-sistency may indicate a flawed safety design process.

� Effective: The indicators should appropriately address theunderlying assumptions, uncertainties, and vulnerabilities andaccurately evaluate risk.

� Traceable: Each leading indicator and the action attached to it(see Section 5) should be identified as a response to one ormore assumptions.

� Minimal: There should be no extraneous assumptions, checks,or actions that are not necessary to prevent accidents.

� Continually improving: The design of the leading indicatorsprogram should be continually updated over time in responseto feedback about its effectiveness.

� Unbiased: The leading indicator process should minimize(combat) standard biases in risk assessment and management.

Removing or significantly reducing biases is perhaps the mostdifficult of these characteristics to achieve. While heuristic biasescan never be totally eliminated, there are ways to reduce them.One way is to use a structured method for identifying, detecting,and managing leading indicators. Following a structured processcan diminish the power of our biases and encourage us to do amore thorough search. Biases may also have an impact ondecisions about which leading indicators to use and in recognizingthe changes that do occur and accepting that the leading indicatoris in fact accurately predicting increased risk. A structured processand rules to follow can potentially combat these biases too.

In addition to using a structured process, biases can becontrolled by concentrating on plausibility (vulnerability) ratherthan likelihood. That is, thinking about whether an assumptioncould fail to hold in a given way, not whether it is likely to do soand concentrating on causal mechanisms rather than likelihoods.Anything that could happen within the expected lifetime of thesystem should be accorded serious attention as a vulnerability.Vulnerability is discussed further in the next section.

A final way to control biases is to use worst-case analysis.Worst-case thinking can assist in deterring people from concen-trating on the more likely but usually less severe consequences ofevents or ignoring cases completely due to confirmation bias. Thenuclear industry, for example, has traditionally used design basisaccidents in analyzing safety. According to the U.S. NuclearRegulatory Agency [46], a design basis accident is “a postulatedaccident that a nuclear facility must be designed and built towithstand without loss to the systems, structures, and compo-nents necessary to ensure public health and safety.” In contrast, abeyond design basis accident is defined as an accident sequencethat is possible but is not fully considered in the design processbecause it is judged to be too unlikely. Defining design basisaccidents (and thus by inference beyond design basis accidents)is obviously subject to heuristic biases. At Fukushima, for exam-ple, the sea wall was built to withstand likely events, which wasreasonable. Worst case analysis, however, which is commonoutside the nuclear industry, would in addition try to protectagainst the worst case, for example, designing the plant so that abreach of the sea wall (which was possible even if consideredhighly unlikely) would not lead to losing all power in the plantand a potential catastrophe. At the least, if an accident scenario isnot fully incorporated into the design process and that scenario isnot impossible, then it needs to be considered for inclusion in aleading indicators program.

3.3. The starting point: Assumptions about why accidents occur

Despite much effort to avoid them, accidents still occur.Theoretically, if we design a safe system, that is, eliminate oradequately control or mitigate all the hazards and nothingchanges, then we should not have accidents. The problem is thatneither of these conditions is usually true in practice: no engineer-ing process is perfect nor is human behavior. In addition, everysystem and its environment are subject to change over time. Thestarting point in seeking more effective leading indicators is withthe assumptions about why accidents occur.

3.3.1. A general categorization of the causes of accidentsThe causes for accidents may arise in technical system devel-

opment or in physical operations and may reflect managementand cultural deficiencies, in both development and operations.Often several or all of these types of causes can be found inaccident scenarios. The following list describes the way accidentcauses can arise in each of these three areas:

Development and implementation

� Inadequate hazard analysis: Assumptions about the systemhazards or the process used to identify them do not hold.○ HA is not performed or is not completed.○ Some hazards are not identified due to inadequacies in the

hazard analysis process or in how it is performed.○ Hazards are identified but they are not handled because

they are assumed to be “sufficiently unlikely”.○ HA is incomplete, that is, important causes are omitted and

therefore not handled.� Inadequate design of control and mitigation measures for

the identified hazards, possibly due to inadequate engineeringknowledge or to inappropriate assumptions about oper-ations.

� Inadequate construction of control and mitigation measures.

Operations

� Controls that designers assumed would exist during operationsare not adequately implemented or used.

� Controls are implemented, but changes over time violate theassumptions underlying the original design of the controls.○ New hazards arise with changing conditions, were not

anticipated during design and development, or were dis-missed as unlikely to occur.

○ Physical controls and mitigation measures degrade overtime in ways not accounted for in the analysis and designprocess.

○ Components (including humans) behave differently overtime (violate assumptions made during design and analysis).

○ The system environment changes over time (violates assump-tions made during design and analysis).

Management

� The safety management system design is flawed.� The safety management system does not operate the way it

was designed (assumed) to operate. While there may be manyreasons for misbehavior, one general cause is that the safetyculture, i.e., the goals and values of the organization withrespect to safety, degrades over time. In addition, the behaviorof those making safety-related decisions may be influenced bycompetitive, financial or other pressures.


To prevent accidents, we must eliminate or reduce the occur-rence of these causes. The response may be in the form of shapingor hedging actions. A leading indicators program can be used toattempt to detect them before an accident occurs.

Some of these causes involve engineering errors, for example,inadequate design of physical control mechanisms. Simple calcu-lation or knowledge errors may be involved, but incorrect assump-tions can also play an important role. The common engineeringdesign assumption about independence when redundancy is usedto protect against failures is an example. Consider the Macondo(Deep Water Horizon) blowout preventer. There was redundancyin the means to control a potential blowout, but the redundantunits contained a common cause failure mode. Acceptance of thebelief that blowout preventers never failed was widespread in theindustry despite the fact that ineffective blowout preventers hadpreviously contributed to several serious accidents (a heuristicbias?) [47,48]. Challenger is another example of a common causefailure, only this time the independence assumption about theindependence of the O-rings was checked and invalidated scien-tifically many years before the fatal Challenger flight, but thechange was never documented in the Marshall Space Center database where the launch decision was made [45]. Given the verylarge number of accidents that have involved common mode/cause failure [35], it appears to be an important assumption torevisit as are others that have contributed to many accidents.

The goals and values of those participating in an industry ororganization, i.e., the safety culture, is an important assumptionthat when wrong can be a major factor in accidents and must bereflected in the set of leading indicators. For example, a safetypolicy is a basic requirement for every company or organization tocommunicate the desired safety culture and behavior expected ofindividuals. There must be a way to measure how well that policyis being followed and if adherence changes over time. Assump-tions about management behavior and decision making are alsocommonly found to be violated after accidents occur and must bemonitored.

Sometimes the safety-related assumptions underlying the sys-tem or operational design process hold originally but becomeuntrue due to changes over time. The world is constantly chan-ging, especially human behavior and, as noted, major accidents areusually preceded by the migration of the system to a state ofunrecognized high risk. Using the terminology developed so far,that migration basically involves moving to states where theassumptions used during the design and development of thesystem and the shaping actions are violated. So even if a goodjob in terms of hazard analysis and design for safety has been doneduring development and manufacturing, there will still, almostinevitably, be accidents. This potential for migration toward theviolation of the assumptions underlying the safety of the systemneeds to be reflected in the set of leading indicators. For example,operators may start to take shortcuts or turn off safety devices inorder to operate more efficiently or, in an air traffic control system,the airspace may become more crowded than originally consid-ered in the system design.

3.3.2. Vulnerability vs. likelihoodThe evaluation of “likelihood” when assessing risk is a key

assumption that, if incorrect, can lead to accidents that might havebeen prevented. Too often, there is no scientific basis for makingsuch assumptions about likelihood and occasionally politics inter-venes. After accidents, it is common to find that the hazardinvolved had been identified but not controlled because it wasdeemed too unlikely to occur [35].

Instead of trying to predict the likelihood that an event willoccur or an assumption will fail, the similar but different concept

of vulnerability can be used. Vulnerability in the world ofassumption-based planning involves assessing whether anassumption could plausibly fail during the lifetime of the system,not the specific probability of that happening. Trying to assess theexact probability of software issuing an unsafe command oroperators making a mistake or changing their behavior is impos-sible in a world of imperfect engineering and changing humanbehavior.

If an assumption is vulnerable, then it makes no sense not toprotect against it in some way. While some events may not betechnically impossible, they may be implausible enough that forpractical reasons they are treated as such. An example is ameteorite hitting an aircraft. Such an event has never occurredin the past and nothing has changed to alter the vulnerability, thatis, to increase the number of meteorites penetrating our atmo-sphere or to make our aircraft more vulnerable to them. Note,however, that vulnerability may change over time. Part of aleading indicator program involves identifying when vulnerabilityhas changed from that previously assumed.

The potential for reducing biases by using a structured processwas described in Section 3.1. In addition to using a structuredprocess, biases can be controlled by concentrating on plausibilityand causal mechanisms rather than likelihood. That is, thinkingabout whether an assumption could fail to hold in a given way, notwhether it is likely to do so and concentrating on causal mechan-isms rather than likelihoods, i.e., what are the potential scenariosthat could lead to the failure of the assumption. Anything thatcould happen within the expected lifetime of the system should beaccorded serious attention as vulnerable. Ensuring the assump-tions are revisited occasionally and checks are instituted forfailures of the assumptions is not expensive.

It can reasonably be argued that likelihood is actually beingconsidered here, using plausibility as a proxy. The difference is inthe potential for error in the two types of estimates. Vulnerabilitydoes not require assigning a probability, such as 3.5E�7, or evenrelative (but usually undefined or poorly defined) categories suchas those often used in a risk matrix. Examples of such categoriesare frequent, probable, occasional, remote, improbable, impossible.The difference is that instead of trying to assign a numericallikelihood estimate or one of a set of poorly defined categories,only two categories, possible and impossible, are used. That is, ifthe likelihood is not zero, then the assumption needs to beconsidered for inclusion in the leading indicators program. Thatconclusion does not necessarily imply that costly controls will beused, but the hazard cannot be dismissed at the beginning ofdevelopment (during preliminary hazard analysis) and neverconsidered again—at least until the first accident caused by it.

4. Identifying safety-critical assumptions

How do we identify the safety-related assumptions that needto be included in a leading indicator program? This sectiondescribes a concrete process. The process is based on an accidentcausality model, STAMP (System-Theoretic Accident Model andProcesses), that is more comprehensive than traditional models.The process also includes a new hazard analysis method STPA(System Theoretic Process Analysis) built on STAMP as a theore-tical foundation. STPA can be used to identify safety criticalassumptions that can then form the basis for a leading indicatorprogram.

4.1. STAMP: The accident causality model underlying the process

Accident causality modes underlie all our efforts to engineer forsafety. Basically they provide an explanation for why accidents


occur and impose patterns on accident causation. You may not beaware you are using one, but you are: Our mental models of howaccidents occur determine how we investigate and prevent acci-dents and the causal factors we identify.

Identifying leading indicators will necessarily be related to ourassumptions about why accidents occur. Traditional accidentcausality models explain accidents in terms of a chain of directlyrelated events that cause the accident. The events usually involvefailures2 of system components, and such causality models arebuilt on reliability theory. Chain-of-failure-event models are lim-ited in their ability to handle accidents in complex systems,organizational and managerial (social and cultural) factors inaccidents, and the systemic causes of the events, most of whichdo not involve failures.

In contrast, STAMP is a new model of accident causation thatextends the old models to include non-linear and indirect relation-ships and thus can better handle the levels of complexity andtechnical innovation in today’s systems [4]. STAMP is based onsystems theory rather than reliability theory. A systems-theoreticmodel allows capturing the non-linear dynamics of interactionsamong system components and anticipating the risk-related con-sequences of change and adaptation over time.

In STAMP, accidents are conceived as resulting not simply fromsystem component failures but more generally from interactionsamong system components (both physical and social) that violatesystem safety constraints. Examples of safety constraints are that ahighly reactive chemical must be stored below a maximumtemperature, pressure in a well must be within acceptable levelsat all times, aircraft must maintain minimum separation whenaloft, and a nuclear power plant must not discharge more than aminimum amount of radioactive materials into the environment.The constraints must be enforced in the operating process andcontingency action must be taken if the constraints are somehowviolated.

STAMP views process or system safety as a control problem, nota reliability problem: accidents occur when component failures,external disturbances, and/or potentially unsafe interactionsamong system components are not controlled (handled) ade-quately. The controls may be managerial, organizational, physical,operational, or manufacturing. Major accidents rarely have a singleroot cause such as component failure or human error (which aresymptoms rather than root causes) but from the inadequateenforcement of safety-related constraints on the development,design, construction, and operation of the entire socio-technicalsystem. Often they can also be characterized as resulting from anadaptive feedback function that fails to maintain safety as perfor-mance changes over time to meet a complex and changing set ofgoals and values [2].

Safety-related constraints and the safety requirements areenforced by a safety control structure that must be carefullydesigned and evaluated to ensure that the controls are adequateto maintain the constraints on behavior necessary to control risk.Fig. 1 shows the safety control structure existing at the time of theMacondo well blowout. Each component has specific assignedresponsibilities for maintaining the safety of the system, that is,enforcing the safety constraints and preventing losses. For exam-ple, the mud logger is responsible for creating a detailed record ofa borehole by examining the contents of the circulating drillingmedium, the cementer is responsible for properly sealing off awellbore, and local management has responsibilities for over-seeing that these and other activities are carried out properly

and safely. The government oversight agency may be responsiblefor ensuring that safe practices are being followed and acceptableequipment being used. And so on. Higher level componentresponsibilities may be broader than the components below thembut no less important in maintaining safety.

Major accidents are rarely the result of unsafe behavior by onlyone of the components but usually the result of unsafe behaviorand interactions involving all or most of the components in thecontrol structure. When accidents occur and they are investigatedthoroughly, as was true for the Macondo well blowout, it almostalways turns out that more than one component (and often all) didnot fulfill its control responsibilities. Also, as shown in Fig. 1, morethan one company may participate in the safety control structure,with the controllers of the components (whether part of their owncompany or another) having individual responsibilities for ensur-ing that the controlled processes or components are fulfilling theirsafety responsibilities. As long as the necessary constraints areenforced by someone or by several groups, different designs of thesafety control structure can be equally effective.

Fig. 2 shows a more general example of a safety controlstructure with a focus on producing a product. This structuremight be more typical for an oil refinery or aircraft manufacturer,where the government agency involved in safety oversight mightbe OSHA, EPA or the FAA. As with Fig. 1, each component in thecontrol structure has responsibility for controlling the behavior ofsome lower level components in the structure.

Between the levels of the safety control structure there areclassic feedback control loops (Fig. 3): the controllers providecontrol actions to maintain a “set point,” in this case a set of safetyconstraints on the behavior of the controlled process. In turn thecontrollers get feedback from the controlled processes to assist inproviding appropriate and effective control actions. Feedback maybe direct from the physical process, such as sensors that provideinformation about the state of the well at that point in time, ormay involve feedback from lower level controllers to higher levelcontrollers to provide information about the current state of thesafety control activities and the perceived level of risk.

Note that the use of the term “control” does not imply onlymanagerial and operator controls. Physical component behaviorand interactions among components can be controlled throughthe use of physical controls such as interlocks or through varioustypes of barriers and fault tolerance features. Standard managerialcontrols may also be involved. In addition to physical and manage-rial controls, all behavior is influenced and at least partially“controlled” by the social and organizational context in whichthe behavior occurs. Control is provided not only by engineeredsystems and direct management intervention, but also indirectlyby policies, procedures, shared values, and other aspects of theorganizational culture, sometimes called the “safety culture.”

In order to provide effective control, the controller must havean accurate model of the process it is controlling (Fig. 3). Forhuman controllers, this model is commonly called a mental model.For both automated and human controllers, the process model ormental model is used to determine what control actions arenecessary to keep the system operating effectively.

The process model includes assumptions about how the controlledprocess operates and about the current state of the controlled process.Accidents in complex systems, particularly those related to softwareor human controllers, often result from inconsistencies between themodel of the process used by the controller and the actual processstate. The inconsistency contributes to the controller providing inade-quate control. The local BP manager on Deepwater Horizon thoughtthe cement had properly sealed the annulus and ordered the mud tobe removed, the operators at Texas City thought the level of liquid inthe isomerization unit was below the appropriate threshold, and theMars Polar Lander software thought the spacecraft had landed and

2 A standard definition of failure is “The nonperformance or inability of thesystem or component to perform its specified function for a specified time underspecified environmental conditions.”


turned off the descent engines prematurely. Usually, these models ofthe controlled system become incorrect due to missing or inadequatefeedback and communication channels. As examples, DeepwaterHorizon had limited sensors to provide information about the stateof the well; Texas City had no sensors above the maximum fill level ofthe tank; and the Mars Polar Lander software misinterpreted noisefrom the landing leg sensors that occurred when the parachutedeployed.

The effectiveness of the safety control structure in preventingaccidents is greatly dependent on the accuracy of the informationabout the state of the controlled system each controller has, oftenin the form of feedback from the controlled process although othersources of such information can and often does exist. Performancemetrics and leading indicators of changes in the safety controlstructure are a form of feedback and can provide a means for

measuring the risk in the current state of the process and thesafety control structure. They provide important signals about thepotential for an accident.

The identification of assumptions to support a leading indica-tors program starts from the causes of accidents in Section 3.3 andthe STAMP accident causality model. In general, safety criticalassumptions involve:

1. Assumptions about the system hazards and the paths to (causesof) hazards. New hazards may arise or assumptions underlyingthe causal analysis of existing hazards may change.

2. Assumptions about the effectiveness of the controls, that is, theshaping and hedging actions, used to reduce or manage hazards.For example, the flare tower in a chemical plant may be sufficientto handle the maximum amount of gas released when the plant is

Fig. 1. The safety control structure existing at the time of the Macondo accident.


designed, but changes in the plant or even new information aboutthe hazards may invalidate these assumptions over time.

3. Assumptions about how the system will be operated and theenvironment (context) in which it will operate. For example,assumptions that the controls will be operating as assumed bythe designers (e.g., refrigeration units would control the reactiv-ity of the MIC at Bhopal). Assumptions about human behavior

are particularly vulnerable as humans tend to adapt theirbehavior over time.

4. Assumptions about the development environment andprocesses.

5. Assumptions about the organizational and societal safety controlstructure during operations, i.e., that it is working as designed, thedesign was adequate to ensure the system safety requirementsare enforced, and the system controllers are fulfilling their safetyresponsibilities and operating as designed. For example, accidentinvestigations often uncover the fact that some feedback andcommunication channels are broken or degraded and are notoperating as assumed. Such assumptions include those about thestate of the safety culture, for example, that the organizationalsafety policy is being followed.

6. Assumptions about vulnerability or severity in risk assessmentthat may change over time and thus require a redesign of therisk management and leading indicators system itself.

A process for identifying the first three technical (engineering)assumptions is described in Section 4.2 while identifying the lastthree managerial and organizational assumptions is discussed inSection 4.3.

Fig. 2. An example of a generic safety control structure. This generic model would need to be instantiated for a particular hazard and control structure design.

Fig. 3. Every controller uses a model of the state of the controlled process todetermine what control actions are needed.


4.2. Assumptions underlying engineering decisions

TCAS II3 (Traffic alert and Collision Avoidance System) is usedin this paper as an example of identifying safety-related engi-neering assumptions that can be used to create leading indica-tors. TCAS is a device required on most commercial and somegeneral aviation aircraft that assists in avoiding midair collisions.It was intended to be an independent backup to the normal AirTraffic Control (ATC) system and the pilot’s “see and avoid”responsibilities by reading signals from air traffic control trans-ponders on aircraft in its vicinity and analyzing them withrespect to slant range and relative altitude. TCAS determineswhich aircraft represent potential collision threats and providesappropriate indications, called advisories, to the flight crew toassure proper separation. Two types of advisories can be issued:Resolution Advisories (RAs), which provide instructions to thepilots to ensure safe separation from nearby traffic in the verticalplane,4 and Traffic Advisories (TAs), which indicate the positionsof intruding aircraft and may later cause resolution advisories tobe displayed.

A sophisticated system engineering process should documentthe assumptions and rationale underlying the design decisionswhen they are made as that information is important in standardsystem engineering processes such as rework and upgrades. Ifassumptions and rationale have not already been specified, then aspecial effort will be needed to identify them.

Documentation is critical both in system engineering and inoperating the systems that result. One way to integrate theassumptions and design rationale into the system engineeringdocumentation is to use Intent Specifications [4,49], but otherdocumentation methods can be used. Assumptions and designrationale are specified when needed to explain a decision or torecord fundamental information on which the design is based. Theitalicized statements below are taken from an Intent Specificationfor TCAS II created by Leveson [50].

The goals for TCAS are:

G1: Provide affordable and compatible collision avoidance systemoptions for a broad spectrum of National AirspaceSystem Users.

G2: Detect potential midair collisions with other aircraft in allmeteorological conditions; throughout navigable airspace,including airspace not covered by ATC primary or secondaryradar systems, and in the absence of ground equipment.

From the goals, a set of high-level system requirements can bederived, for example:

1.18: TCAS shall provide collision avoidance protection for any twoaircraft closing horizontally at any rate up to 1200 knot andvertically up to 10,000 ft per minute.Assumption: This requirement is derived from the assump-tion that commercial aircraft can operate up to 600 knot and5000 ft per minute during vertical climb or controlled descentand therefore two planes can close horizontally up to1200 knot and vertically up to 10,000 fpm.

This assumption is an example of something that will need tobe checked in the future to ensure that technological changes havenot contradicted it and thus made vulnerable all the technicaldesign decisions based on it (which can be identified by thetraceability pointers in an intent specification).

Another system requirement is that:

1.19.1: TCAS shall operate in enroute and terminal areas withtraffic densities up to 0.3 aircraft per square nautical miles(i.e., 24 aircraft within 5 nmi).Assumption: Traffic density may increase to this level by1990, and this will be the maximum density over the next20 years.

Again, future aircraft performance limits may change or there maybe significant changes in airspace management, such as reducedvertical separation or very different ways of handling air traffic.Lots of computations in TCAS are based on the assumptionunderlying requirement 1.19.1, and it needs to be monitored totrigger recomputation of safety parameters if it changes.

Another type of assumption may be specified to explain a decisionor to record fundamental information on which the design is based.For example, the design may be based on assumptions about theenvironment in which the system will operate. Examples from TCASinclude:

EA1: High-integrity communications exist among aircraftEA2: The TCAS-equipped aircraft carries a Mode-S air traffic

control transponder.5

EA3: All aircraft have operating transponders.EA4: All aircraft have legal identification numbers.EA5: Altitude information is available from intruding targets with

a minimum precision of 100 ft.EA6: The altimetry system that provides own aircraft pressure

altitude to the TCAS equipment will satisfy the requirementsin RTCA Standard …

EA7: Threat aircraft will not make an abrupt maneuver thatthwarts the TCAS escape maneuver.

New technology and new types of aircraft integrated into controlledairspace could violate these assumptions. EA4 is an example of a non-technical assumption. Identification numbers are usually provided bythe aviation authorities in each country. That assumption will need tobe ensured by international agreement and monitored by someinternational agency. The assumption that aircraft have operatingtransponders (EA3) may be enforced by the airspace rules in aparticular country and, again, must be ensured by some group. Thetruth of this assumption is critical as TCAS will not display any aircraftwithout an operating transponder nor provide an RA. EA7 is anexample of an assumption on the behavior of pilots and the air trafficcontrol system and could also be violated by the introduction ofunmanned or other types of new aircraft into the airspace.

Some assumptions may be imposed on the system by environ-mental requirements and constraints. Those assumptions may leadto restrictions on the use of the new system (which will requireassumption checking) or may indicate the need for system safetyand other analyses to determine the constraints that must beimposed on the system being created or on the larger encompass-ing system to ensure safety. Examples for TCAS include:

E1: The behavior or interaction of non-TCAS equipment with TCASmust not degrade the performance of the TCAS equipment orthe performance of the equipment with which TCAS interacts.

E2: Among the aircraft environmental alerts, the hierarchy shall be:Windshear has first priority, then the Ground Proximity Warn-ing System (GPWS), then TCAS.

E3: The TCAS alerts and advisories must be independent of thoseusing the master caution and warning system.

3 TCAS is called ACAS in Europe. In the rest of this paper, it will be calledsimply TCAS.

4 Adding horizontal maneuvers is planned but not yet implemented for TCAS/ACAS.

5 An aircraft transponder sends information that assists air traffic control inmaintaining aircraft separation.


The process for identifying safety-critical assumptions also getsimportant input from system hazard analysis and the largersystem engineering process within which such analyses areembedded. First, accidents for the system under considerationare defined. Without defining what is being considered as a loss, itis not possible to identify when a loss might be imminent. ForTCAS, an important loss is a mid-air collision. Because TCAS canaffect airworthiness and safety of flight in other ways, other typesof losses need to be considered in addition, such as a collision withthe ground (terrain).

Then the system hazards are identified. A hazard is a systemstate or set of conditions that, together with a particular set ofworst-case environmental conditions, will lead to an accident (lossevent). Hazards are defined as within the system boundaries, thatis, within the control of the system designers. For TCAS, thehazards are:

H1: TCAS causes or contributes to a near midair collision (NMAC),defined as a pair of controlled aircraft violating minimumseparation standards.

H2: TCAS causes or contributes to an aircraft coming too close to afixed structure or natural terrain.

H3: TCAS causes or contributes to the pilot losing control over theaircraft.

H4: TCAS interferes with other safety-related aircraft systems (forexample, ground proximity warning).

H5: TCAS interferes with the ground-based air traffic controlsystem (e.g., transponder transmissions to the ground orradar or radio services).

H6: TCAS interferes with an ATC advisory that is safety-related(e.g., avoiding a restricted area or adverse weatherconditions).

The first basic set of safety-critical assumptions is that hazardswill not occur in a properly designed and operated system. Anyoccurrence of one of these hazards (even if an accident does notresult) should trigger a complete review of the safety engineeringprocess, in this case, the process used to eliminate or mitigateTCAS hazards. Checking an assumption after the hazard hasoccurred is likely too late to prevent a loss, however, but theidentification of hazards serves as a starting point from whichearlier checks can be derived by identifying the scenarios that canlead to a hazard.

Additional assumptions can even at this high level be deduced,for example, that there is a ground-based air traffic control system(which could change in the future) and that TCAS will not interferewith its operation. While hazards rarely change, new ones may beintroduced when changes are made to the system and the processused to handle them may be undermined.

Checks for the occurrence of hazards also provide importantinformation about the adequacy of the hazard analysis processitself. The goal of hazard analysis and safety engineering is toidentify hazards and then either eliminate or prevent them. If theycannot be prevented, then they need to be mitigated. Hazards thatthe engineers thought were eliminated or prevented should, ofcourse, never occur. If they do, this event is an indication of flawsin the engineering process or perhaps in the assumptions madeabout the operational system, such as assumptions about pilotor air traffic controller behavior. It is not just enough to fix thetechnical process. The holes in the development process thatallowed hazardous behavior to occur need to be fixed.

Ideally, flaws in engineering practices or operational behaviorassumptions will be identified by leading indicators before theactual hazardous states occur. This goal can be achieved byidentifying the assumptions underlying the hazardous scenariosidentified by hazard analysis. Most of the traditional hazard

analysis methods are limited to failure events or simple scenarios.Clearly, more powerful hazard analysis techniques, which cangenerate a larger set of causal scenarios, will be preferable. Thecausal scenarios are used to design the controls and thus form theassumptions under which the controls are created.

STPA [4] is a new hazard analysis technique based on thetheoretical STAMP accident causality model. It is basically arigorous method for examining the control loops in the safetycontrol structure to identify scenarios or paths to accidents.Because the STAMP framework extends current accident modelsand thus includes accidents caused by component failure(s), STPAcan not only identify the hazard scenarios typically generated bythe traditional hazard analysis methods, but it also includes thosefactors not included or poorly handled by them such as softwarerequirements errors, component interaction accidents, complexhuman decision-making errors, inadequate coordination amongmultiple controllers, and flawed management and regulatorydecision making. STPA works not only on the physical systembut on the management structure and organizational design, asdiscussed in the next section.

Fig. 4 shows some of the types of general flaws considered in anSTPA analysis. The analysis is performed on a specification of thesystem’s safety control structure and is broken into two steps inorder to more carefully structure it.

The first step in STPA assists in identifying the safety controlrequirements and unsafe, ineffective, and missing control actions thatcan lead to hazards. There are four types of hazardous unsafe control:

1. An action required for safety is not provided or not followed,e.g., the operator does not close the intake valve when the tankis full.

2. An unsafe control action is provided, e.g., mud is removedbefore the well has been properly sealed.

3. A potentially safe control action is provided too early or toolate, that is, at the wrong time or in the wrong sequence, e.g.,the plant evacuation signal is delayed or, as in Bhopal, theoperators do not investigate a reported leak until after thetea break.

4. A control action required for safety is stopped too soon orapplied too long, e.g., a pilot continues a control action, such asascending or descending, beyond the target altitude level or avalve stops closing or opening before the valve is fully open orclosed.

After the potentially unsafe control actions are identified, thesecond step in STPA determines how these unsafe control actionscould occur, that is, the scenarios that can lead to a hazardoussystem state or accident. It also identifies the scenarios where safecontrol is provided but the control actions are not executedcorrectly, perhaps because of a component failure in the controlledprocess.

The first step in STPA is to model the high-level TCAS opera-tional safety control structure, shown in Fig. 5 for TCAS. Thiscontrol structure should be created using the system requirementsand hazards.

Each controller is assigned responsibilities involving the avoid-ance of hazards. Important safety-critical assumptions are relatedto these responsibilities and to whether the controllers areenforcing the safety constraints (responsibilities) that they havebeen assigned. The migration to states of higher risk commonlyfound in major accidents can be explained as a weakening of theenforcement of the safety constraints on the controlled process bythe safety control structure.

A midair collision over Uberlingen Germany in 2002 demon-strates the role of assumptions about the operation of the controlstructure in safety [51]. There are three groups with potential


responsibilities over the pilot’s response to a potential NMAC:TCAS, the ground ATC, and the airline operations center. The latterprovides the airline procedures for responding to TCAS alerts andtrains the pilots in them. Clearly, any potential conflicts andcoordination problems between these three controllers will needto be resolved in the overall air traffic management system. In thecase of TCAS, the RA provided by TCAS was always to be followedin the case of conflicting advisories. The designers decided thatbecause there was no practical way, at that time, to downlinkinformation to the ground controllers about any TCAS advisoriesthat might have been issued for the crew, the pilot was toimmediately implement the TCAS advisory and the co-pilot wouldtransmit the TCAS alert information by radio to the ground ATCso that the ground air traffic controller would know the state ofthe airspace and the advisories being given. The airline wouldprovide the appropriate procedures and training to implement thisprotocol.

Several important assumptions about how conflicting advi-sories would be handled were violated in the Uberlinger midair

collision, for example, there were supposed to be two controllersin the ground ATC tower, the pilots are supposed to follow theTCAS maneuver when there is a conflict between the advisoryprovided by the ground ATC system and TCAS, and the airlineoperations center was assumed to be training pilots to follow TCASin such a conflict situation. The first of these assumptions had beenviolated for a while at night in the Swiss air traffic control centerhandling the two aircraft at the time of the tragedy, it is unknownwhether the second one had been violated previously as thatinformation was never checked, and the third assumption, i.e., thatthe airline involved was training pilots to always follow TCASwhen presented with conflicting advisories also had not held for along time but nobody apparently was given the responsibility forensuring that such training was occurring or they had not beenexercising that responsibility. These incorrect assumptions aboutthe operation of the control structure could have served as leadingindicators that the designed control structure was degrading.

In STAMP, accidents result from the violation of safety con-straints on system operation. The system hazards form the basis

Fig. 4. General factors in unsafe control used to create causal scenarios for hazards.

Fig. 5. Safety control structure for TCAS.


for identifying the high-level system safety constraints that mustbe maintained for safety. Essentially the system safety constraintsare the system safety requirements. For example, H5 gives rise tothe following system safety constraint:

SC.2: TCAS must not interfere with the ground ATC system or otheraircraft transmissions to the ground ATC system (H5).

STAMP assumes that accidents occur when safety constraints onthe behavior of the system are violated. STPA hazard analysisidentifies scenarios leading to the violation of the safety con-straints (and thus the hazards). These scenarios lead to moredetailed safety constraints and requirements on the design andoperation of the system. For example, STPA can be used to identifycauses for the violation of SC.2. This information can then berefined into a more detailed safety constraint SC2.1:

SC2.1 The system design must not interfere with ground-basedsecondary surveillance radar, distance-measuring equip-ment channels, and with other radio services that operatein the 1030/1090 MHz frequency band (2.5.1).

The assumption underlying a safe design of TCAS is that suchinterference will never occur. If it does, then this is a leadingindicator that the design or operation of the system is flawed.

Humans tend to change their behavior over time and useautomation in different ways than originally intended by thedesigners and assumptions about operator behavior provide animportant source for identifying leading indicators. For example,H3 is that TCAS causes or contributes to the pilots losing controlover the aircraft. Safety constraint SC.6, which is derived by STPAfrom H3, says:

SC.6: TCAS must not disrupt the pilot and ATC operations duringcritical phases of flight nor disrupt aircraft operation (H3,2.2.3, 2.19, 2.24.2).

Besides identifying the related hazard from which this safetyconstraint was derived (in this case H3), the specification alsopoints to features of the design (2.2.3, 2.19, and 2.24.2) used tocontrol that hazard, i.e., to enforce SC.6. These controls alsocontain important assumptions that need to be checked. The mostbasic assumption is that these controls will be effective inpreventing the hazardous scenario and that they are implementedcorrectly. For example, in the STPA analysis, one of the scenariosidentified that could lead to the violation of SC.6 is that TCASprovides distracting resolution advisories while the pilot is on theground or in the middle of taking off. A control was designed toprevent this scenario that allows the pilot to inhibit resolutionadvisories during critical phases of takeoff and landing:

SC6.1 The pilot of a TCAS-equipped aircraft must have the optionto switch to the Traffic-Advisory-Mode-Only where trafficadviories are displayed but display of resolution advisories isinhibited (2.2.3).

Assumption: This feature will be used only during takeoff orin final approach to parallel runways, when two aircraft areprojected to come close to each other and TCAS would call foran evasive maneuver.

Addition of the control, i.e., the ability of the pilot to inhibit TCASresolution advisories by switching to TA-only mode, createsanother hazardous scenario that must be controlled through pilotprocedures, training, etc., and leads to another assumption thatshould be checked during operation of the system to ensure thatpilots are not violating the assumption associated with SC6.1.

Other examples of operational procedures that were created toeliminate or control hazardous scenarios are:

OP4: After the threat is resolved, the pilot shall return promptlyand smoothly to his/her previously assigned flight path.

OP9: The pilot must not maneuver on the basis of a Trafficadvisory only

Because these procedures were created to counter specific scenar-ios that were identified as leading to hazards, they represent asource of assumptions that should be checked to identify hazar-dous behavior that could lead to an accident.

As another example, in the Uberlingen accident, there wereadditional causal factors not mentioned earlier. One was thatmaintenance was being performed on the ATC equipment at thetime of the collision, which disabled the controller’s aural conflictalert. If the controller had known it was not working, he couldhave adjusted his behavior. This type of causal factor can becontrolled in operational procedures, in this case, the proceduresfor performing maintenance while the ATC tower is still operating.An important assumption, of course, is that such procedures arebeing followed and this assumption will need to be checked.

A final source for assumptions that can be used to identifyleading indicators is limitations in the design of safety-relatedcontrols. These limitations should be documented as they repre-sent important information in the decision about whether thesystem should be deployed. Some limitations are related to thebasic functional requirements, for example:

L4: TCAS does not currently indicate horizontal escape maneuversand therefore does not (and is not intended to) increasehorizontal separation.

Other limitations are related to the environmental assumptions,for example:

L1. TCAS provides no protection against aircraft without trans-ponders or with nonoperational transponders (EA3).

L6: Aircraft performance limitations constrain the magnitude ofthe escape maneuver that the flight crew can safely execute inresponse to a resolution advisory. It is possible for theselimitations to preclude a successful resolution of the conflict(H3, 2.38, 2.39)

L4: TCAS is dependent on the accuracy of the threat aircraft’sreported altitude. Separation assurance may be degraded byerrors in intruder pressure altitude as reported by thetransponder of the intruder aircraft (EA5)

An example assumption related to L1 is that the operation ofaircraft without transponders will be precluded in operations.

Limitations may relate to hazards or hazard causal factors thatcould not be completely eliminated or controlled in the design:Thus they represent accepted risks.

L3: TCAS will not issue an advisory if it is turned on or enabled toissue resolution advisories in the middle of a conflict.

An implied assumption here is that pilots will, except underunusual circumstances, turn TCAS on before taking off, whichcan be checked in performance audits.

Finally, limitations may be related to problems encountered ortradeoffs made during system design. For example, TCAS has ahigh-level, performance-monitoring requirement that led to theinclusion of a self-test function in the system design to determinewhether TCAS is functioning correctly. The following systemlimitation relates to this self-test facility:

L9: Use by the pilot of the self-test function in flight will inhibitTCAS operation for up to 20 seconds depending upon thenumber of targets being tracked. The ATC transponder will notfunction during some portion of the self-test sequence.


A safety-related assumption is that this behavior will be rare andtherefore not result in frequent periods of non-operation of TCASand therefore increased risk of an NMAC.

4.3. Management and organizational safety control structureassumptions

To provide a richer example for deriving the assumptions under-lying the management and organizational components of the safetycontrol structure, the Space Shuttle operational safety control structureis used. After the Columbia accident, an Independent TechnicalAuthority (ITA) with safety responsibility was integrated into theexisting NASA manned space program management structure.

As with the engineering design process, design and operationaleffectiveness of the higher-levels of the safety control structure startsfrom the safety requirements, which are derived from the hazards. Thehazard here will be broader than that for the technical process alone:

System hazard: Poor engineering and management decision mak-ing leading to the loss of a Space Shuttle or serious injury tothe crew.

This hazard can be used to generate detailed requirements, suchas:

SC1: Safety considerations must be first and foremost in technicaldecision making.

SC1a: State-of-the-art standards and requirements for NASAmissions must be established, implemented, enforced,and maintained that protect the astronauts, the work-force, and the public.

SC1b: Safety-related technical decision making must be indepen-dent from programmatic considerations, including costand schedule.

SC1c: Safety-related decision making must be based on correct,complete, and up-to-date information.

SC1d: Overall (final) decision making must include transparentand explicit consideration of both safety and program-matic concerns.

SC1e: NASA must provide for effective assessment and improve-ment in safety-related decision making.

See [4] for the complete list of organizational safety constraints forthis management system.

These safety requirements/constraints should be manifested inthe safety control structure. An example of the Safety Controlstructure for the Shuttle program created after the Space ShuttleColumbia loss is shown in Fig. 6.

The components have specific responsibilities with respect tomaintaining safety in the Space Shuttle Program [4]. Part of theSTPA analysis of the safety control structure involves ensuring thatthe high-level requirements/constraints are traceable to the com-ponent responsibilities and that adequate responsibility for ensur-ing the safety requirements has been allocated and designed intothe structure.

The next step is to perform an STPA hazard analysis on thestructure to identify basic management and organizational risks. Forexample, one of the system safety requirements is that

SC1a: State-of-the-art safety standards and requirements forNASA missions must be established, implemented, enforced,and maintained that protect the astronauts, the workforce,and the public.

The basic design of the safety control structure must enforce thisconstraint by allocating appropriate responsibilities for enforcingthis requirement (safety constraint) and the feedback loops neces-sary to enforce it successfully.

As an example, the Chief Engineer is responsible for the technicalstandards and system safety requirements used in the Shuttleprogram and for all changes, variances, and waivers to the require-ments. The control actions the Chief Engineer has available toimplement this responsibility are:

� To develop, monitor, and maintain technical standards andpolicy

� To establish or approve, in coordination with programs andprojects, the technical requirements and ensure they areenforced and implemented in the programs and projects (i.e.,to ensure the design is compliant with the requirements).

� To approve all changes to the initial technical requirements� To approve all variances (waivers, deviations, exceptions to the

requirements).� etc.

Fig. 6. The NASA ITA control structure design.


Taking just the first of these, i.e., the control responsibility todevelop, monitor, and maintain technical standards and policy, thepotential unsafe control actions identified using STPA include:

1. General technical and safety standards are not created.2. Inadequate standards and requirements are created.3. Standards degrade over time due to external pressures to

weaken them. The process for approving changes is flawed.4. Standards are not changed over time as the environment and

needs change.

Each of these violates an assumption underlying the design of thesafety control structure and could provide important informationif included in a leading indicator program.

As another example, the Chief Engineer cannot perform all theduties listed above with respect to technical standards, so he orshe has a network of people in the hierarchy below to whom someof the responsibilities are delegated or “warranted.” The chiefengineer retains responsibility for ensuring that the warrantholders perform their duties adequately as in any hierarchicalmanagement structure.

The Chief Engineer responsibility to approve all variances andwaivers to technical requirements is assigned to the SystemTechnical Warrant Holder (STWH). The risks or potentially unsafecontrol actions of the STWH with respect to this responsibility are:

1. An unsafe engineering variance or waiver is approved.2. Designs are approved without determining conformance with

safety requirements. Waivers become routine.3. Reviews and approvals take so long that ITA becomes a bottle-

neck. Mission achievement is threatened. Engineers start toignore the need for approvals and work around the STWH inother ways.

In fact, all of these risks were part of the causal factors in the SpaceShuttle Columbia loss. They were, unfortunately, not identified assuch at the time [45]. Once again, they represent a migration of thestate of the shuttle program safety control structure over time dueto financial and time pressures. The ITA program was designed toprevent a reoccurrence. But the same financial and politicalpressures that led to the loss still existed after the Shuttle programoperation resumed, and the assumptions underlying the design ofthe new ITA management structure should have been part of aleading indicators program.

Important assumptions also arise from the safety culture. Thesafety culture is the set of values and deep cultural assumptionsupon which safety-related actions are taken and decisions aremade in the safety control structure. Although these assumptionsare not directly reflected in the design of the structure, they areimplicit in its operation because they affect the behavior of thehuman controllers and human-implemented controls and alsounderlie the basic requirements (such as SC1 above), i.e. thatsafety considerations will be first and foremost in decision making.

The core organizational safety values (the desired safetyculture) of the organization should be documented along withproviding education and buy-in by all employees. A criticalassumption underlying safety is that the safety culture is acceptedand underlies decision making. The weakening or inaccuracy ofthis assumption is an important leading indicator that risk isincreasing [4].

4.4. Assumptions underlying coordination risks

The risks and assumptions underlying the safety control structure(at both the operational process and organizational control structurelevels) may also involve coordination problems. Coordination risks

arise when multiple people or groups control the same process. Thetypes of unsafe interactions that may result include: (1) both con-trollers assume that the other is performing the control responsibilitiesand, as a result, nobody does or (2) controllers provide conflicting con-trol actions that have unintended side effects. Coordination problemsamong multiple controllers were a factor in the Uberlingen accident.

When similar responsibilities related to the same systemrequirements are assigned to multiple controllers, then theassumptions that the control structure designers made abouthow the actions would be coordinated need to be recorded andused in the leading indicators program.

5. An assumption-based leading indicator program

Once safety-related assumptions are identified and documen-ted, they can be used as the basis for a leading indicators program.Such a program has three aspects: (1) identifying appropriate andeffective leading indicators, (2) creating a safety indicator mon-itoring program to use them, and (3) embedding this monitoringsystem within a risk management program. The next threesections propose a process for each of these in turn.

5.1. A structured process for identifying leading indicators based onSTAMP

In general, the assumptions under which the system will beused and the environment in which the system will operate are ofthree types:

1. The models and assumptions used during initial decisionmaking and design are correct.

2. The system will be constructed, operated, and maintained inthe manner assumed by the designers.

3. The models and assumptions are not violated by changes in thesystem, such as workarounds or unauthorized changes inprocedures, or by changes in the environment.

These assumptions are used to determine what should be checked,how, when, and what actions should be taken if the checkingdetermines an assumption

Date post:	31-Jan-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Reliability Engineering and System Safetysunnyday.mit.edu/leadingindicators-published.pdf · 2018....

Documents