Applying Systems Thinking to Safety in Complex, Socio-Technical Systems · PDF fileApplying...

Applying Systems Thinking to

Safety in Complex, Socio-Technical

Systems

Prof. Nancy Leveson

Aeronautics and AstronauticsEngineering Systems

Complex Systems Research Lab

MIT

Topics

• Need for a new accident (loss) causality model

• A causality model based on systems theory

• Example: Causal analysis of an E. coli water

contamination incident

• Uses and applications

Chain-of-Events Model

• Explains accidents in terms of multiple events,

sequenced as a forward chain over time.

– Simple, direct relationship between events in chain

– Omits non-linear, indirect relationships (e.g., feedback)

• Events chosen almost always involve component failure,

human error, or energy-related event

• Forms basis for engineering safety analysis

e.g., FTA, FMECA, HAZOP, etc.

And design for safety

e.g. redundancy, safety margins, barriers and interlocks

Chain-of-events example

Limitations of Chain-of-Events Model

• Human error

• Social and organizational factors in accidents

• Adaptation

– Systems are continually changing

– Systems and organizations migrate toward accidents

(states of high risk) under cost and productivity

pressures in an aggressive, competitive environment

• Accidents without component failure

Human Error

• Human error

– Define as deviation from normative procedures, but

operators always deviate from standard procedures

• Normative vs. effective procedures

• Sometimes violation of rules has prevented accidents

– Cannot effectively model human behavior by

decomposing it into individual decisions and acts and

studying it in isolation from

• Physical and social context

• Value system in which takes place

• Dynamic work process

SystemSystem’’s Theoretic View of Safetys Theoretic View of Safety

• Safety is an emergent system property

– Accidents arise from interactions among system

components (human, physical, social)

– That violate the constraints on safe component

behavior and interactions

• Losses are the result of complex processes, not

simply chains of failure events

• Most major accidents arise from a slow migration of

the entire system toward a state of high-risk

STAMP: System’s Theoretic Accident

Model and Processes (1)

• Views safety as a dynamic control problem rather than a

component failure problem, e.g.,

– MPL software did not adequately control descent speed

– O-ring did not control release of hot gases from Shuttle field joint

– Public health system did not adequately control contamination of

the milk supply with melamine

– Financial system did not adequately control the use of financial

instruments

• Events are the result of the inadequate control

– Result from lack of enforcement of safety constraints

– Need to examine larger process and not just event chain

Example

Control

Structure

Controlling and managing dynamic systems

requires a model of the controlled process

Controlled Process

Model of

Process

Controller

Control

Actions Feedback

Process models must contain:

- Required relationship among

process variables

- Current state (values of

process variables

- The ways the process can

change state

STAMP: System’s Theoretic Accident

Model and Processes (2)

• To understand accidents, need to examine control

structure to determine

• Why inadequate to enforce safety constraints

• Why events occurred

• To prevent accidents, need to create a control structure

that will enforce the safety constraints

• Critical paradigm change:

Prevent failuresPrevent failures

Enforce safety Enforce safety

constraintsconstraints

Accident Causality

• Accidents occur when

– Control structure or control actions do not enforce safety constraints

• Unhandled environmental disturbances or conditions

• Unhandled or uncontrolled component failures

• Dysfunctional (unsafe) interactions among components

– Control structure degrades over time (asynchronous evolution)

– Control actions inadequately coordinated among multiple controllers

Dysfunctional Controller Interactions

Boundary areas

Overlap areas (side effects of decisions and control

actions)

Controller 1

Controller 2

Process 1

Process 2

Controller 1

Controller 2

Process

Uncoordinated “Control Agents”

Control Agent

(ATC)

InstructionsInstructions

“SAFE STATE”

ATC provides coordinated instructions to both planes

“SAFE STATE”

TCAS provides coordinated instructions to both planes

Control Agent

(TCAS)


“UNSAFE STATE”

BOTH TCAS and ATC provide uncoordinated & independent instructions

Control Agent

(ATC)


No Coordination

Uses for STAMP

• Basis for new, more powerful hazard analysis techniques (STPA)

• Safety-driven design (physical, operational, organizational)

• More comprehensive accident/incident investigation and root

cause analysis

• Organizational and cultural risk analysis

– Identifying physical and project risks

– Defining safety metrics and performance audits

– Designing and evaluating potential policy and structural

improvements

– Identifying leading indicators of increasing risk (“canary in the coal

mine”)

• New holistic approaches to security

Current Uses (1)

• Safety analysis of new missile defense system (MDA)

• Designing for safety in new JPL outer planets explorer

(NASA)

• Incorporating risk into early trade studies (NASA)

• Analysis of the management structure of the space shuttle

program (post-Columbia) (NASA)

• Risk management in the development of NASA’s new

manned space program (Constellation) (NASA)

• NASA Mission control ─ re-planning and changing

procedures safely (NSF)

Current Uses (2)

• Accident/incident analysis (BP)

• New hazard analysis technique for the petrochemical

industry (BP?)

• Analysis and prevention of corporate fraud

• Safety of maglev trains (Japan Central Railway)

• Food safety

• Safety in pharmaceutical drug development (NSF)

• Medical malpractice

• Risk analysis of outpatient (ambulatory) GI surgery at Beth

Israel Deaconess Hospital (NIH/AHRQ)

Ballistic Missile Defense System (BMDS)

Non-Advocate Safety Assessment using STPA

• A layered defense to defeat all ranges of threats in all phases of flight (boost, mid-course, and terminal)

• Made up of many existing systems (BMDS Element)

– Early warning radars

– Aegis

– Ground-Based Midcourse Defense (GMD)

– Command and Control Battle Management and Communications (C2BMC)

– Others

• MDA used STPA to evaluate (prior to deployment and test) the residual safety risk of inadvertent launch

© Copyright Nancy Leveson, Aug. 2006

8/2/2006 27

Safety Control Structure Diagram for FMIS


Results

• Deployment and testing held up for 6 months because so many scenarios identified for inadvertent launch (the only hazard considered so far). In many of these scenarios:

– All components were operating exactly as intended

– Complexity of component interactions led to unanticipated system behavior

• STPA also identified component failures that could cause inadequate control (most analysis techniques consider only these failure events)

• As changes are made to the system, the differences are assessed by updating the control structure diagrams and assessment analysis templates.

• Adopted as primary safety approach for BMDS


Safety-driven Model-based

System Engineering Methodology for

an Outer Planets Explorer Spacecraft

• Top-down specification and analysis of a deep space exploration

mission system with a “Deep Dive” into the area of communications

antenna pointing (important hazard for these types of systems)

– Specification encompassed all aspects of the mission system (i.e.,

spacecraft, launch vehicle, ground network, etc.) to the extent that they

informed the specific hazard selected

• Generated requirements and design constraints from hazards and

STPA hazard analysis to guide design decisions as they are made.

– Results included traceability from hazards to design decisions,

documentation of design rationale, and formal executable models

of system design

– Performed trade studies to evaluate safety of various design

options

Example: Early System Architecture

Trades for Space Exploration

• Part of an MIT/Draper Labs contract with NASA

• Wanted to include risk, but little information available

• Not possible to evaluate likelihood when no design information available

• Can consider severity by using worst-case analysis associated with specific hazards.

• Developed three step process:

– Identify system-level hazards and associated severities

– Identify mitigation strategies and associated impact

– Calculate safety/risk metrics for each architecture


Risk Analysis of NASA ITA

1. Preliminary Hazard Analysis

2. Modeling the ITA Safety Control Structure

3. Mapping Requirements to Responsibilities

4. Detailed Hazard Analysis using STPA

• System hazards

• System safety requirements and constraints

• Roles and responsibilities

• Feedback mechanisms

• Gap analysis • System risks (inadequate controls)

5. Categorizing & Analyzing Risks

6. System Dynamics Modeling and Analysis

7. Findings and Recommendations

• Immediate and longer term risks

• Sensitivity

• Leading indicators

• Risk Factors

• Policy

• Structure

• Leading indicators and measures of effectiveness


Successful vs. Unsuccessful ITA

Implementation

Identification of Lagging vs.

Leading Indicators

• Number of waivers issued good

indicator but lags rapid increase

in risk

• Incidents under investigation is

a better leading indicator

Managing Tradeoffs Among Risks

• Good risk management requires understanding

tradeoffs among

– Schedule

– Cost

– Performance

– Safety

• Demonstrated how to do this using modeling and

analysis for NASA’s new manned space program to

return to the moon and go on to Mars

STAMP-SD model of NASA

Exploration Systems Mission

Design Work

Remaining

Design Work

Completed

Pending Technology

Development Tasks

Completed Technology

Development Tasks

Design Task

Completion Rate

Technology Development

Task Completion Rate

Technologies used in

DesignTechnology

Utilization Rate

Pending Hazard

Analyses

Incoming Program

Design Work

Incoming Hazard

Analysis Tasks

Incoming Technology

Development Tasks

Completed Hazard

Analyses

Hazard Analyses

used in DesignHA Completion

Rate

HA Utilization

Rate

Hazard Analyses

unused in Design

Decisions

HA Discard Rate

Abandoned

Technologies

Technology

Abandonment Rate

Design Task Allocation

Rate (from P/PManagement)

Technology Development TaskAllocation Rate (from P/P

Management)

Capacity for Performing

System Design Work 0

Capacity for

PerformingTechnologyDevelopment Work 0

Design SchedulePressure fromManagement

Fraction of HAs Too

Late to Influence Design

Average Hazard

Analysis Quality

Average Quality ofHazard Analyses used in

Design

Fraction of Design Tasks with

Associated Hazard Analysis

Technology Available to

be used in Design

Additional Incoming

Design Work

Progress Report to

Management

Additional Incoming Workfrom Changes (from P/P

Management)

Design Work

Completed with

Undiscovered Safety

and Integration

Problems

Design Work Completion

Rate with Safety andIntegration Flaws

Total Design Work

Completion Rate

Work Discovered with

Safety and Integration

Problems

Flaw Discovery

Rate

Design Work withAccepted Problems or

Unsatisfied

Requirements

Acceptance Rate

Unplanned Rework

Decision Rate

Additional Operations Cost for Safety

and Integration Workaround

Efficacy of Safety

Assurance (SMA)

Safety Assurance

Resources

Time to Discover

Flaws

Incentives to

Report Flaws

Efficacy of System

Integration

Quality of Safety

Analyses 0

Maximum System Safety

Analysis Completion Rate

System

Performance

Apparent Work

Completed

Desired Design Task

Completion Rate

Safety of

Operational System

System Design

Overwork

Desired Safety Analysis

Completion Rate

Ability to Perform

Contractor SafetyOversight 2

Fraction of Design TasksCompleted with Safety and

Integration Flaws

NASA ESMD Workforce Planning

Simulation varied:

• Initial experience distribution of ESMD civil servant workforce

• Maximum civil servant hiring rates

• Transfers from Shuttle ops during Shuttle retirement

Important Issues:

- Increase in retirements

- Hiring limits

- Transfers

Example: Schedule Pressure and Safety Priority

in Developing the Shuttle Replacement

1. Overly aggressive schedule

enforcement has little effect on

completion time (<2%) and cost,

but has a large negative impact

on safety

2. Priority of safety activities has a

large positive impact, including a

positive cost impact (less

rework)

0 1 2 3 4 5 6 7 8 9

10

0

2.5

5

7.5

10

Schedule Pressure

Safety

Priority

Relative Cost

Work in Progress: Pharma Safety

• Modeling development dynamics of new drugs, including

factors impacting drug safety, benefits, and profitability

• Our focus is modeling and understanding how drug

safety is impacted by pharma, regulatory and medical

organizational structure and culture and how these

factors interact

Potential General Pharma Topics

• Factors underlying the pipeline slowdown (including organizational

and social factors)

• Long-term safety impact of alternative organizational and social

structures

• Impact of regulatory decision making on

– Drug development strategies

– Pharmacovigilance

– Safety

– Cost and profits

• Potential impact of other changes such as EHR

Modeling Risk in Complex

Healthcare Settings

• Goal: To identify ways unnecessary risk can be mitigated without impacting overall performance

• Testbed: Ambulatory/outpatient (GI) surgical care at BIDMC

– Technologically rich environment with

• Significant human-machine interface issues

• Dynamically changing patient status

• Outcomes depend heavily on success with which humans recognize and respond appropriately to changes

– Drivers

• Cost-containment pressures

• Desire to deliver care more efficiently

– Tensions among safety, productivity, efficiency lead to pressures to waive safety controls

Healthcare (con’t)

• Designed a series of experiments to study complex

interactions among:

– Production pressures

– Historical experience with adverse outcomes

– Inherent risk tolerance/propensity

– Confidence in and compliance with safety controls

To identify how interactions drive system above

acceptable thresholds of safety.

• Paradigm change for healthcare: assess risk in terms

of processes rather than outcomes

Differences from Traditional Approaches

• More comprehensive view of causality

• A top-down system’s approach to preventing losses

• Includes organizational, social, and cultural aspects

of risk

• Treats accidents as dynamic processes

• Includes human decision making and mental models

• Handles much more complex systems than traditional

safety analysis approaches

Date post:	12-Mar-2018
Category:	Documents
Upload:	truongthuan
View:	220 times
Download:	5 times

Applying Systems Thinking to Safety in Complex, Socio-Technical Systems · PDF fileApplying...

Documents