Date post: | 12-Mar-2018 |
Category: |
Documents |
Upload: | truongthuan |
View: | 220 times |
Download: | 5 times |
Applying Systems Thinking to
Safety in Complex, Socio-Technical
Systems
Prof. Nancy Leveson
Aeronautics and AstronauticsEngineering Systems
Complex Systems Research Lab
MIT
Topics
• Need for a new accident (loss) causality model
• A causality model based on systems theory
• Example: Causal analysis of an E. coli water
contamination incident
• Uses and applications
Chain-of-Events Model
• Explains accidents in terms of multiple events,
sequenced as a forward chain over time.
– Simple, direct relationship between events in chain
– Omits non-linear, indirect relationships (e.g., feedback)
• Events chosen almost always involve component failure,
human error, or energy-related event
• Forms basis for engineering safety analysis
e.g., FTA, FMECA, HAZOP, etc.
And design for safety
e.g. redundancy, safety margins, barriers and interlocks
Chain-of-events example
Limitations of Chain-of-Events Model
• Human error
• Social and organizational factors in accidents
• Adaptation
– Systems are continually changing
– Systems and organizations migrate toward accidents
(states of high risk) under cost and productivity
pressures in an aggressive, competitive environment
• Accidents without component failure
Human Error
• Human error
– Define as deviation from normative procedures, but
operators always deviate from standard procedures
• Normative vs. effective procedures
• Sometimes violation of rules has prevented accidents
– Cannot effectively model human behavior by
decomposing it into individual decisions and acts and
studying it in isolation from
• Physical and social context
• Value system in which takes place
• Dynamic work process
SystemSystem’’s Theoretic View of Safetys Theoretic View of Safety
• Safety is an emergent system property
– Accidents arise from interactions among system
components (human, physical, social)
– That violate the constraints on safe component
behavior and interactions
• Losses are the result of complex processes, not
simply chains of failure events
• Most major accidents arise from a slow migration of
the entire system toward a state of high-risk
STAMP: System’s Theoretic Accident
Model and Processes (1)
• Views safety as a dynamic control problem rather than a
component failure problem, e.g.,
– MPL software did not adequately control descent speed
– O-ring did not control release of hot gases from Shuttle field joint
– Public health system did not adequately control contamination of
the milk supply with melamine
– Financial system did not adequately control the use of financial
instruments
• Events are the result of the inadequate control
– Result from lack of enforcement of safety constraints
– Need to examine larger process and not just event chain
Example
Control
Structure
Controlling and managing dynamic systems
requires a model of the controlled process
Controlled Process
Model of
Process
Controller
Control
Actions Feedback
Process models must contain:
- Required relationship among
process variables
- Current state (values of
process variables
- The ways the process can
change state
STAMP: System’s Theoretic Accident
Model and Processes (2)
• To understand accidents, need to examine control
structure to determine
• Why inadequate to enforce safety constraints
• Why events occurred
• To prevent accidents, need to create a control structure
that will enforce the safety constraints
• Critical paradigm change:
Prevent failuresPrevent failures
Enforce safety Enforce safety
constraintsconstraints
Accident Causality
• Accidents occur when
– Control structure or control actions do not enforce safety constraints
• Unhandled environmental disturbances or conditions
• Unhandled or uncontrolled component failures
• Dysfunctional (unsafe) interactions among components
– Control structure degrades over time (asynchronous evolution)
– Control actions inadequately coordinated among multiple controllers
Dysfunctional Controller Interactions
Boundary areas
Overlap areas (side effects of decisions and control
actions)
Controller 1
Controller 2
Process 1
Process 2
Controller 1
Controller 2
Process
Uncoordinated “Control Agents”
Control Agent
(ATC)
InstructionsInstructions
“SAFE STATE”
ATC provides coordinated instructions to both planes
“SAFE STATE”
TCAS provides coordinated instructions to both planes
Control Agent
(TCAS)
InstructionsInstructions
“UNSAFE STATE”
BOTH TCAS and ATC provide uncoordinated & independent instructions
Control Agent
(ATC)
InstructionsInstructions
No Coordination
Uses for STAMP
• Basis for new, more powerful hazard analysis techniques (STPA)
• Safety-driven design (physical, operational, organizational)
• More comprehensive accident/incident investigation and root
cause analysis
• Organizational and cultural risk analysis
– Identifying physical and project risks
– Defining safety metrics and performance audits
– Designing and evaluating potential policy and structural
improvements
– Identifying leading indicators of increasing risk (“canary in the coal
mine”)
• New holistic approaches to security
Current Uses (1)
• Safety analysis of new missile defense system (MDA)
• Designing for safety in new JPL outer planets explorer
(NASA)
• Incorporating risk into early trade studies (NASA)
• Analysis of the management structure of the space shuttle
program (post-Columbia) (NASA)
• Risk management in the development of NASA’s new
manned space program (Constellation) (NASA)
• NASA Mission control ─ re-planning and changing
procedures safely (NSF)
Current Uses (2)
• Accident/incident analysis (BP)
• New hazard analysis technique for the petrochemical
industry (BP?)
• Analysis and prevention of corporate fraud
• Safety of maglev trains (Japan Central Railway)
• Food safety
• Safety in pharmaceutical drug development (NSF)
• Medical malpractice
• Risk analysis of outpatient (ambulatory) GI surgery at Beth
Israel Deaconess Hospital (NIH/AHRQ)
Ballistic Missile Defense System (BMDS)
Non-Advocate Safety Assessment using STPA
• A layered defense to defeat all ranges of threats in all phases of flight (boost, mid-course, and terminal)
• Made up of many existing systems (BMDS Element)
– Early warning radars
– Aegis
– Ground-Based Midcourse Defense (GMD)
– Command and Control Battle Management and Communications (C2BMC)
– Others
• MDA used STPA to evaluate (prior to deployment and test) the residual safety risk of inadvertent launch
© Copyright Nancy Leveson, Aug. 2006
8/2/2006 27
Safety Control Structure Diagram for FMIS
© Copyright Nancy Leveson, Aug. 2006
Results
• Deployment and testing held up for 6 months because so many scenarios identified for inadvertent launch (the only hazard considered so far). In many of these scenarios:
– All components were operating exactly as intended
– Complexity of component interactions led to unanticipated system behavior
• STPA also identified component failures that could cause inadequate control (most analysis techniques consider only these failure events)
• As changes are made to the system, the differences are assessed by updating the control structure diagrams and assessment analysis templates.
• Adopted as primary safety approach for BMDS
© Copyright Nancy Leveson, Aug. 2006
Safety-driven Model-based
System Engineering Methodology for
an Outer Planets Explorer Spacecraft
• Top-down specification and analysis of a deep space exploration
mission system with a “Deep Dive” into the area of communications
antenna pointing (important hazard for these types of systems)
– Specification encompassed all aspects of the mission system (i.e.,
spacecraft, launch vehicle, ground network, etc.) to the extent that they
informed the specific hazard selected
• Generated requirements and design constraints from hazards and
STPA hazard analysis to guide design decisions as they are made.
– Results included traceability from hazards to design decisions,
documentation of design rationale, and formal executable models
of system design
– Performed trade studies to evaluate safety of various design
options
Example: Early System Architecture
Trades for Space Exploration
• Part of an MIT/Draper Labs contract with NASA
• Wanted to include risk, but little information available
• Not possible to evaluate likelihood when no design information available
• Can consider severity by using worst-case analysis associated with specific hazards.
• Developed three step process:
– Identify system-level hazards and associated severities
– Identify mitigation strategies and associated impact
– Calculate safety/risk metrics for each architecture
© Copyright Nancy Leveson, Aug. 2006
Risk Analysis of NASA ITA
1. Preliminary Hazard Analysis
2. Modeling the ITA Safety Control Structure
3. Mapping Requirements to Responsibilities
4. Detailed Hazard Analysis using STPA
• System hazards
• System safety requirements and constraints
• Roles and responsibilities
• Feedback mechanisms
• Gap analysis • System risks (inadequate controls)
5. Categorizing & Analyzing Risks
6. System Dynamics Modeling and Analysis
7. Findings and Recommendations
• Immediate and longer term risks
• Sensitivity
• Leading indicators
• Risk Factors
• Policy
• Structure
• Leading indicators and measures of effectiveness
© Copyright Nancy Leveson, Aug. 2006
Successful vs. Unsuccessful ITA
Implementation
Identification of Lagging vs.
Leading Indicators
• Number of waivers issued good
indicator but lags rapid increase
in risk
• Incidents under investigation is
a better leading indicator
Managing Tradeoffs Among Risks
• Good risk management requires understanding
tradeoffs among
– Schedule
– Cost
– Performance
– Safety
• Demonstrated how to do this using modeling and
analysis for NASA’s new manned space program to
return to the moon and go on to Mars
STAMP-SD model of NASA
Exploration Systems Mission
Design Work
Remaining
Design Work
Completed
Pending Technology
Development Tasks
Completed Technology
Development Tasks
Design Task
Completion Rate
Technology Development
Task Completion Rate
Technologies used in
DesignTechnology
Utilization Rate
Pending Hazard
Analyses
Incoming Program
Design Work
Incoming Hazard
Analysis Tasks
Incoming Technology
Development Tasks
Completed Hazard
Analyses
Hazard Analyses
used in DesignHA Completion
Rate
HA Utilization
Rate
Hazard Analyses
unused in Design
Decisions
HA Discard Rate
Abandoned
Technologies
Technology
Abandonment Rate
Design Task Allocation
Rate (from P/PManagement)
Technology Development TaskAllocation Rate (from P/P
Management)
Capacity for Performing
System Design Work 0
Capacity for
PerformingTechnologyDevelopment Work 0
Design SchedulePressure fromManagement
Fraction of HAs Too
Late to Influence Design
Average Hazard
Analysis Quality
Average Quality ofHazard Analyses used in
Design
Fraction of Design Tasks with
Associated Hazard Analysis
Technology Available to
be used in Design
Additional Incoming
Design Work
Progress Report to
Management
Additional Incoming Workfrom Changes (from P/P
Management)
Design Work
Completed with
Undiscovered Safety
and Integration
Problems
Design Work Completion
Rate with Safety andIntegration Flaws
Total Design Work
Completion Rate
Work Discovered with
Safety and Integration
Problems
Flaw Discovery
Rate
Design Work withAccepted Problems or
Unsatisfied
Requirements
Acceptance Rate
Unplanned Rework
Decision Rate
Additional Operations Cost for Safety
and Integration Workaround
Efficacy of Safety
Assurance (SMA)
Safety Assurance
Resources
Time to Discover
Flaws
Incentives to
Report Flaws
Efficacy of System
Integration
Quality of Safety
Analyses 0
Maximum System Safety
Analysis Completion Rate
System
Performance
Apparent Work
Completed
Desired Design Task
Completion Rate
Safety of
Operational System
System Design
Overwork
Desired Safety Analysis
Completion Rate
Ability to Perform
Contractor SafetyOversight 2
Fraction of Design TasksCompleted with Safety and
Integration Flaws
NASA ESMD Workforce Planning
Simulation varied:
• Initial experience distribution of ESMD civil servant workforce
• Maximum civil servant hiring rates
• Transfers from Shuttle ops during Shuttle retirement
Important Issues:
- Increase in retirements
- Hiring limits
- Transfers
Example: Schedule Pressure and Safety Priority
in Developing the Shuttle Replacement
1. Overly aggressive schedule
enforcement has little effect on
completion time (<2%) and cost,
but has a large negative impact
on safety
2. Priority of safety activities has a
large positive impact, including a
positive cost impact (less
rework)
0 1 2 3 4 5 6 7 8 9
10
0
2.5
5
7.5
10
Schedule Pressure
Safety
Priority
Relative Cost
Work in Progress: Pharma Safety
• Modeling development dynamics of new drugs, including
factors impacting drug safety, benefits, and profitability
• Our focus is modeling and understanding how drug
safety is impacted by pharma, regulatory and medical
organizational structure and culture and how these
factors interact
Potential General Pharma Topics
• Factors underlying the pipeline slowdown (including organizational
and social factors)
• Long-term safety impact of alternative organizational and social
structures
• Impact of regulatory decision making on
– Drug development strategies
– Pharmacovigilance
– Safety
– Cost and profits
• Potential impact of other changes such as EHR
Modeling Risk in Complex
Healthcare Settings
• Goal: To identify ways unnecessary risk can be mitigated without impacting overall performance
• Testbed: Ambulatory/outpatient (GI) surgical care at BIDMC
– Technologically rich environment with
• Significant human-machine interface issues
• Dynamically changing patient status
• Outcomes depend heavily on success with which humans recognize and respond appropriately to changes
– Drivers
• Cost-containment pressures
• Desire to deliver care more efficiently
– Tensions among safety, productivity, efficiency lead to pressures to waive safety controls
Healthcare (con’t)
• Designed a series of experiments to study complex
interactions among:
– Production pressures
– Historical experience with adverse outcomes
– Inherent risk tolerance/propensity
– Confidence in and compliance with safety controls
To identify how interactions drive system above
acceptable thresholds of safety.
• Paradigm change for healthcare: assess risk in terms
of processes rather than outcomes
Differences from Traditional Approaches
• More comprehensive view of causality
• A top-down system’s approach to preventing losses
• Includes organizational, social, and cultural aspects
of risk
• Treats accidents as dynamic processes
• Includes human decision making and mental models
• Handles much more complex systems than traditional
safety analysis approaches