Date post: | 01-Sep-2014 |
Category: |
Technology |
Upload: | john-allspaw |
View: | 4,018 times |
Download: | 0 times |
Responding to Outages Maturely
John AllspawSVP, Tech Ops
Code As Craft, Berlin
Tuesday, April 24, 12
OPERABILITY
Tuesday, April 24, 12
PRODUCTION
Tuesday, April 24, 12
http://WhoOwnsMyAvailability.com
Tuesday, April 24, 12
Tuesday, April 24, 12
How important is this?
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
How important is this?
Tuesday, April 24, 12
How Can This Happen?
Tuesday, April 24, 12
Complicated? Complex?
Tuesday, April 24, 12
Complex Systems
• Cascading Failures
• Difficult to determine boundaries
• Complex systems may be open
• Complex systems may have a memory
• Complex systems may be nested
• Dynamic network of multiplicity
• May produce emergent phenomena
• Relationships are non-linear
• Relationships contain feedback loopsTuesday, April 24, 12
How Can This Happen?It does happen.And it will again.
And again.Tuesday, April 24, 12
Tuesday, April 24, 12
Optimization
MTBF
MTTRTuesday, April 24, 12
http://www.flickr.com/photos/sparktography/75499095/Tuesday, April 24, 12
How does team troubleshooting
happen?Tuesday, April 24, 12
Time
Problem Starts
DetectionEvaluation
ResponseStable
ConfirmationAll Clear Po
stMort
em
Tuesday, April 24, 12
Time
Problem Starts
DetectionEvaluation
ResponseStable
ConfirmationAll Clear
Stress
PostM
ortem
Tuesday, April 24, 12
Forced beyond learned roles
Actions whose consequences are both important and difficult to see
Cognitively and perceptively noisy
Coordinative load increases exponentiallyTuesday, April 24, 12
Tuesday, April 24, 12
So What Can We Do?
Tuesday, April 24, 12
We Learn From Others
Tuesday, April 24, 12
Characteristics of response to escalating scenarios
Tuesday, April 24, 12
...tend to neglect how processes develop within time (awareness of rates) versus assessing how things are in the moment
Characteristics of response to escalating scenarios
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Tuesday, April 24, 12
...have difficulty in dealing with exponential developments (hard to imagine how fast something can change, or accelerate)
Characteristics of response to escalating scenarios
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Tuesday, April 24, 12
...inclined to think in causal series, instead of causal nets.
A therefore B,
instead of
A, therefore B and C (therefore D and E), etc.
Characteristics of response to escalating scenarios
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Tuesday, April 24, 12
Thematic Vagabonding
Pitfalls
Tuesday, April 24, 12
Pitfalls
Goal Fixation(encystment)
Tuesday, April 24, 12
Pitfalls
Refusal to make decisions
Tuesday, April 24, 12
Non-communicating lone wolf-isms
Heroism
Tuesday, April 24, 12
Irrelevant noise in comm channels
Distraction
Tuesday, April 24, 12
Jens Rasmussen, 1983Senior Member, IEEE
“Skills, Rules, and Knowledge; Signals, Signs, and Symbols, and Other Distinctions in Human Performance Models”IEEE Transactions On Systems, Man, and Cybernetics, May 1983
Tuesday, April 24, 12
SKILL - BASED
Simple, routineRULE - BASED
Knowable, but unfamiliarKNOWLEDGE - BASED
WTF IS GOING ON?(Reason, 1990)
Tuesday, April 24, 12
• Which causes did you consider first?
• Which ones did you not consider at all?
• How much of what you considered comes from recent history?
• How much comes from observations from other team members?
Team Troubleshooting
Tuesday, April 24, 12
• How effective is the response team in communicating to other groups? Users?
• How long does it take to exhaust obvious cause(s)?
Team Troubleshooting
Tuesday, April 24, 12
Team Dynamics
Tuesday, April 24, 12
• Air Traffic Control
• Naval Air Operations At Sea
• Electrical Power Systems
• Etc.
High Reliability Organizations
• Complex Socio-Technical systems
• Efficiency <-> Thoroughness
• Time/Resource Constrained
• Engineering-driven
Tuesday, April 24, 12
Tuesday, April 24, 12
“The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea”Rochlin, La Porte, and Roberts. Naval War College Review 1987
http://govleaders.org/reliability.htm
Tuesday, April 24, 12
Tuesday, April 24, 12
Close interdependence between groups
Tuesday, April 24, 12
Close reciprocal coordination and information sharing, resulting in overlapping knowledge
Tuesday, April 24, 12
High redundancy: multiple people observing the same event and sharing information
Tuesday, April 24, 12
Broad definition of who belongs to the team.
Tuesday, April 24, 12
Teammates are included in the communication loops rather than excluded.
Tuesday, April 24, 12
Lots of error correction.
Tuesday, April 24, 12
High levels of situation comprehension: maintain constant awareness of the possibility of accidents.
Tuesday, April 24, 12
High levels of interpersonal skills
Tuesday, April 24, 12
Maintenance of detailed records of past incidents that are closely examined with a view to learning from them.
Tuesday, April 24, 12
Patterns of authority are changed to meet the demands of the events: organizational flexibility.
Tuesday, April 24, 12
The reporting of errors and faults is rewarded, not punished.
Tuesday, April 24, 12
So What ElseCan We Do?
Tuesday, April 24, 12
We Drill
Tuesday, April 24, 12
We GameDay
Tuesday, April 24, 12
Tuesday, April 24, 12
We Learn To Improvise
Tuesday, April 24, 12
IMPROVISATION
Tuesday, April 24, 12
IMPROVISATION
Tuesday, April 24, 12
We Learn From Our Mistakes
Tuesday, April 24, 12
Postmortems
• Full timelines: What happened, when, who involved
• Review in public, everyone invited
• Search for “second stories” instead of “human error”
• Cultivating a blameless environment
• Giving requisite authority to individuals to improve things
Tuesday, April 24, 12
High signal:noise in comm channels?
Troubleshooting fatigue?
Troubleshooting handoff?
All tools on-hand and working?
Improvised tooling or solutions?
Metrics visibility?
Collaborative and skillful communication?
Qualifying Response
Tuesday, April 24, 12
Remediation
Tuesday, April 24, 12
We Share Near-MissEvents
Tuesday, April 24, 12
Near MissesHey everybody -
Don’t be like me. I tried to X, but that wasn’t a good idea.
It almost exploded everyone.
So, don’t do: (details about X)
Love, Joe
Tuesday, April 24, 12
• Can act like “vaccines” - help system safety without actually hurting anything
• Happen more often, so provide more data on latent failures
• Powerful reminder of hazards, and slows down the process of forgetting to be afraid
Near Misses
Tuesday, April 24, 12
Practice!
• How we troubleshoot in the moment, as a distributed team
• How we handle time pressure
• How we Observe/Orient/Decide/Act
• How we communicate during emergencies
• How we trust (or not) each other during emergencies
• How we relate to emergencies when things are normal
• How we could detect how we are protected during normal times (i.e., why aren’t we going down RIGHT NOW?)
Tuesday, April 24, 12
Resilient Response
• Can learn from other fields
• Can train for outages
• Can learn from mistakes
• Can learn from successes as well as failures
Tuesday, April 24, 12
http://www.flickr.com/photos/sparktography/75499095/Tuesday, April 24, 12
THE END
Tuesday, April 24, 12
A parting wordA parting challenge
Tuesday, April 24, 12
Two Propositions
Tuesday, April 24, 12
100 changes
6 change-related issuesTuesday, April 24, 12
100 > 6
Tuesday, April 24, 12
Proposition #1
“Ways in which things go right are special cases of the ways in which things go wrong.”
Tuesday, April 24, 12
Proposition #1
Successes = failures gone wrong
Study the failures, generalize from that.
Potential data sources: 6 out of 100
Tuesday, April 24, 12
Proposition #2
“Ways in which things go wrong are special cases of the ways in which things go right.”
Tuesday, April 24, 12
Proposition #2
Failures = successes gone wrongStudy the successes, generalize from that
Potential data sources: 94 out of 100Tuesday, April 24, 12
94/100 ?
6/100 ?
OR
Tuesday, April 24, 12
What and WHY Do Things Go RIGHT?
Tuesday, April 24, 12
Not just: why did we fail?
But also: why did we succeed?
Tuesday, April 24, 12
Mature Role of Automation
http://www.bainbrdg.demon.co.uk/Papers/Ironies.html
“Ironies of Automation” - Lisanne Bainbridge
Tuesday, April 24, 12
Mature Role of Automation
• Moves humans from manual operator to supervisor
• Extends and augments human abilities, doesn’t replace it
• Doesn’t remove “human error”
• Are brittle
• Recognize that there is always discretionary space for humans
• Recognizes the Law of Stretched Systems
Tuesday, April 24, 12
Law of Stretched Systems
“Every system is stretched to operate at its capacity; as soon as there is some improvement, for example, in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity”
D. Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006
Tuesday, April 24, 12