Date post: | 05-Apr-2017 |
Category: |
Technology |
Upload: | matthew-boeckman |
View: | 71 times |
Download: | 5 times |
The evolving role of context in Incident Management
Matthew BoeckmanDeveloper Advocate
Victorops.com/blog
@matthewboeckmanBackground
● 18 years on-call Ops● 15 years w/software
teams● Startup junkie● DevOps enthusiast
3
What is VictorOps?
VictorOps ingests all of your alerts from your current monitoring tools and becomes the logical layer between your alerts and the people who receives them.
victorops.com/IMA
5
5 Phases of Incident Management
Detection
monitoring, metrics, thresholds
Response
alerting,on-call,escalation
Remediation
fixes,tickets,deployments
Analysis
postmortem,how or why,understand
Readiness
improvement,game days,learning
6
Standard Incident Workflow
Detection Response Remediation
AnalysisReadiness
7
Incident Management Assessment Matrix
Detection Response Remediation Analysis Preparedness
Novice
Beginner
Competent
Proficient
Expert
8
Incident Management Maturity Matrix
Detection Response Remediation Analysis Preparedness
Novice
Beginner xCompetent x xProficient x x
Expert
9
Self Assessment
Poll: How would you rate your overall team maturity?
A. NoviceB. BeginnerC. CompetentD. ProficientE. Expert
10
The Focus Question
How can we help teams
mature their incident management practice
(Stated plainly: Make On-Call suck less)
11
Situational Context
12
Incident Management Key Metrics
● MTTR Mean time to Repair(MTTR)● Availability (SLA)● Ticket Volumes● Escalations● Customer Satisfaction
13
Incident Management Key Metrics
14
Time Spent Managing Incidents - Low Maturity
Detection Response Remediation Analysis
Readiness
Time to Repair (MTTR)
15
Time Spent Managing Incidents - Medium Maturity
Detection Response Remediation Analysis
Readiness
Time to Repair (MTTR)
16
Time Spent Managing Incidents - High Maturity
Detection
Response
Remediation Analysis Readiness
Time to Repair (MTTR)
17
A New Core Metric
Detection
Response
Remediation Analysis Readiness
Time to Repair (MTTR)
Time to Learn(TTL)
Identify trendsCapacity planImprove infrastructure
GamedaysCross trainUpdate runbooks
18
Beep Beep Beep
19
Standard Incident Workflow
20
Standard Diagnostic Procedure
1. Fire up the VPN
2. Navigate dashboards, find relevant section
3. Review ticket or incident history for host
4. Review Runbooks for associated host
21
Common Bottlenecks to Establishing Context
● Multiple sources of record● Duplicate Runbooks or documentation● Metric overload
● New responders unfamiliar with systems
22
Where Does it Hurt?
Poll: Which is the most painful problem you experience in establishing context
A. Multiple sources of recordB. Duplicate documentationC. Metric overloadD. Everything is equally on fireE. Everything is fantastic
23
Beep Beep Beep
24
A Tale of Two Graphs
Massive spike above expected norm
Response: Fire up the laptop and put a pot of coffee on
25
A Tale of Two Graphs
Small spike for a consistently loaded box.
Response: ACK alert, go back to sleep
26
This Time, with Context!
27
Enhanced Contextual Workflow
28
Alert Enhancements
Poll: My team is doing some enhancement of alerts today.
A. TrueB. False
Many incidents can be tracked to deploys
Developer Velocity = Constant Change
Silos impair communication
29
CI/CD Exacerbates the Contextual Challenge
30
A Tale of Two Incidents
31
A Tale of Two Incidents
32
Introducing: The Scientific Method
Make Observations (the measurement)
Ask a question (why would a webserver quit working?)
Form a hypothesis (because we just deployed?)
33
The Sandstorm
34
No. Do not.
35
Measure Everything: the Anti-pattern
Measurements cost time and money
Busy dashboards lead to sub-concious filtering
Measurements create a natural impulse to alert
36
Enhance
37
Stop
38
An Embarrassment of Dashboards
39
Rule of Thumb
Measure much
Alert on some
Contextualize all
40
Iteration is Key
Dialing in context takes time
Conduct blameless postmortems
Experiment with more and less context
Be objective in your assessment of what works
41
Leverage Situational Context
Providing incident responders with context
can meaningfully impact MTTR
paying dividends in time
to move your practice forward
42
The Beginning
Detection Response Remediation Analysis
Readiness
Time to Repair (MTTR)
43
The Goal
Detection
Response
Remediation Analysis Readiness
Time to Repair (MTTR)
Time to Learn(TTL)
Identify trendsCapacity planImprove infrastructure
GamedaysCross trainUpdate runbooks
Take the IMA!http://victorops.com/ima
Questions?
44
Thank you!
Matthew Boeckman@matthewboeckman
Slides on devops.com & slideshare.com