Monitoring is Never “Done”
@melaniemj
Responsibilities @ Yardi
Implementation and administration of monitoring, alerting, and log aggregation/analysis tools.
o 15,000+ Deviceso 9 Datacenterso 5000+ Customer Installationso We monitor windows envs with linux envs
This was me in 2008 @ Point2
How code is delivered
How code operates in production
A good problem to have
Everyone wants “the monitoring” so they can say “it’s monitored”
Communicating Work
o Classify o Quantify o Qualify
Words....
o Loggingo Alertingo Dashboards o Reportso 4-9so 24x7x365 this shit can’t go down
Can it be this simple?
Let’s talk about “the monitoring” for X
Be awesome
X is monitored
DCVA (OODA)
1. Definition
I can hit this one page so it’s up right?
No thanks, let’s redefine status
1. Definition
o What questions are you trying to answer?o What information do you need when a failure
occurs?o What are the most common failures?o Who is the audience for the information?
2. Checks & Collections
o Environment & Codeo Data pointso Detailed logso Current state
3. Visualization
o Analysiso Dashboardso Correlations
4. Action
o Fault detection o Alertingo RCA
Cycle
(What to collect)
(Inform on failure) (How to collect)
(Make collections pretty)
Team Time Distribution
Time Distribution (Desired)
Is “X” monitored?
When “X” goes into some degraded stateo The right people know.
o They have enough information to find the problem, recover, and later to do RCA.
o If they don’t they will revisit definition.
How does your team
o Classify o Quantify o Qualify
Monitoring is Never “Done”
Melanie Cey @melaniemj
Senior Systems AnalystSystems Reliability Engineering @ Yardi