+ All Categories
Home > Technology > Top-Down Approach to Monitoring

Top-Down Approach to Monitoring

Date post: 18-Aug-2015
Category:
Upload: bigpanda
View: 53 times
Download: 5 times
Share this document with a friend
Popular Tags:
34
Top-Down Approach to Monitoring July 30, 2015
Transcript
Page 1: Top-Down Approach to Monitoring

Top-Down Approach to MonitoringJuly 30, 2015

Page 2: Top-Down Approach to Monitoring

1996

2

Tivoli Software acquired by IBM

Patrol Software acquired by BMC

Ethan Galstad creates a simple MS-DOS application designed to "ping" Novell Netware servers

“HOW to monitor?” is the primary question

Page 3: Top-Down Approach to Monitoring

2015

3

https://www.bigpanda.io/monitoringscape/

Page 4: Top-Down Approach to Monitoring

Shifting from “How?” to “What?”

4

Page 5: Top-Down Approach to Monitoring

5

Page 6: Top-Down Approach to Monitoring

Bottom-Up Approach

6

Network Servers Apps

Overall System Health

Page 7: Top-Down Approach to Monitoring

Problem #1: Inflation of Tools

7

Page 8: Top-Down Approach to Monitoring

Problem #2: Inflation of “Whats”

8

Page 9: Top-Down Approach to Monitoring

Problem #3: Inflation of Alerts

9

Page 10: Top-Down Approach to Monitoring

10

Page 11: Top-Down Approach to Monitoring

11

We’re trying to answer a simple question:

Is our system in a healthy state?

Page 12: Top-Down Approach to Monitoring

12

No Alerts

Many Alerts Unhealthy System≠

≠ Healthy System

Page 13: Top-Down Approach to Monitoring

13

Healthy System =A system that continuously generates value for its users under a well known set of KPIs

Page 14: Top-Down Approach to Monitoring

Top-Down Approach

14

KPIs UX

Overall System Health

Page 15: Top-Down Approach to Monitoring

15

KPIs UX

Overall System Health Network Servers Apps

Overall System Health

• Selective • Proactive

• Exhaustive • Reactive

vs

Bottom-UpTop-Down

Page 16: Top-Down Approach to Monitoring

A key performance indicator (KPI) is a business metric used to evaluate factors that are crucial to the success of an organization. KPIs differ per organization;

Definition of KPI

16

Page 17: Top-Down Approach to Monitoring

Let’s play a game!

17

CPU Utilization # Clicks on a button

TemperatureThis is Sam

What does Sam’s company do?

Page 18: Top-Down Approach to Monitoring

We sought out a single indicator that closely approximated our most important activity: viewing. We discovered that a server-side metric related to playback starts (the act of “clicking play”) had both a predictable pattern and fluctuated significantly when UI/device/server problems were happening. The Netflix streaming pulse was created.

The Pulse of Netflix

18

http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html

We named it “SPS” for “starts per second”.

Page 19: Top-Down Approach to Monitoring

Healthy SPS Pattern

19

http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html

Page 20: Top-Down Approach to Monitoring

Unhealthy SPS Pattern

20

http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html

Page 21: Top-Down Approach to Monitoring

What’s so special about SPS?

21

• SPS is easy to understand by all stakeholders

• One metric that covers different point of failure: server problems, device problems, etc.

• Most important: it’s a clear KPI that indicates when user experience is compromised

Page 22: Top-Down Approach to Monitoring

But what about root cause analysis?

22

KPIs UX

Overall System Health

Network Servers Apps

Page 23: Top-Down Approach to Monitoring

Github: need for speed

23

https://github.com/blog/1252-how-we-keep-github-fast

The most important factor in web application

design is responsiveness. And the first step

toward responsiveness is speed. But speed

within a web application is complicated.

Page 24: Top-Down Approach to Monitoring

Start from the Top:Response Times Dashboard

24

https://github.com/blog/1252-how-we-keep-github-fast

• Each row represented a different major component

• Clicking one of the rows allows you to dive in and see the mean, 98th percentile, and 99.9th percentile response times

Page 25: Top-Down Approach to Monitoring

Digging Deeper:Mission Control Bar

25

https://github.com/blog/1252-how-we-keep-github-fast

Total Time Render Time Cache & Database JS & CSS Size

Page 26: Top-Down Approach to Monitoring

And Deeper

26

https://github.com/blog/1252-how-we-keep-github-fast

Render Breakdown

SQL Query Viewer

Page 27: Top-Down Approach to Monitoring

27

Why talk about BigPanda?

Because Pandas are awesome!

Page 28: Top-Down Approach to Monitoring

BigPanda

28

Because.. • We’re not Netflix or Github: growing startup (7 devs, 1 full-time Ops)

• We feel the pain!

• Our KPIs are easy to describe and understand (especially if you’re an Ops person)

Page 29: Top-Down Approach to Monitoring

BigPanda

29

As a unified dashboard on top of all your

monitoring systems, and eventually a single

point of truth for production incidents, our data

pipeline has to be reliable and fast.

KPI: Low data pipeline latency

Page 30: Top-Down Approach to Monitoring

Pipeline Latency Metric

30

• Metric are sent from within the apps

• Stored in Graphite

• Sum of all the average latencies of all alerts that went through the pipeline

• Monitored by Nagios

Page 31: Top-Down Approach to Monitoring

• Very good indicator of possible service outage

• Must have for detection of SLA violation

• Very good indicator of performance bottlenecks (can be broken down to sub-pipelines / specific organizations etc)

• Simple and high-level: easy to explain to non-technical stakeholders (e.g. sales)

Pipeline Latency Metric

31

Page 32: Top-Down Approach to Monitoring

• Bottom-up approach (“monitor all the things”) is easier to start with, but soon enough leads to alert fatigue and disorientation.

• Top-down approach requires thought and custom instrumentation, but keeps you focused on what’s important.

• High level metrics can be complemented by low level metrics. Trying to deduce the former from the latter is futile.

• Take advantage of the rich monitoring landscape, but as means to an end. Don’t let the tools dictate to you what you need to measure.

• Monitoring is - first of all - about your business.

TL;DR

32

Page 33: Top-Down Approach to Monitoring

33

Questions?

Page 34: Top-Down Approach to Monitoring

34

Thanks!


Recommended