+ All Categories


By Kasper Nissen


Monitoring with

Hi! My name is Kasper


What am I going to cover?





Monitoring - why and what?

Prometheus - an introduction

Short demo

DEMO Part 1



Why monitor?


What to monitor?


Analyzing long-term trends


What to monitor?


Comparing over time or experiment groups


What to monitor?




What to monitor?


Building dashboards



Conducting ad hoc retrospective analysis




What is broken? and why?

What to monitor?


What to monitor?


HostsCPU, Memory, I/O, Network, Filesystem


What to monitor?


ContainersCPU, Memory, I/O, Restarts, Throttling


What to monitor?


ApplicationsThroughput, Latency


The Four Golden Signals


Site Reliability Engineering - How Google Runs Production Systems

What to monitor?


LatencyThe time it takes to service a request. Important to distinguish between the latency of successful and failed requests.


What to monitor?


TrafficA measure of how much demand is being placed on your system, measured in a high-level system-specific metric.


What to monitor?


ErrorsThe rate of requests that fail, either explicitly (e.g. HTTP 500s), implicitly (HTTP 200 success with wrong content)


What to monitor?


SaturationHow “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g. in a memory-constrained system, show memory)




What to monitor?


PrometheusPrometheus was presented to be the protector and benefactor of mankind.








Heavily inspired by Borgmon

Built by ex-Googlers at SoundCloud

Pull-based (scrapes at regular intervals)

Many integration possibilities

The 2nd project in CNCF

What is Prometheus?








Monitoring system and Timeseries Database


Metrics collection and storage



Dashboard / Graphing / Trending

Source: https://promcon.io/2016-berlin/talks/prometheus-design-and-philosophy/

Prometheus focus on




Operational systems monitoring

Dynamic cloud environments

Source: https://promcon.io/2016-berlin/talks/prometheus-design-and-philosophy/

Prometheus does not do








Raw log / event collection (use ELK stack)

Request tracing (use opentracing.io)

“Magic” anomaly detection

Durable long-term storage

Automatic horizontal scaling

User / auth management

Prometheus Architecture


Long-lived jobs

Pushgateway Alertmanager Short-lived jobs


The Data model


<metric name>{<label name>=<label value>, …}

api_http_requests_total{method="POST", handler="/messages"}



Every time series is uniquely identified by its metric name and a set of key-value pairs, also known as labels.

How to get metrics?


Directly instrumented

Not Directly instrumented


Source: https://promcon.io/2016-berlin/talks/so-you-want-to-write-an-exporter/


Directly instrumented software


cAdvisor Doorman

Etcd Kubernetes-Mesos

Kubernetes RobustIRC

SkyDNS Weave Flux

Official Prometheus Exporters


Node/system metrics exporter AWS CloudWatch exporter

Blackbox exporter Collectd exporter Consul exporter

Graphite exporter HAProxy exporter InfluxDB exporter

JMX exporter Memcached exporter Mesos task exporter

MySQL server exporter SNMP exporter StatsD exporter

3rd party exporters


Databases Aerospike exporter

ClickHouse exporter CouchDB exporter MongoDB exporter

PgBouncer exporter PostgreSQL exporter ProxySQL exporter

Redis exporter RethinkDB exporter

SQL query result set metrics exporter

3rd party exporters


Hardware related apcupsd exporter

IoT Edison exporter IPMI exporter knxd exporter

Ubiquiti UniFi exporter

Messaging systems NATS exporter NSQ exporter

RabbitMQ exporter RabbitMQ Management Plugin exporter

Mirth Connect exporter

3rd party exporters


Storage Ceph exporter

ScaleIO exporter

HTTP Apache exporter

Nginx metric library Passenger exporter

Varnish exporter WebDriver exporter

APIs Docker Hub exporter

GitHub exporter OpenWeatherMap exporter

Rancher exporter Speedtest.net exporter

Logging Google's mtail log data extractor

Grok exporter

Other monitoring systems Cloud Foundry Firehose exporter

scollector exporter Heka dashboard exporter

Heka exporter Munin exporter

New Relic exporter

Miscellaneous BIG-IP exporter BIND exporter BOSH exporter

Jenkins exporter Meteor JS web framework exporter

Minecraft exporter module PowerDNS exporter

rTorrent exporter SMTP/Maildir MDA blackbox prober

Xen exporter






Non-SQL Query Language

Better for metrics computation

Only does reads

Source: https://promcon.io/2016-berlin/talks/prometheus-design-and-philosophy/

PromQL - Operators


+ (addition) == (equal)- (substraction) != (not-equal)* (multiplication) > (greater-than)/ (division) < (less-than)% (modulo) >= (greater-or-equal)^ (exponentiation) <= (less-or-equal)and (intersection) or (union)unless (complement)

… and vector matching Source: https://prometheus.io

PromQL - Aggregation Operators


sum stddev bottomk

min stdvar topk

max count quantile

avg count_values

Source: https://prometheus.io

PromQL - Examples



errors{job=“foo”} / total{job=“foo”}

Source: https://promcon.io/2016-berlin/talks/prometheus-design-and-philosophy/

DEMO Part 2





What to monitor?


Symptom-based alertingBe proactive


What to monitor?


Prevent alert fatigue- Use ticketing systems (Avoid email spam) - Warning are tasks like new features


What to monitor?


Provide runbooks- Keep them concise - Explanation, hints, links - Dynamic - include recent observations


What to monitor?


Practice outages“Firedrills”, “Gamedays” - repeat regularly



Start being proactive. Dont be firefighters.

… and remember …


Hope is NOT a strategy

@phennexSource: Site Reliability Engineering, How Google Runs Production Systems (2016), B. Beyer et al.

If you wanna know more…


- prometheus.io - promcon.io - The Site Reliability Engineering book - Podcasts:

- https://dev.to/sedaily/prometheus-monitoring-with-brian-brazil - https://dev.to/sedaily/the-art-of-monitoring-with-james-turnbull

(prefers push based opposite prometheus) - https://dev.to/sedaily/prometheus-with-julius-volz


The 3rd project in CNCF


Thank you! @phennex [email protected]


Top Related