+ All Categories
Home > Software > Understanding and Extending Prometheus AlertManager

Understanding and Extending Prometheus AlertManager

Date post: 22-Jan-2018
Category:
Upload: lee-calcote
View: 2,031 times
Download: 0 times
Share this document with a friend
34
Understanding and Extending Prometheus AlertManager Lee Calcote calcotestudios.com/talks
Transcript
Page 1: Understanding and Extending Prometheus AlertManager

Understanding and ExtendingPrometheus AlertManager

Lee Calcotecalcotestudios.com/talks

Page 2: Understanding and Extending Prometheus AlertManager

Lee Calcote

linkedin.com/in/leecalcote

@lcalcote

blog.gingergeek.com

[email protected]

clouds, containers, infrastructure,applications  and their management

calcotestudios.com/talks

Page 3: Understanding and Extending Prometheus AlertManager

Show of Hands

Page 4: Understanding and Extending Prometheus AlertManager

AlertManagerPrometheus

Page 5: Understanding and Extending Prometheus AlertManager

 is an alert...Alertmanager

@lcalcote

Purpose

ingester

grouper

de-duplicator

silencer

throttlernotifier

Page 6: Understanding and Extending Prometheus AlertManager

  Receivers

\ˈnō-mən-ˌklā-chəra brief Prometheus AlertManager construct review

match alerts to their receiver andhow often to notify where and how to send alerts

 Routes

@lcalcote

Page 7: Understanding and Extending Prometheus AlertManager

- matches alerts with specific labels and preventsthem from being included in notifications.

 

 - suppress specific notifications when otherspecific alerts are already firing.

 

 - categorizes alerts of similar nature into a singlenotification.

Silencers

Inhibitors

Grouping

\ˈnō-mən-ˌklā-chəra brief Prometheus AlertManager construct review

Muting

Suppressing

Correlating

group_wait: 30s

group_by: ['alertname', 'cluster']

group_interval: 5m

@lcalcote

Page 8: Understanding and Extending Prometheus AlertManager

Inhibition

Multiple approaches to suppression

@lcalcote

repeat_interval

vs

Silences

vs

per routeglobalvia ui / api

Page 9: Understanding and Extending Prometheus AlertManager

Alerts

ALERT <alert name>

IF <PromQL vector expression>

FOR <duration>

LABELS { ... }

ANNOTATIONS { ... }

Supports clients other thanPrometheus

is notified when alertstransition state

@lcalcote

a shared construct

Prometheus AlertManager inactive

firing

pending

state transition

inactive

firingnotifications

!

Page 10: Understanding and Extending Prometheus AlertManager

Notification Integrations

@lcalcote

Page 11: Understanding and Extending Prometheus AlertManager

Notifying to Multiple Destinations

Use  to advance to next receiver.continue

route: receiver: email_webhook

receivers: - name: email_webhook email_configs: - to: '[email protected]' webhook_configs: - url: <webhook url here>

Use a  that goes to both destinations.receiver

route: receiver: ops-team-all # default routes: - match: severity: page receiver: ops-team-b continue: true - match: severity: critical receiver: ops-team-a

receivers: - name: ops-team-all email_configs: - to: [email protected] - name: ops-team-a email_configs: - to: [email protected] - name: ops-team-b email_configs: - to: [email protected]

or

@lcalcote

Page 12: Understanding and Extending Prometheus AlertManager

Inhibitor

Dispatcher

Non-HA AlertManager Architecture

Silencerde-duplication

Dispatcher sorts incoming alerts intoaggregation groups and assigns thecorrect notifiers to each.

api

Alert Provider

UI

Silence Provider

store

de-duplication

subscribe

Router

batchedalerts

notification

pipe

line

Notify Provider

checks for previouslysent notifications

Retry

RetryMaintenanceScript

!

@lcalcote

alerts

Page 13: Understanding and Extending Prometheus AlertManager

@lcalcote

High Availabilitybeing introduced in 0.5

I gossip protocols.

built atop Weave Mesh

With HA, you no longer have to monitor the monitor.

 

Designed for an alert to be sent to all instances in the cluster.

 

All Prometheus instances send alerts to all Alertmanager instances.

 

Guarantees notifications to be sent at least once.

@lcalcote

Page 14: Understanding and Extending Prometheus AlertManager
Page 15: Understanding and Extending Prometheus AlertManager

AlertManager UI

@lcalcote

Page 16: Understanding and Extending Prometheus AlertManager

@lcalcote

Page 17: Understanding and Extending Prometheus AlertManager

Story:As an Operator, I would like to not only see a list of firing alerts,but also a list of all transpired alerts, so that I may have additionalcontext as the thresholding behavior for a given defined alert.

@lcalcote

Prologue:Alert troubleshooting is improved when operators have a view ofwhat is firing, has recently fired, what is normal, but also go backin time and see what fired an hour ago. Understanding firing orderassists in root cause analysis and identify problem areas.

 Limitations:

1. AlertManager database (SQLite) is not intended to providelong-term storage.

Acceptance Criteria:

1. Once fired, whether actively firing or not, alerts will bedisplayed on the History page.

2. Optionally, fired alerts will be notified to a Slack channel.

Stretch:Include pagination

Add a date range picker

Add a host filter

 

Page 18: Understanding and Extending Prometheus AlertManager

Environmenttest setup

Page 19: Understanding and Extending Prometheus AlertManager

Random Sample Targets

$ git clone https://github.com/prometheus/client_golang.git $ cd client_golang/examples/random $ go get -d $ go build

Fetch and compile the client library code example.

Start example targets in separate terminals.

$ ./random -listen-address=:8080 $ ./random -listen-address=:8081 $ ./random -listen-address=:8082

Be sure to create and run the  andpoint it at your soon-to-be AlertManager:

random sample targets

@lcalcote

Page 20: Understanding and Extending Prometheus AlertManager

Prometheus and Alert Rules SetupFollow the  to download, configure and run Prometheus.getting started instructions

$ ./prometheus -config.file=prometheus.yml -alertmanager.url=http://localhost:9093

ALERT instance_down

IF up == 0

FOR 5s

LABELS {severity="page"}

ANNOTATIONS {

DESCRIPTION="{{$labels.instance}} of job {{$labels.job}}

has been down for more than 5 seconds.",

SUMMARY="Instance {{$labels.instance}} down"}

/alert.rules

A simple alert rule that will fire when any given target is unreachable for longer than 5 seconds.@lcalcote

!

...

# Load and evaluate rules in this file every 'evaluation_interval' seconds.

rule_files:

- "alert.rules"

...

/prometheus.yml

Page 21: Understanding and Extending Prometheus AlertManager

Environmentdevelopment setup

Page 22: Understanding and Extending Prometheus AlertManager

@lcalcote

Grab Repos

$ git clone https://github.com/prometheus/alertmanager.git

Given that our user story includes making front-end changes to AlertManager,

ensure that you install a small utility to generate Go code from any file.

Clone AlertManager repo

Get, build and copy go-bindata into any directory on your PATH

$ go get -u github.com/jteeuwen/go-bindata/... $ cd $GOPATH/src/github.com/jteeuwen/go-bindata/go-bindata $ go build

Page 23: Understanding and Extending Prometheus AlertManager

Notification Integrationcreate an alert notification receiver.

 

route: group_by: [cluster] # If an alert isn't caught by a route, send it slack. receiver: slack_general routes: # Send severity=slack alerts to slack. - match: severity: page receiver: slack_general receivers:- name: slack_general slack_configs: - api_url: '<your-web-url-here>' channel: '#<your-channel-name-here>' send_resolved: true

Of the supported AlertManager receivers,

let’s opt for integrating Slack.

@lcalcote

Page 24: Understanding and Extending Prometheus AlertManager

@lcalcote

The  canassist in building

routing trees.

visual editor

Page 25: Understanding and Extending Prometheus AlertManager

Build, Run, TestVerify you have a functional development

environment by building and running the project:

$ make assets # invokes go-bindata to inject static web files $ go build # compiles go code $ ./alertmanager -config.file=slack.yml # runs alertmanager with the specified configuration

@lcalcote

$ curl -X POST http://localhost:9090/-/reload $ kill -HUP `pgrep alertmanager`

$ ./promtool check-config <config file> $ ./promtool check-rules <rules file>

Reload Prometheus or AlertManager configs

Validate Prometheus config and alert rules

Page 26: Understanding and Extending Prometheus AlertManager

@lcalcote

TestIf you choose to setup a Slack channel, you

should now see new alerts firing as andwhen your random targets go up and down.

Page 27: Understanding and Extending Prometheus AlertManager

/ui/app/js/app.js

Changelog

/api.go

/ui/app/partials/history.html Angular

HTML

Go

Go & SQL

/provider/provider.go /provider/sqlite/sqlite.go /provider/boltmem/boltmem.go

Page 28: Understanding and Extending Prometheus AlertManager

@lcalcote

All UI functionality should be addressable via API.

Let’s register a :

/api.go

new /history API endpoint

r.Get("/history", ihf("history", api.listAllAlerts))

func (api *API) listAllAlerts(w http.ResponseWriter, r *http.Request) { alerts := api.alerts.GetAll() defer alerts.Close()

With our /api/v1/history endpoint a newly addressable API endpoint, we’ll need to

build a function to handle requests made to it.

The  function will handle inbound

HTTP requests made to the new endpoint.

api.listAllAlerts

Page 29: Understanding and Extending Prometheus AlertManager

@lcalcote

1. Add  (e.g. GetAll() AlertIterator) to /provider/provider.go2. Add a  to /provider/sqlite/sqlite.go3. Add a to /provider/boltmem/boltmem.go

a new AlertIteratornew AlertProvider and SQL querynew AlertIterator and AlertProvider

With API endpoint, let’s turn our attention to thebackend for collecting the right recordset from our

data provider.

/provider

Page 30: Understanding and Extending Prometheus AlertManager

@lcalcote

/ui/app/js/app.js

angular.module('am.controllers').controller('NavCtrl',

function($scope, $location) {

$scope.items = [{

name: 'History',

url: 'history'

},

angular.module('am.services').factory('History',

function($resource) {

return $resource('', {}, {

'query': {

method: 'GET',

url: 'api/v1/history'

}

});

}

);

 NavCtrl for the :History menu item

as well as a :new History service

angular.module('am.controllers').controller('HistoryCtrl',

function($scope, History) {

$scope.refresh = function () {

History.query({},

function(data) {

$scope.groups = data.data;

console.log($scope.groups);

}, function(data) {

console.log(data.data);

})

}

$scope.refresh(); } );

and a :new History controller

angular.module('am.directives').directive('history',

function() {

return {

restrict: 'E',

scope: {

alert: '=',

group: '='

},

templateUrl: 'app/partials/history.html'

}; } );

Insert a :new History directive

Page 31: Understanding and Extending Prometheus AlertManager

@lcalcote

Finally, we’ll need a page in which to

view the transpired alerts. So, create a

new file, , under

/ui/app/partials.

 

history.html

History.html will simply format the

display a tabular recordset. A new

recordset will be retrieved from our data

provider.

/ui/app/partials/history.html

Page 32: Understanding and Extending Prometheus AlertManager

@lcalcote

SummaryThis example enhancement provides a view

of transient history — that of the period that

the SQlite database holds.

 

AlertManager is not currently intended to

provide long-term storage.

 

Contributing is easier than you may think.

 

Reference

Alert History

fork

Alert History

tutorial

Page 33: Understanding and Extending Prometheus AlertManager

Resources

IRC:  on  

 

Mailing lists:

 – discussing Prometheus usage and community support

 – contributing to Prometheus development

 

 

 

 

 to file bugs and features requests

#prometheus irc.freenode.net

prometheus-users

prometheus-developers

@PrometheusIO

Prometheus repositories

@lcalcote

#

Page 34: Understanding and Extending Prometheus AlertManager

Lee Calcote

Thank you.Questions?

clouds, containers, infrastructure,applications  and their management

linkedin.com/in/leecalcote

@lcalcote

blog.gingergeek.com

[email protected]

calcotestudios.com/talks

yes, we're hiring


Recommended