+ All Categories
Home > Documents > Alertmanager - PromCon

Alertmanager - PromCon

Date post: 02-Feb-2022
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
29
Alertmanager and high availability Frederic Branczyk Software Engineer at CoreOS Prometheus/Alertmanager/Kubernetes @brancz
Transcript
Page 1: Alertmanager - PromCon

Alertmanager and high availability

Frederic BranczykSoftware Engineer at CoreOS

Prometheus/Alertmanager/Kubernetes@brancz

Page 2: Alertmanager - PromCon

Where does CoreOS fit in?

● Automating Monitoring infrastructure

● Prometheus + Kubernetes

Page 3: Alertmanager - PromCon

What will I be talking about?

● From alert to notification

● High availability contract

● High availability implementation

● Implications on operating HA Alertmanager

Page 4: Alertmanager - PromCon

Alertmanager Features

● Receives and groups alerts

● Deduplicates alerts

● Sends notifications to providers

○ Pagerduty, email, Slack, etc.

● Silencing

Page 5: Alertmanager - PromCon

Prometheus & Alertmanager

Page 6: Alertmanager - PromCon

Alerting Rule Alerting Rule Alerting Rule Alerting Rule...

04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/profile, method=GET04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET04:11 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/settings, method=POST04:12 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=GET04:13 hey, HighLatency, service=”X”, zone=”eu-west”, path=/index, method=POST04:13 hey, CacheServerSlow, service=”X”, zone=”eu-west”, path=/user/profile, method=POST . . .04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/comments, method=GET04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=POST

Page 7: Alertmanager - PromCon

Grouped in one notification

● 3 x HighLatency

● 10 x HighErrorRate

● 2 x CacheServerSlow

● (+individual Alerts)

Page 8: Alertmanager - PromCon

Boiled down:

Alertmanager reliably

sends notifications

Page 9: Alertmanager - PromCon

High Availability

Page 10: Alertmanager - PromCon

Infrastructure Scaling Story

Prometheus

Prometheus

Alertmanager

Alertmanager

Gossip

Microservice 1

Microservice 2

Microservice 3

Microservice 1

Microservice 2

Microservice 3

...

Page 11: Alertmanager - PromCon

Why decoupled?

● Keep Prometheus alerting simple

● High availability of Prometheus

● No state sharing between Prometheus

Page 12: Alertmanager - PromCon

Example Alerting Rule

ALERT NoLeaderIF etcd_has_leader == 0FOR 10mLABELS { severity = "warning"}ANNOTATIONS { summary = "etcd no leader", description = "etcd instance has no leader",}

Page 13: Alertmanager - PromCon

Alert Evaluation in Prometheus

Rule 1

Rule 2

Rule 3

...

● Evaluate Rule/Alert

● Fire alert against Alertmanager

Repeat in *rule evaluation interval*

Page 14: Alertmanager - PromCon

Simple configuration

● Resolve alerts in 5m

● Group by job label

● Group for 10 seconds

● Send via webhook

receiver

global: resolve_timeout: 5m route: group_by: ['job'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'webhook'receivers:- name: 'webhook' webhook_configs: - url: 'http://127.0.0.1:5001/'

Page 15: Alertmanager - PromCon

Notification Pipeline

Silence

Do not continue

Wait

Position in cluster

multiplied by 5

seconds

Dedup

Has notification

already been sent?

Send

Send notification via favorite

provider

Gossip

Tell other peers

notification has been

sent

Page 16: Alertmanager - PromCon

What is gossiped?

● Yes

○ Sent notifications

○ Silences

● No

○ Received alerts

Page 17: Alertmanager - PromCon

How? CRDTs!

● Conflict-free replicated data type

● Associativity (a+(b+c)=(a+b)+c)

● Commutativity (a+b=b+a)

● Idempotence (a+a=a)

● Well suited for AP systems

Page 18: Alertmanager - PromCon

Yes, but how? mesh by Weaveworks!

● Eventually consistent

● LWW-element-set

● Mergeable log of records

● Merges based on UID

○ On conflict latest timestamp wins

Page 19: Alertmanager - PromCon

Why not etcd?

● Simple operation

○ Less moving pieces

○ Single binary

● Want: AP not CP

Page 20: Alertmanager - PromCon

Silences

Page 21: Alertmanager - PromCon

Create Silences

Create Silence Alertmanager 0

SilencesDatabase

ID Values

1 Query, Start, End

2 Query, Start, End

Alertmanager 1

SilencesDatabase

ID Values

1 Query, Start, End

2 Query, Start, End

Gossip DeltaID: 2 ...

Merge Gossip Data

Page 22: Alertmanager - PromCon

Update Silences

Update SilenceUID: 1Start: Start1

Alertmanager 0

SilencesDatabase

ID Values

1 Query, Start, End

2 Query, Start, End

Alertmanager 1

SilencesDatabase

ID Values

1 Query, Start, End

2 Query, Start, End

Gossip DeltaID: 1Start: Start1

Merge Gossip Data

1 Query, Start1, End 1 Query, Start1, End

Page 23: Alertmanager - PromCon

Notification Log

Page 24: Alertmanager - PromCon

Non silenced alert example

Alertmanager 1

Alertmanager 0

Prometheus

● Wait 0s

● Wait 5s

● Dedup: Not sent→ Send

● Gossip

● Receive Gossip Data

● Deduplicate → Do not send

Page 25: Alertmanager - PromCon

Gossip Partition

Alertmanager 1

Alertmanager 0

Prometheus

● Wait 0s

● Wait 5s

● Dedup: Not sent→ Send

● Gossip

● Dedup: Not sent→ Send

NetworkPartition

Page 26: Alertmanager - PromCon

Notification Log

Alert Firing Alertmanager 0

NotificationLog

UID Values

1 Resolve,Notify,TS,...

2 Resolve,Notify,TS,...

Alertmanager 1

NotificationLog

UID Values

1 Resolve,Notify,TS,...

2 Resolve,Notify,TS,...

Gossip DeltaUID: 2 ...

Merge Gossip Data

Page 27: Alertmanager - PromCon

Group Key● Group at runtime

○ By Group By labels

● XOR with Route

● Concat with Receiver

global: resolve_timeout: 5m route: group_by: ['job'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'webhook'receivers:- name: 'webhook' webhook_configs: - url: 'http://127.0.0.1:5001/'

Page 28: Alertmanager - PromCon

DEMO!

Page 29: Alertmanager - PromCon

[email protected]

GitHub: @brancz

Twitter: @fredbrancz

QUESTIONS?

Thanks!

We’re hiring: coreos.com/careers

Let’s talk!

#prometheus on Freenode

More events: coreos.com/community

LONGER CHAT?

also in Berlin!


Recommended