+ All Categories
Home > Documents > ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the...

~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the...

Date post: 16-May-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
29
~ A ~ THEORY ~~ and ~~ PRACTICE ~~~ of ~~~ SERVICE LEVEL OBJECTIVES Jamie Wilkinson SRECon Asia, June 2018
Transcript
Page 1: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

~ A ~

THEORY~~ and ~~

PRACTICE~~~ of ~~~

SERVICE LEVEL OBJECTIVES

Jamie WilkinsonSRECon Asia, June 2018

Page 2: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

Alice Goldfuss, Monitorama 2017, Used with permission

Page 3: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

Symptom Based Alerting

https://www.flickr.com/photos/chris-warren-photos/2220257496/ CC-BY-NC 2.0

Page 4: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

the cost of maintenance must scale sublinearly with the growth of the service

service size: e.g. queries, storage footprint, cores used, watts

“ops work”

cost

time

Why does X ∀ X ∈ {Ops} suck?

capacity

Page 5: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

symptom

IMAGE CREDITS from Taiyo no Yusha Faibado (The Brave Fighter of Sun Fighbird) Episode 3 - All Members In! Space Police! Screenshot from: Hirano, Yasushi (Writer), & Yatabe, Katsuyoshi. (1991) All Members In! Space Police [Series Episode] In S. Imai, Y. Honna, T. Takayuki, Taiyo no Yusha Faibado [Brave Fighter of Sun Fighbird]. Tokyo, Japan. Takara and Sunriseh

ATTRIBUTION CREDITSPhillipines Department of Science and Technology https://www.facebook.com/DOSTph/photos/a.1778676488821528.1073741834.1124649117557605/1786984204657423/?type=3&theater

Page 6: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

What makes this a symptom?

https://pixabay.com/en/corridor-arcade-arches-passage-1251517/ public domain

Page 7: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

https://www.flickr.com/photos/dbreg2007/4462205185 CC-BY-SA 2.0

Page 8: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

Engineering Tolerance

https://en.wikipedia.org/wiki/Engineering_tolerance Public Domain

Page 9: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

100%

99.9%99.5%99%

Availability “Tolerance”

Page 10: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

By Jenna Fair [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

By Contributor(s): Queensland Newspapers Pty Ltd [Public domain], via Wikimedia Commons

By G.E. Ulrich, USGS. Cropping by Hike395 (talk · contribs) (USGS) [Public domain], via Wikimedia Commons

Page 11: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

SLAs, SLOs, SLIs

● SLI → Indicator: a measurement○ distribution of response time over 10 minutes○ response error ratios over 10 minutes

● SLO → Objective: a goal○ 99.9th percentile response latency below 5ms○ lower than 1% rate of errors

● SLA → Agreement: economic incentives○ or we get paged

Page 12: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

“As a mechanical engineer in an R&D lab I frequently ask myself, what is a reasonable

tolerance to set on this part?”

https://engineerdog.com/2017/12/02/engineering-guidelines-for-selecting-mechanical-design-tolerances/

Page 13: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts
Page 14: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

A symptom is anything that can be measured by the SLO.

A symptom-based alert is an alert when the SLO is in danger of being missed.

Page 15: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

For availability SLAs we often talk about system uptime:

How do you measure uptime of a distributed system?

Page 16: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

By David Hall [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons

Page 17: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

Another way to calculate this is with a request success rate:

Defining SLOs in terms of request success rate makes it easier to measure an error budget

Page 18: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

var responses = prometheus.NewCounterVec(

prometheus.CounterOpts{Name: "responses",

Help: "total errors served"},

[]string{"code", “user”})

...

responses.WithLabelValues(

http.StatusText(400),

GetUser(req)).Add(1)

Page 19: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

1/qps sample density

quer

ies p

er se

cond

time, 1 second samples

?

Page 20: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts
Page 21: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

record: error_ratio_by_user

expr: sum by (job, user)(

rate(responses{code!~”200”}[10s]))

/ on (job, user)

sum by (job, user)(rate(responses[10s]))

alert: ErrorRatioTooHigh

expr: error_ratio_by_user > 0.01

Page 22: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

SLO burn

cum

ulat

ive e

rror

s

time

alerting window

scaled error budget

error ra

te threshold

Page 23: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

Burn rate maths

Average QPS rate: 1000SLO: 99% over 1 week= 604,800,000 total queries= 6,048,000 permissible errors

Page 24: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

Take 1 hour moving average of errorsPage if error budget is going to be exhausted in less than 24 hours= 6,048,000 errors consumed per day= 70 err/s = 252,000 errors in 1 hour

Page 25: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

Page if 15m rate over 70.

cum

ulat

ive e

rror

s

time

alerting window = 1h

scaled error budget = 252000

error ra

te threshold = 70 err/s

Page 26: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

expr: delta(errors[1h]) > (expected_events * error_budget / burn_period)

=

expr: delta(errors[1h]) > ((1000 qps * 7d) * 0.01 / 24h)

=

expr: delta(errors[1h]) > 70

SLO Fast Burn

Page 27: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

EDITORIALISE ABOUT OBSERVABILITYhttps://pixabay.com/en/german-zeiss-binoculars-blc-lens-3310355/ CC0

Page 28: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

“one of the most powerful context-sensitive incredibly adaptive

anomaly-detecting and responding agents in the world”

-- John Allspaw, Monitorama 2013

Page 29: ~ A ~ THEORY - USENIX · the cost of maintenance must scale sublinearly with the growth of the service service size: e.g. queries, storage footprint, cores used, watts

1. Symptom-based alerts are good for your health2. SLO is defined by you, customers, and system3. SLO implies error budget, informs engineering

tolerance4. Page only on SLO risk, because that’s what matters


Recommended