SLO Review
Takeshi Kondo / @chaspy 2020/01/25
SRE NEXT 2020 #srenext #srenextC
Service Level Objectives
Questions
• ✋Do you know the meaning of SLO?• ✋Do you define SLO for your service?• ✋Do you have an Error Budget Policy for your service?
Target
• People who want to know SLI/SLO• People who want to know how to use SLI/SLO• People who want to keep the reliability and agility of product
development
Site Reliability Engineering: Measuring and Managing Reliability 🎉
https://www.coursera.org/learn/site-reliability-engineering-slos
tl;dr
• It is worth defining and reviewing SLI / SLO• But the SLI / SLO is not perfect from the beginning• Reduce cognitive load and introduce gradually to team
Agenda
• Learn SLO• What / Why / Where
• Case Study in Quipper• Takeaways• Provide Recommended SLIs• Make the configuration as code• Have a steep learning curve
Agenda
• Learn SLO• What / Why / Where
• Case Study in Quipper• Takeaways• Provide Recommended SLIs• Make the configuration as code• Have a steep learning curve
What
• SLI / Service Level Indicators• A quantifiable measure of service reliability• i.e. http success rate, response time
• SLO / Service Level Objectives• Set a reliability target for an SLI• 99%, 99.9%, 99.99%…
• Error Budget• An SLO implies an acceptable level of unreliability• This is a budget that can be allocated
The Art of SLOs – Slides / https://docs.google.com/presentation/d/1qcQ6alG_qUg3qWf733ZsDnTggwzqe4PZICrFXZ1zQZs/edit#slide=id.g75945b48fe_0_0
SLI should be related to user happiness
😄
😥
SLI(%)Good Event
——————————- Valid Event
SLI should be related to user happiness
😄
😥
SLI(%)http 2xx status count
———————————————————————————-——- http 2xx status count + 5xx status count
SLO is a reliability target for an SLI
😄
😥
SLI(%)
SLO: 99.9%
http 2xx status count ———————————————————————————-——- http 2xx status count + 5xx status count
SLO is a reliability target for an SLI
😄
😥
SLI(%)
SLO: 99.9%
Present: 99.95%
10000 (2xx count) ———————————————————————————-——-
10000 (2xx count) + 5 (5xx count)
We can accept Errors as Error Budget
😄
😥
SLI(%)
SLO: 99.9%
Present: 99.95%
10000 (2xx count) ———————————————————————————-——-
10000 (2xx count) + 5 (5xx count)
Error Budget We can accept more 5
count of 5xx error 😌
We can accept Errors as Error Budget
😄
😥
SLI(%)
SLO: 99.9%
Present: 99.95%
10000 (2xx count) ———————————————————————————-——-
10000 (2xx count) + 5 (5xx count)
Error Budget We can accept more 5
count of 5xx error 😌
Event based SLO
We can accept Errors as Error Budget
😄
😥
SLI(%)
SLO: 99.9%
Present: 99.95%
95 percentile Response time < 100msec In last 1 minutes
———————————————————————————-——- All time window
We can accept Errors as Error Budget
😄
😥
SLI(%)
SLO: 99.9%
Present: 99.95%
95 percentile Response time < 100msec In last 1 minutes
———————————————————————————-——- All time window
7 days
Error Budget is only 10 minutes in 7 days 😅
We can accept Errors as Error Budget
😄
😥
SLI(%)
SLO: 99.9%
Present: 99.95%
95 percentile Response time < 100msec In last 1 minutes
———————————————————————————-——- All time window
7 days
Error Budget is only 10 minutes in 7 days 😅
Monitor based SLO
Why
• Fact-based decision making• Team can develop with a balance between reliability and agility• Especially important in the microserrvices architecture
Team can develop with a balance between reliability and agility
🤔
Reliability Agility
Ops 🙂Keep the reliability
Dev 😎Let’s release new feature!
SLO
Especially important in the microserrvices architecture
ServiceA
ServiceB
ServiceC
Success Rate 99.9%
Success Rate 99%
Success Rate 99% 😥
Reliability depends on other services
Where
Synthetics Client
Frontend
CDN LoadBalancer Application DataStore
Many options, Trade-off
Where
Synthetics Client
Frontend
CDN LoadBalancer Application DataStore
Many options, Trade-off
Some requests might not reach to the apps
Need more engineering effort to generate E2E tests
In Quipper
Synthetics Client
Frontend
CDN LoadBalancer Application DataStore
Send everything to Datadog
Agenda
• Learn SLO• What / Why / Where
• Case Study in Quipper• Takeaways• Provide Recommended SLIs• Make the configuration as code• Have a steep learning curve
Self-Contained
“Encourage development teams to be self-contained so that each team can make products more comprehensively, proactively, and efficiently.”
SRE Mission for 2020 / Self-Contained
• Product Team can develop by themselves• No ask SREs
• We SRE provides the process• Design Doc• Production Readiness Check• Delegate Infrastructure Management(Terraform)• SLI/SLO
Timeline
2019 2020
Migrated to Kubernetes
Define the Ownership
Production Readiness Checklist
SLO review by myself
Set Error Budget Policy
Jun.Mar. Mar.Sep.
SRE NEXT
SLO review with Devs
Timeline
2019 2020
Migrated to Kubernetes
Define the Ownership
Production Readiness Checklist
SLO review by myself
SLO review with Devs
Jun.Mar. Mar.Sep.
SRE NEXT
Set Error Budget Policy
Timeline
2019 2020
Migrated to Kubernetes
Define the Ownership
Production Readiness Checklist
SLO review by myself
SLO review with Devs
Jun.Mar. Mar.Sep.
SRE NEXT
Why do we need such steps?
Set Error Budget Policy
Why do we need such steps?
• SLIs/SLOs we defined are appropriate?• If not, Error Budget Policy won’t work well
• Can the product team start the process itself?• If not, need some scaffold, preparation, training
Case Study in Quipper
• Define the Ownership• SLO review by myself• SLO review with Devs• Set Error Budget Policy
Case Study in Quipper
• Define the Ownership• SLO review by myself• SLO review with Devs• Set Error Budget Policy
Know your systems and organizations
• 2 Product• 4 Branches 🇯🇵🇮🇩🇵🇭🇲🇽• 97 Kubernetes Deployment• 84 Developers (Includes 6 SREs)• 48 subdomains
Where is the Ownership?
Define the Owner
Define the Owner
Services / Teams
Japan 7 Global 8 Philippines 3 indonesia 4 Shared 1
Define Service Owner In Design Doc for new service
Case Study in Quipper
• Define the Ownership• SLO review by myself• SLO review with Devs• Set Error Budget Policy
SLO review by myself
• Establish SLO Review process• How to set SLO?• How to monitor SLO?• What is an action when SLO violation?• How to investigate?
• Improve SLI / SLO accuracy• How to think to revise?
How to set and monitor SLO?
How to set and monitor SLO?
• Unfortunately, there is no Alert or recording system 😅• Use Slack reminder and record on Github Issue
How to set and monitor SLO?
Availability Table
https://landing.google.com/sre/sre-book/chapters/availability-table/
Too many errors 🤔
Target too high 🤔
Start with this!
Realized that “SLO Review” is good habit
• Good habit?• Like Pair-Programming or Unit Test
• Why?• Motivate to get metrics• No burnout, feel relief• Aware of the factors that hinder reliability
• Platform Outage• Push notification• Resource Capacity• Rolling Update
Case Study in Quipper
• Define the Ownership• SLO review by myself• SLO review with Devs• Set Error Budget Policy
Many Problems…
• Noisy metrics by dos detector• Developing SLIs• Send http path tag for shared service• No available metrics for microservices SLIs
Dos Detector: Rate limiting by Reverse Proxy
Dos Detector: Rate limiting by Reverse Proxy
If a large number of requests are made from the same client
in a short time, returns 503
SLI should be related to user happiness
😄
😥
SLI(%)http 2xx status count
———————————————————————————-——- http 2xx status count + 5xx status count
Noisy metrics by dos detector
Send http path tag for shared service
Coaching Team uses example.quipper.com/coaching
School Team uses example.quipper.com/school
Send http path tag for shared service
Send http path tag for shared service
No available metrics for microservices SLIs
No available metrics for microservices SLIs
ServiceA
ServiceB
ServiceC
GET http://serviceb
GET http://servicec
No available metrics for microservices SLIs
ServiceA
ServiceB
ServiceC
GET http://serviceb
GET http://servicec
Side-car container
Case Study in Quipper
• Define the Ownership• SLO review by myself• SLO review with Devs• Set Error Budget Policy• To be continued…
Agenda
• Learn SLO• What / Why / Where
• Case Study in Quipper• Takeaways• Provide Recommended SLIs• Make the configuration as code• Have a steep learning curve
Provide Standardized / Recommended SLIs
• Ideally, better to set SLIs by Product Team but…• Start with default first
SLI menu
• Availability• http success rate
• Latency• upstream response time < x msec
Make the configuration as code
Make the configuration as code
Developer can easily change by pull request
Have a steep learning curve
Good Documentation
Work together 🤝
Agenda
• Learn SLO• What / Why / Where
• Case Study in Quipper• Takeaways• Provide Recommended SLIs• Make the configuration as code• Have a steep learning curve
Summery
• It is worth defining and reviewing SLI / SLO• But the SLI / SLO is not perfect from the beginning• Reduce cognitive load and introduce gradually to team
Thank You!
chaspy
chaspy_
Site Reliability Engineerat Quipper
Takeshi Kondo
SRE Lounge Terraform-jp