Date post: | 23-Jan-2017 |
Category: |
Engineering |
Upload: | todd-palino |
View: | 293 times |
Download: | 2 times |
I’m No HeroFull StackReliability
At LinkedIn
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd PalinoStaff Site Reliability EngineerLinkedIn, Data Infrastructure Streaming
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 3
What is Site Reliability Engineering?
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 4
Types of SRE
Embedded
Central (or Production SRE)
Tools and Infrastructure
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 6
We Can’t Do It Alone
The Kafka SRE team is 3 people in the US, and 1.5 SREs in Bangalore
We manage over 6000 application instances– 100 Kafka clusters, with 1800 brokers– Over 1 trillion messages a day
The environment is never static from one day to the next
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 7
Maslow’s Hierarchy
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 8
Todd’s Hierarchy of Reliability
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 9
Infrastructure as a Service
SREs do not deploy hardware and OS
Production Operations– Datacenter Technicians– Systems Operations– Network Operations
Provide all basic OS and network services
There is still tweaking for individual applications
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 10
Common Repositories
All source code and configurations are committed to one place
Subversion and Git centrally managed
Consistent management– Precommit checks– ACLs and Review boards
Connects directly to the build systems
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 11
Containerization
Most of our stack is Java– Python is well-supported– Always a few one-offs
Java applications have Tomcat and Jetty containers– Hooks for monitoring– Client libraries are managed by the team that owns the application
Provides a consistent control surface for applications
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 12
Build and Deployment
When code is committed, it is automatically built– Successes become deployment artifacts– Failures are tracked via Jira
Build systems are centrally managed
Common tools– Dependency management and introspection– Version management– Error budgeting– Deployment
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 13
Monitoring
Monitoring, graphing, and alerting as a service
Completely self-service– Applications annotate metrics and they are automatically collected– Monitoring dashboards can be created by anyone
Automatic metrics and dashboards for common features– HTTP servers, system and OS metrics– Client libraries (such as Kafka)
Additional metrics can be published outside the container
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 14
Site Up
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 15
Site Up
With the stack supporting it, applications sit on top– SREs architect and run the application– SRE and developers respond to failures
The NOC monitors high-level metrics– Overall site health and growth metrics– They also coordinate incident response
Incident response is blameless– Fix the problem, don’t fix the blame
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 16
Review and Revise
All components are constantly improving– Incidents expose issues in the infrastructure– Feedback from usage of the tools
Steering committees discuss large-scale changes– Production Operations, SRE, and Development all have their own– Comprised of individual contributors, not managers
Open collaboration– Common repositories means everyone can help