Date post: | 16-Jul-2015 |
Category: |
Software |
Upload: | christian-beedgen |
View: | 260 times |
Download: | 3 times |
Scaling A Start-up DevOps Team To 10x
While Scaling The System 50x
Christian Beedgen – Co-Founder & CTO
Stefan Zier – Lead Architect
DevOpsDays Austin 2014
Christian Beedgen
– Co-Founder, CTO
– ArcSight, Amazon, …
– No prior experience running production systems
Stefan Zier
– Lead Architect, first engineer
– ArcSight, Amazon,…
– No prior experience running production systems
Intro
2
3
Scaling
Spreading constructive beliefs and behavior from the few to the many.
Robert I. Sutton
Scaling up Excellence: Getting to More Without Settling for Less
4
Petabyte scale log management platform
Big Data™, High Velocity, Human Real Time
Distributed
100% in AWS
Service Oriented Architecture
99% in Scala
Run by engineers
The Sumo Logic Service
5
Data Ingest
6
Code Commits, Services
7
Engineering Head Count
Sumo Logic Confidential8
0
10
20
30
40
50
60
The Challenge
9
Scaling Sumo Logic
– More confidence and uptime
– More operators
– More change
– More services
10
DevOps Culture
Spreading Knowledge
Control surfaces
How We Scaled
11
12
Culture
a shared, learned, system of values, beliefs and attitudes that shapes and influences perception and behavior — an abstract “mental blueprint” or “mental code.”
One week, 24/7 responsibility for
– Operational decision making
– Alert response
– Deploying the bits
– Configuration changes
Pair of people (primary, secondary)
– Social schedules & travel
– Training
– Relief after a noisy night
Being On Call
13
Sumo on Sumo
– Perfect dog fooding use case
Post mortems
– Drive improvements from incidents
Alerting
– Code I wrote yesterday just woke me up at 4am
Feedback Loops
14
Mandated for PCI compliance
– Change Management Board = Channel on Slack
– Change Request = JIRA ticket
– Audit trail = Paste slack conversation into JIRA
Actually helpful
– Good documentation
– Starts good discussions
– Makes change mindful
Change Management
15
16
Spreading Knowledge
Tactical
– Daily Standups
– Chat
– Playbooks
Strategic
– Mentoring
– “How the sausage is made” sessions
– Checklists
Spreading Knowledge
17
18
Playbooks
19
Linked to alert
– GitHub wikis
– URL in alert
Focused on MTTR
– Steps to restore service
– List of Subject Matter Experts to call
Continuously improved
– Boy Scout rule
Culture
Knowledge
Control surfaces
Three Pillars
Sumo Logic Confidential20
Checklists
21
Improve outcomes
– Ensure experts don’t miss any critical steps
– Prevent repeating mistakes
Well designed
– Coherent
– Living documents
– Concise, clear and require specific actions
– Need to be short and well-organized
– Are NOT step-by-step instructions
22
23
DevOps Friendly
24
Control Surfaces matter for scale
– Simplify complex operations
– Consistent view
– Built-in safety
Natural to use
– Easy to learn, discover
Natural to extend
– Every developer
25
dsh
26
dsh
– CLI
– Full stack
– Fast
– Safe
– Secure
– Proactive
– Discoverable
Model Driven
27
Creates consistency
Provides guard rails
Deployment
– Cluster
• Instance
– Assembly
Configured at all levels
28
daemon restart api:p:25,receiver:p:10
29
dsh
30
dsh
– Scala
– Model based
– Trivial to extend
– Specific to OUR needs
– Meaningful defaults
– Prevents mistakes
31
val filter = FilterBuilder.withCluster(“zk”).
withOnlyRunningInstances.build()
val instances = deployment.connect.describeInstances(filter)
instances.par.foreach {
instance =>
val ssh = instance.connectSSH
ssh.execute(“sudo service api restart”)
}
What would we do differently next time?
32
Upgrade the system less monolithic
Don’t ask UI developers do operations
Clearer guidelines on managers & operations
Next Experiments
33
Divide up big rotation
Bring India development team into rotation
Switch from 24/7 shifts to 12/7
Deploy smaller parts of the system more often
Bring full-time operations people into the mix
Thank You!
34
Christian Beedgen
@raychaser
Stefan Zier
@stefanzier
We’re hiring!go.sumologic.com/jobs