Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud
Coburn WatsonManager, Cloud Performance, NetflixSurge ‘13
2
Netflix, Inc.
• World's leading internet television network• ~ 38 Million subscribers in 40+ countries• Over a billion hours streamed per month• Approximately 33% of all US Internet traffic
at night• Recent Notables• Increased Originals catalog• Large open source contribution• OpenConnect (homegrown CDN)
3
About Me
• Manage Cloud Performance Engineering Team• Sub-team of Cloud Solutions Organization
• Focus on performance since 2000• Large-scale billing applications, eCommerce,
datacenter mgmt., etc.• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.
• Passion for tackling performance at cloud-scale• Looking for great performance engineers• [email protected]
4
Freedom and Responsibility
• Culture deck..a great read• Good performers: 2x, Top performers: 10x• What engineers dislike• cumbersome processes• deployment inefficiency• restricted access• restricted technical freedom• lack of trust
• If removed…maximize:• Engineering velocity• Engineer satisfaction
5
Maximizing: Engineering Velocity
6
How
• Implementation freedom• SCM, libraries, language
• that said..platform benefits exist
• Deployment freedom• Service team owns• push schedule, functionality, performance
• operational activities (being paged)• On-demand cloud capacity
• Thousands of instances at the push of a button
7
Rapid Deployment?
Impossible..
3-6 Months?
8
Rapid (Cloud) Deployment
3-5 Minutes
9
BaseAMI• Supply the foundation• Monitoring, java, apache, tomcat, etc.
• Open source project: Aminator
10
Pushing Code: Red-Black
• Gracefully roll code in, or out, of production• Asgard is our AWS configuration mgmt.
tool
11
Compounded risks with increased velocity
Risks: Decreased Reliability, Performance, and Scalability
Not all Roses
12
Goal: CI (Continuous Improvement)
13
Maximizing: Reliability
14
Fear (Revere) the Monkeys
• Simulate• Latency• Errors
• Initiate• Instance Termination• Availability Zone Failure
• Identify• Configuration Drift
… in Test and Production
15
Tracking Change: Chronos
• Aggregate Significant Events *• Current Sources:• Pushes (Asgard)• Production Change Requests (JIRA)• AWS Notifications• Dynamic Property Changes• ASG Scaling Events
• Implementation• Simple REST-service; customized adapters
* - “can disrupt production service”
16
Chronos, cont.
17
Automated Canary Analysis•Identify regression between new and existing code•Point ACA to baseline (prod) and canary ASG
• Typically analyze an hours worth of time series data• Compare ratio of averages between canary and baseline• Evaluate range and noise; determine quality of signal
• Bucket: Hot, Cold, Noisy, or OK• Multiple classifiers available• Multiple metric collections (e.g. hand-picked by service, general)
• Rollup• Constrained: along metric dimensions• Final: Score the canary
•Implementation: R-based analysis
18
HOT OK NOISYCOLDOK
NOISY
constrained rollup (dashed)final rollup
ACA: in Action
19
Hystrix: Defend Your App
● Protection from downstream service failures● Functional (unavailable) or performance in nature
20
Maximizing: Scalability and Performance
21
Dynamic Scaling
EC2 footprint autoscales 2500-3500 instances per day• order of tens of thousands of EC2 instances• Larger ASG spans 200-900 m2.4xlarge daily
Why:• Improved scalability during unexpected workloads• Absorb variance in service performance profile• Reactive chain of dependencies• Creates "reserved instance troughs" for batch
activity
22
Dynamic Scaling, cont.
Example covers 3 services• 2 edge (A,B), 1 mid-tier (C)• C has more upstream services
than simply A and B
Multiple Autoscaling Policies• (A) System Load Average• (B,C) Request-Rate based
23
Dynamic Scaling, cont.
24
Dynamic Scaling, cont.
• Response time variability greatest during scaling events• Average response time primary between 75-150 msec
25
Dynamic Scaling, cont.
• Instance counts 3x, Aggregate requests 4.5x (not shown)• Average CPU utilization per instance: ~25-55%
26
Study performed: • 24 node C* SSD-based cluster (hi1.4xlarge)• mid-tier service load application• Targeting 2x production rates
• Increase read ops from 30k to to 70k in ~ 3 minutes
• Increase write ops 750 to 1500 in ~ 3 minutes
Results: • 95th pctl response time increase: ~ 17 msec to 45
msec• 99th pctl response time increase: ~ 35 msec to 80
msec
Cassandra Performance
27
Response times consistent during 4x increase in load *
* Due to upstream code change
EVcache (memcached) Scalability
28
Cloud-scale Load Testing
• Ad-Hoc or CI-based load test model• (CI) Run-over-run comparison; email on rule
violation
1. Jenkins initiates job2. JMeter instances apply load3. Results written to s3 4. Instance metrics published to Atlas5. Raw data fetched and processed
29
Conclusions
• Continually accelerate engineering velocity• Evolve architecture and processes to mitigate
risks
• Stateless micro-service architectures win!
• Remove barriers for engineers• Last option should be to reduce rate of change
• Exercise failure and “thundering herd” scenarios
• Cloud native scaling and resiliency are key factors• Leverage pre-existing OSS PaaS when
possible
30
Netflix Open Source
Our Open Source Software simplifies mgmt at scale
Great projects, stunning colleagues: jobs.netflix.com
31
Q&A
• Netflix Tech Blog: http://techblog.netflix.com