+ All Categories
Home > Software > Engineering Netflix Global Operations in the Cloud

Engineering Netflix Global Operations in the Cloud

Date post: 16-Apr-2017
Category:
Upload: josh-evans
View: 1,480 times
Download: 2 times
Share this document with a friend
83
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Josh Evans - Director of Operations Engineering Engineering Netflix Global Operations in the Cloud
Transcript

PowerPoint Presentation

Josh Evans - Director of Operations EngineeringEngineering Netflix Global Operations in the Cloud

2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Internet

Web scale distributed are like a living organismServices are like the organs of the human body a diverse set of functions that create a wholeWe can understand many aspects but there are always mysteries or unobservedThey are complex and can become fragile or sickConstantly adapting and renewing itself

If youre responsible for operating such a system youll quickly learn that;

Impossible to test every permutation of failureImpossible to know exactly how it will behave under adverse conditionsConstantly changing to meet the needs of the business even if you learn the system once your knowledge will quickly become stale

Today well talk about strategies for successfully operate complex distributed systems in the face of these challenges

Notes:Vivid details complex, overwhelming no one understands the whole system

Two Operational ChallengesOperational ExcellenceOperations EngineeringOur Journey

Im going to take you on a journey exploring...Well delve deep into operational excellence this is the core of our talk todayBy the end of this talk youll have a strategic framework and tools to pursue operational excellence for your business

Our JourneyTwo Operational ChallengesOperational ExcellenceOperations Engineering

Product Innovation

winning moments of truth

Acquisition, retention, engagement

Every facet of the product1400 AB tests in the last year & acceleratingContinuous Innovation

Talk about blurring the lines between user interface and watching

Challenge #1:Accelerate Innovation and Rate of Change

Scale & Complexity

100,000s of requests per second1000s of Global Starts per Second

This is the heartbeat of the Netflix streaming service

Approaching Global ReachOctober - Spain, Portugal, ItalyEarly 2016 - Korea, Taiwan, Singapore, Hong Kong65m members 100m~60 counties 200

EU-West

US-EastUS-WestMulti-Zone, Multi-Region

Netflix CDN(Open Connect)CloudControl PlaneInternet

The Bigger PictureService PartnersService Partners

In addition to our cloud service & control plane for devices we have CDN thousands of caches at ISPs & Ixs, terabits/sec, petabytes of contentService partners xbox live, psn, samsung, etc

Challenge #2:Sustain & Improve Qualityin the face of ever growing scale & complexity

Our JourneyTwo Operational ChallengesOperational ExcellenceOperations Engineering

Greg Peters were leaving money on the table regarding operations & qualityDid some reading and realized that for Netflix and most internet-based services the operational challenge is about the tension between quality & velocity

Operational Excellence

QualityVelocity

Availability vs. Rate of ChangeRate of ChangeAvailability (nines)6543210110100100099.9999%99.999%99.99%99.9%99%90%31.5 seconds5.26 minutes52.56 minutes8.76 hours3.26 days36.5 days

Quality vs. Velocity

Availability vs. Rate of ChangeRate of ChangeAvailability (nines)6543210110100100099.9999%99.999%99.99%99.9%99%90%

31.5 seconds5.26 minutes52.56 minutes8.76 hours3.26 days36.5 days

The Zero Sum Game

Availability vs. Rate of ChangeRate of ChangeAvailability (nines)6543210110100100099.9999%99.999%99.99%99.9%99%90%

31.5 seconds5.26 minutes52.56 minutes8.76 hours3.26 days36.5 days

The Zero Sum Game

Availability vs. Rate of ChangeRate of ChangeAvailability (nines)6543210110100100099.9999%99.999%99.99%99.9%99%90%

Shifting the Curve

Operational Excellence is the continuous improvement of the management, design, and function of operational environments to achieve greater quality, velocity, and competitive advantage.

Remember member & engineer environments this is why velocity is an operational challenge

Our JourneyTwo Operational ChallengesOperational ExcellenceOperations Engineering

Build Itdesigncodebuildbaketestdeploy

Run ItoperateconfiguremonitorrespondYou build it, you run itglobally

Aligns incentives if you write bad code you get calledDaunting task for each engineering team

Undifferentiated Heavy Lifting

Operations Engineering is the application of software engineering practices and principles to achieve and sustain operational excellence.automationmodular componentstools & servicesbest practices

automationmodularitytoolsservicesbest practices

The leverage comes from

Our Journey Operations EngineeringEngineering ToolsInsight & Real-time AnalyticsPerformance & Reliability

Leverage

Our JourneyEngineering ToolsInsight & Real-time AnalyticsPerformance & Reliability

Leverage

Data CenterDelayed provisioningHand-crafted serversVariations and complexityOur Artisanal Past DeliveryLate night, manual deploymentsRepeated mistakes Painful delays to production fixes

productivityvelocityqualityEngineering Tools

cloud managementdelivery engineautomation platform

Mention - asgard replacement

Global Cloud Management

powerfuleasy to useglobal

Delivery Pipelines

feature richmodularparallel or serial

fully automated

Automated Global Delivery

feature richmodularparallel or serial

fully automated

The Paved RoadStashGradleUbuntuJenkinsSpinnaker

Our JourneyEngineering ToolsInsight & Real-time AnalyticsPerformance & Reliability

Leverage

Insight & Real-Time Analytics

OODA loop

An outage may not be life or death but

We know this is the experience that our customers have every minute countsSo we strive to continuously improve time to detect and time to recover

DES on time series data

Predict the future based on history

Favor recent history

Threshold-based alerts

6-8 minute delayAnomaly DetectionAlert!

SPS is the heartbeat of the Netflix service EKG heart monitoring arrhythmia or cardiac arrest require intervention

Double Exponential SmoothingMini-batches of time series dataPredict future values based on historyFavor recent valuesLook for the gap

Latency of DetectionStream processing vs. time series8 minutes < 1 minute

Finer Granularity, Shorter Time Windows

Ensemble Learning

Use multiple algorithms to obtain better predictive performance.Simple Ensemble Methods: Voting: used when each classifier produces a single class label.Averaging: used when each classifier produces a confidence estimate.Tend to yield better results when there is a diversity among the algorithms.

Median Absolute DeviationIQRLeast SquaresHDIVoting

observe, orient, decide, actAlert!From 6-8 minutes to < 1 minute

observe, orient

decide, act

How do we take humans out of the equation?

Outlier Detection & Remediation

These outliers are like cancer cells we systematically detect, study, and remove them from our ecosystem to maintain health

Unsupervised machine learningDensity-based clustering algorithm

ActionsEmail, pageOOS, detach, terminateKepler

An ounce of prevention

Old Version (v1.0)New Version(v1.1)Load BalancerCustomers100 Servers5 Servers95% 5%MetricsCanary Release Process

Old Version (v1.0)New Version(v1.1)Load BalancerCustomers0 Servers100 Servers 100%MetricsCanary Release Process

DefineMetricsA threshold

Every n minutesClassify metricsCompute scoreMake a decisionAutomatic Canary Analysis

Every n minutesClassify each metricCompute the mean value for the canary & controlCalculate the ratio of the mean valuesClassify the ratio of high, low, etc.Compute the final canary score% of metrics that match in performanceMake go/no-go decisionContinue with release of score is > 95%

Systematic observation of facets & permutationsUnsupervised monitoring & decision- makingAutomated tuning & recoveryAlerts with analysisThinking Globally

Our JourneyEngineering ToolsInsight & Real-time AnalyticsPerformance & Reliability

Leverage

Performance & Reliability

InternetZuulAPINCCPPlayback HistoryPlayback SessionsMAP

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the systems capability to withstand turbulent conditions in production.

Living system analogy - inoculation

Like giving your service a flu shot - introduces a safer version of the disease in production under - controlled circumstancesNeeds periodic boosters

Cluster ACluster D

Edge ClusterCluster BCluster CImagine a monkey loose in your data center

Xen Hypervisor vulnerability 9/25/14

218 out of 2700+ Cassandra nodes rebooted 22 did not reboot successfullyAutomation handled the rest

A State of Xen Chaos Monkey & Cassandra

Out of our 2700+ Cassandra nodes218 rebooted 22 did not reboot successfullyAutomation replaced failed nodes0 downtime due to reboot

Device

Service B

Service CInternet

EdgeZuulService A ELBFITFault-Injection Testing (FIT)Simulate service failuresOverride by device or account% of member traffic

Device

Service B

Service CInternet

EdgeZuulService A ELB

FITFault-Injection Testing (FIT)Simulate service failuresOverride by device or account% of member traffic

US-EastUS-WestAZ1

EU-WestGlobal Traffic Management

The InternetDNS-based RoutingZuul Proxy Back Channel

###, ###, ###

Alerting and MonitoringApache & Tomcat HardeningAutomated Canary AnalysisAutoscalingChaos ParticipationConsistent NamingELB ConfigurationHealthcheck ConfiguredRed-Black PipelineSqueeze TestingTimeout & Fallback TuningWorkload ReliabilityProduction Ready?

Our JourneyEngineering ToolsInsight & Real-time AnalyticsPerformance & Reliability

Leverage

A federation of toolsCommon UI elementsDeep linking Operational Tools as a Product

Canary AnalysisConformityIntegration TestsCitrusChaosStaticUnit Tests

Deep Integration Modular Components

Functional Testing

RTA auto-tuningAlertsApache/TomcatAuto-scalingHystrix fallbacks

RTA decision supportACACitrusFlowConformity checksConsistent namesELBsHealth checkRed/black deployment

Delivery integrationACACitrusFITProduction Ready Automation & Integration

Containing failuresRecovering quicklySuccessfully shifting the curve

Internet

Our Journey Ends

Shift the Curve

Continuously engineer your operations to increase quality of customer experience & engineering velocity

https://netflix.github.io/

SpeakerWhen?Where?Engineering Netflix Global Operations in the CloudJosh EvansWed @11amPalazzo NEfficient Innovation: High-Velocity Cost Management at NetflixAndrew ParkWed @ 2:45pmPalazzo CNetflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per SecondPeter BakasWed @ 2:45pmSan Polo 3501BA Day in the Life of a Netflix Engineer Using 37% of the InternetDave HahnWed @ 4:15pmVenetian HAvailability: The New Kind of Innovators DilemmaCoburn WatsonWed @ 4:15pmMarcello 4501BReal-Time Analytics In Service of Self-Healing EcosystemsRoy RapoportChris SandenWed @ 4:15pmLido 3001BRunning Spark and Presto on the Netflix Big Data PlatformDaniel WeeksThu @ 11amPalazzo FSplitting the Check on Compliance and Security: Keeping Developers and Auditors Happy in the CloudJason ChanThu @ 11amMarcello 4501B

@

Josh Evans [email protected]

@josh_evans_nflx

Thank you!


Recommended