QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache...

Post on 16-Oct-2020

3 views 0 download

transcript

Tweet @jedberg with feedback!

QConSP 2013

Tweet @jedberg with feedback!

Do you have...

• A release Engineer?

• A QA department?

• Chef or Puppet to manage your systems?

Tweet @jedberg with feedback!

Do you have...

• Upwards of 100 releases a day?

Tweet @jedberg with feedback!

Jeremy Edberg

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Netflix is the world’s leading Internet television network with nearly 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per

month, including original series. For one low monthly price, Netflix members can watch as much as they want, anytime, anywhere, on nearly any Internet-

connected screen.Source: http://ir.netflix.com

What is Netflix?

Tweet @jedberg with feedback!

The Netflix way

• Everything is “built for three”

• Fully automated build tools to test and make packages

• Fully automated machine image bakery

• Fully automated image deployment

• Independent teams responsible for both Dev and Ops

Tweet @jedberg with feedback!

Philosophy

Tweet @jedberg with feedback!

Freedom and Responsibility

• We hire responsible adults and keep rules and policies to a minimum

• Developers can change any code in production at any time

• And things don’t break (usually)

• Not eXtreme Go Horse

Tweet @jedberg with feedback!

Automate all the things!

Tweet @jedberg with feedback!

Automate all the things!

• Application startup

• Configuration

• Code deployment

• System deployment

Tweet @jedberg with feedback!

Automation

• Standard base image

• Tools to manage all the systems

• Automated code deployment

Tweet @jedberg with feedback!

Shared state should be stored in a shared service

Data on an instance should be replicated to other

instances

Tweet @jedberg with feedback!

“Build for three”We hold a boot camp for new engineers to teach them how

to build for a highly distributed environment.

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

7%$(0/,4.H,IJ0/#B/C./% F(%$#8/G0 >0?.%#

>%),+,),>0?.%#D,J/C(

<.=.4,$#>0?.%(

678

D%?.%E( 6@"#A%()#B/C./%

!"#$%&'%()(#*%$#+,-#

./)0#)1%#2%34.5#678

9!"#0'):0'/+#$%&'%()(#*%$#+,-#)0#678#

+%*%/+%/;.%(

Tweet @jedberg with feedback!

!"#$%&'()*'+,-')./!0)/120)3456)

7'8)1,$')%()*,#-%+'(9):/;)

<#'()*=$=)

/'(#%>=?,@=A%>)

1$('=&,>B):/;)

*CD)

E%1)F%BB,>B)

GH'>!%>>'-$)!*I)J%K'#)

!*I)D=>=B'&'>$)=>L)

1$''(,>B)

!%>$'>$)M>-%L,>B)

!%>#"&'()M?'-$(%>,-#)

:71)!?%"L)1'(+,-'#)

!*I)MLB')F%-=A%>#)

J(%N#')

/?=9)

7=$-O)

Tweet @jedberg with feedback!

Highly aligned, loosely coupled

• Services are built by different teams who work together to figure out what each service will provide.

• The service owner publishes an API that anyone can use.

Tweet @jedberg with feedback!

Advantages to a Service Oriented Architecture• Easier auto-scaling

• Easier capacity planning

• Identify problematic code-paths more easily

• Narrow in the effects of a change

• More efficient local caching

Tweet @jedberg with feedback!

Freedom and Responsibility

• Developers deploy when they want

• They also manage their own capacity and autoscaling

• And fix anything that breaks at 4am!

Tweet @jedberg with feedback!

Decision making

Risk to my serviceRisk to Netflix

Time of Day/Week

Tweet @jedberg with feedback!

All systems choices assume some part will fail at some

point.

Tweet @jedberg with feedback!

Reliability and $$

Tweet @jedberg with feedback!

The Monkey Theory

• Simulate things that go wrong

• Find things that are different

Tweet @jedberg with feedback!

Execution

Photo from I, Robot, copyright 20th Century Fox

Tweet @jedberg with feedback!

Netflix built a global PaaS

• Service Oriented Architecture

• HTTP/Rest interfaces between services

Tweet @jedberg with feedback!

Netflix PaaS features• Supports all regions and zones

• Multiple accounts

• Cross region/account replication

• Internationalized, localized and GeoIP routed

• Advanced key management

• Autoscaling with 1000s of instances

• Monitoring and alerting on millions of metrics

Tweet @jedberg with feedback!

What AWS Provides

• Instances

• Machine Images

• Elastic IPs

• Load Balancers

• Security groups / Autoscaling groups

• Availability zones and regions

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

Tomcat

Optional Apache

Monitoring

Log Rotation to S3

Appdynamics Machine Agent

Appdynamics App Agent

monitoring

Application war file, base servlet, platform, interface

jars for dependent services

GC and thread dump logging

Healthcheck, status servelets, JMX interface,

Servo autoscale

Tweet @jedberg with feedback!

The Netflix PlatformDiscovery (Eureka)Entrypoints (Edda)

Configuration (Archaius)Zookeeper (Exhibitor)logging (Blitz4j & Honu)

NIWS (Ribbon)GeoBase

Circuit Breakers (Hystrix)Cassandra (Priam &

Astyanax & CassJMeter) Cryptex AKMS

EvCacheZuuli18nL10n

Open Source

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Finding things

• Discovery (Eureka)

• Application to instance mapping

• Heartbeat to keep track of health

• Entrypoints (Edda)

• Local database of AWS resources

• NIWS (Ribbon)

• On instance software load balancer

• Handles retry logic

• Geo (Geolocation library)

• Provides IP to Lat/Lon mapping for any service that needs it.

Tweet @jedberg with feedback!

Entrypoints (Edda)

• REST API

• GET /REST/v1/instance/$id

• Keeps track of all resources

• Autoscaling groups, EIPs, Instances, Applications, Clusters, History

Tweet @jedberg with feedback!

Entrypoints Exploration

Find all active instances all()

Find all instances in a group

%(cloudmonkey)

How many instances are not in an autoscale

group?count(all(),-info(eval(INSTANCES;asg())))

Which ELB contains a particular instance?

filter(TYPE;asg;*(i-4a12d3b9))

Tweet @jedberg with feedback!

Keeping it all straight

• Configuration (Archaius)• Global variables (Fast properties)

• Base• Base system. Prod vs. Test, etc

• Zookeeper (Curator)• Locks, other similar coordination

• logging (Blitz4j and Honu)• Keep track of what happened and store it for

post analysis.

Tweet @jedberg with feedback!

Keeping it secure

• Cryptex

• Service for key management

• High, medium and low value keys

• AKMS (Amazon Key Management System)

• Hands out keys to instances (and dev boxes) so they don’t have to store the key on the instance

Tweet @jedberg with feedback!

Key Management

• Cryptex service provides keys

• Low value: Cookie encryption keys

• Med value: Device activation keys

• High value: Credit card encryption

Tweet @jedberg with feedback!

Cryptex

• Pass in encrypted string, get decrypted string out

• Decryption is in a different place depending on value of key

• Always try to design for lowest value key

Tweet @jedberg with feedback!

Translating it

• i18n (Internationalization)

• Make it easy to translate things from one language to another

• L10n (Localization)

• The library that actually does the translations

Tweet @jedberg with feedback!

Storing it• Cassandra (Priam, astyanax)

• Configure and access Cassandra

• Provide OO abstractions handle connection pooling, discovery of hosts

• EVCache (Eccentric Volatile Cache)

• Wrapper for memcached to handle zone awareness and replication

• Proxies

• Get data out of the datacenter and into the cloud.

Tweet @jedberg with feedback!

DataWhat do we do with it all?

Tweet @jedberg with feedback!

We store it!

• Cache (memcached)

• Cassandra

• RDS (MySql)

Tweet @jedberg with feedback!

Cassandra

Tweet @jedberg with feedback!

Why Cassandra?

• Availability over consistency

• Writes over reads

• We know Java

• Open source + support

Tweet @jedberg with feedback!

Cassandra Benefits

• Fast writes

• Fast negative lookups

• Easy incremental scalability

• Distributed -- No SPoF

Tweet @jedberg with feedback!

Things we store in Cassandra

• Video Quality

• Network issues

• Usage History

• Playback Errors

• A/B Tests

Tweet @jedberg with feedback!

A/B Testing

Tweet @jedberg with feedback!

A/B Testing

Online Data Offline Data

Test Cell allocationTest MetadataStart/End dateUI Directives

Test trackingRetention

Fraction ViewedPages Viewed

Tweet @jedberg with feedback!

Using Cassandra at Netflix

• Priam

• Zero touch auto-config

• State management

• Token assignment

• Node replacement

• Backup/restore to/from S3

• Astyanax

• OO abstraction to Cassandra

• Multi-region support

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Cassandra Architecture

Tweet @jedberg with feedback!

Cassandra Architecture

For more info, see DAT202: Optimizing your Cassandra Database on AWS

Tweet @jedberg with feedback!

Tools

• Asgard

• AWS usage

• Atlas

• Chronos

• Build system

• Explorers (Cassandra and SimpleDB)

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Deploying Code; Step 1

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Auto ScalingGroup

LaunchConfiguration

SecurityGroup

Amazon MachineImage

Instances

Configuration

Elastic LoadBalancer

Tweet @jedberg with feedback!

api-usprod-v007

api-frontend

api-usprod-v008

Tweet @jedberg with feedback!

api-usprod-v007

api-frontend

api-usprod-v008

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Netflix has moved the granularity from the

instance to the cluster

Tweet @jedberg with feedback!

Why Bake?

Generic AMI

Instance

Traditional:•launch OS•install packages•install app

Netflix:•launch OS+app

App AMI Instance

Tweet @jedberg with feedback!

Getting Baked

Perforce / Git

libraries

source

Ant targets

Ivy

Groovy all over

snapshot / release libraries / apps

app bundlesapp bundles

Jenkins

sync

resolve

buildcompile report

publishtest

Perforce / Git

sourcesourcesource

sync

Perforce / Git Ant targets

sourcesource

sync compile

Perforce / Git

sourcesource

sync

libraries

resolve

Artifactory

Ivylibraries snapshot / release

libraries / apps

Groovy all over

build

Tweet @jedberg with feedback!

Base ImageBaking

Yum / Apt

Linux: CentOS, Fedora, Ubuntu

AWSRPMs: Apache, Java...

ec2 slave instances

Linux: CentOS, Fedora, Ubuntu

ec2 slave instances

S3 / EBS

foundation AMI

base AMI

Bakery

mount

installinstall

ec2 slave instances

Bakeryinstall

foundation AMI

base

Ready forappbake

snapshot

Tweet @jedberg with feedback!

App ImageBaking

Jenkins / Yum / Artifactory

Linux, Apache, Java, Tomcat

AWSapp bundle

ec2 slave instances

Linux, Apache, Java, Tomcat

ec2 slave instances

S3 / EBS

base AMI

app AMI

Bakery

mount

installinstall

ec2 slave instances

Bakeryinstall

base AMI

Ready to launch!

snapshot

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

Tomcat

Optional Apache

Monitoring

Log Rotation to S3

Appdynamics Machine Agent

Appdynamics App Agent

monitoring

Application war file, base servlet, platform, interface

jars for dependent services

GC and thread dump logging

Healthcheck, status servelets, JMX interface,

Servo autoscale

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

JBoss

Optional Apache

Monitoring

Log Rotation to S3

Appdynamics Machine Agent

Appdynamics App Agent

monitoring

Application war file, base servlet, platform, interface

jars for dependent services

GC and thread dump logging

Healthcheck, status servelets, JMX interface,

Servo autoscale

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Python

Django

Optional Apache

Monitoring

Log Rotation to S3

Appdynamics Machine Agent

monitoring

Application file, base server, platform, interface

libs for dependent serviceslogging

Tweet @jedberg with feedback!

The Monkey Theory

• Simulate things that go wrong

• Find things that are different

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

The simian army• Chaos -- Kills random instances

• Chaos Gorilla -- Kills zones

• Chaos Kong -- Kills regions

• Latency -- Degrades network and injects faults

• Conformity -- Looks for outliers

• Circus -- Kills and launches instances to maintain zone balance

• Doctor -- Fixes unhealthy resources

• Janitor -- Cleans up unused resources

• Howler -- Yells about bad things like Amazon limit violations

• Security -- Finds security issues and expiring certificates

Tweet @jedberg with feedback!

What’s going on?!

Tweet @jedberg with feedback!

Atlas

Tweet @jedberg with feedback!

!""#$%&'()*'#+",""""#)-.$/011*)10(2*#3""""#)-.$/011*)10(2*45)6#""73""#0%)*('#+",""""88"92&"$0:"&')";060'$*.-("'(9%)"$2<<):('".:"(=)"$2:>.1""""!""""""#<)(*.$?0<)#+"#@-.$A%&1.:/?&<B*2--)5#3""""""#0--%9C2#+"#$%&'()*#3""""""#$2:5.(.2:#+"!""""""""#(9-)#+"#D(0(.$C=*)'=2%5#3""""""""#<0E#+"FGF""""""H3""""""#')6)*.(9#+"#<0;2*#3""""""#5)'$*.-(.2:#+"#-%&1.:".'"5*2--.:1"<)(*.$'#""""H3""""!""""""#<)(*.$?0<)#+"#@-.$A%&1.:/?&<B*2--)5/I:'(0:$)#3""""""#0--%9C2#+"#.:'(0:$)#3""""""#$2:5.(.2:#+"!""""""""#(9-)#+"#?&<J$$&**):$)'#3""""""""#:&<#+"K3""""""""#$2:5.(.2:#+"!""""""""""#(9-)#+"#D(0(.$C=*)'=2%5#3""""""""""#<0E#+"FGF""""""""H""""""H3""""""#26)**.5)'#+"!""""""""#')*6.$)/L)9/26)**.5)#+"#MNOKP#3""""""""#*)Q&.*)/.:'(0:$)/'(0(&'/:2(/.:+",#BJR?#3"#JSC/JT/D@UVIW@#73""""""""#)<0.%/26)**.5)#+"#5)6:&%%X:)(>%.EG$2<#""""""H3""""""#')6)*.(9#+"#<.:2*#""""H3

!""""""#<)(*.$?0<)#+"#@-.$A%&1.:/Y)(*.$W2&:(#3""""""#0--%9C2#+"#.:'(0:$)#3""""""#5)'$*.-(.2:#+"#Z!.:'(0:$)I5H".'"*)-2*(.:1"(22"<0:9"<)(*.$'#3""""""#$2:5.(.2:#+"!""""""""#(9-)#+"#?&<J$$&**):$)'#3""""""""#:&<#+"K3""""""""#$2:5.(.2:#+"!""""""""""#(9-)#+"#D(0(.$C=*)'=2%5#3""""""""""#<0E#+"FGF""""""""H""""""H3""""""#055.(.2:0%B)(0.%'#+"!""""""""#'(0(&'S*%#+"#=((-+88Z!-&[%.$B:'?0<)H+\FFM8D(0(&'#3""""""""#:0$W%&'()*S*%#+"#:0$Z!):6H8Z!*)1.2:H8$%&'()*8'=2]8Z!$%&'()*H#""""""H""""""#26)**.5)'#+"!""""""""#'&[;)$(#+"#Z!.:'(0:$)I5H".'"*)-2*(.:1"(22"<0:9"<)(*.$'#3""""""""#.:$.5):(/L)9#+"#Z!<)(*.$?0<)H+Z!.:'(0:$)I5H#3""""""""#')*6.$)/L)9/26)**.5)#+"#MNOKP#3""""""""#)<0.%/26)**.5)#+"#5)6:&%%X:)(>%.EG$2<#""""""H3""""""#')6)*.(9#+"#<.:2*#""""H""7H

Example Alert Config

Tweet @jedberg with feedback!

Alert Tuning

Tweet @jedberg with feedback!

Alert Systems

alerting

api

api

COREEvent

Gateway

Paging Service

AmazonSES

CORE Agent

Other Team’s Agent

CORE Agent

Atlas

Appdynamics

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Chronos

Tweet @jedberg with feedback!

Best Practices

Tweet @jedberg with feedback!

Incident Reviews

• What went wrong?

• How could we have detected it sooner?

• How could we have prevented it?

• How can we prevent this class of problem in the future?

• How can we improve our behavior for next time?

Ask the key questions:

Tweet @jedberg with feedback!

Best Practices for Data

• Have multiple copies of all data

• Keep those copies in multiple AZs

• Avoid keeping state on a single instance

• Take frequent snapshots of EBS disks

• No secret keys on the instance

Tweet @jedberg with feedback!

Circuit Breakers (Hystrix)Be liberal in what you accept, strict in what you send

Tweet @jedberg with feedback!

Netflix autoscaling

Traffic Peak

Text1

2Deployment

Tweet @jedberg with feedback!

AWS UsageDollar amounts have been carefully removed

Tweet @jedberg with feedback!

Going multi-zone

Tweet @jedberg with feedback!

Benefits of Amazon’s Zones

• Loosely connected

• Low latency between zones

• 99.95% uptime guarantee per region

Tweet @jedberg with feedback!

Going Multi-region

Tweet @jedberg with feedback!

Leveraging Mutli-region

• 100% uptime is theoretically possible.

• You have to replicate your data

• This will cost money

Tweet @jedberg with feedback!

Multi-Region Challenges

• Data replication

• Cache invalidation

• Misdirected users

• Sudden load increase during failover

• When do you fail over?

Tweet @jedberg with feedback!

Data Replication

Tweet @jedberg with feedback!

Cache Replication

• Three strategies available to users:

• No replication

• Invalidation only

• Full copy

Tweet @jedberg with feedback!

Traffic Routing and Failover

• Need to scale up and not get overwhelmed

• Don’t want to suddenly give a bad experience to people

• Make sure that misrouted users are sent “home”

• Can’t failover at first sign of trouble, need to strike a balance

Tweet @jedberg with feedback!

Coming soon...

• We’re in the testing phases now

• Expect to see more info and a tech blog post in the future

Tweet @jedberg with feedback!

Just a quick reminder...

(Some of) Netflix is open source:

https://github.com/netflix

Tweet @jedberg with feedback!

Netflix is hiring

http://jobs.netflix.com/jobs.html

Tweet @jedberg with feedback!

Please don’t forget to vote!

Voting is how we know what to present to you next time. :)

Tweet @jedberg with feedback!

Questions?

Tweet @jedberg with feedback!

Getting in touch

Email: jedberg@{gmail,netflix}.com

Twitter: @jedberg

Web: www.jedberg.net

Facebook: facebook.com/jedberg

Linkedin: www.linkedin.com/in/jedberg