Download - Triangle Devops Meetup 10/2015

Netflix Open Source &What I have done in a year?

Andrew SpykerSenior Software Engineer, Netflix

Back to the Past

Previous talks at @TriangleDevops● 10/16/2013 - Learn about NetflixOSS● 6/18/2014 - Learn about Docker

About Netflix

● 69M members● 2000+ employees (1400 tech)● 80+ countries● > 100M hours watch per day● > ⅓ NA internet download traffic● 500+ Microservices● Many 10’s of thousands VM’s● 3 regions across the world

About the Speaker● Cloud platform technologies

○ Distributed configuration, service discovery, RPC, application frameworks, non-Java sidecar

● Container cloud○ Resource management and scheduling, making Docker containers

operational in Amazon EC2/ECS

● Open Source○ Organize @NetflixOSS meetups & internal group

● Performance○ Assist across Netflix, but focused mainly on cloud platform perf

With Netflix for ~ 1 year. Previously at IBM here in Raleigh/Durham (RTP)

@aspyker

ispyker.blogspot.

com

Agenda

● NetflixOSSNetflix Cloud ArchitectureGetting StartedPersonal Projects

Why does Netflix open source?● Allows engineers to gather feedback

○ Openly talk, through code, on our approach○ Collaboration on key projects with the world○ Happily use proven outside open source

■ And improve it for Netflix scale and availability

● Netflix culture of freedom and responsibility○ Want to open source?○ Go for it, be responsible!

● Recruiting and Retention○ Candidates know exactly what they can work on○ NetflixOSS engineers choose to stay at Netflix

NetflixOSS is widely used● The architecture has shaped public cloud usage

○ Immutability, Red/Black Deploys, Chaos,Regional and worldwide high availability

● Offerings○ Pivotal Spring Cloud

● Large usage○ IBM Watson as a Service (on IBM Cloud)○ Nike Digital is hiring NetflixOSS experts

● Interesting usage○ “To help locate new troves of data claiming to be the files stolen from

AshleyMadison, the company’s forensics team has been using a tool that Netflix released last year called Scumblr”

NetflixOSS Website Relaunch

http://netflix.github.io

Key aspects of NetflixOSS website● Show how the pieces fit together

○ Projects now discussed with each other in context

● OSS categories mirror internal teams○ No artificial categories, focal points for each area

● Focus on projects that are core to Netflix○ Projects mentioned are core and strategic

Agenda

NetflixOSS● Netflix Cloud ArchitectureGetting StartedPersonal Projects

Elastic, Web and Hyper Scale

Doing this

Not doing that

Elastic, Web and Hyper Scale

Front endAPI

AnotherMicroservice

Temporalcaching

DurableStorage

LoadBalancers

…

Strategy Benefit

Automate everything Less errors, more consistency than manual runbooks

Expose well designed API to users Offloads presentation complexity to clients

Remove state for mid tier services Allows easy elastic scale out

Push temporal state to client and caching tier Leverage clients, avoids data tier overload

Use partitioned data storage Data design and storage scales with HA

……

…

…

…

RecommendationMicroservice

HA and Automatic Recovery

Feeling This

Not Feeling That

Micro serviceImplementation

Call microservice #2

Highly Available Service Runtime Recipe

Ribbon REST clientwith Eureka

Microservice #1(REST services)

App ServiceMicroservice #2

Executecall

Hys

trix

EurekaServer(s)

EurekaServer(s)

EurekaServer(s)

KaryonFallback

Implementation

Implementation Detail Benefits

Decompose into micro services• Key user path always available• Failure does not propagate across service boundaries

Karyon /w automatic Eureka registration• New instances are quickly found• Failing individual instances disappear

Ribbon client with Eureka awareness• Load balances & retries across instances with “smarts”• Handles temporal instance failure

Hystrix as dependency circuit breaker• Allows for fast failure• Provides graceful cross service degradation/recovery

IaaS High Availability

Region (us-east-1)

us-east-1eus-east-1c

Eureka

Web App Service1 Service2

Cluster Auto Recovery and Scaling Services (Auto Scaling Groups)

…

ELB’s

Rule Why?

Always > 2 of everything 1 is SPOF, 2 doesn’t scale, slow DR recovery, majority consensus not possible

Including IaaS and cloud services You’re only as strong as your weakest dependency

Use auto scaler/recovery monitoring Clusters guarantee availability and service latency

Use application level health checks Instance on the network != healthy

Worldwide availability Data replication, global front-end routing, cross region traffic

us-east-1d

A truly global service

● Replicate data across regions

● Be able to redirect traffic from region to region

● Be able to migrate regional traffic to other regions

● Have automated control across regions Flux Demo

Testing is only way to prove HA

● Chaos Monkey○ Kill instances in production - runs regularly

● Chaos Gorilla○ Kills availability zones (single datacenter)○ Also testing for split brain important

● Chaos Kong○ Kill entire region and shift traffic globally○ Run frequently but with prior scheduling

Continuous Delivery

Reading This

Not This

v

Continuous Delivery

Cluster v1 Canary v2 Cluster V2

Step Technology

Developers test locally Unit test frameworks

Continuous build Continuous build server based on gradle builds

Build “bakes” full instance image Aminator and deployment pipeline bake images from build artifacts

Developer work across dev and test Archaius allows for environment based context

Developers do canary tests, red/black deployments in prod

Asgard console provides app cluster common devops approach, security patterns, and visibility

ContinuousBuild Server

Baked to images (AMI’s)

… …

From Asgard to Spinnaker● Spinnaker is our CI/CD solution

○ CI/CD solution including baking and Jenkins integration○ Workflow engine for the continuous delivery○ Pipeline based deployment including baking○ Global visibility across all of our AWS regions○ Provides an API first design○ A microservices runtime HA architecture○ More flexible cloud model so the community can contribute back

improvements not related to AWS

● Asgard continues to work side-by-side● Spinnaker is this new end to end CI/CD tool

Spinnaker ExamplesWorks atNetflix scale

Views of global pipelines

From simple Asgard like deployment to advanced CI/CD pipelines

Operational Visibility

If you can’t see it, you can’t improve it

Operational Visibility

Microservice #1 Microservice #2

Visibility Point Technology

Basic IaaS instance monitoring Not enough (not scalable, not app specific)

User like external monitoring SaaS offerings or OSS like Uptime

Targeted performance, sampling Vector performance and app level metrics

Service to service interconnects Hystrix streams ➔Turbine aggregation ➔Hystrix dashboard

Application centric metrics Servo/Spectator gauges, counters, timers sent to metrics store like Atlas

Remote logging Logstash/Kibana or similar log aggregation and analysis frameworks

Threshold monitoring and alerts Services like Atlas and PagerDuty for incident management

Servo/Spectator

Hystrix/Turbine

External UptimeMonitoring Metric/Event

Repositories

LogStash/ElasticSearch/Kibana

Incidents

……

…

…

Atlas

Vector

Security

Dynamic Security

Done in new ways

NOT

Dynamic, Web Scale & Simpler SecuritySecurity Monkey

● Monitors security policies, tracks changes, alerts on situations

Scumblr● Searches internet for security “nuggets” (credentials, hacking discussions)

Sketchy● A safe way to collect text and screenshots from websites

FIDO● Automated event detection, analysis, enrichment & and enforcement

Sleepy Puppy● Delayed cross site scripting propagation testing framework

Lemur● x.509 certificate orchestration framework

What did we not cover?

Over 50 github projects● NetflixOSS is “Technical indigestion as a service”

Big Data, Data Persistence and UI Engineering● Big Data tools used well beyond Netflix● Ephemeral, semi and fully persistent data systems● Recent addition of UI OSS and Falcor

Agenda

NetflixOSSNetflix Cloud Architecture● Getting StartedPersonal Projects

How do I get started?● All of the previous slides shows NetflixOSS components

○ Code: http://netflix.github.io○ Announcements: http://techblog.netflix.com/

● Want to get running a bit faster?

● ZeroToCloud○ Workshop for getting started with build/bake/deploy in Amazon EC2

● ZeroToDocker○ Docker images that containing running Netflix technologies (not production

ready, but easy to understand)

ZeroToDocker Demo

Mac OS X

Virtual Box

Ubuntu 14.04

single kernel

Con

tain

er #

1Fi

lesy

stem

+

proc

ess

Eur

eka

Con

tain

er

Zuul

Con

tain

er

Ano

ther

C

onta

iner

...

● Docker running instances○ Single kernel○ Contained processes

● Zookeeper and Exhibitor● A Microservices app and

surrounding NetflixOSS services (Zuul to Karyonwith Eureka)

Agenda

NetflixOSSNetflix Cloud ArchitectureGetting Started● Personal Projects

Performance Focus● Reduced Karyon startup time by ⅔

○ Removal of classpath scanning○ Moved eureka “UP” registration to be event based○ Java 8 (faster startup was focus)

● Investigated other opportunities now being considered for Karyon 3○ Loading components asynchronously (console)

● Beyond platform startup time - key service○ Fixes to platform that saved 3 minutes

■ library version tracking, ribbon connection priming○ Fixes to application logic (distributed indexing/filtering)

Performance Focus - Eureka

● Identified issues w/ OOM’s & eureka client○ For a “full update” we used 2G of memory○ Was crashing discovery for our EVCache nodes

● Helped prototype the following○ XStream - required 370M of heap○ Jackson V1 (first attempt) - down to 260M○ Jackson V2 (current) - down to 130M○ Jackson V2 (+compact for future scenarios) - down to 64M

Performance Automation

● Implemented automated performance measurement● Jenkins pipeline as part of every platform candidate● Uses Elastic (search) and Kibana dashboards● Measures

○ Boot to tomcat start time○ Tomcat start to up in discovery○ Profiles the startup○ Number of dependencies○ Used/unused dependencies○ Jacoco code coverage

● In our face monitoringdashboard

Platform Sidecar (Prana)● Prana started as an edge focused “what was

needed”, then wider Netflix usage● Created release management

○ User oriented smoke tests - Acme Air NodeJS○ Now releases can be done with confidence

● Supported the Netflix desktop experience○ Uses isomorphic JavaScript on NodeJS + Prana○ Added circuit breaker, LB & dist config support○ Caused my first partial outage (insert story here)

● Supported the EVCache clusters

Strategy - Platform Direction● Helped define some of the platform direction

● Improvements in Eureka to ensure its continued scalability

● Key improvements needed in Karyon 3○ Performance improvements (footprint/startup)○ Focus on mocks needed in dev, unit test, CI envs○ Ability to narrow features for infrastructural services○ Rework of Prana to be on same platform base

Open Source

● Led internal & external meetups on OSS

● Web site redesign to help external users

● Implemented ZeroToDocker○ Implemented the platform focused aspects○ Helped other teams onboard into ZeroToDocker

● Worked to operationalize prod deployments○ Separate dev stack, metrics, consistent pipelines○ Built up teams (existing impl, strategic work)

● Created strategy for going forward○ Increase leverage of “Mantis” technology for

scheduling and job management○ Increase leverage of ECS for Docker AWS

integration & resource management● Working on strategy of non-runtime components

○ Changes to Netflix build/bake/deploy○ Changes to key supporting services

Container cloud

Questions

?