AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration with AWS (CON313)

transcript

Andrew Spyker, Sr. Software Engineer, Netflix

December 2016

CON313

NetflixContainer Scheduling, Execution, and Integration with AWS

What to Expect from the Session

• Why containers?

• Including current use cases and scale

• How did we get there?

• Overview of our container cloud platform

• Collaboration with ECS

About Netflix

• 86.7M members

• 1000+ developers

• 190+ countries

• > ⅓ NA internet download traffic

• 500+ microservices

• Over 100,000 VMs

• 3 regions across the world

Why containers?

Given that our VM architecture is comprised of …

amazingly resilient,

microservice driven,

cloud native,

CI/CD devops enabled,

elastically scalable

do we really need containers?

Our Container System Provides Innovation Velocity

• Iterative local development, deploy when ready

• Manage app and dependencies easily and completely

• Simpler way to express resources, let system manage

Innovation Velocity - Use Cases

• Media encoding - encoding research development time

• Using VMs - 1 month, using containers - 1 week

• Niagara

• Build all Netflix codebases in hours

• Saves development 100s of hours of debugging

• Edge Rearchitecture with Node.js

• Focus returns to app development

• Simplifies, speeds test and deployment

Why not use existing container mgmt solution?

• Most solutions are focused on the datacenter

• Most solutions are

• Working to abstract datacenter and cross-cloud

• Delivering more than cluster manager

• Not yet at our level of scale

• Wanted to leverage our existing cloud platform

• Not appropriate for Netflix

What do batch users want?

• Simple shared resources, run till done, job files

• NOT

• EC2 instance sizes, automatic scaling, AMI OS

• WHY

• Offloads resource management ops, simpler

Historic use of containers

• General workflow (Meson), stream

processing (Mantis)

• Proven using cgroups and Mesos

• With simple isolation

• Using specific packaging formats

cgroups

Enter Titus

Job Management

Resource Management & Optimization

Container ExecutionIntegration

Sample batch use cases

• Algorithm

Training

GPU usage

• Personalization and recommendation

• Deep learning with neural nets/mini batch

• Titus

• Added g2 support using nvidia-docker-plugin

• Mounts nvidia drivers and devices into Docker containers

• Distribution of training jobs and infrastructure made self service

• Recently moved to p2.8xl instances

• 2X performance improvement with same CUDA-based code

• Media encoding experimentation

• Digital watermarking

Ad hoc

reporting

Open connect

CDN reporting

Lessons learned from batch

• Docker helped generalize use cases

• Cluster automatic scaling adds efficiency

• Advanced scheduling required

• Initially ignored failures (with retries)

• Time-sensitive batch came later

Titus Batch Usage (Week of 11/7)

• Started ~ 300,000 containers during the week

• Peak of 1000 containers per minute

• Peak of 3,000 instances (mix of r3.8xls and m4.4xls)

Services

Adding Services to Titus

Job Management

Resource Management & Optimization

Container ExecutionIntegration

Service

Services are just

long- running

batches, right?

Services more complex

Services resize constantly and run forever

• Automatic scaling

• Hard to upgrade underlying hosts

Have more state

• Ready for traffic vs. just started/stopped

• Even harder to upgrade

Existing, well-defined dev, deploy, runtime, & ops tools

Real Networking is Hard

Multi-Tenant Networking is Hard

• IP per container

• Security group support

• IAM role support

• Network bandwidth isolation

Solutions

• VPC Networking driver

• Supports ENI’s - full IP functionality

• With scheduling - security groups

• Support traffic control (isolation)

• EC2 Metadata proxy

• Adds container “node” identity

• Delivers IAM roles

VPC Networking Integration with Docker

Executor

Titus Networking Driver

- Create and attach ENI with

- security group

- IP address

create net namespace

Executor

- Launch ”pod root” container with

- IP address

- Using “pause” container

- Using net=none

Pod Root

ContainerDocker

create net namespace

Executor

- Create virtual ethernet

- Configure routing rules

- Configure metadata proxy iptables NAT

- Configure traffic control for bandwidthpod_root_id

Pod Root

Container

Executor

Pod Root

Container(pod_root_id)

Docker

Container

create container with

--net=container:pod_root_id

Metadata Proxy

container

Amazon

Metadata

Service

(169.254.169.254)

Titus Metadata Proxy

What is my IP, instanceid, hostname?

- Return Titus assigned

What is my AMI, instance type, etc.

- Unknown

Give me my role credentials

- Assume role to container role, return

credentials

Give me anything else

- Proxy

veth<id>

169.254.169.254:80

host_ip:9999

iptables/NAT

Putting it all together

Virtual Machine Host

ENI1sg=A

ENI2sg=X

ENI3sg=Y,Z

Non-routable IP IP1

sg=X sg=X sg=Y,ZNonroutable IP, sg=A Metadata proxy

container

pod root

veth<id>

container

pod root

veth<id>

container

pod root

veth<id>

container

pod root

veth<id>

Container 1 Container 2 Container 3 Container 4

Linux Policy Based Routing

+ Traffic Control

169.254.169.254

Additional AWS Integrations

• Live and rotated to S3 log file access

• Multi-tenant resource isolation (disk)

• Environmental context

• Automatic instance type selection

• Elastic scaling of underlying resource pool

Netflix Infrastructure Integration

• Spinnaker CI/CD

• Atlas telemetry

• Discovery/IPC

• Edda (and dependent systems)

• Healthcheck, system metrics pollers

• Chaos testing

VM’sVM’s

Why? Single consistent cloud platform

Virtual Machines

Service

Applications

Cloud Platform Libraries

(metrics, IPC, health)

VM’sVM’s

Container

Service

Applications

(metrics, IPC, health)

VM’sVM’s

Container

Applications

(metrics, IPC)

Edda EurekaAtlas

Titus Spinnaker Integration

Deploy based on

new Docker

registry tags

Deployment

strategies same

as Auto Scaling

IAM roles and

security groups

per container

Basic resource

requirements

Easily see health

check & service

discovery status

Fenzo – The heart of Titus scheduling

Extensible library for scheduling frameworks

• Plugins based scheduling objectives

• Bin packing, etc.

• Heterogeneous resources & tasks

• Cluster automatic scaling

• Multiple instance types

• Plugin-based constraints evaluator

• Resource affinity, task locality, etc.

• Single offer mode added in support of ECS

Fenzo scheduling strategy

For each task

On each host

Validate hard constraints

Eval fitness and soft constraints

Until fitness “good enough”, and

A minimum #hosts evaluated

Plugins

Scheduling – Capacity Guarantees

Desired

Titus maintains …

Critical tier

• guaranteed

capacity & start

latencies

Flex tier

• more dynamic

capacity & variable

start latency

Titus Master

SchedulerFenzo

Scheduling – Bin Packing, Elastic Scaling

User adds work tasks

• Titus does bin

packing to ensure

that we can

downscale entire

hosts efficientlyCan

terminate

Desired

✖ ✖ ✖ ✖

Titus Master

SchedulerFenzo

Availability Zone B

Availability Zone A

Scheduling – Constraints including zone

balancing

User specifies constraints

• Availability Zone

balancing

• Resource and Task

affinity

• Hard and softDesired

Titus Master

SchedulerFenzo

Auto Scaling group version 001

Scheduling – Rolling new Titus code

Operator updates Titus agent

codebase

• New scheduling on new cluster

• Batch jobs drain

• Service tasks are migrated via

Spinnaker pipelines

• Old cluster scales down

Desired

Auto Scaling group version 002

Desired

✖ ✖

Titus Master

SchedulerFenzo

Current Service Usage

• Approach

• Started with internal applications

• Moved on to line-of-fire Node.js (shadow first, prod 1Q17)

• Moved on to stream processing (prod 4Q)

• Current - ~ 2000 long running containers

Batch 2Q

Service

pre-prod 3Q

Service

shadow

Service

Collaboration with ECS

Why ECS?

• Decrease operational overhead of underlying cluster

state management

• Allow open source collaboration on ECS agent

• Work with Amazon and others on EC2 enablement

• GPUS, VPC, security groups, IAM roles, etc.

• Over time, this enablement should result in less maintenance

Titus Today

Container host

mesos-

executor

containercontainer

containerMesos

master

Scheduler

integrationOutbound

- Launch/terminate container

- Reconciliation

Inbound

- Container host events (and offers)

- Container events

First Titus ECS Implementation

Container host

ECS agent

executor

containercontainer

containerECS

Scheduler

integrationOutbound

- Polling for

- Container host events

- Container events

Collaboration with ECS team starts

• Collaboration on ECS “event stream” that could provide

• “Real time” task & container instance state changes

• Event based architecture more scalable than polling

• Great engineering collaboration

• Face to face focus

• Monthly interlocks

• Engineer to engineer focused

Current Titus ECS Implementation

Container host

ECS agent

executor

containercontainer

container

Scheduler

integration

Outbound

- Reconciliation

Inbound

- Container host events

- Container events

CloudWatch

EventsSQS

Analysis - Periodic Reconciliation

For tasks in listTasks

describeTasks (batches of 100)

Number of API calls: 1 + num tasks / 100 per reconcile

1280 containers

across 40 nodes

Analysis - Scheduling

• Number of API calls: 2X number of tasks

• registerTaskDefinition and startTask

• Largest Titus historical job

• 1000 tasks per minute

• Possible with increased rate limits

Continued areas of scheduling collaboration

• Combining/batching registerTaskDefinition and startTask

• More resource types in the control plane

• Disk, network bandwidth, ENIs

• To fit with existing scheduler approach

• Extensible message fields in task state transitions

• Named tasks (beyond ARNs) for terminate

• Starting vs. started state

Possible phases of ECS support in Titus

• Work in progress

• ECS completing scheduling collaboration items

• Complete transition to ECS for overall cluster manager

• Allows us to contribute to ECS agent open source

Netflix cloud platform and EC2 integration points

• Future

• Provide Fenzo as the ECS task placement service

• Extend Titus Job Management features to ECS

Titus Future Focus

Future Strategy of Titus

• Service automatic scaling and global traffic

integration

• Service/batch SLA management

• Capacity guarantees, fair shares, and pre-emption

• Trough / Internal Spot market management

• Exposing pods to users

• More use cases and scale

Questions?

Andrew Spyker (@aspyker)

Thank you!

Remember to complete

your evaluations!

AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration with AWS (CON313)

Technology