(BDT318) How Netflix Handles Up To 8 Million Events Per Second

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Peter Bakas, Director of Engineering, Event and Data Pipelines, Netflix

October 2015

BDT318

Netflix KeystoneHow Netflix Handles Data Streams Up to 8 Million Events Per Second

Peter Bakas

@ Netflix : Cloud Platform Engineering - Event and Data Pipelines

@ Ooyala : Analytics, Discovery, Platform Engineering & Infrastructure

@ Yahoo : Display Advertising, Behavioral Targeting, Payments

@ PayPal : Site Engineering and Architecture

@ Play : Advisor to various Startups (Data, Security, Containers)

Who is this guy?

What to Expect from the Session

• Architectural design and principles for Keystone

• Current state of technologies that Keystone is leveraging

• Best practices in operating Kafka and Samza

Why are we here?

Publish, Collect, Process, Aggregate & Move Data

@ Cloud Scale

• 550 billion events per day

• 8.5 million events & 21 GB per second during peak

• 1+ PB per day

• Hundreds of event types

By the numbers

How did we get here?

Chukwa

Chukwa/Suro + Real-Time

Chukwa/Suro + Real-Time

Now what?!!

Keystone

Keystone

Split Fronting Kafka Clusters

Normal-priority (majority)

• 2 copies, 12 hour retention

High-priority (streaming activities etc.)

• 3 copies, 24 hour retention

Split Fronting Kafka Clusters

Instance type - D2XL

• Large disk (6TB) with 450-475MB/s of sequential I/O

throughput measured

• Large memory (30GB)

• Medium network capability (~ 700Mbps)

• Replication lag starts to show when bytes in above

18MB/second per broker with thousands of partition

• PR is available to Apache Kafka

• https://github.com/apache/kafka/pull/132

• https://issues.apache.org/jira/browse/KAFKA-1215

• Improved availability

• Reduce cost of maintenance

Kafka Zone Aware Replica Assignment

https://github.com/apache/kafka/pull/132

https://issues.apache.org/jira/browse/KAFKA-1215

Keystone

Control Plane + Data Plane

• Control plane for router is job manager

• Infrastructure is data plane

• Declarative, reconciliation driven

• Smart scheduling managing tradeoffs

• Auto Scaling based on traffic

• Fault tolerance

• Application (router) faults

• AWS hardware faults

Keystone

Routing Service - Samza



Amazon S3 Routing

• ~5800 long running containers for Amazon S3 routing

• ~500 C3-4XL AWS instances for Amazon S3 routing

Elasticsearch Routing

• ~850 long running containers for Elasticsearch routing

• ~70 C3-4XL AWS instances for Elasticsearch routing

Kafka Routing

• ~3200 long running containers for Kafka routing

• ~280 C3-4XL AWS instances for Kafka routing


Container Footprint:

• 2G - 5G memory

• 160 mbps max network bandwidth

• 1 CPU Share

• 20G disk for buffer & logs

• Processes 1-12 partitions

• Periodically reports health to infrastructure


Observed Numbers:

• Avg memory usage of ~1.8G per container

• Avg memory usage per node ~20G(Range: 7G - 25G)

• Avg CPU utilization of 8% per node

• Avg NetworkIn 256Mbps per node

• Avg NetworkOut 156Mbps per node


Publish to Amazon S3 sink:

• Every 200mb or 2 mins

• S3 average upload latency 200ms

Producer to Router latency:

• 30 percentile topics under 500 ms

• 70 percentile topics under 1 sec

• 90 percentile under 2 sec

• Overall average about 2.5 seconds

Kafka to Router consumer lag (est time to catch up):

• 65 percentile under 500ms

• 90 percentile under 5 seconds

+ Alterations

• Internal build of Samza version 0.9.1

• Fixed SAMZA-41 in 0.9.1

• Support static partition range assignment

• Added SAMZA-775 in 0.9.1

• Prefetch buffer specified based on heap to use

• Backported SAMZA-655 to 0.9.1

• Environment variable configuration rewriter

• Backported SAMZA-540 to version 0.9.1

• Expose latency related metrics in OffsetManager

• Integration with Netflix Alert & Monitoring systems

Keystone

Real-time processing




• Streaming jobs to analyze movie plays, A/B tests, etc.

• Direct API for Kafka in 1.3

• Observed 2x performance improvement compared to 1.2

• Additional improvement possible with prefetching and connection pooling

(not available yet)

• Campaigned for backpressure support

• Result - Spark 1.5 release has community developed back pressure

support SPARK-7398

Great. How do I use it?

Annotation-based event definition

@Resource(type = ConsumerStorageType.DB, name = "S3Diagnostics")

public class S3Diagnostics implements Annotatable {

....

S3Diagnostics s3Diagnostics = new S3Diagnostics();

....

LogManager.logEvent(s3Diagnostics); // log this diagnostic event

Java

{

"eventName" : "ksproxytest",

"payload" : {

"k1" : "v1",

"k2" : "v2"

}

}

Non-Java : Keystone Proxy

Wire format

• Extensible

• Currently supports JSON

• Will support Avro

• Encapsulated as a shareable jar

• Immutable message through the pipeline

Producer Resilience

• Outage should never disrupt existing instances from serving business

purpose

• Outage should never prevent new instances from starting up

• After service is restored, event producing should resume

automatically

Fail, but never block

block.on.buffer.full=false

handle potential blocking of first metadata request

Trust me, it works!

Keystone Dashboard

Keystone Dashboard

Keystone Dashboard

Trust, but verify!

• Broker monitoring

• Alert on offline broker from ZooKeeper

• Consumer monitoring

• Alert on consumer lag/stuck and unconsumed partitions

• Heart-beating

• Produce/consume messages and measure latency

• Broker performance testing

• Produce tens of thousands messages per second on single instance

• Create multiple consumer groups to test consumer impact on broker

Auditor

Auditor - Broker Monitoring

Consumer Offset

Stuck Consumer Unconsumed Partitions

Auditor - Consumer Monitoring

Consumer Lag

Meet Winston

New Internal Automation Engine:

• Collect diagnostic information based on alerts & other operational

events

• Help services self heal

• Reduce MTTR

• Reduce pager fatigue

• Improve productivity for developer

Winston

Winston

How do you like your Kaffee?

Kaffee

Kaffee

Kaffee

What’s next?

• Performance tuning + optimizations

• Self service

• Schemas + registry

• Event discovery + visualization

• Open Source Auditor/Kaffee

Near Term

And then???

Global real-time data stream + stream processing network

Office HoursWed 4:00PM – 5:30PM

@ Booth

[email protected]

@peter_bakas

Remember to complete

your evaluations!

Thank you!

Date post:	21-Jan-2017
Category:	Technology
Upload:	amazon-web-services
View:	11,650 times
Download:	1 times

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

Technology