+ All Categories
Home > Technology > (BDT318) How Netflix Handles Up To 8 Million Events Per Second

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

Date post: 21-Jan-2017
Category:
Upload: amazon-web-services
View: 11,650 times
Download: 1 times
Share this document with a friend
61
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Peter Bakas, Director of Engineering, Event and Data Pipelines, Netflix October 2015 BDT318 Netflix Keystone How Netflix Handles Data Streams Up to 8 Million Events Per Second
Transcript
Page 1: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Peter Bakas, Director of Engineering, Event and Data Pipelines, Netflix

October 2015

BDT318

Netflix KeystoneHow Netflix Handles Data Streams Up to 8 Million Events Per Second

Page 2: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Peter Bakas

@ Netflix : Cloud Platform Engineering - Event and Data Pipelines

@ Ooyala : Analytics, Discovery, Platform Engineering & Infrastructure

@ Yahoo : Display Advertising, Behavioral Targeting, Payments

@ PayPal : Site Engineering and Architecture

@ Play : Advisor to various Startups (Data, Security, Containers)

Who is this guy?

Page 3: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

What to Expect from the Session

• Architectural design and principles for Keystone

• Current state of technologies that Keystone is leveraging

• Best practices in operating Kafka and Samza

Page 4: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Why are we here?

Page 5: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Publish, Collect, Process, Aggregate & Move Data

Page 6: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

@ Cloud Scale

Page 7: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

• 550 billion events per day

• 8.5 million events & 21 GB per second during peak

• 1+ PB per day

• Hundreds of event types

By the numbers

Page 8: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

How did we get here?

Page 9: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Chukwa

Page 10: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Chukwa/Suro + Real-Time

Page 11: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Chukwa/Suro + Real-Time

Page 12: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Now what?!!

Page 13: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Keystone

Page 14: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Keystone

Page 15: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Split Fronting Kafka Clusters

Normal-priority (majority)

• 2 copies, 12 hour retention

High-priority (streaming activities etc.)

• 3 copies, 24 hour retention

Page 16: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Split Fronting Kafka Clusters

Instance type - D2XL

• Large disk (6TB) with 450-475MB/s of sequential I/O

throughput measured

• Large memory (30GB)

• Medium network capability (~ 700Mbps)

• Replication lag starts to show when bytes in above

18MB/second per broker with thousands of partition

Page 17: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

• PR is available to Apache Kafka

• https://github.com/apache/kafka/pull/132

• https://issues.apache.org/jira/browse/KAFKA-1215

• Improved availability

• Reduce cost of maintenance

Kafka Zone Aware Replica Assignment

Page 18: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Keystone

Page 19: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Control Plane + Data Plane

• Control plane for router is job manager

• Infrastructure is data plane

• Declarative, reconciliation driven

• Smart scheduling managing tradeoffs

• Auto Scaling based on traffic

• Fault tolerance

• Application (router) faults

• AWS hardware faults

Page 20: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Keystone

Page 21: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Routing Service - Samza

Page 22: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Routing Service - Samza

Page 23: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Routing Service - Samza

Amazon S3 Routing

• ~5800 long running containers for Amazon S3 routing

• ~500 C3-4XL AWS instances for Amazon S3 routing

Elasticsearch Routing

• ~850 long running containers for Elasticsearch routing

• ~70 C3-4XL AWS instances for Elasticsearch routing

Kafka Routing

• ~3200 long running containers for Kafka routing

• ~280 C3-4XL AWS instances for Kafka routing

Page 24: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Routing Service - Samza

Container Footprint:

• 2G - 5G memory

• 160 mbps max network bandwidth

• 1 CPU Share

• 20G disk for buffer & logs

• Processes 1-12 partitions

• Periodically reports health to infrastructure

Page 25: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Routing Service - Samza

Observed Numbers:

• Avg memory usage of ~1.8G per container

• Avg memory usage per node ~20G(Range: 7G - 25G)

• Avg CPU utilization of 8% per node

• Avg NetworkIn 256Mbps per node

• Avg NetworkOut 156Mbps per node

Page 26: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Routing Service - Samza

Publish to Amazon S3 sink:

• Every 200mb or 2 mins

• S3 average upload latency 200ms

Producer to Router latency:

• 30 percentile topics under 500 ms

• 70 percentile topics under 1 sec

• 90 percentile under 2 sec

• Overall average about 2.5 seconds

Kafka to Router consumer lag (est time to catch up):

• 65 percentile under 500ms

• 90 percentile under 5 seconds

Page 27: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

+ Alterations

• Internal build of Samza version 0.9.1

• Fixed SAMZA-41 in 0.9.1

• Support static partition range assignment

• Added SAMZA-775 in 0.9.1

• Prefetch buffer specified based on heap to use

• Backported SAMZA-655 to 0.9.1

• Environment variable configuration rewriter

• Backported SAMZA-540 to version 0.9.1

• Expose latency related metrics in OffsetManager

• Integration with Netflix Alert & Monitoring systems

Page 28: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Keystone

Page 29: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Real-time processing

Page 30: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Real-time processing

Page 31: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Real-time processing

Page 32: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Real-time processing

Page 33: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

• Streaming jobs to analyze movie plays, A/B tests, etc.

• Direct API for Kafka in 1.3

• Observed 2x performance improvement compared to 1.2

• Additional improvement possible with prefetching and connection pooling

(not available yet)

• Campaigned for backpressure support

• Result - Spark 1.5 release has community developed back pressure

support SPARK-7398

Page 34: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Great. How do I use it?

Page 35: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Annotation-based event definition

@Resource(type = ConsumerStorageType.DB, name = "S3Diagnostics")

public class S3Diagnostics implements Annotatable {

....

S3Diagnostics s3Diagnostics = new S3Diagnostics();

....

LogManager.logEvent(s3Diagnostics); // log this diagnostic event

Java

Page 36: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

{

"eventName" : "ksproxytest",

"payload" : {

"k1" : "v1",

"k2" : "v2"

}

}

Non-Java : Keystone Proxy

Page 37: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Wire format

• Extensible

• Currently supports JSON

• Will support Avro

• Encapsulated as a shareable jar

• Immutable message through the pipeline

Page 38: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Producer Resilience

• Outage should never disrupt existing instances from serving business

purpose

• Outage should never prevent new instances from starting up

• After service is restored, event producing should resume

automatically

Page 39: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Fail, but never block

block.on.buffer.full=false

handle potential blocking of first metadata request

Page 40: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Trust me, it works!

Page 41: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Keystone Dashboard

Page 42: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Keystone Dashboard

Page 43: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Keystone Dashboard

Page 44: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Trust, but verify!

Page 45: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

• Broker monitoring

• Alert on offline broker from ZooKeeper

• Consumer monitoring

• Alert on consumer lag/stuck and unconsumed partitions

• Heart-beating

• Produce/consume messages and measure latency

• Broker performance testing

• Produce tens of thousands messages per second on single instance

• Create multiple consumer groups to test consumer impact on broker

Auditor

Page 46: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Auditor - Broker Monitoring

Page 47: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Consumer Offset

Stuck Consumer Unconsumed Partitions

Auditor - Consumer Monitoring

Consumer Lag

Page 48: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Meet Winston

Page 49: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

New Internal Automation Engine:

• Collect diagnostic information based on alerts & other operational

events

• Help services self heal

• Reduce MTTR

• Reduce pager fatigue

• Improve productivity for developer

Winston

Page 50: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Winston

Page 51: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

How do you like your Kaffee?

Page 52: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Kaffee

Page 53: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Kaffee

Page 54: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Kaffee

Page 55: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

What’s next?

Page 56: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

• Performance tuning + optimizations

• Self service

• Schemas + registry

• Event discovery + visualization

• Open Source Auditor/Kaffee

Near Term

Page 57: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

And then???

Page 58: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Global real-time data stream + stream processing network

Page 59: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Office HoursWed 4:00PM – 5:30PM

@ Booth

[email protected]

@peter_bakas

Page 60: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Remember to complete

your evaluations!

Page 61: (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Thank you!


Recommended