Apache Kafka at LinkedIn

Post on 19-Aug-2014

330 views 5 download

description

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.

transcript

Jay KrepsIntroduction to Apache Kafka

The Plan1. What is Apache Kafka?2. Kafka and Data Integration3. Kafka and Stream Processing

Apache Kafka

Abrief

historyof

ApacheKafka

Characteristics• Scalability of a filesystem– Hundreds of MB/sec/server throughput–Many TB per server

• Guarantees of a database–Messages strictly ordered– All data persistent

• Distributed by default– Replication– Partitioning model

Kafka is about logs

What is a log?

Logs: pub/sub done right

Partitioning

Nodes Host Many Partitions

Producers Balance Load

Consumer’s Divide Up Partitions

End-to-End

Kafka At LinkedIn• 175 TB of in-flight log data per colo• Replicated to each datacenter• Tens of thousands of data producers• Thousands of consumers• 7 million messages written/sec• 35 million messages read/sec• Hadoop integration

Performance• Producer (3x replication):– Async: 786,980 records/sec (75.1 MB/sec)– Sync: 421,823 records/sec (40.2 MB/sec)

• Consumer: – 940,521 records/sec (89.7 MB/sec)

• End-to-end latency: – 2 ms (median)– 14 ms (99.9th percentile)

The Plan1. What is Apache Kafka?2. Kafka and Data Integration3. Kafka and Stream Processing

Data Integration

Maslow’s Hierarchy

For Data

New Types of Data• Database data– Users, products, orders, etc

• Events– Clicks, Impressions, Pageviews, etc

• Application metrics– CPU usage, requests/sec

• Application logs– Service calls, errors

New Types of Systems• Live Stores– Voldemort– Espresso– Graph– OLAP– Search– InGraphs

• Offline– Hadoop– Teradata

Bad

Good

Example: User views job

Comparing Data Transfer Mechanisms

The Plan1. What is Apache Kafka?2. Kafka and Data Integration3. Kafka and Stream Processing

Stream Processing

Stream processing is ageneralization

of batch processing

Stream Processing = Logs + Jobs

Examples• Monitoring• Security• Content processing• Recommendations• Newsfeed• ETL

Frameworks Can Help

Samza Architecture

Log-centric Architecture

Kafkahttp://kafka.apache.org

Samzahttp://samza.incubator.apache.org

Log Bloghttp://linkd.in/199iMwY

Benchmark:http://t.co/40fkKJvanx

Mehttp://www.linkedin.com/in/jaykreps

@jaykreps