Date post: | 16-Apr-2017 |
Category: |
Software |
Upload: | otavio-carvalho |
View: | 205 times |
Download: | 1 times |
Apache Kafka
● Apache Kafka is a distributed messaging system ○ Provides fast, highly scalable and redundant messaging
through a pub-sub model
● It was built at LinkedIn to be used as central hub for all of the messaging communication between their systems
● Focus on scalability and fault tolerance
Motivation
● Microservices○ "In short, the microservice architectural style is an approach to developing a
single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery."- Martin Fowler
● Monolith First ○ Using microservices as a way to decompose monolitical
infrastructures
● Message Queues○ Asynchronous processing○ Decoupling○ Load balancing○ Scalability
How is it different?
● High throughput○ Millions of events per second per node
● Fault-tolerance guarantees○ Relies on Apache Zookeeper for detection of node failures
and leader election○ Maintains a structure called ISR (In-Sync Replica Set) in order
to be able to tolerate node failures○ (Claims to) Guarantees up to f failures with f+1 replicas
without losing data
● Distributed○ More nodes can be included and the system keeps its
high-performance and fault-tolerance capabilities
● Broker-centric (AMQP)○ AMQP implementations are usually broker-centric○ Focus on delivery guarantees between producers/consumers○ Transient preferred over durable messages ○ Use the broker itself to maintain state of what is consumed
(via message acknowledgements)
● Producer-centric (Kafka)○ Partition a fire hose of event data into durable message
brokers with cursors (pointers) ○ Support to batch consumers that may be offline, or online
consumers that want messages at low latency○ Doesn't have message acknowledgements, it assumes the
consumer tracks what has been consumed so far
Comparison with AMQP
Kafka Terminology
● Producers○ Processes that publishes
msgs to topics● Consumers
○ Processes that readsmsgs from topics
● Topic○ Name of the feed to which
msgs are published● Broker
○ Process running on asingle machine
● Cluster○ Group of brokers working
together
Kafka Terminology
● Partitions○ Subdivision of Topics
■ Scalability■ Load balancing
○ Consumers controltheir own offsets
● Replication○ In-Sync-Replica (ISR) sets
Kafka Terminology
Figure 1. A Kafka cluster with 4 brokers, 1 topic and 2 partitions, each with 3 replicas
Use Cases
● Messaging
● Distributed log / Log aggregation
● Change Data Capture
● Stream Processing / Event Sourcing
Use Cases - Messaging
● Messaging○ Simple Queueing
■ e.g. Queue for sending e-mails○ Tracking user events○ Near real-time metrics
Use Cases - Distributed Log
● Distributed log / Log aggregation○ LinkedIn usage
■ The whole platform is built around a central log■ 13 million messages/sec, 15 gigabytes per sec■ Over 1100 brokers in more than 60 clusters
Use Cases - Change Data Capture
Use Cases - Stream Processing
● Stream Processing / Event Sourcing
LinkedIn's example Netflix's example
DEMO
14
ISSUES15
Issues
● CAP theorem (Consistency, Availability, Partitioning)○ "You can't sacrifice partition tolerance"
● Jepsen tests (@aphyr)○ In order to force failures on Kafka, it needs to shrink ISR
(In-Sync Replica Set) to one node (the master) and then lose the master itself■ It will cause a leader election and a new leader will be
elected● It causes Kafka to lose ~50% of writes done during this
partition time■ Kafka users usually set a replication factor of 2 or 3
replicas for each partition on a given topic
● https://aphyr.com/posts/315-jepsen-rabbitmq● https://aphyr.com/posts/293-jepsen-kafka● https://thoughtworks.jiveon.com/people/tbartlet/blog/2015/11/
02/project-metamorphosis-with-kafka-spark● https://thoughtworks.jiveon.com/message/1013489● https://medium.com/@ikem/event-sourcing-and-cqrs-a-look-at-
kafka-e0c1b90d17d8#.x4f9ezrwn● https://martin.kleppmann.com/2016/01/29/event-sourcing-stre
am-processing-at-ddd-europe.html● http://microservices.io/patterns/microservices.html● http://martinfowler.com/articles/microservices.html● https://engineering.linkedin.com/kafka/running-kafka-scale● https://engineering.linkedin.com/kafka/intra-cluster-replication-
apache-kafka● http://martinfowler.com/bliki/MonolithFirst.html
Links
● https://www.oreilly.com/learning/making-sense-of-stream-processing/page/3/integrating-databases-and-kafka-with-change-data-capture
● http://kafka.apache.org/documentation.html● https://github.com/toddpalino/kafkafromscratch/blob/master/A
pache%20Kafka%20from%20Scratch.pdf● http://www.javaworld.com/article/3060078/big-data/big-data-m
essaging-with-kafka-part-1.html● https://sookocheff.com/post/kafka/kafka-in-a-nutshell/
Links
Use Cases - Change Data Capture
● Log compaction○ Kafka + Kafka Connect
Partitioning
● Custom Partitioner○ Write your own logic
● Default Partitioner○ Manual○ Hashing
■ The most common approach■ Messages with the same key go to the same producer
○ Spraying■ Random partitioning