Apache KafkaYinhao HeJiaqi Xiao
Ananth Gottumukkala
Publish/subscribe messaging pattern● Publisher: classify the message without
knowing any subscribers exist
● Subscriber: subscribe to the message
without knowing any publishers exist
● Broker: decouples publishers from
subscribers
(Similar to a bulletin board)
What is Kafka?● Open source publish/subscribe messaging system
● Distributed event log (persistent on disk)
● Hybrid between a messaging system and a database
● High throughput platform
● Real-time data streams
● Used by Twitter, Netflix, and originally developed by LinkedIn
Kafka structure
Message● Single Unit of Data (Byte Array)
● Batch○ collection of messages produced for the same topic
and partition
○ trade-off between latency and throughput
○ can be compressed
● Additional Structure○ E.g. JSON, XML, AVRO or PROTOBUF
● Message ordering not guaranteed across multiple partitions
Producer & ConsumerProducer
● create new messages & send to specific topic
Consumer
● read messages○ In order
● Offset○ Created when message is written to Kafka○ Consumer remember what offset each partition is at○ Zookeeper
Consumer Group● each partition only
consumed by one member of a consumer group
Broker● Kafka cluster consists of
multiple servers called brokers
● Controller Broker responsible for administrative operations○ Assign partitions to brokers○ Monitor Broker Failure
● Provides redundancy of messages in the partition○ Avoid Broker Failure
Retention● Provides a certain time period durable
storage for messages
● Time
● Size
● Individual topics can also configure their
own retention settings
Reliability Guarantees● Guarantees the order of messages in one partition
● Committed messages won't be lost as long as at least one replica
remains alive and retention policy holds
● Consumers can only read committed messages
● At least once message delivery semantics
Advantages of KafkaDeals with Integration Complexity
High Throughput and Fairly Low Latency
Handles Big Data
Many Configuration Options
Data Retention
Multiple Producers/Consumers
Disadvantages of KafkaSteep Learning Curve
Not Low Enough Latency
Susceptible to Data Loss
● Split-Brain● Partition Lead Failover
Kafka vs JMS/ActiveMQ
Kafka JMS/ActiveMQ
Real-Time Data Stream Traditional Messaging
Consumers Pull Messages from Brokers Messages Pushed to Consumers
Implements Backpressure Hard to Achieve Backpressure
Data Retention to Disk No Data Retention
Guarantees Message Ordering in Partition No Ordering Guarantees
Can rewind and re-consume data Consumer does not track offset
Kafka vs Kinesis
Kafka Kinesis
Requires setting up your own cluster, nodes, replicas, partitions, etc.
AWS manages infrastructure, config, etc.
Flexible config but need to tune producers (amt. of data to send to broker), consumers (# replicas, # consumers per partition/topic)
Config not as flexible but AWS ensures availability/durability for 7 days. Configure # shards for throughput
Higher Maintenance/Risk Mgmt Cost Pay-as-you-go / Per # Shards
Thank you