+ All Categories
Home > Data & Analytics > Distributed messaging with Apache Kafka

Distributed messaging with Apache Kafka

Date post: 21-Apr-2017
Category:
Upload: saumitra-srivastav
View: 7,273 times
Download: 0 times
Share this document with a friend
22
Distributed messaging with Apache Kafka Saumitra Srivastav @_saumitra_ http:// www.meetup.com/Bangalore-Apache-Kafka-Group/ 1
Transcript
Page 1: Distributed messaging with Apache Kafka

1

Distributed messaging withApache Kafka

Saumitra Srivastav@_saumitra_

http://www.meetup.com/Bangalore-Apache-Kafka-Group/

Page 2: Distributed messaging with Apache Kafka

2

Introduction

Kafka is a:• distributed• replicated• persistent• partitioned• high throughput• pub-sub

messaging system.

Incubated at LinkedIn. Written in Scala.

Page 3: Distributed messaging with Apache Kafka

3

Demo Application

Twitter stream analytics

Page 4: Distributed messaging with Apache Kafka

4

StreamProducer

Broker-1 Broker-2 Broker-3

Twitter Streaming API

Kafka Cluster

Solr-1

Realtime search

Solr-2 Cassandra-1

Data Store for longer retention

Cassandra-2

Sentiment Analysis

Page 5: Distributed messaging with Apache Kafka

5

Terminology

Topics: categories in which message feed is maintained

Producer: Processes that publish messages to a Kafka topic.

Consumers: processes that subscribe to topics and process the feed of published messages

Brokers: Servers which form a kafka cluster and act as a data transport channel between producers and consumers.

Producer Producer

Consumer Consumer

Broker

Kafka Cluster

Broker Broker

Page 6: Distributed messaging with Apache Kafka

6

Simplified View of a Kafka System

ZookeeperBroker 1 Broker 2 Broker 3

Producer 1 Producer 2

Consumer 1 Consumer 2 Consumer 3

Page 7: Distributed messaging with Apache Kafka

7

Topics and Partitions

TOPIC – 1 (error log)

TOPIC – 2 (security log)

Page 8: Distributed messaging with Apache Kafka

8

Partitions

• Each partition is an ordered, immutable sequence of messages.

• Messages are continuously appended to it.

• Each message in partition is assigned a unique sequential id number called offset.

• Any message in partition can be accessed using this offset.

Page 9: Distributed messaging with Apache Kafka

9

Partitions

• Partition servers 2 purposes:1. Scaling2. Parallelism

• Scaling A topic can be divided into multiple partition, and each partition can be on different servers.

• ParallelismA consumer can consume from multiple partitions at same time(while maintaining ordering guarantee).

Page 10: Distributed messaging with Apache Kafka

10

Distribution & Replication

• The partitions of the log are distributed over Kafka cluster

• Each server handles data and requests for some number of partition

• Each partition is replicated for fault tolerance.

• Each partition has one server which acts as the leader.

• The leader handles all read and write requests for the partition.

• Followers keep replicating the leader.

Page 11: Distributed messaging with Apache Kafka

11

Producers

• Producers publish data to the topics of their choice.

• Producer can choose the topic’s partition to which message should be assigned.

• Partition can be selected in a round robin manner for load balancing.

• Kafka doesn’t care about serialization format. All it need is a byte array.

Page 12: Distributed messaging with Apache Kafka

12

Consumers

• Other messaging systems basically follow 2 models:• Queuing• Publish-Subscribe

• Kafka uses a concept of consumer group which generalizes both these models.

• Consumers label themselves with a consumer group name

• Each message published to a topic, is delivered to one consumer instance, within each subscribing consumer group.

Page 13: Distributed messaging with Apache Kafka

13

Consumers

Page 14: Distributed messaging with Apache Kafka

14

Consumer Groups

ZookeeperBroker 1 Broker 2 Broker 3

Producer 1 Producer 2

Consumer 1 Consumer 2 Consumer 3

Consumer-Group A Consumer-Group B

Page 15: Distributed messaging with Apache Kafka

15

Consumer groups

ZookeeperBroker 1

Topic-1

Broker 2

Topic-1

Broker 3

Topic-1

Producer 1 Producer 2

Consumer 1Consumer-Group A Consumer-Group B

P0 P3 P5 P2 P4

Consumer 2 Consumer 3

Page 16: Distributed messaging with Apache Kafka

16

Message Persistence

• Unlike other messaging system, message are not deleted on consumption.

• Message are retained until a configurable period of time after which they are deleted (even if they are NOT consumed).

• Consumers can re-consume any chunk of older message using message offset.

• Kafka performance is effectively constant with respect to data size, so huge data size is not an issue.

Page 17: Distributed messaging with Apache Kafka

17

DemoRunning a multi-broker kafka cluster

Page 18: Distributed messaging with Apache Kafka

18

Guarantees

1. Ordering guarantee• Messages sent by a producer to a particular topic partition will be

appended in the order they are sent.• A consumer instance sees messages in the order they are stored in the

log.

2. At least once delivery

3. Fault toleranceFor a topic with replication factor N, up to N-1 server failures will not cause any data loss.

4. No corruption of data:• over the network• On the disk

Page 19: Distributed messaging with Apache Kafka

19

DemoConsumer/Producer Java API

Page 20: Distributed messaging with Apache Kafka

20

Misc Design features

1. Stateless broker• Each consumer maintains its own state(offset)

2. Load balancing3. Asynchronous send4. Push/pull model instead of Push/Push5. Consumer Position6. Offline Data Load7. Simple API8. Low Overhead9. Batch send and receive10. No message caching in JVM11. Rely on file system buffering• mostly sequential access patterns

12. Zero-copy transfer: file->socket

Page 21: Distributed messaging with Apache Kafka

21

Use Cases

1. Messaging2. Website Activity Tracking3. Metrics4. Log Aggregation5. Stream Processing


Recommended