1. 1 Copyright 2010-2015 Cloudera. All rights reserved. Not to
be reproduced or shared without prior written consent from
Cloudera. Intro to Apache Kafka Jason Hubbard | Systems
Engineer
2. 2 Copyright 2010-2015 Cloudera. All rights reserved. Not to
be reproduced or shared without prior written consent from
Cloudera. Kafka Overview
3. 3 Copyright 2010-2015 Cloudera. All rights reserved. Not to
be reproduced or shared without prior written consent from
Cloudera. What is Kafka? Developed by LinkedIn after challenges
building pipelines into Hadoop Message-based store used to build
data pipelines and support streaming applications Kafka offers
Publish & subscribe semantics Horizontal scalability High
availability Nodes in a Kafka cluster (called brokers) can handle
Reads/writes per second in the 100s of MBs Thousands of producers
and consumers Multiple node failures (with proper
configuration)
4. 4 Copyright 2010-2015 Cloudera. All rights reserved. Not to
be reproduced or shared without prior written consent from
Cloudera. Why Kafka? (Or rather, why not Flume?) No ability to
replay events Multiple sinks requires event replication (via
multiple channels) Sinks that share a source (mostly) process
events in sync Spool Source Avro Sink Channel Spool Source Avro
Sink Channel Avro Source HBase Sink Channel HDFS Sink HBase HDFS
Logs More Logs Channel
5. 5 Copyright 2010-2015 Cloudera. All rights reserved. Not to
be reproduced or shared without prior written consent from
Cloudera. Why Kafka for Hadoop? 2009 2012
6. 6 Copyright 2010-2015 Cloudera. All rights reserved. Not to
be reproduced or shared without prior written consent from
Cloudera. Why Kafka? Decoupling 2012 2013+?
7. 7 Copyright 2010-2015 Cloudera. All rights reserved. Not to
be reproduced or shared without prior written consent from
Cloudera. A Departure from Legacy Models Message stores have two
well-known types Queues (producer-consumers) Topics
(publisher-subscribers) One consumer gets one message from a queue,
then its gone Consumers might work alone or in concert Multiple
subscribers can get one message from a topic Messages are published
Kafka inverts or blends these concepts Tracks consumers by group
identification Retains messages by expiration, not consumer
interaction Bakes in partitioning for scalability and parallel
operations Bakes in replication for availability and fault
tolerance
8. 8 Copyright 2010-2015 Cloudera. All rights reserved. Not to
be reproduced or shared without prior written consent from
Cloudera. Components & Roles A Kafka server is called a broker
Brokers can work together in a cluster Each broker hosts message
stores called topics You can partition a topic across brokers for
scale and parallelism You can also replicate a topic for resilience
to failure Producers push to a Kafka topic, consumers pull Kafka
provides Consumer and Producer APIs
9. 9 Copyright 2010-2015 Cloudera. All rights reserved. Not to
be reproduced or shared without prior written consent from
Cloudera. Detailed Architecture Its all about the logs! No not
application logs
10. 10 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Kafka Detailed Architecture Brokers and consumers
initialize their state in Zookeeper Broker state includes host
name, port address, and partition list Consumer state includes
group name and message offsets (deprecated) Producer Consumer
Producers Kafka Cluster Consumers Broker Producer Consumer Broker
Zookeeper Broker Broker Offsets
11. 11 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Kafka and Zookeeper Kafka uses Zookeeper To indicate
liveness of each broker To store broker and consumer state To
coordinate leader elections for failover Zookeeper stores consumer
offset by default This can be switched to the brokers, if desired
Zookeeper also tracks and supports state changes such as
Adding/removing brokers and consumers Rebalancing consumers
Directing producers and consumers to partition leaders
12. 12 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Topic Partitions Partition is a totally-ordered store of
messages (log) Partition order is immutable Messages are deleted as
their time runs out New messages are appendable only The message
offset is both a sequence number and a unique identifier (topic,
partition) 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 0 1 2 3 4 5 6 7 8 9
1 0 1 1 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Partition 0 Partition 1
Partition 2 Writes Old New
13. 13 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. How are partitions distributed? Partitions are usually
distributed across brokers Each broker may host partitions of
several topics One broker acts as leader for any replicated
partition Other brokers with a replica act as followers Only
leaders serves read/write requests If the leader blinks out, a
follower is elected to take over Election occurs only among in-sync
replicas (ISRs)
14. 14 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Scalability & Parallelism Partitions can be used to
allow message storage that exceeds one brokers capacity More
brokers = greater message capacity Partitions also allow consumer
groups to read a topic in parallel Each member can read a partition
Kafka ensures no consumer contention in one group for a
partition
15. 15 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Replication A topic partition is the unit of replication
A replica remains in-sync with its leader so long as It maintains
communication with Zookeeper It does not fall too far behind the
leader (configurable) Replicating to n brokers Allows Kafka to
offer availability under n - 1 losses The quality of this offer is
tempered by the ISR group count
16. 16 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Fault Tolerance A broker may lead for some partitions and
follow for others The replication for each topic determines how
many brokers will follow Followers passively replicate the leader
You can set an ISR policy Boils down to preference for high,
medium, or low throughput The right ISR policy strikes some balance
between Availability: electing a leader quickly in the event of
failure Latency: assuring a producer its messages are safe (i.e.,
durable)
17. 17 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Producers Producers publish data (messages) to Kafka
topics Producers choose the partition a message goes to By
selecting in round-robin fashion to distribute the load By
assigning a semantic partitioning function to key the messages
18. 18 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Consumers A consumer reads messages published to Kafka
topics by moving its offset The offset increments by default Every
consumer specifies a group label Consumer acts in one group do not
affect other groups If one group "tails" a topics messages, it does
not change what another group can consume They come and go with
little impact on the cluster or other consumers
19. 19 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Kafka Consumer Group Operation Every message in a
partition is read by the same instance of a consumer group Group
members can be processes residing on separate machines The diagram
below shows a two-broker cluster The brokers host one topic in four
partitions, P0-P3 Group A has two instances; each instance reads
two partitions Group B has four instances; each instance reads one
partition Kafka Cluster P0 P3 P1 P2 Consumer Group A C1 C2 Consumer
Group B C3 C4 C5 C6 Broker 1 Broker 2
20. 20 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Messages Kafka stores messages in its own format
Producers and consumers also use this format for transfer
efficiency Any serializable object can be a message Popular formats
include string, JSON, and Avro Each messages id is also its unique
identifier in a topic partition
21. 21 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Traditional Message Ordering Traditional queues store
messages in the order received Consumers draw messages in store
order With multiple consumers however, messages are not received in
order Consumers may experience different delay They might also
consume messages at different rates To retain order, only one
process may consume from the queue Comes at the expense of
parallelism
22. 22 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Guarantees for Ordering Kafka appends messages sent by a
producer to one partition in sending order If a producer sends M1
followed by M2 to the same partition M1 will have a lower offset
than M2 M1 will appear earlier in the partition A consumer always
sees messages in stored order Given a partition with N
replications, up to N-1 server failures may occur without message
loss
23. 23 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Message Retention The Kafka cluster retains messages for
a length of time You can set retention time per topic or for all
You can also set a storage limit on any topic Kafka deletes
messages upon expiration
24. 24 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Demo
25. 25 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Creating Topics Kafka ships with command line tools
useful for exploring The kafka-topics tool creates topics via
Zookeeper The default Zookeeper port is 2181 To create and list the
topic device_status Use the --list parameter to list all topics $
kafka-topics --create -zookeeper localhost:2181-replication-factor
1 -partitions 1 --topic device_status $ kafka-topics --list
-zookeeper localhost:2181
26. 26 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Creating a Producer Use kafka-console-producer to publish
messages Requires a broker list, e.g., localhost:9092 Provide a
comma-delimited list for failover protection Provide the name of
the topic We will log messages to the topic named device_status $
kafka-console-producer --broker-listlocalhost:9092 --topic
device_status
27. 27 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Creating a Consumer The kafka-console-consumer tool is a
simple consumer It uses ZooKeeper to connect; below we access
localhost:2181 We also name a topic: device_status To read all
available messages on the topic, we use the --from-beginning option
$ bin/kafka-console-consumer --zookeeper localhost:2181--topic
device_status --from-beginning
28. 28 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Creating a Spark Consumer The kafka-console-consumer tool
is a simple consumer import org.apache.spark.streaming._ import
org.apache.kafka.common.serialization.StringDeserializer import
org.apache.spark.streaming.kafka010._ import
org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import
org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val ssc = new StreamingContext(sc, Seconds(1)) val kafkaParams =
Map[String, Object]("bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer], "group.id"
-> "kafkaintro", "auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)) val topics =
Array("TopicA") val stream = KafkaUtils.createDirectStream[String,
String](ssc, PreferConsistent, Subscribe[String, String](topics,
kafkaParams)) stream.map(_.value)print() ssc.start()
29. 29 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Common & Best Practices
30. 30 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Tip: Balance Throughput & Durability Producers
specify the durability they need with the property
request.required.acks Adding brokers can improve throughput Common
practice: Durability Behaviour Per Event Latency Required Acks
(request.required.acks) Highest All replicas are in-sync Highest -1
Moderate Leader ACKS message Medium 1 Lowest No ACKs required
Lowest 0 Property Value replication 3 min.insync.replicas 2
request.required.acks -1
31. 31 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Tip: Consider Message Keys A Kafka message is stored as a
KV pair The key is not used in the default case A producer can set
content in a message key then use a Partitioner subclass to hash
the key This allows the producer to effect semantic partitioning
Example: DEBUG, INFO, WARN, ERROR partitions for a syslog topic
Kafka guarantees messages with the same partition hash are stored
in the same partition A consumer group could then pair each member
with an intended partition
32. 32 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Tip: Writing Files to Topics Kafka will accept file
content as a message Write a files data to the device_alerts topic:
Then read it: $ cat alerts.txt |
kafka-console-producer--broker-list localhost:9092 --topic
device_alerts $ kafka-console-consumer --zookeeper
localhost:2181--topic device_alerts --from-beginning Remember that
the consumer offsets might be stored in Kafka instead of
Zookeeper
33. 33 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Best Uses Kafka is intended for storing messages Log
records Event information For small messages, latency in the tens
of milliseconds is common Kafka is not well-suited for large file
transfers Message limits < 10KB benefit low latencies
34. 34 Copyright 2010-2015 Cloudera. All rights reserved. Not
to be reproduced or shared without prior written consent from
Cloudera. Thank you [email protected]