+ All Categories
Home > Technology > Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

Date post: 27-Nov-2014
Category:
Upload: christopher-curtin
View: 5,698 times
Download: 1 times
Share this document with a friend
Description:
I presented an introduction to Kafka 0.8.0 to the Atlanta Java User's Group.
Popular Tags:
48
Introduction to Apache Kafka Chris Curtin Head of Technical Research Atlanta Java Users Group March 2013
Transcript
Page 1: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

Introduction to Apache KafkaChris CurtinHead of Technical Research

Atlanta Java Users Group March 2013

Page 2: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

2

About Me• 20+ years in technology• Head of Technical Research at Silverpop (12 + years at

Silverpop)• Built a SaaS platform before the term ‘SaaS’ was being

used• Prior to Silverpop: real-time control systems, factory

automation and warehouse management• Always looking for technologies and algorithms to help

with our challenges• Car nut

Page 3: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

3

Silverpop Open Positions• Senior Software Engineer (Java, Oracle, Spring, Hibernate,

MongoDB)• Senior Software Engineer – MIS (.NET stack)• Software Engineer• Software Engineer – Integration Services (PHP, MySQL)• Delivery Manager – Engineering• Technical Lead – Engineering• Technical Project Manager – Integration Services• http://www.silverpop.com – Go to Careers under About

Page 4: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

4

Caveats• We don’t use Kafka in production• I don’t have any experience with Kafka in operations• I am not an expert on messaging

systems/JMS/MQSeries etc.

Page 5: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

5

Apache Kafka – from Apache• Apache Kafka is a distributed publish-subscribe messaging

system. It is designed to support the following– Persistent messaging with O(1) disk structures that provide

constant time performance even with many TB of stored messages.

– High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.

– Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.

– Support for parallel data load into Hadoop.

Page 6: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

6

Background• LinkedIn product donated to Apache• Most core developers are from LinkedIn• Pretty good pickup outside of LinkedIn: Air BnB &

Urban Airship for example• Fun fact: no logo yet

Page 7: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

7

Why?

Data Integration

Page 8: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

8

Point to Point integration (thanks to LinkedIn for slide)

Page 9: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

9

Page 10: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

10

(Thanks to http://linkstate.wordpress.com/2011/04/19/recabling-project/)

Page 11: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

11

What we’d really like (thanks to LinkedIn for slide)

Page 12: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

12

Looks Familiar: JMS to the rescue!

Page 13: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

13

Okay: Data warehouse to the rescue!

Page 14: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

14

Okay: CICS to the rescue!

Page 15: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

15

Kafka changes the paradigm

Kafka doesn’t keep track of who consumed which message

Page 16: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

16

Consumption Management• Kafka leaves management of what was consumed up

to the business logic• Each message has a unique identifier (within the topic

and partition)• Consumers can ask for message by identifier, even if

they are days old• Identifiers are sequential within a topic and partition.

Page 17: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

17

Why is Kafka Interesting?

Horizontally scalable messaging system

Page 18: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

18

Terminology• Topics are the main grouping mechanism for

messages• Brokers store the messages, take care of redundancy

issues• Producers write messages to a broker for a specific

topic• Consumers read from Brokers for a specific topic• Topics can be further segmented by partitions• Consumers can read a specific partition from a Topic

Page 19: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

19

API (1 of 25) - Basics• Producer: send(String topic, String key, Message

message)• Consumer: Iterator<Message> fetch(…)

Page 20: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

20

API• Just kidding – that’s pretty much it for the API

• Minor variation on the consumer for ‘Simple’ consumers but that’s really it

• ‘under the covers’ functions to get current offsets or implement non-trivial consumers

Page 21: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

21

Architecture (thanks to LinkedIn for slide)

Page 22: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

22

Producers• Pretty Basic API• Partitioning is a little odd, requires Producers to know

about partition scheme• Producers DO NOT know about consumers

Page 23: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

23

Consumers: Consumer Groups• Easiest to get started with• Kafka makes sure only one thread in the group sees a

message for a topic (or a message within a Partition)• Uses Zookeeper to keep track of what messages were

consumed in which topic/partitions• No ‘once and only once’ delivery semantics here• Rebalance may mean a message gets replayed

Page 24: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

24

Consumers: Simple Consumer• Consumer subscribes to a specific topic and partition• Consumer has to keep track of what message offset

was last consumed• A lot more error handling required if Brokers have

issues• But a lot more control over which messages are read.

Does allow for ‘exactly once’ messaging

Page 25: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

25

Consumer Model Design• Partition design impacts overall throughput

– Producers know partitioning class– Producers write to single Broker ‘leader’ for a partition

• Offsets as only transaction identifier complicates consumer– ‘throw more hardware’ at the backlog is complicated– Consumer Groups == 1 thread per partition

• If expensive operations can’t throw more threads at it

• Not a lot of ‘real world’ examples on balancing # of topics vs. # of partitions

Page 26: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

26

Why is Kafka Interesting?

Memory Mapped Files

Kernel-space processing

Page 27: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

27

What is a commit log? (thanks to LinkedIn for slide)

Page 28: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

28

Brokers• Lightweight, very fast message storing• Writes messages to disk using kernel space NOT JVM• Uses OS Pagecache• Data is stored in flat files on disk, directory per topic

and partition• Handles the replication

Page 29: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

29

Brokers continued• Very low memory utilization – almost nothing is held in

memory• (Remember, Broker doesn’t keep track of who has

consumed a message)• Handle TTL operations on data• Drop a file when the data is too old

Page 30: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

30

Why is Kafka Interesting?

Stuff just works

Producers and Consumers are about business logic

Page 31: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

31

Consumer Use Case: batch loading• Consumers don’t have to be online all the time• Wake up every hour, ask Kafka for events since last

request• Load into a database, push to external systems etc.• Load into Hadoop (Stream if using MapR)

Page 32: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

32

Consumer Use Case: Complex Event Processing

• Feed to Storm or similar CEP• Partition on user id, subsystem, product etc.

independent of Kafka’s partition• Execute rules on the data• Made a mistake? Replay the events and fix it

Page 33: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

33

Consumer Use Case: Operations Logs• Load ‘old’ operational messages to debug problems• Do it without impacting production systems

(remember, consumers can start at any offset!)• Have business logic write to different output store

than production, but drive off production data

Page 34: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

34

Adding New Business Logic (thanks to LinkedIn for slide)

Page 35: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

35

Adding Producers• Define Topics and # of partitions via Kafka tools• (possibly tell Kafka to balance leaders across

machines)• Start producing

Page 36: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

36

Adding Consumers• Using Kafka adding consumers doesn’t impact

producers• Minor impact on Brokers (just keeping track of

connections)

Page 37: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

37

Producer Codepublic class TestProducer { public static void main(String[] args) { long events = Long.parseLong(args[0]); long blocks = Long.parseLong(args[1]); Random rnd = new Random();

Properties props = new Properties(); props.put("broker.list", "vrd01.atlnp1:9092,vrd02.atlnp1:9092,vrd03.atlnp1:9092"); props.put("serializer.class", "kafka.serializer.StringEncoder"); props.put("partitioner.class", "com.silverpop.kafka.playproducer.OrganizationPartitioner"); ProducerConfig config = new ProducerConfig(props);

Producer<Integer, String> producer = new Producer<Integer, String>(config); for (long nBlocks = 0; nBlocks < blocks; nBlocks++) { for (long nEvents = 0; nEvents < events; nEvents++) { long runtime = new Date().getTime(); String msg = runtime + "," + (50 + nBlocks) + "," + nEvents+ "," + rnd.nextInt(1000); String key = String.valueOf(orgId); KeyedMessage<Integer, String> data = new KeyedMessage<Integer, String>("test1", key, msg); producer.send(data); } } producer.close(); }}

Page 38: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

38

Simple Consumer Code String topic = "test1"; int partition = 0; SimpleConsumer simpleConsumer = new SimpleConsumer("vrd01.atlnp1", 9092,100000, 64 * 1024, "test"); boolean loop = true; long maxOffset = -1; while (loop) { FetchRequest req = new FetchRequestBuilder().clientId("randomClient") .addFetch(topic, partition, maxOffset+1, 100000) .build(); FetchResponse fetchResponse = simpleConsumer.fetch(req); loop = false; for (MessageAndOffset messageAndOffset : fetchResponse.messageSet(topic, partition)) { loop = true; ByteBuffer payload = messageAndOffset.message().payload(); maxOffset = messageAndOffset.offset(); byte[] bytes = new byte[payload.limit()]; payload.get(bytes); System.out.println(String.valueOf(maxOffset) + ": " + new String(bytes, "UTF-8")); } } simpleConsumer.close();

Page 39: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

39

Consumer Groups Code// create 4 partitions of the stream for topic “test”, to allow 4 threads to consumeMap<String, List<KafkaStream<Message>>> topicMessageStreams = consumerConnector.createMessageStreams(ImmutableMap.of("test", 4));List<KafkaStream<Message>> streams = topicMessageStreams.get("test");

// create list of 4 threads to consume from each of the partitions ExecutorService executor = Executors.newFixedThreadPool(4);

// consume the messages in the threadsfor(final KafkaStream<Message> stream: streams) { executor.submit(new Runnable() { public void run() { for(MessageAndMetadata msgAndMetadata: stream) { // process message (msgAndMetadata.message()) } } });}

Page 40: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

40

Demo• 4- node Kafka cluster• 4 – node Storm cluster• 4 – node MongoDB cluster• Test Producer in IntelliJ creates website events into

Kafka• Storm-Kafka Spout reads from Kafka into Storm

topology– Trident groups by organization and counts visits by day

• Trident end point writes to MongoDB• MongoDB shell query to see counts change

Page 41: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

41

LinkedIn Clusters (2012 presentation)• 8 nodes per datacenter

– ~20 GB RAM available for Kafka– 6TB storage, RAID 10, basic SATA drives

• 10,000 connections into the cluster for both production and consumption

Page 42: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

42

Performance (LinkedIn 2012 presentation)• 10 billion messages/day• Sustained peak:

– 172,000 messages/second written– 950,000 messages/second read

• 367 topics• 40 real-time consumers• Many ad hoc consumers• 10k connections/colo• 9.5TB log retained• End-to-end delivery time: 10 seconds (avg)

Page 43: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

43

Questions so far?

Page 44: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

44

Something completely Different• Nathan Marz (twitter, BackType)• Creator of Storm

Page 45: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

45

Immutable Applications• No updates to data• Either insert or delete• ‘Functional Applications’

• http://manning.com/marz/BD_meap_ch01.pdf

Page 46: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

46

(thanks to LinkedIn for slide)

Page 47: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

47

Information• Apache Kafka site: http://kafka.apache.org/• List of presentations: https://

cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations

• Kafka wiki: https://cwiki.apache.org/confluence/display/KAFKA/Index

• Paper: http://sites.computer.org/debull/A12june/pipeline.pdf

• Slides: http://www.slideshare.net/chriscurtin• Me: [email protected] @ChrisCurtin on twitter

Page 48: Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013

48

Silverpop Open Positions• Senior Software Engineer (Java, Oracle, Spring, Hibernate,

MongoDB)• Senior Software Engineer – MIS (.NET stack)• Software Engineer• Software Engineer – Integration Services (PHP, MySQL)• Delivery Manager – Engineering• Technical Lead – Engineering• Technical Project Manager – Integration Services• http://www.silverpop.com/marketing-company/

careers/open-positions.html


Recommended