From a kafkaesque story to The Promised Land

transcript

From a Kafkaesque Story to the Promised Land

7/7/2013Ran Silberman

Open Source paradigm

The Cathedral & the Bazaar by Eric S Raymond, 1999the struggle between top-down and bottom-up design

Challenges of data platform[1]

• High throughput

• Horizontal scale to address growth

• High availability of data services

• No Data loss

• Satisfy Real-Time demands

• Enforce structural data with schemas

• Process Big Data and Enterprise Data

• Single Source of Truth (SSOT)

SLA's of data platform

BI DWH

Real-time Customers

Real-time dashboards

Data Bus

Offline Customers

SLA:1. 98% in < 1/2 hr2. 99.999% < 4 hrs

SLA:1. 98% in < 500 msec2. No send > 2 sec

Real-time servers

Legacy Data flow in LivePerson

BI DWH (Oracle)

RealTime servers

View Reports

Customers

Sessionize

Modeling

Schema View

1st phase - move to Hadoop

Sessionize

Modeling

Schema View

RealTime servers

BI DWH (Vertica)

View Reports

Hadoop

MR Job transfers data to BI DWH

Customers

2. move to Kafka

RealTime servers

HDFSBI DWH (Vertica)

Hadoop

KafkaTopic-1

View Reports

Customers

3. Integrate with new producers

RealTime servers

Hadoop

KafkaTopic-1 Topic-2

New RealTime servers

View Reports

Customers

4. Add Real-time BI

View Reports

Customers

RealTime servers

Hadoop

Topology

5. Standardize Data-Model using Avro

View Reports

Customers

RealTime servers

Hadoop

Topology

6. Define Single Source of Truth (SSOT)

View Reports

Customers

RealTime servers

Hadoop

Topology

Kafka[2] as Backbone for Data

• Central "Message Bus"

• Support multiple topics (MQ style)

• Write ahead to files

• Distributed & Highly Available

• Horizontal Scale

• High throughput (10s MB/Sec per server)

• Service is agnostic to consumers' state

• Retention policy

Kafka Architecture

Kafka Architecture cont.

Node 1

Zookeeper

Producer 1 Producer 2 Producer 3

Node 2 Node 3

Consumer 1 Consumer 1Consumer 1

Group1

Kafka Architecture cont.

Node 1

Zookeeper

Producer 1 Producer 2

Node 3 Node 4

Consumer 2 Consumer 3Consumer 1

Node 2

Topic1 Topic2

Kafka replay messages.

Zookeeper

Node 3 Node 4

Min Offset ->

Max Offset ->

fetchRequest = new fetchRequest(topic, partition, offset, size);

currentOffset : taken from zookeeperEarliest offset: -2 Latest offset : -1

Kafka API[3]

• Producer API

• Consumer API

o High-level API

using zookeeper to access brokers and to save

offsets

o SimpleConsumer API

direct access to Kafka brokers

• Kafka-Spout, Camus, and KafkaHadoopConsumer all

use SimpleConsumer

Kafka API[3]

• Producermessages = new List<KeyedMessage<K, V>>()

Messages.add(new KeyedMessage(“topic1”, null, msg1));

Send(messages);

• Consumerstreams[] = Consumer.createMessageStream((“topic1”, 1);

for (message: streams[0]{

//do something with message

Kafka in Unit Testing• Use of class KafkaServer

• Run embedded server

Introducing Avro[5]

• Schema representation using JSON

• Support types

o Primitive types: boolean, int, long, string, etc.

o Complex types: Record, Enum, Union, Arrays,

Maps, Fixed

• Data is serialized using its schema

• Avro files include file-header of the schema

Add Avro protocol to the story

Topic 1

Schema Repo

Producer 1

Topic 2

Consumers: Camus/Storm

Create Message according to Schema 1.0

Register schema 1.0Add revision to message header

Send message

Read message Extract header and obtain schema version

Get schema by version 1.0

Encode message with Schema 1.0

Decode message with schema 1.0

{event1:{header:{sessionId:"102122"),{timestamp:"12346")}...

Header Avro message

Kafka message

Pass message

Kafka + Storm + Avro example

• Demonstrating use of Avro data passing from Kafka to

• Explains Avro revision evolution

• Requires Kafka and Zookeeper installed

• Uses Storm artifact and Kafka-Spout artifact in Maven

• Plugin generates Java classes from Avro Schema

• https://github.com/ransilberman/avro-kafka-storm

Producer machine

Resiliency

Producer

Consistent Topic

Send message to Kafka

local file

Persist message to local disk

Kafka Bridge

Send message to Kafka

Fast Topic

Real-time Consumer: Storm

Offline Consumer: Hadoop

Challenges of Kafka

• Still not mature enough

• Not enough supporting tools (viewers, maintenance)

• Duplications may occur

• API not documented enough

• Open Source - support by community only

• Difficult to replay messages from specific point in time

• Eventually Consistent...

Eventually Consistent

Because it is a distributed system -

• No guarantee for delivery order

• No way to tell to which broker message is sent

• Kafka do not guarantee that there are no duplications

• ...But eventually, all message will arrive!

Event generated

Event destination

Desert

Major Improvements in Kafka 0.8[4]

• Partitions replication

• Message send guarantee

• Consumer offsets are represented numbers instead of

bytes (e.g., 1, 2, 3, ..)

Addressing Data Challenges

• High throughput

o Kafka, Hadoop

• Horizontal scale to address growth

o Kafka, Storm, Hadoop

• High availability of data services

o Kafka, Storm, Zookeeper

• No Data loss

o Highly Available services, No ETL

Addressing Data Challenges Cont.

• Satisfy Real-Time demands

o Storm

• Enforce structural data with schemas

o Avro

• Process Big Data and Enterprise Data

o Kafka, Hadoop

• Single Source of Truth (SSOT)

o Hadoop, No ETL

References

• [1]

Satisfying new requirements for Data Integration By Dav

id Loshin

• [2]Apache Kafka

• [3]Kafka API

• [4]Kafka 0.8 Quick Start

• [5]Apache Avro

• [5]Storm

Thank you!

From a kafkaesque story to The Promised Land

Technology