Online Media Data Stream Processing with Kafka

Post on 14-Dec-2014

1,012 views 6 download

description

 

transcript

CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3

2

Overview

•  What is Streaming Data? •  Why Kafka? •  Kafka Architecture

•  Use Case: Prospective Search

18. Septem

ber 2012

3

About Sentric

•  Spin-off of MeMo News AG, the leading provider for Social Media Monitoring & Analytics in Switzerland

•  Big Data expert, focused on Hadoop, HBase and Solr

•  Objective: Transforming data into insights

18. Septem

ber 2012

CC 2.0 by audreyjm529| http://flic.kr/p/mNMtL  

5

Data Streams

•  Website Activity Data •  User activity

•  Server activity

•  Social Media Data •  News Data •  …

•  How to Analyze in Real-Time?

What is Streaming Data?

18. Septem

ber 2012

6

Offline vs. Online

What is Streaming Data?

18. Septem

ber 2012

t  

now  

Offline  (Hadoop/MR)   Online  (Ka5a)  

CC 2.0 by Tom Hilton | http://flic.kr/p/54KSXy  

8

Streaming Systems

•  Message Queues (RabbitMQ, ActiveMQ) •  do not scale / have no persistence

•  Flume / Scribe •  Log-Aggregation only, high throughput and

scalable, push model •  Focus on offline consumption

•  Kafka •  High throughput and scalable, pull model •  Different consumption profiles

Why Kafka?

18. Septem

ber 2012

9

Consumer Performance

Why Kafka?

18. Septem

ber 2012

Source:  h<p://research.microso@.com/en-­‐us/um/people/srikanth/netdb11/netdb11papers/netdb11-­‐final12.pdf  

CC 2.0 by Presidente | http://flic.kr/p/2ptSZ  

11

Key Concepts

•  Messaging System •  Publish-Subscribe •  Persistent

•  High-Throughput

Kafka Architecture

18. Septem

ber 2012

12

Messaging

Kafka Architecture

18. Septem

ber 2012

Broker Producer

Consumer

Producer

Producer

Producer

Consumer

Consumer

ZooKeeper

Push Pull

13

Publish-Subscribe

Kafka Architecture

18. Septem

ber 2012

logs

Consumer Consumer Consumer

Msg

Msg

page-views

Msg

Topics

14

Persistent

•  Persists messages to disc •  Topic is base abstraction

•  Binary write ahead log •  No message ID •  Message offset ID (byte position)

•  Messages retained a specific time •  Default is 7 days

Kafka Architecture

18. Septem

ber 2012

15

High-Throughput

•  API Simplicity •  Append message

•  Fetch message from given byte position

•  Batching •  Stateless Broker •  O(1) disc access (no seeks) •  Use of operating system features

Kafka Architecture

18. Septem

ber 2012

CC 2.0 by nolifebeforecoffee | http://flic.kr/p/c1UTf

17

Solution Architecture

Prospective Search

18. Septem

ber 2012

REST

n News Agents

MySQL Solr

Web-UI

RT Alerts

Kafka

HBase

Icons by http://dryicons.com

18

Prospective Search with Kafka

Prospective Search

18. Septem

ber 2012

Processing

Kafka Consumer

Pull (Batch)

Prospective Search

RT Alerts

Icons by http://dryicons.com

19

Resources to get started

•  http://incubator.apache.org/kafka/ •  http://sites.computer.org/debull/

A12june/A12JUN-CD.pdf

18. Septem

ber 2012

20

Thank you!

Questions? Christian Gügi, christian.guegi@sentric.ch

Swiss Big Data User Group

18. Septem

ber 2012