Event Stream Processing with Kafka and Samza

Post on 12-Jul-2015

2,383 views 2 download

Tags:

transcript

Event Stream Processingwith Kafka and Samza

Zach Cox - @zcox - zcox522@gmail.comIowa Code Camp - 1 Nov 2014

Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

EventSomething happenedRecord that fact so we can process it

EventDescribes what happened

Who did it?What did they do?What was the result?

Provides contextWhen did it happen?Where did it happen?How did they do it?Why did they do it?

Event Example: PageviewUser viewed web pageUser

ID: a2be9031-9465-4ecb-9302-9b962fa854acIP: 65.121.142.238User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101Safari/537.36

Web PageURL:

ContextTime: 2014-10-14T10:49:24.438-05:00

https://www.mycompany.com/page.html

Event Example: ClickthroughUser clicked linkUser

ID: a2be9031-9465-4ecb-9302-9b962fa854acIP: 65.121.142.238User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101Safari/537.36

LinkURL: Referer:

ContextTime: 2014-10-14T10:49:24.438-05:00

https://www.mycompany.com/product.htmlhttps://www.othersite.com/foo.html

Event Example: User UpdateUser changed first nameUser

ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5First name: ZachContext

Time: 2014-10-14T10:59:56.481-05:00IP: 65.121.142.238

Event Example: User UpdateUser uploaded a new profile imageUser

ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5Profile Image

URL: Context

Time: 2014-10-14T10:59:56.481-05:00IP: 65.121.142.238Using: webcam

http://profile-images.s3.amazonaws.com/katy-perry.jpg

Event Example: TweetUser posted a tweetUser

ID:Username: @zcoxName: Zach CoxBio: Developer @BannoHQ | @iascala organizer | co-founded@Pongr

TweetID: 527152511568719872URL: URL: Text: Going to talk about processing event streams using@apachekafka and @samzastream this Saturday @iowacodecamp

Mentions: @apachekafka, @samzastream, @iowacodecampURLs:

ContextTime: 2014-10-14T10:59:56.481-05:00Using: Twitter for AndroidLocation: 41.7146365,-93.5914038

https://twitter.com/zcox/status/527152511568719872

http://iowacodecamp.com/session/list#66

http://iowacodecamp.com/session/list#66

Event Example: HTTP Request LatencySome measured code took some time to executeCode

production.my-app.some-server.http.get-user-profileTime to execute

Min: 20 msecMax: 950 msecAverage: 190 msecMedian: 110 msec50%: 100 msec75%: 120 msec95%: 150 msec99%: 500 msec

ContextTime: 2014-10-14T11:17:01.597-05:00

Event Example: Runtime ExceptionSome code threw a runtime exceptionSome code

Stack trace: [...]Exception

Message: HBase read timed outContext

Time: 2014-10-14T11:21:23.749-05:00Application: my-appMachine: some-server.my-company.com

Event Example: Application LoggingSome code logged some information[INFO] [2014-10-14 11:25:44,750] [sentry-akka.actor.default-dispatcher-2]a.e.s.Slf4jEventHandler: Slf4jEventHandler startedMessage: Slf4jEventHandler startedLevel: INFOTime: 2014-10-14 11:25:44,750Thread: sentry-akka.actor.default-dispatcher-2Logger: akka.event.slf4j.Slf4jEventHandler

Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

Unified LogEvents need to be sent somewhereEvents should be accessible to any programLog provides a place for events to be sent and accessedKafka is a great log service

Data Integration

Data Integration

Log

Sequence of recordsAppend-onlyOrdered by timeEach record assigned unique sequential numberRecords stored persistently on disk

Log Service

Logs in Distributed Databases

Traditional Cache

Cache missesCache invalidation

Infrastructure as Distributed Database

Cache is now replicated from DB

Infrastructure as Distributed Database

Cache can be in-process with web app

Log for Event StreamsSimple to send events toBroadcasts events to all consumersBuffers events on disk: producers and consumers decoupledConsumers can start reading at any offset

KafkaApache OSS, mainly from LinkedInHandles all the logs/event streamsHigh-throughput: millions events/secHigh-volume: TBs - PBs of eventsLow-latency: single-digit msec from producer to consumerScalable: topics are partitioned across clusterDurable: topics are replicated across clusterAvailable: auto failover

Twitter Example

Receive messages via long-lived HTTP connection as JSONWrite messages to a Kafka topic

Twitter Streaming API

Twitter Example

Twitter rate-limits clients<1% sample, ~50-100 tweets/sec400 keywords, ? tweets/sec

1 weird trick to get more tweets: multiple clients, same Kafka topic!

Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

Event Stream ProcessingTurn events into valuable, actionable informationProcess events as they happen, not later (batch)Do all of this reliably, at scale

Event Stream Processor

Event Stream Processor: Input

Event Stream Processor: Output

SamzaEvent stream processing frameworkApache OSS, mainly from LinkedInSimple Java APIScalable: runs jobs in parallel across clusterReliable: fault-tolerance and durability built-inTools for stateful stream processing

Samza Job1) Class that extends StreamTask:

class MyTask extends StreamTask { override def process( envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator): Unit = { //process message in envelope }}

2) my-task.properties config filejob.factory.class=org.apache.samza.job.local.ThreadJobFactoryjob.name=my-task

task.class=com.banno.MyTask...

Stateless ProcessingOne event at a timeTake action using only that event

SELECT * FROM raw_messages WHERE message_type = 'status';

Samza Job: Separate Message Types

Many message types from TwitterSamza job to separate into type-specific streamsOther jobs process specific message types

Stateful Stream ProcessingOne event at a timeTake action using that event and stateState = data built up from past eventsAggregationGroupingJoins

AggregationState = aggregated values (e.g. count)Incorporate each new event into that aggregationOutput aggregated values as events to new streamWhat happens if job stops?

Crash, deploy, ...Can't lose state!Samza handles this all for you

SELECT COUNT(*) FROM statuses;

Samza Job: Total Status Count

Increment a counter on every status (tweet)Periodically output current count

GroupingState = some data per groupTwo Samza jobs:

Output statuses by user (map)Count statuses per user (reduce)

Output: (user, count)Could use as input to job that sorts by count (most active users)

SELECT user_id, COUNT(user_id) FROM statuses GROUP BY user_id;

SELECT user_id, COUNT(user_id) FROM statuses GROUP BY user_id ORDER BY COUNT(user_id) DESC LIMIT 5;

JoinsSamza job has multiple input streamsStream-Stream join: ad impressions + ad clicksStream-Table join: page views + user zip codeTable-Table join: user data + user settingsJoins involving tables need DB changelog

SELECT u.username, s.text FROM statuses s JOIN users u ON u.id = s.user_id;

What else can we compute?Tweets per sec/min/hour (recent, not for-all-time)Enrich tweets with weather at current locationMost active users, locations, etcEmojis: % of tweets that contain, top emojisHashtags: % of tweets that contain, top #hashtagsURLs: % of tweets that contain, top domainsPhoto URLs: % of tweets that contain, top domainsText analysis: sentiment, spam

Reprocessinghttp://samza.incubator.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html

Druid

Send it eventsDruid reads from Kafka topicThat Kafka topic is a Samza output stream

Super fast time-series queries: aggregations, filters, top-n, etc

http://druid.io

Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

Let's chat!Zach Cox@zcoxzcox522@gmail.comBanno is hiring!