Post on 12-Jul-2015
transcript
Event Stream Processingwith Kafka and Samza
Zach Cox - @zcox - zcox522@gmail.comIowa Code Camp - 1 Nov 2014
Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly
References
Kafka
Samza
Kafka DocumentationThe Log: What every software engineer should know about real-time data's unifying abstractionBenchmarking Apache Kafka
Samza DocumentationQuestioning the Lamba ArchitectureMoving faster with data streams: The rise of Samza at LinkedInWhy local state is a fundamental primitive in stream processingReal time insights into LinkedIn's performance using ApacheSamza
Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly
EventSomething happenedRecord that fact so we can process it
EventDescribes what happened
Who did it?What did they do?What was the result?
Provides contextWhen did it happen?Where did it happen?How did they do it?Why did they do it?
Event Example: PageviewUser viewed web pageUser
ID: a2be9031-9465-4ecb-9302-9b962fa854acIP: 65.121.142.238User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101Safari/537.36
Web PageURL:
ContextTime: 2014-10-14T10:49:24.438-05:00
https://www.mycompany.com/page.html
Event Example: ClickthroughUser clicked linkUser
ID: a2be9031-9465-4ecb-9302-9b962fa854acIP: 65.121.142.238User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101Safari/537.36
LinkURL: Referer:
ContextTime: 2014-10-14T10:49:24.438-05:00
https://www.mycompany.com/product.htmlhttps://www.othersite.com/foo.html
Event Example: User UpdateUser changed first nameUser
ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5First name: ZachContext
Time: 2014-10-14T10:59:56.481-05:00IP: 65.121.142.238
Event Example: User UpdateUser uploaded a new profile imageUser
ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5Profile Image
URL: Context
Time: 2014-10-14T10:59:56.481-05:00IP: 65.121.142.238Using: webcam
http://profile-images.s3.amazonaws.com/katy-perry.jpg
Event Example: TweetUser posted a tweetUser
ID:Username: @zcoxName: Zach CoxBio: Developer @BannoHQ | @iascala organizer | co-founded@Pongr
TweetID: 527152511568719872URL: URL: Text: Going to talk about processing event streams using@apachekafka and @samzastream this Saturday @iowacodecamp
Mentions: @apachekafka, @samzastream, @iowacodecampURLs:
ContextTime: 2014-10-14T10:59:56.481-05:00Using: Twitter for AndroidLocation: 41.7146365,-93.5914038
https://twitter.com/zcox/status/527152511568719872
http://iowacodecamp.com/session/list#66
http://iowacodecamp.com/session/list#66
Event Example: HTTP Request LatencySome measured code took some time to executeCode
production.my-app.some-server.http.get-user-profileTime to execute
Min: 20 msecMax: 950 msecAverage: 190 msecMedian: 110 msec50%: 100 msec75%: 120 msec95%: 150 msec99%: 500 msec
ContextTime: 2014-10-14T11:17:01.597-05:00
Event Example: Runtime ExceptionSome code threw a runtime exceptionSome code
Stack trace: [...]Exception
Message: HBase read timed outContext
Time: 2014-10-14T11:21:23.749-05:00Application: my-appMachine: some-server.my-company.com
Event Example: Application LoggingSome code logged some information[INFO] [2014-10-14 11:25:44,750] [sentry-akka.actor.default-dispatcher-2]a.e.s.Slf4jEventHandler: Slf4jEventHandler startedMessage: Slf4jEventHandler startedLevel: INFOTime: 2014-10-14 11:25:44,750Thread: sentry-akka.actor.default-dispatcher-2Logger: akka.event.slf4j.Slf4jEventHandler
Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly
Unified LogEvents need to be sent somewhereEvents should be accessible to any programLog provides a place for events to be sent and accessedKafka is a great log service
Data Integration
Data Integration
Log
Sequence of recordsAppend-onlyOrdered by timeEach record assigned unique sequential numberRecords stored persistently on disk
Log Service
Logs in Distributed Databases
Traditional Cache
Cache missesCache invalidation
Infrastructure as Distributed Database
Cache is now replicated from DB
Infrastructure as Distributed Database
Cache can be in-process with web app
Log for Event StreamsSimple to send events toBroadcasts events to all consumersBuffers events on disk: producers and consumers decoupledConsumers can start reading at any offset
KafkaApache OSS, mainly from LinkedInHandles all the logs/event streamsHigh-throughput: millions events/secHigh-volume: TBs - PBs of eventsLow-latency: single-digit msec from producer to consumerScalable: topics are partitioned across clusterDurable: topics are replicated across clusterAvailable: auto failover
Twitter Example
Receive messages via long-lived HTTP connection as JSONWrite messages to a Kafka topic
Twitter Streaming API
Twitter Example
Twitter rate-limits clients<1% sample, ~50-100 tweets/sec400 keywords, ? tweets/sec
1 weird trick to get more tweets: multiple clients, same Kafka topic!
Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly
Event Stream ProcessingTurn events into valuable, actionable informationProcess events as they happen, not later (batch)Do all of this reliably, at scale
Event Stream Processor
Event Stream Processor: Input
Event Stream Processor: Output
SamzaEvent stream processing frameworkApache OSS, mainly from LinkedInSimple Java APIScalable: runs jobs in parallel across clusterReliable: fault-tolerance and durability built-inTools for stateful stream processing
Samza Job1) Class that extends StreamTask:
class MyTask extends StreamTask { override def process( envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator): Unit = { //process message in envelope }}
2) my-task.properties config filejob.factory.class=org.apache.samza.job.local.ThreadJobFactoryjob.name=my-task
task.class=com.banno.MyTask...
Stateless ProcessingOne event at a timeTake action using only that event
SELECT * FROM raw_messages WHERE message_type = 'status';
Samza Job: Separate Message Types
Many message types from TwitterSamza job to separate into type-specific streamsOther jobs process specific message types
Stateful Stream ProcessingOne event at a timeTake action using that event and stateState = data built up from past eventsAggregationGroupingJoins
AggregationState = aggregated values (e.g. count)Incorporate each new event into that aggregationOutput aggregated values as events to new streamWhat happens if job stops?
Crash, deploy, ...Can't lose state!Samza handles this all for you
SELECT COUNT(*) FROM statuses;
Samza Job: Total Status Count
Increment a counter on every status (tweet)Periodically output current count
GroupingState = some data per groupTwo Samza jobs:
Output statuses by user (map)Count statuses per user (reduce)
Output: (user, count)Could use as input to job that sorts by count (most active users)
SELECT user_id, COUNT(user_id) FROM statuses GROUP BY user_id;
SELECT user_id, COUNT(user_id) FROM statuses GROUP BY user_id ORDER BY COUNT(user_id) DESC LIMIT 5;
JoinsSamza job has multiple input streamsStream-Stream join: ad impressions + ad clicksStream-Table join: page views + user zip codeTable-Table join: user data + user settingsJoins involving tables need DB changelog
SELECT u.username, s.text FROM statuses s JOIN users u ON u.id = s.user_id;
What else can we compute?Tweets per sec/min/hour (recent, not for-all-time)Enrich tweets with weather at current locationMost active users, locations, etcEmojis: % of tweets that contain, top emojisHashtags: % of tweets that contain, top #hashtagsURLs: % of tweets that contain, top domainsPhoto URLs: % of tweets that contain, top domainsText analysis: sentiment, spam
Reprocessinghttp://samza.incubator.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html
Other Stream Processing FrameworksStormSpark StreamingHadoop StreamingAkkaRiemannEsper
Druid
Send it eventsDruid reads from Kafka topicThat Kafka topic is a Samza output stream
Super fast time-series queries: aggregations, filters, top-n, etc
http://druid.io
Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly
References
Kafka
Samza
Kafka DocumentationThe Log: What every software engineer should know about real-time data's unifying abstractionBenchmarking Apache Kafka
Samza DocumentationQuestioning the Lamba ArchitectureMoving faster with data streams: The rise of Samza at LinkedInWhy local state is a fundamental primitive in stream processingReal time insights into LinkedIn's performance using ApacheSamza
Let's chat!Zach Cox@zcoxzcox522@gmail.comBanno is hiring!