+ All Categories
Home > Engineering > Real time data viz with Spark Streaming, Kafka and D3.js

Real time data viz with Spark Streaming, Kafka and D3.js

Date post: 21-Apr-2017
Category:
Upload: ben-laird
View: 6,193 times
Download: 5 times
Share this document with a friend
22
Stream processing and visualization for transaction investigation Using Kafka, Spark, and D3.js Ben Laird Capital One Labs
Transcript
Page 1: Real time data viz with Spark Streaming, Kafka and D3.js

Stream processing and visualization for transaction investigationUsing Kafka, Spark, and D3.js

Ben LairdCapital One Labs

Page 2: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

About me

Cornell Engineering ’07BS, Operations

ResearchJohns Hopkins ‘12

MS, Applied Math

• Data Engineer• Northrop Grumman• IBM• Space Debris

Tracking• NLP of intel

documents• Counter-IED GIS

analysis

Cornell expectationsCornell reality

Page 3: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Now: Data Scientist at Capital One Labs

Page 4: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

A technical challenge: Build a dynamic, rich visualization of large, streaming data

Normally, we have two options

Small dataEasy visualization

Big dataNo visualization

Page 5: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Data Science: More than just Hadoop

• Understanding all the requirements of your problem and the architecture that meets those demands is an ever important for a data scientist

• Data processing solution doesn’t matter if you have a 1hr load time in the browser.

• Visualization doesn’t matter if there is no way to process/store data

Stream Handling Stream

Processing Intermediate Storage

Web Server/Fram

eworkEvent Based

Comm Browser Viz

Page 6: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Our system must be able to process and visualize a real time transaction stream

• Requirement: System must handle 1B+ transactions

• Loading 1B records on the client side isn’t feasible

• Our data is not only big, it is live.

• Assume a stream of 50 records/second

Page 7: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Proposed solution: Use existing big data tools to process stream before web stack

Tool PurposeApache Kafka Distributed Messaging for transaction streamApache Spark Streaming

Distributed processing of transaction stream. Aggregate to levels that can be handled by browser

MongoDB Intermediate storage in Capped Collection for web server access

Node.js Server side framework for web server and Mongo interaction

Socket.io Event based communication – Pass new data from stream into browser

Crossfilter Client side data indexDC.js/D3.js D3.js graphics and intergration with

CrossfilterHow/Why did I pick these for our

architecture?

Page 8: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

A foray into data visualization tools

From the beautiful: Minard Map, 1869Source: http://www.edwardtufte.com/tufte/minard

Page 9: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

to the ‘not beautiful’

Sources: http://www.excelcharts.com/, http://www.datavis.ca/gallery/evil-pies.php

Page 10: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

With most solutions, you face a trade off between ease of use and flexibility

• If you need a quick solution or don’t need full control or customization, there are fantastic options

• Tableau

• ElasticSearch Kibana

Page 11: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

D3.js provides an extremely powerful way of joining data with completely custom graphics

Limitless possibilities. Complete control over data and viz. Not trivial to use though!

Page 12: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Bind data directly to elements in the DOM. Create graphics from scratch

http://bl.ocks.org/mbostock/7341714

Page 13: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

All about finding the right level of abstraction. Introduce DC.js

• Don’t always want to construct bar charts from the ground up.• Build axes, ticks, set colors, scales, bar widths, height,

projections...Too tedious sometimes• DC.js adds a thin layer on top of d3.js to

construct most chart types and to link charts together for fast filtering.

Page 14: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

DC.js combines d3.js with Square’s crossfilter

• Built by

• Javascript library for very fast (<50ms) filtering of multi-dimensional datasets

• Developed for transaction analysis (Perfect!)

• Very fast sorting and filtering

• Downside: Only practical up to a couple million records.

Page 15: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Need some backend processing to aggregate data before we hit the web stack

• Developed by LinkedIn• Fast, scalable

messaging publish-subscribe service that runs on a distributed cluster

Transaction Stream

Transaction Processing

• Part of the larger Apache Spark compute engine

• Fast, in-memory streaming processing over sliding windows

• Handles data aggregation steps

• Can be used to run ML algorithms

Page 16: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

What is Apache Spark?

Write programs in terms of transformations on distributed datasetsResilient Distributed Datasets• Collections of objects spread

across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Operations• Transformations

(e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Source: http://spark-summit.org/wp-content/uploads/2013/10/McDonough-spark-tutorial_spark-summit-2013.pdf

Page 17: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Word Count in Spark vs Java MapReduce

scala> val rdd = sc.textFile("all_text_corpus.txt”)

scala> val allWords = rdd.flatMap(sentence=>sentence.split(" ”)

scala> val counts = allWords.map(word=>(word,1)).reduceByKey(_+_)

scala> counts.map{case (k,v)=>(v,k)}.sortByKey(ascending=false).map{case (v,k)=>(k,v)}.take(25)

Array(("",70230), (the,63641), (and,38896), (of,34986), (to,31743), (a,22481), (in,18710), (his,14712), (was,13963), (that,13735), (he,13588), (I,11761), (with,11308), (had,9303), (her,8429), (not,7900), (as,7641), (it,7626), (for,7619), (at,7574), (on,7350), (is,6383), (you,6173), (be,5525), (by,5315))

Page 18: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Word Count in Spark vs Java MapReduce

Page 19: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Transaction Aggregation with Spark

Batch up incoming transactions every 30 seconds, and compute average transaction size and total number of transactions for every merchant, zip code for a 5 min sliding window. Write batched results to MongoDB

Page 20: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

MongoDB for intermediate storage

• Use capped collection to immediately find last element. • No costly O(N) or worse searches.

• Tap into Mongo with Node.js

Page 21: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Node.js and Socket.io for server side updates

• Add socket.io listener in client side javascript

Page 22: Real time data viz with Spark Streaming, Kafka and D3.js

C1 Labs Data Science

Demo!


Recommended