Date post: | 21-Apr-2017 |
Category: |
Engineering |
Upload: | ben-laird |
View: | 6,193 times |
Download: | 5 times |
Stream processing and visualization for transaction investigationUsing Kafka, Spark, and D3.js
Ben LairdCapital One Labs
C1 Labs Data Science
About me
Cornell Engineering ’07BS, Operations
ResearchJohns Hopkins ‘12
MS, Applied Math
• Data Engineer• Northrop Grumman• IBM• Space Debris
Tracking• NLP of intel
documents• Counter-IED GIS
analysis
Cornell expectationsCornell reality
C1 Labs Data Science
Now: Data Scientist at Capital One Labs
C1 Labs Data Science
A technical challenge: Build a dynamic, rich visualization of large, streaming data
Normally, we have two options
Small dataEasy visualization
Big dataNo visualization
C1 Labs Data Science
Data Science: More than just Hadoop
• Understanding all the requirements of your problem and the architecture that meets those demands is an ever important for a data scientist
• Data processing solution doesn’t matter if you have a 1hr load time in the browser.
• Visualization doesn’t matter if there is no way to process/store data
Stream Handling Stream
Processing Intermediate Storage
Web Server/Fram
eworkEvent Based
Comm Browser Viz
C1 Labs Data Science
Our system must be able to process and visualize a real time transaction stream
• Requirement: System must handle 1B+ transactions
• Loading 1B records on the client side isn’t feasible
• Our data is not only big, it is live.
• Assume a stream of 50 records/second
C1 Labs Data Science
Proposed solution: Use existing big data tools to process stream before web stack
Tool PurposeApache Kafka Distributed Messaging for transaction streamApache Spark Streaming
Distributed processing of transaction stream. Aggregate to levels that can be handled by browser
MongoDB Intermediate storage in Capped Collection for web server access
Node.js Server side framework for web server and Mongo interaction
Socket.io Event based communication – Pass new data from stream into browser
Crossfilter Client side data indexDC.js/D3.js D3.js graphics and intergration with
CrossfilterHow/Why did I pick these for our
architecture?
C1 Labs Data Science
A foray into data visualization tools
From the beautiful: Minard Map, 1869Source: http://www.edwardtufte.com/tufte/minard
C1 Labs Data Science
to the ‘not beautiful’
Sources: http://www.excelcharts.com/, http://www.datavis.ca/gallery/evil-pies.php
C1 Labs Data Science
With most solutions, you face a trade off between ease of use and flexibility
• If you need a quick solution or don’t need full control or customization, there are fantastic options
• Tableau
• ElasticSearch Kibana
C1 Labs Data Science
D3.js provides an extremely powerful way of joining data with completely custom graphics
Limitless possibilities. Complete control over data and viz. Not trivial to use though!
C1 Labs Data Science
Bind data directly to elements in the DOM. Create graphics from scratch
http://bl.ocks.org/mbostock/7341714
C1 Labs Data Science
All about finding the right level of abstraction. Introduce DC.js
• Don’t always want to construct bar charts from the ground up.• Build axes, ticks, set colors, scales, bar widths, height,
projections...Too tedious sometimes• DC.js adds a thin layer on top of d3.js to
construct most chart types and to link charts together for fast filtering.
C1 Labs Data Science
DC.js combines d3.js with Square’s crossfilter
• Built by
• Javascript library for very fast (<50ms) filtering of multi-dimensional datasets
• Developed for transaction analysis (Perfect!)
• Very fast sorting and filtering
• Downside: Only practical up to a couple million records.
C1 Labs Data Science
Need some backend processing to aggregate data before we hit the web stack
• Developed by LinkedIn• Fast, scalable
messaging publish-subscribe service that runs on a distributed cluster
Transaction Stream
Transaction Processing
• Part of the larger Apache Spark compute engine
• Fast, in-memory streaming processing over sliding windows
• Handles data aggregation steps
• Can be used to run ML algorithms
C1 Labs Data Science
What is Apache Spark?
Write programs in terms of transformations on distributed datasetsResilient Distributed Datasets• Collections of objects spread
across a cluster, stored in RAM or on Disk
• Built through parallel transformations
• Automatically rebuilt on failure
Operations• Transformations
(e.g. map, filter, groupBy)
• Actions(e.g. count, collect, save)
Source: http://spark-summit.org/wp-content/uploads/2013/10/McDonough-spark-tutorial_spark-summit-2013.pdf
C1 Labs Data Science
Word Count in Spark vs Java MapReduce
scala> val rdd = sc.textFile("all_text_corpus.txt”)
scala> val allWords = rdd.flatMap(sentence=>sentence.split(" ”)
scala> val counts = allWords.map(word=>(word,1)).reduceByKey(_+_)
scala> counts.map{case (k,v)=>(v,k)}.sortByKey(ascending=false).map{case (v,k)=>(k,v)}.take(25)
Array(("",70230), (the,63641), (and,38896), (of,34986), (to,31743), (a,22481), (in,18710), (his,14712), (was,13963), (that,13735), (he,13588), (I,11761), (with,11308), (had,9303), (her,8429), (not,7900), (as,7641), (it,7626), (for,7619), (at,7574), (on,7350), (is,6383), (you,6173), (be,5525), (by,5315))
C1 Labs Data Science
Word Count in Spark vs Java MapReduce
C1 Labs Data Science
Transaction Aggregation with Spark
Batch up incoming transactions every 30 seconds, and compute average transaction size and total number of transactions for every merchant, zip code for a 5 min sliding window. Write batched results to MongoDB
C1 Labs Data Science
MongoDB for intermediate storage
• Use capped collection to immediately find last element. • No costly O(N) or worse searches.
• Tap into Mongo with Node.js
C1 Labs Data Science
Node.js and Socket.io for server side updates
• Add socket.io listener in client side javascript
C1 Labs Data Science
Demo!