Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | mongodb |
View: | 1,022 times |
Download: | 5 times |
Spark in the Leaf
3
Ross LawleyJVM Software engineerOn the drivers team
Twitter: @RossC0
4
Agenda
The data challengeSparkUse CasesConnectorsDemo
C 18,000 BCE
First recorded example of Humans saving data.
Tally sticks used to track trading activity and record inventory.
1663
First recorded statistical – analysis of dataJohn Graunt started the field of demographics in an attempt to
predict the spread of the bubonic plague.
1928
First use of magnetic tape to store data.
Fritz Pfleumer formed basis of modern digital data storage.
1965
The start of Big Data?The US Government plans the world’s first data center to store
742 million tax returns and 175 million sets of fingerprints.
1970
The start of accessible data
Relational Database model developed by Edgar F Codd.
1997
Michael Lesk estimates the digital universe increasing tenfold in size every year.
2001
Big Data challenges defined
Doug Laney defined the Three “Vs” of Big Data
2010
Eric Schmidt
Every two days now we create as much information as we did from the dawn of civilization up until 2003
“
Big Data
Big Challenge
Apache Spark is the Taylor Swift of big data software.
“Derrick Harris, Fortune
22
What is Spark?
Fast and general computing engine for clusters
• Makes it easy and fast to process large datasets• APIs in Java, Scala, Python, R• Libraries for SQL, streaming, machine learning, …• It’s fundamentally different to what’s come before
23
Why not just use Hadoop?
• Spark is FAST– Faster to write.– Faster to run.
• Up to 100x faster than Hadoop in memory• 10x faster on disk.
A visual comparison
HadoopSpark
25
Spark Programming Model
Resilient Distributed Datasets
• An RDD is a collection of elements that is immutable, distributed and fault-tolerant.
• Transformations can be applied to a RDD, resulting in new RDD.
• Actions can be applied to a RDD to obtain a value.
• RDD is lazy.
26
RDD Operations
Transformations Actionsmap reduce
filter collect
flatMap count
mapPartitions save
sample lookupKey
union take
join foreach
groupByKey
reduceByKey
27
Example: Filtering text
val searches = spark.textFile("hdfs://...") .filter(line => line.contains("Search")) .map(s => s.split("\t")(2)).cache() Driver
Worker
Worker
Worker
// Count searches mentioning MongoDBsearches.filter(_.contains("MongoDB")) .count() tasksres
ults
Block 1
Block 2
Block 3
// Fetch the searches as an array of stringssearches .filter(_.contains("MongoDB"))
.collect()
Cache 1Cache 2
Cache3
28
Built in fault tolerance
RDDs maintain lineage information that can be used to reconstruct lost partitions
val searches = spark.textFile("hdfs://...") .filter(_.contains("Search")) .map(_.split("\t")(2)).cache()
.filter(_.contains("MongoDB")) .count()
Mapped RDD
Filtered RDD
HDFS RDD
Cached RDD
Filtered RDD Count
29
Spark higher level libraries
Spark
Spark SQL
Spark Streaming MLIB GraphX
Spark + MongoDB
31
MongoDB and Spark
Spark
Spark SQL Spark Streaming MLIB GraphX
32
MongoDB and Spark
Spark
Spark SQL
Spark Streaming MLIB GraphX
33
MongoDB and Spark
Spark
Spark SQL
Spark Streaming MLIB GraphX
34
Spark + MongoDB top use cases:– Business Intelligence– Data Warehousing – Recommendation– Log processing– User Facing Services– Fraud detection
35
Data Management
OLTPApplicationsFine grained operations
Offline Processing Analytics Data Warehousing
Fraud Detection
I'm so in love!
Fraud Detection
I'm so in love!
Me, too<3
Now send me your CC number
?
Ok, XXXX-123-zzz
$$$
Fraud Detection
Sharing Workloads
Chat App
HDFS HDFS HDFS ArchivingData Crunching
LoginUser ProfileContactsMessages…
Fraud DetectionSegmentationRecommendations
Spark
MongoDB + Spark Connectors
Choices, choices
Hadoop Connector Stratio Connector
MongoDB Hadoop Connector
HDFS HDFS HDFSMongoDB Hadoop
Connector
MongoDB Shard
Spark
MongoDB Hadoop Connector
HDFS HDFS HDFSMongoDB Hadoop
Connector
MongoDB Shard
Spark
YARN
44
MongoDB Hadoop Connector
Positive Not So Good
Battle Tested Not the fastest thing
Integrated with existing Hadoop components Not dedicated to Spark
Supports HIVE and PIG Dependent on HDFS
http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
45
Stratio Spark-MongoDBhttp://spark-packages.org/?q=mongodb
Stratio Spark MongoDB
MongoDB Shard
Spark
Stratio Spark-MongoDB
https://github.com/Stratio/spark-mongodb
47
MongoDB Hadoop Connector Stratio Spark-MongoDB Connector
Machine Learning Yes Yes
SQL No Yes
Data Frames No Yes
Streaming No No
Python Yes YesSpark SQL syntax
Use MongoDB secondary indexes to filter input data Yes Yes
Compatibility with MongoDB replica sets and sharding Yes Yes
HDFS Support Yes Yes
Support for MongoDB BSON Files Yes PartialWrite only
Commercial Support YesWith MongoDB Enterprise Advanced
YesProvided by Stratio
Spark Streaming
49
Spark Streaming
Twitter Feed Spark
50
Spark Streaming
Twitter Feed
{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [
], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}
51
Spark Streaming{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [
], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}
{ "time": "Mon Sep 24 03:35", "freebandnames": 1}
{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [
], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}
{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [
], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}
{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [
], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}
{ "time": "Mon Sep 24 03:35", "freebandnames": 4}
Spark
52
Capped Collection
MongoDB and Spark Streaming future
{ "time": "Mon Sep 24 03:35", "freebandnames": 4}{ "time": "Mon Nov 5 09:40", “mongoDBLondon": 400}{ "time": "Mon Nov 5 11:50", “spark": 7556}{ "time": "Mon Nov 24 12:50", "itshappening": 100}
Tailable Cursor
Spark SQL
54
Demo
Spark Stratio Spark MongoDB
55
Open High Low Close
Symbol, Timestamp, Day, Open, High, Low, Close, VolumeMSFT, 2009-08-24 09:30, 24, 24.41, 24.42, 24.31, 24.31, 683713
MongoDB + Spark performance
Document Design mattersdb.ticks.find(){ _id: 'MSFT_12', type: 'Open', date: ISODate("2015-07-12 10:00"), volume: 12.9}
Resource
Type
WhenData
Time Seriesdb.ticks.find(){ _id: 'MSFT_12', type: 'Open', date: ISODate("2015-07-12 10:00"), volume: 1699342, minutes: { "0": 12.9, "1": 14.4, ... "59": 15.8 }}
Series
WiredTiger
Very High Speed
61
Spark I/O Matters
val searches = spark.fromMongoDB(mongoDBConfig) .filter(line => line.contains("Search")) .map(s => s.split("\t")(2))
SparkDriver
Worker Worker
// Count searches mentioning MongoDBsearches.filter(_.contains("MongoDB")) .count()
App// Fetch the searches as an array of stringssearches .filter(_.contains("MongoDB"))
.collect()
.cache()
62
Spark and MongoDB
• An extremely powerful combination
• Many possible use cases
• Some operations are actually faster if performed using Aggregation Framework
• Evolving all the time
65
References
• Resources– https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb– http://spark.apache.org/docs/latest/quick-start.html– https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals– http://techanjs.org/
• Images– https://commons.wikimedia.org/wiki/File:SAM_PC_1_-_Tally_sticks_1_-_Overview.jpg– http://www.pieria.co.uk/articles/a_17th_century_spreadsheet_of_deaths_in_london– http://www.snipview.com/q/Fritz_Pfleumer– https://news.google.com/newspapers?id=ZGogAAAAIBAJ&sjid=3GYFAAAAIBAJ&dq=data-center&pg=933%2C5465131– http://www.slideshare.net/renguzi/codd– http://www.datasciencecentral.com/profiles/blogs/a-little-known-component-that-should-be-part-of-most-data-science– https://medium.com/deepend-indepth/know-your-audience-better-than-asio-4802839c3fd3#.fj0dxq99w– http://timschreiber.com/img/cardboard-tank.jpg– http://olap.com/forget-big-data-lets-talk-about-all-data/– http://www.engadget.com/2015/10/07/lexus-cardboard-electric-car/– http://cdn.theatlantic.com/static/infocus/ngt051713/n10_00203194.jpg– http://www.businessinsider.com/the-red-pill-reddit-2013-8– https://www.flickr.com/photos/dogfaceboy/2572744331/