Date post: | 28-Jul-2015 |
Category: |
Engineering |
Upload: | mate-gulyas |
View: | 232 times |
Download: | 0 times |
Apache SparkMate Gulyas
WHY WE DO IT?
54% average viewability**
36% non-human visitor *
What percentage of digital ads reach
people?
Clickbots, botnets
Invisible, hidden ads
Transparency in the market
*http://technorati.com/iab-keynote-36-percent-ad-traffic-from-bots-and-threatening-industry/**http://www.statista.com/statistics/255061/viewability-rates-for-rich-media-ads-worldwide-by-industry/
JavaScript SegmentationBehaviour analysis
WHAT WE DO?
Distributed data processing
WHAT DO WE NEED?
Averge size client30 GB / day
900 GB / month
20 average size clients600 GB / day
18 TB / month
Recurring data transformations
WHAT DO WE NEED?
Interactive / Batch / Streaming / SQL / Graph processing
IT’S SO COOL
In-memory
WHY SPARK?
Productive API
WHY SPARK?
Multiple language
WHY SPARK?
Active community
WHY SPARK?
friendly
WHY SPARK?
developeranalyst
CFO
friendly
WHY SPARK?
developeranalyst
CFO
friendly
WHY SPARK?
developeranalyst
CFO
Resilient Distributed Dataset (RDD)
ONE THING TO REMEMBER
RDD
IT RUN’S ON
MesosYARNStandaloneAWS EC2
IT GET’S DATA FROM?
Amazon S3, HDFS, Cassandra, Hive, Hbase, Tachyon, Local Filesystem, ODBC databases, etc...
Batch processing
THE OLD WAY
Interactive analytics
THE NEW WAY
SPARK WITH IPYTHON
Spark SQL
THE NOT THAT OLD WAY
{"name": "Mate Gulyas", "twitter": "gulyasm"}{"name": "John Doe", "email": "[email protected]"}{"name": "Jane Doe", "email": "[email protected]"}
val input = hiveCtx.jsonFile(“example.json”)input.registerAsTable(“users”)hiveCtx.sql(“SELECT name, twitter FROM people;”)
SQL WITH JSON
Spark Streaming
THE LOW LATENCY WAY
DStream
MLlib
THE SKYNET WAY
GraphX
I LOVE GRAPHS
Third party modules
THE OTHERS WAY
On-premisesAWSDatabricks Cloud
BUT… WHERE TO GO?
TAKEAWAY I
Spark can provide one platform to cover most of the use-cases in data analytics
TAKEAWAY II
Productive, fast data processing framework that helps you minimize to time business impact.