Apache Spark: The modern data analytics platform

Apache SparkMate Gulyas

WHY WE DO IT?

54% average viewability**

36% non-human visitor *

What percentage of digital ads reach

people?

Clickbots, botnets

Invisible, hidden ads

Transparency in the market

*http://technorati.com/iab-keynote-36-percent-ad-traffic-from-bots-and-threatening-industry/**http://www.statista.com/statistics/255061/viewability-rates-for-rich-media-ads-worldwide-by-industry/

JavaScript SegmentationBehaviour analysis

WHAT WE DO?

Distributed data processing

WHAT DO WE NEED?

Averge size client30 GB / day

900 GB / month

20 average size clients600 GB / day

18 TB / month

Recurring data transformations

WHAT DO WE NEED?

Interactive / Batch / Streaming / SQL / Graph processing

IT’S SO COOL

In-memory

WHY SPARK?

Productive API

WHY SPARK?

Multiple language

WHY SPARK?

Active community

WHY SPARK?

friendly

WHY SPARK?

developeranalyst

CFO

friendly

WHY SPARK?

developeranalyst

CFO

friendly

WHY SPARK?

developeranalyst

CFO

Resilient Distributed Dataset (RDD)

ONE THING TO REMEMBER

RDD

IT RUN’S ON

MesosYARNStandaloneAWS EC2

IT GET’S DATA FROM?

Amazon S3, HDFS, Cassandra, Hive, Hbase, Tachyon, Local Filesystem, ODBC databases, etc...

Batch processing

THE OLD WAY

Interactive analytics

THE NEW WAY

SPARK WITH IPYTHON

Spark SQL

THE NOT THAT OLD WAY

{"name": "Mate Gulyas", "twitter": "gulyasm"}{"name": "John Doe", "email": "[email protected]"}{"name": "Jane Doe", "email": "[email protected]"}

val input = hiveCtx.jsonFile(“example.json”)input.registerAsTable(“users”)hiveCtx.sql(“SELECT name, twitter FROM people;”)

SQL WITH JSON

Spark Streaming

THE LOW LATENCY WAY

DStream

MLlib

THE SKYNET WAY

GraphX

I LOVE GRAPHS

Third party modules

THE OTHERS WAY

On-premisesAWSDatabricks Cloud

BUT… WHERE TO GO?

TAKEAWAY I

Spark can provide one platform to cover most of the use-cases in data analytics

TAKEAWAY II

Productive, fast data processing framework that helps you minimize to time business impact.

MATE [email protected]

@gulyasm@enbritely

THANK YOU!

Date post:	28-Jul-2015
Category:	Engineering
Upload:	mate-gulyas
View:	232 times
Download:	0 times

Apache Spark: The modern data analytics platform

Engineering