+ All Categories
Home > Data & Analytics > Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

Date post: 21-Apr-2017
Category:
Upload: slim-baltagi
View: 14,323 times
Download: 1 times
Share this document with a friend
100
Spark or Hadoop: Is it an either-or proposition? By Slim Baltagi (@SlimBaltagi ) [email protected] http://www.SparkBigData.com OR XOR ?? Los Angeles Spark Users Group March 12, 2015
Transcript
Page 1: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

Spark or Hadoop Is it an either-or proposition

By Slim Baltagi (SlimBaltagi)sbaltagigmailcom

httpwwwSparkBigDatacom

ORXOR

Los Angeles Spark Users Group March 12 2015

2

Your Presenter ndash Slim Baltagibull Sr Big Data Solutions Architect

living in Chicagobull Over 17 years of IT and business

experiencesbull Over 4 years of Big Data

experience working on over 12 Hadoop projects

bull Speaker at Big Data eventsbull Creator and maintainer of the

Apache Spark Knowledge Base httpwwwSparkBigDatacom with over 4000 categorized Apache Spark web resources

SlimBaltagi

httpswwwlinkedincominslimbaltagi

sbaltagigmailcom

Disclaimer This is a vendor-independent talk that expresses my own opinions I am not endorsing nor promoting any product or vendor mentioned in this talk

3

AgendaI MotivationII Big Data Typical Big Data

Stack Apache Hadoop Apache Spark

III Spark with HadoopIV Spark without HadoopV More QampA

4

I Motivation1 News 2 Surveys3 Vendors4 Analysts5 Key Takeaways

5

1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger

Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way

6

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data

7

Apache Spark Survey 2015 by Typesafe - Quick Snapshot

8

3 Vendorsbull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml

bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

9

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

10

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 2: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

2

Your Presenter ndash Slim Baltagibull Sr Big Data Solutions Architect

living in Chicagobull Over 17 years of IT and business

experiencesbull Over 4 years of Big Data

experience working on over 12 Hadoop projects

bull Speaker at Big Data eventsbull Creator and maintainer of the

Apache Spark Knowledge Base httpwwwSparkBigDatacom with over 4000 categorized Apache Spark web resources

SlimBaltagi

httpswwwlinkedincominslimbaltagi

sbaltagigmailcom

Disclaimer This is a vendor-independent talk that expresses my own opinions I am not endorsing nor promoting any product or vendor mentioned in this talk

3

AgendaI MotivationII Big Data Typical Big Data

Stack Apache Hadoop Apache Spark

III Spark with HadoopIV Spark without HadoopV More QampA

4

I Motivation1 News 2 Surveys3 Vendors4 Analysts5 Key Takeaways

5

1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger

Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way

6

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data

7

Apache Spark Survey 2015 by Typesafe - Quick Snapshot

8

3 Vendorsbull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml

bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

9

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

10

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 3: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

3

AgendaI MotivationII Big Data Typical Big Data

Stack Apache Hadoop Apache Spark

III Spark with HadoopIV Spark without HadoopV More QampA

4

I Motivation1 News 2 Surveys3 Vendors4 Analysts5 Key Takeaways

5

1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger

Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way

6

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data

7

Apache Spark Survey 2015 by Typesafe - Quick Snapshot

8

3 Vendorsbull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml

bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

9

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

10

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 4: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

4

I Motivation1 News 2 Surveys3 Vendors4 Analysts5 Key Takeaways

5

1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger

Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way

6

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data

7

Apache Spark Survey 2015 by Typesafe - Quick Snapshot

8

3 Vendorsbull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml

bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

9

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

10

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 5: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

5

1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger

Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way

6

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data

7

Apache Spark Survey 2015 by Typesafe - Quick Snapshot

8

3 Vendorsbull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml

bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

9

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

10

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 6: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

6

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data

7

Apache Spark Survey 2015 by Typesafe - Quick Snapshot

8

3 Vendorsbull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml

bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

9

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

10

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 7: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

7

Apache Spark Survey 2015 by Typesafe - Quick Snapshot

8

3 Vendorsbull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml

bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

9

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

10

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 8: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

8

3 Vendorsbull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml

bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

9

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

10

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 9: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

9

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

10

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 10: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

10

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 11: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

11

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 12: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

12

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 13: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

13

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 14: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

14

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 15: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

15

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 16: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

16

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 17: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

17

2 Typical Big Data Stack

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 18: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

18

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 19: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

19

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 20: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

20

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 21: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

21

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 22: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

22

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 23: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

23

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

time

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 24: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

24

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 25: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

25

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 26: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

26

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 27: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

27

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 28: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

28

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Open SourceApache 20 version 1x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 29: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

29

Hadoop MapReduce vs Tez vs SparkCriteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

No Interactive mode except Hive Pig

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

to data types and data sources is same

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 30: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

30

Hadoop MapReduce vs Tez vs SparkCriteria

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 31: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

31

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 32: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

32

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 33: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

33

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 34: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

34

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 35: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

35

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 36: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

36

Hive on Spark (Currently in Beta Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 37: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

37

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 38: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

38

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 39: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

39

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 40: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

40

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 41: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

41

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 42: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

42

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 43: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

43

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

SQL

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 44: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

44

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 45: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

45

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 46: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

46

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 47: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

47

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 48: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

48

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 49: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

49

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 50: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

50

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 51: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

51

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 52: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

52

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 53: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

53

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 54: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

54

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 55: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

55

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 56: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

56

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 57: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

57

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 58: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

58

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 59: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

59

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 60: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

60

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 61: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

61

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 62: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

62

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 63: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

63

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 64: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

64

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 65: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

65

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 66: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

66

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 67: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

67

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 68: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

68

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 69: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

69

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 70: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

70

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 71: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

71

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 72: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

72

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 73: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

73

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 74: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

74

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 75: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

75

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 76: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

76

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 77: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

77

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 78: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

78

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 79: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

79

3 Distributionsbull Using Spark on a Non-Hadoop distribution

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 80: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

80

Cloud

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 81: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

81

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 82: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

82

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 83: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

83

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 84: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

84

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 85: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 86: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

86

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 87: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

87

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 88: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

88

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 89: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

89

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 90: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

90

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 91: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

91

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 92: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

92

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 93: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

93

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 94: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

94

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 95: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

95

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 96: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

96

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 97: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

97

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 98: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

98

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 99: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

99

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA
Page 100: Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

100

IV More QampA

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbaltagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

  • Spark or Hadoop Is it an either-or proposition
  • Your Presenter ndash Slim Baltagi
  • Agenda
  • I Motivation
  • 1 News
  • 2 Surveys
  • Apache Spark Survey 2015 by Typesafe - Quick Snapshot
  • 3 Vendors
  • 3 Vendors (2)
  • 3 Vendors (3)
  • 3 Vendors (4)
  • 4 Analysts
  • 4 Analysts
  • 5 Key Takeaways
  • II Big Data Typical Big Data Stack Hadoop Spark
  • 1 Big Data
  • 2 Typical Big Data Stack
  • 3 Apache Hadoop
  • 4 Apache Spark
  • 5 Key Takeaways (2)
  • III Spark with Hadoop
  • 1 Evolution of Programming APIs
  • 1 Evolution of Compute Models
  • 1 Evolution
  • 1 Evolution (2)
  • 1 Evolution
  • 1 Evolution Apache Flink
  • Hadoop MapReduce vs Tez vs Spark
  • Hadoop MapReduce vs Tez vs Spark (2)
  • Hadoop MapReduce vs Tez vs Spark (3)
  • IV Spark with Hadoop
  • 2 Transition
  • 2 Transition (2)
  • Pig on Spark (Spork)
  • Hive on Spark (Currently in Beta Expected i
  • Hive on Spark (Currently in Beta Expec
  • Sqoop on Spark
  • (Expected in 31 r
  • Apache Crunch
  • (Expec
  • (Expected in Mahout 10 )
  • III Spark with Hadoop (2)
  • 3 Integration
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration
  • 3 Integration (2)
  • 3 Integration (3)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration YARN
  • 3 Integration (3)
  • 3 Integration (6)
  • 3 Integration (4)
  • 3 Integration (5)
  • 3 Integration (6)
  • 3 Integration (7)
  • 3 Integration (8)
  • 3 Integration Kite SDK
  • 3 Integration
  • 3 Integration
  • 3 Integration
  • III Spark with Hadoop (3)
  • 4 Complementarity
  • 4 Complementarity + +
  • 4 Complementarity +
  • 4 Complementarity + (2)
  • 4 Complementarity +
  • 4 Complementarity +
  • 4 Complementarity (2)
  • 4 Complementarity (3)
  • 5 Key Takeaways (3)
  • IV Spark without Hadoop
  • 1 File System
  • 1 File System (2)
  • IV Spark without Hadoop (2)
  • 2 Deployment
  • IV Spark without Hadoop (3)
  • 3 Distributions
  • Cloud
  • DSE
  • Slide 82
  • Slide 83
  • Slide 84
  • IV Spark without Hadoop (4)
  • 4 Alternatives
  • (2)
  • YARN vs Mesos
  • Spark Native API
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Storm vs Spark Streaming
  • GraphX
  • Notebook
  • IV Spark on Non-Hadoop
  • 6 Key Takeaways
  • IV More QampA

Recommended