Download - Spark: The State of the Art Engine for Big Data Processing

SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING

Presented By:Ramaninder Singh Jhajj

Seminar on Internet Technologies

AGENDA

• Problem

• Limitation of Map Reduce

• Spark Computing Framework

• Resilient Distributed Datasets

• A Unified Stack

• Who uses Spark?

• Demo

SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 2

PROBLEM

• Data growing faster than processing speeds

• Map Reduce:• Restrict the programming interface so that the system can do more

automatically.

• Express jobs as high level operators.

• Map Reduce is efficient (But may be not always)


LIMITATIONS OF MAPREDUCE

• Work very well with one-pass computation but ineffiecient for multi-pass algorithms.


Source: http://www.slideshare.net/aknahs/spark-16667619

SOLUTION: IN-MEMORY DATA SHARING


Source: http://www.slideshare.net/aknahs/spark-16667619

SPARK ... IS WHAT YOU MIGHT CALL A SWISS ARMY KNIFE OF BIG DATA ANALYTICS TOOLS


- Reynold Xin (@rxin),

Berkeley AMPLab Shark Development Lead

SPARK: IN A TWEET

SPARK COMPUTING FRAMEWORK

• Spark is a fast and general engine for large scale data processing.

• Handles batch, interactive, iterative and real-time application scenarios and provides clean APIs in Java, Scala, Python.

• "Here‘s an operation, run it on all the data": I don‘t care where it runs and how faults will be handled.


RESILIENT DISTRIBUTED DATASETS (RDD)

• Primary memory abstraction.

• Read only collection of objects partitioned across cluster that can be rebuilt if a partition is lost.

• Can be cached explicitely in memory.

• Two operations: Transformations and Actions


RDD OPERATIONS

Transform

ations RDD action Value


Transformations Actions

map reduce

filter collect

flatMap count

mapPartitions first

groupByKey take(n)

reduceByKey saveAsTextFile

join foreach

https://spark.apache.org/docs/latest/programming-guide.html#rdd-operations

RDD EXAMPLE

lines = spark.textfile("hdfs://........ ")

errors = lines.filter(_.startWith("Error"))

messages = errors.map(_.split("\t")(2))

messages.cache()

messages.filter(_.contains(str)).count()

Base RDD from HDFS

Transformed to new RDD

Stored in Memory

Action


FAULT TOLERANCE IN RDDS

• Achieved through a notion of lineage.

• Keep track of how it was derived from other RDDs.


Ex: message = textFile(...).filter(_.contains("error")).map(_.split("\t")(2))

EXAMPLE: WORD COUNT

• Word Count in MapReduce: 50-70 lines of code.

• What about Spark?

Python

Java


SPARK: A UNIFIED STACK


Spark Core

RDD APIs Fault ToleranceProcessing

A UNIFIED STACK : SPARK SQL

Spark Core

Spark SQL


• Spark SQL unifies access to structured data.

• Seamlessly mix SQL queries with Spark programs

sqlCtx.jsonFile("s3n://... ") .registerAsTable("json")schema_rdd = sqlCtx.sql(" " " SELECT * FROM hiveTable JOIN json ... " " ")

Spark Core

Spark SQL

GraphX


A UNIFIED STACK : SPARK GRAPHX

•GraphX is Apache Spark's API for graphs and graph-parallel computation

•Seamlessly work with both graphs and collections

A UNIFIED STACK : SPARK MLLIB

Spark Core

Spark SQL

GraphX MLLib


• MLlib is Apache Spark's scalable machine learning library.

• High-quality algorithms, 100x faster than MapReduce.

points = spark.textFile("hdfs://... ").map(parsePoint)

model = KMeans.train(points, k=10)

Spark Core

Spark SQL

GraphX MLLibSpark

Streaming


A UNIFIED STACK : SPARK STREAMING

• Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.

WHO USES SPARK?

Source: Spark Wiki (https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark)SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 18

DEMO


CONCLUSION

• Spark is the first system to allow an efficient, general-purpose programming language to be used interactively to process large datasets on a cluster.

• Same engine performs data extraction, model training and interactive queries, no need of separate framework for each function.


INTERESTED IN READING MORE?

• https://spark.apache.org/

• http://ampcamp.berkeley.edu/

• https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w

• https://spark.apache.org/documentation.html

• edx.org offering a course „Introduction to Big Data with Apache Spark“


https://spark.apache.org/

http://ampcamp.berkeley.edu/

https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w

https://spark.apache.org/documentation.html

ANY QUESTIONS?

Thanks you for listening