Apache Spark

transcript

Resilient Distributed Dataset: A Fault-Tolerant Abstraction For In-Memory Cluster Computing

Mahdi Esmail oghli

Dr. Bagheri

AmirKabir University of technology

http://BigData.ceit.aut.ac.ir

A BIGDATA Processing Framework.

“What is BigData?”

Dealing With BigData

Sampling

Hashing

Approximation methods

Map-Reduce Model

A BigData Programming Model

Potential of parallelism

Can be executed on a cluster

Map-Reduce Model

Reduce

OutputInput

Problems with current computing frameworks Specially Map-reduce

Provides abstraction for accessing the cluster’s computational resources

Lack of abstraction for distributed memory

Problems with current computing frameworks Specially Map-reduce

Makes them inefficient for those that reuse intermediate results across multiple computations

SPARK Motivation

Problems with current computing frameworks (ex. Map-Reduce):

Iterative algorithms

Interactive data mining tools

Data reuse examples

Iterative machine learning and graph algorithms:

Page Rank

K-means clustering

Logistic regression

Data reuse examplesInteractive Data Mining (runs multiple queries on the same subset of data) :

Statistical queries

Fraud detection

Stream queries

Current Solution

The only way to reuse data between computations with current frameworks:

Write it to an external stable storage

system. X

Map-Reduce Model

Reduce

OutputInput

Stable Storage

Developed System For Reusing Intermediate Data

Pregel : Iterative graph computation

HALOOP: Iterative map-reduce interface

Developed System For Reusing Intermediate Data

Just for specific computation patterns.

We need abstraction for more general reuse.

RDDResilient Distributed Dataset

Read-Only partitioned collection of records.

Can be created on either stable storage or other RDDs (using transformations).

User can control Persistence and Partitioning

Efficient data reuse

Parallel data structure

Allow explicit persist results

In-memory computation

Large clusters

fault-tolerant manner

Current fault tolerant approaches

Data replication across machines

Log update across machines

2 Main interface for RDD

Actions Transformations

-e.g., map, filter and join

Transformations

Interface used for fault tolerance in RDD

Actions

SPARK computes RDDs Lazily (Helps pipelining)

Actions return value.

ex. Count - Collect - Save

RDD can express other models

Map - Reduce

Pregel

HALOOP

rdd1.join(rdd2) .groupby(…) .filter(…)

groupby

filter

Task Scheduler

Execute task by worker

Results

SPARK Runtime

Driver

Input Data

Input DataRAM

Worker

TasksResults

Results

An Example

1. lines = spark.textFile(“hdfs://…”)

2. errors = lines.filter( _.startsWith(“ERROR”) )

3. errors.persist()

4. errors.filter( _.contains(“HDFS”) ).map( _.split(‘\t’)[3]).count

An Example

Errors

HDFS errors

filter(_.startsWith(“ERROR”))

filter(_.contains(“HDFS”))

map( _.split(‘\t’)[3])

Persistent FunctionIndicates which RDDs we want to reuse in the future actions.

Other persistence strategies like:

Store the RDD only in disk

Replicating across machines

Set persistence priorities to RDDs.

SPARKRDD is been

implemented in a system called SPARK

In SCALA Language

What benchmarks show about SPARK

20X faster than HADOOP for iterative applications

It can scan 1TB dataset with 5-7s latency

100 GB Data 100 node

Evaluation(Logistic Regression)

HADOOP HADOOPBM SPARK

First Iteration Later Iterations

Evaluation(K-means)

HADOOP HADOOPBM SPARK

First Iteration Later Iterations

SPARK STACK

Apache Spark

Distributed File System. e.x. HDFS, GlusterFS

Spark SQLSpark

Streaming MLlib GraphX

SPARK Won Daytona GraySort Contest 2014

Spark officially sets a new record in large-scale sorting

Spark the fastest open source engine for sorting

HADOOP MR SPARK SPARK 1PT

Data Size 102.5 TB 100 TB 1000 TB

Elapsed Time 72 mins 23 mins 234 mins

# Nodes 2100 206 190

# Cores 50400 physical 6592 virtualized 6080 virtualized

Cluster disk throughput

3150 GB/s 618 GB/s 570 GB/s

EnvironmentDedicated datacenter

EC2 (i2.8xlarge)

Sort rate 1.42 PT/min 4.27 TB/min 4.27 TB/min

Without using Spark’s in-memory cache

Current committers

SPARK > HADOOP MR

References

Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.

Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.

http://Spark.apache.org

https://databricks.com

Thank you

Apache Spark

Technology