Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for...

transcript

Spark:Resilient Distributed Datasets for

In-Memory Cluster Computing

Brad KarpUCL Computer Science

(with slides contributed by Matei Zaharia)

CS M038 / GZ069th March 2016

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters

But as soon as it got popular, users wanted more:»More complex, multi-stage applications

(e.g. iterative machine learning & graph processing)»More interactive ad-hoc queries

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters

But as soon as it got popular, users wanted more:»More complex, multi-stage applications

(e.g. iterative machine learning & graph processing)»More interactive ad-hoc queriesResponse: specialized frameworks for some of these apps (e.g. Pregel for graph processing)

MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

In MapReduce, the only way to share data across jobs is stable storage è slow!

Examples

iter. 1 iter. 2 . . .

HDFSread

HDFSwrite

HDFSread

HDFSwrite

query 1

query 2

query 3

result 1

result 2

result 3

HDFSread

Examples

iter. 1 iter. 2 . . .

HDFSread

HDFSwrite

HDFSread

HDFSwrite

query 1

query 2

query 3

result 1

result 2

result 3

HDFSread

Slow because of replication and disk I/O,but necessary for fault tolerance

iter. 1 iter. 2 . . .

Goal: In-Memory Data Sharing

query 1

query 2

query 3

one-timeprocessing

RAM 10-100× faster than network/disk,but how to get fault tolerance?

Challenge

How to design a distributed memory abstraction that is both fault-tolerant and

efficient?

ChallengeExisting storage abstractions have interfaces based on fine-grained updates to mutable state»RAMCloud, databases, distributed mem, Piccolo

Requires replicating data or logs across nodes for fault tolerance»Costly for data-intensive apps»10-100x slower than memory write

Solution: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory» Immutable, partitioned collections of records»Can only be built through coarse-grained

deterministic transformations (map, filter, join, …)

Efficient fault recovery using lineage»Log one operation to apply to many elements»Recompute lost partitions on failure»No cost if nothing fails

query 1

query 2

query 3

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

query 1

query 2

query 3

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

query 1

query 2

query 3

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

query 1

query 2

query 3

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

query 1

query 2

query 3

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

query 1

query 2

query 3

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

query 1

query 2

query 3

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

query 1

query 2

query 3

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms»These naturally apply the same operation to many

Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP (Pregel),

iterative MapReduce (Haloop), bulk incremental, …

Support new apps that these models don’t

Tradeoff Space

Granularityof Updates

Write Throughput

Coarse

Low High

K-V stores,databases,RAMCloud

HDFS RDDs

Networkbandwidth

Tradeoff Space

Write Throughput

Coarse

Low High

HDFS RDDs

Memorybandwidth

Networkbandwidth

Tradeoff Space

Write Throughput

Coarse

Low High

HDFS RDDs

Memorybandwidth

Networkbandwidth

Tradeoff Space

Write Throughput

Coarse

Low High

Best fortransactionalworkloads

HDFS RDDs

Memorybandwidth

Networkbandwidth

Tradeoff Space

Write Throughput

Coarse

Low High

Best for batchworkloads

Best fortransactionalworkloads

HDFS RDDs

OutlineSpark programming interfaceImplementation

How people are using Spark

OutlineSpark programming interfaceImplementation

How people are using Spark

Spark Programming InterfaceDryadLINQ-like API in the Scala languageUsable interactively from Scala interpreter

Provides:»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build new

RDDs), actions (compute and output results)»Control of each RDD’s partitioning (layout across

nodes) and persistence (storage in RAM, on disk, etc)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Worker

Master

lines = spark.textFile(“hdfs://...”) Worker

Worker

Master

lines = spark.textFile(“hdfs://...”) Worker

Worker

Master

Base RDD

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))Worker

Worker

Master

errors = lines.filter(_.startsWith(“ERROR”))Worker

Worker

Master

Transformed RDD

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

Worker

Master

messages.persist()

Worker

Master

messages.persist()

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.persist()

Worker

Master

messages.filter(_.contains(“foo”)).countAction

messages.persist()

Worker

Master

messages.persist()

Block 1

Block 2

Block 3

Worker

Master

messages.persist()

Block 1

Block 2

Block 3

Worker

Master

messages.persist()

Block 1

Block 2

Block 3

Worker

Master

tasksresults

messages.persist()

Block 1

Block 2

Block 3

Worker

Master

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

messages.persist()

Block 1

Block 2

Block 3

Worker

Master

Msgs. 1

Msgs. 2

Msgs. 3

messages.persist()

Block 1

Block 2

Block 3

Worker

Master

messages.filter(_.contains(“bar”)).count

Msgs. 1

Msgs. 2

Msgs. 3

messages.persist()

Block 1

Block 2

Block 3

Worker

Master

Msgs. 1

Msgs. 2

Msgs. 3

messages.persist()

Block 1

Block 2

Block 3

Worker

Master

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Result: full-text search of Wikipedia in <1 sec (vs 20 sec

for on-disk data)

messages.persist()

Block 1

Block 2

Block 3

Worker

Master

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Result: full-text search of Wikipedia in <1 sec (vs 20 sec

for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

messages.persist()

Block 1

Block 2

Block 3

Worker

Master

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.: messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.: messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

HadoopRDD FilteredRDD MappedRDD

Fault Recovery Results119

57 56 58 5881

57 59 57 59

020406080

100120140

1 2 3 4 5 6 7 8 9 10

Iteration

Failure happens

Fault Tolerance vs. PerformanceWith RDDs, programmer controls tradeoff between fault tolerance and performance

• if frequently persist (e.g., with REPLICATE), fast recovery, but slower execution

• if infrequently persist, fast execution but slow recovery

Why Wide vs. Narrow Dependencies?RDD includes metadata:

• lineage (graph of RDDs and operations)• wide dependency: requires shuffle (and

possibly entire preceding RDD(s) in lineage)• narrow dependency: no shuffle (and only

needs isolated partition(s) of preceding RDD(s) in lineage)

Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|

links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {ranks = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}.reduceByKey(_ + _)}

Optimizing Placementlinks & ranks repeatedly joined

Can co-partition them (e.g. hash both on URL) to avoid shuffles

Can also use app knowledge, e.g., hash on DNS name

links = links.partitionBy(new URLPartitioner())

reduceContribs0

joinContribs2

Ranks0(url, rank)

Links(url, neighbors)

Ranks2

reduce

Ranks1

PageRank Performance

Hadoop

Basic Spark

Spark + Controlled Partitioning

Memory ExhaustionLeast Recently Used eviction of partitionsFirst evict non-persisted, used partitions; they vanish, as not stored on diskThen evict persisted ones

Behavior with Insufficient RAM

0% 25% 50% 75% 100%

Percent of working set in memory

Scalability18

15 6 3

25 50 100

Number of machines

HadoopHadoopBinMemSpark

25 50 100It

Number of machines

Hadoop HadoopBinMemSpark

Logistic Regression K-Means

Contributors to Speedup

In-mem HDFS In-mem local file Spark RDD

Text Input

Binary Input

ImplementationRuns on Mesos [NSDI 11]to share cluster w/ Hadoop

Can read from any Hadoop input source (HDFS, S3, …)

Spark Hadoop MPI

Node Node Node Node

No changes to Scala language or compiler» Reflection + bytecode analysis to correctly ship code

www.spark-project.org

Programming Models Implemented on SparkRDDs can express many existing parallel models»MapReduce, DryadLINQ»Pregel graph processing [200 LOC]»Iterative MapReduce [200 LOC]»SQL: Hive on Spark (Shark)

Enables apps to efficiently intermix these models

All are based oncoarse-grained operations

RDDs: SummaryRDDs offer a simple and efficient programming model for a broad range of applications

• Avoid disk I/O overhead for intermediate results of multi-phase computations

• Leverage lineage: cheaper than checkpointingwhen no failures, low cost when failures

• Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery

Scheduler significantly more complicated than MapReduce’s; we did not discuss!

Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for...

Documents