Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for...

Post on 05-Jun-2020

13 views 0 download

transcript

Spark:Resilient Distributed Datasets for

In-Memory Cluster Computing

Brad KarpUCL Computer Science

(with slides contributed by Matei Zaharia)

CS M038 / GZ069th March 2016

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters

But as soon as it got popular, users wanted more:»More complex, multi-stage applications

(e.g. iterative machine learning & graph processing)»More interactive ad-hoc queries

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters

But as soon as it got popular, users wanted more:»More complex, multi-stage applications

(e.g. iterative machine learning & graph processing)»More interactive ad-hoc queriesResponse: specialized frameworks for some of these apps (e.g. Pregel for graph processing)

MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

In MapReduce, the only way to share data across jobs is stable storage è slow!

Examples

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Examples

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow because of replication and disk I/O,but necessary for fault tolerance

iter. 1 iter. 2 . . .

Input

Goal: In-Memory Data Sharing

Input

query 1

query 2

query 3

. . .

one-timeprocessing

RAM 10-100× faster than network/disk,but how to get fault tolerance?

Challenge

How to design a distributed memory abstraction that is both fault-tolerant and

efficient?

ChallengeExisting storage abstractions have interfaces based on fine-grained updates to mutable state»RAMCloud, databases, distributed mem, Piccolo

Requires replicating data or logs across nodes for fault tolerance»Costly for data-intensive apps»10-100x slower than memory write

Solution: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory» Immutable, partitioned collections of records»Can only be built through coarse-grained

deterministic transformations (map, filter, join, …)

Efficient fault recovery using lineage»Log one operation to apply to many elements»Recompute lost partitions on failure»No cost if nothing fails

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms»These naturally apply the same operation to many

items

Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP (Pregel),

iterative MapReduce (Haloop), bulk incremental, …

Support new apps that these models don’t

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

HDFS RDDs

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

HDFS RDDs

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

HDFS RDDs

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

Best fortransactionalworkloads

HDFS RDDs

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

Best for batchworkloads

Best fortransactionalworkloads

HDFS RDDs

OutlineSpark programming interfaceImplementation

Demo

How people are using Spark

OutlineSpark programming interfaceImplementation

Demo

How people are using Spark

Spark Programming InterfaceDryadLINQ-like API in the Scala languageUsable interactively from Scala interpreter

Provides:»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build new

RDDs), actions (compute and output results)»Control of each RDD’s partitioning (layout across

nodes) and persistence (storage in RAM, on disk, etc)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Worker

Worker

Worker

Master

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

Worker

Worker

Master

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

Worker

Worker

Master

Base RDD

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))Worker

Worker

Worker

Master

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))Worker

Worker

Worker

Master

Transformed RDD

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

Worker

Worker

Worker

Master

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Worker

Worker

Worker

Master

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).countAction

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

tasks

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

tasksresults

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

Msgs. 1

Msgs. 2

Msgs. 3

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

Msgs. 1

Msgs. 2

Msgs. 3

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

tasks

Msgs. 1

Msgs. 2

Msgs. 3

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Result: full-text search of Wikipedia in <1 sec (vs 20 sec

for on-disk data)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Result: full-text search of Wikipedia in <1 sec (vs 20 sec

for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.: messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.: messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

HadoopRDD FilteredRDD MappedRDD

Fault Recovery Results119

57 56 58 5881

57 59 57 59

020406080

100120140

1 2 3 4 5 6 7 8 9 10

Iter

atio

n ti

me

(s)

Iteration

Failure happens

Fault Tolerance vs. PerformanceWith RDDs, programmer controls tradeoff between fault tolerance and performance

• if frequently persist (e.g., with REPLICATE), fast recovery, but slower execution

• if infrequently persist, fast execution but slow recovery

Why Wide vs. Narrow Dependencies?RDD includes metadata:

• lineage (graph of RDDs and operations)• wide dependency: requires shuffle (and

possibly entire preceding RDD(s) in lineage)• narrow dependency: no shuffle (and only

needs isolated partition(s) of preceding RDD(s) in lineage)

Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|

links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {ranks = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}.reduceByKey(_ + _)}

Optimizing Placementlinks & ranks repeatedly joined

Can co-partition them (e.g. hash both on URL) to avoid shuffles

Can also use app knowledge, e.g., hash on DNS name

links = links.partitionBy(new URLPartitioner())

reduceContribs0

join

joinContribs2

Ranks0(url, rank)

Links(url, neighbors)

. . .

Ranks2

reduce

Ranks1

PageRank Performance

171

72

23

0

50

100

150

200

Tim

e pe

r ite

rati

on (s

)

Hadoop

Basic Spark

Spark + Controlled Partitioning

Memory ExhaustionLeast Recently Used eviction of partitionsFirst evict non-persisted, used partitions; they vanish, as not stored on diskThen evict persisted ones

Behavior with Insufficient RAM

68.8

58.1

40.7

29.7

11.5

0

20

40

60

80

100

0% 25% 50% 75% 100%

Iter

atio

n ti

me

(s)

Percent of working set in memory

Scalability18

4

111

76

116

80

62

15 6 3

0

50

100

150

200

250

25 50 100

Iter

atio

n ti

me

(s)

Number of machines

HadoopHadoopBinMemSpark

274

157

106

197

121

87

143

61

33

0

50

100

150

200

250

300

25 50 100It

erat

ion

tim

e (s

)

Number of machines

Hadoop HadoopBinMemSpark

Logistic Regression K-Means

Contributors to Speedup

15.4

13.1

2.9

8.4

6.9

2.9

0

5

10

15

20

In-mem HDFS In-mem local file Spark RDD

Iter

atio

n ti

me

(s)

Text Input

Binary Input

ImplementationRuns on Mesos [NSDI 11]to share cluster w/ Hadoop

Can read from any Hadoop input source (HDFS, S3, …)

Spark Hadoop MPI

Mesos

Node Node Node Node

No changes to Scala language or compiler» Reflection + bytecode analysis to correctly ship code

www.spark-project.org

Programming Models Implemented on SparkRDDs can express many existing parallel models»MapReduce, DryadLINQ»Pregel graph processing [200 LOC]»Iterative MapReduce [200 LOC]»SQL: Hive on Spark (Shark)

Enables apps to efficiently intermix these models

All are based oncoarse-grained operations

RDDs: SummaryRDDs offer a simple and efficient programming model for a broad range of applications

• Avoid disk I/O overhead for intermediate results of multi-phase computations

• Leverage lineage: cheaper than checkpointingwhen no failures, low cost when failures

• Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery

Scheduler significantly more complicated than MapReduce’s; we did not discuss!