Matei Zaharia UC Berkeley AMPLab spark-project.org UC BERKELEY Big Data Processing with MapReduce...

transcript

Matei Zaharia

UC Berkeley AMPLab

spark-project.orgUC BERKELEY

Big Data Processing with MapReduce and Spark

OutlineThe big data problem

MapReduce model

Limitations of MapReduce

Spark model

Future directions

The Big Data ProblemData is growing faster than computation speedsGrowing data sources

»Web, mobile, scientific, …

Cheap storage»Doubling every 18

months

Stalling CPU speeds»Even multicores not

enough

ExamplesFacebook’s daily logs: 60 TB

1000 genomes project: 200 TB

Google web index: 10+ PB

Cost of 1 TB of disk: $50

Time to read 1 TB from disk: 6 hours (50 MB/s)

The Big Data ProblemSingle machine can no longer process or even store all the data!

Only solution is to distribute over large clusters

Google Datacenter

How do we program this thing?

Traditional Network ProgrammingMessage-passing between nodes

Really hard to do at scale:»How to split problem across nodes?• Must consider network, data locality

»How to deal with failures?• 1 server fails every 3 years => 10K nodes see

10 faults/day»Even worse: stragglers (node is not failed,

but slow)Almost nobody does message passing!

Data-Parallel ModelsRestrict the programming interface so that the system can do more automatically

“Here’s an operation, run it on all of the data”

»I don’t care where it runs (you schedule that)

»In fact, feel free to run it twice on different nodes

Biggest example: MapReduce

MapReduceFirst widely popular programming model for data-intensive apps on clusters

Published by Google in 2004»Processes 20 PB of data / day

Popularized by open-source Hadoop project

»40,000 nodes at Yahoo!, 70 PB at Facebook

MapReduce Programming ModelData type: key-value records

Map function:

(Kin, Vin) list(Kinter, Vinter)

Reduce function:

(Kinter, list(Vinter)) list(Kout, Vout)

Example: Word Countdef mapper(line): foreach word in line.split(): output(word, 1)

def reducer(key, values): output(key, sum(values))

Word Count Execution

the quickbrown

the fox ate the mouse

how now

brown cow

Reduce

brown, 2

fox, 2how, 1now, 1the, 3

ate, 1cow, 1mouse,

1quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1ate, 1

mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

MapReduce ExecutionAutomatically split work into many small tasks

Send map tasks to nodes based on data locality

Load-balance dynamically as tasks finish

Fault Recovery1. If a task crashes:

»Retry on another node• OK for a map because it had no dependencies• OK for reduce because map outputs are on

disk»If the same task repeatedly fails, end the

jobRequires user code to be

deterministic

Fault Recovery2. If a node crashes:

»Relaunch its current tasks on other nodes»Relaunch any maps the node previously ran• Necessary because their output files were lost

along with the crashed node

Fault Recovery3. If a task is going slowly (straggler):

»Launch second copy of task on another node

»Take the output of whichever copy finishes first, and kill the other one

Example Applications

1. SearchInput: (lineNumber, line) records

Output: lines matching a given pattern

Map: if(line matches pattern): output(line)

Reduce: identity function–Alternative: no reducer (map-only job)

2. SortInput: (key, value) records

Output: same records, sorted by key

Map: identity function

Reduce: identify function

Trick: Pick partitioningfunction p so thatk1 < k2 => p(k1) < p(k2)

pigsheepyakzebra

aardvarkant

beecowelephant

Reduce

ant, bee

aardvark,elephant

sheep, yak

3. Inverted IndexInput: (filename, text) records

Output: list of files containing each word

Map: foreach word in text.split(): output(word, filename)

Reduce: def reduce(word, filenames): output(word, unique(filenames))

Inverted Index Example

afraid, (12th.txt)be, (12th.txt, hamlet.txt)greatness, (12th.txt)not, (12th.txt, hamlet.txt)of, (12th.txt)or, (hamlet.txt)to, (hamlet.txt)

to be or not to

hamlet.txt

be not afraid of greatnes

12th.txt

to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txtbe, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt

4. Most Popular WordsInput: (filename, text) records

Output: the 100 words occurring in most files

Two-stage solution:– Job 1:• Create inverted index, giving (word, list(file))

records– Job 2:• Map each (word, list(file)) to (count, word)• Sort these records by count as in sort job

SummaryBy providing a data-parallel model, MapReduce greatly simplified cluster programming:

»Automatic division of job into tasks»Locality-aware scheduling»Load balancing»Recovery from failures & stragglers

But… the story doesn’t end here!

MapReduce model

Spark model

Future directions

When an Abstraction is Useful…People want to compose it!

Most real applications require multiple MR steps

»Google indexing pipeline: 21 steps»Analytics queries (e.g. sessions, top K): 2-5

steps»Iterative algorithms (e.g. PageRank): 10’s

of steps

Problems: programmability & performance

Programmability

Multi-step jobs create spaghetti code»21 MR steps -> 21 mapper and reducer

classes

Lots of boilerplate wrapper code per step

API doesn’t provide type safety

Performance

MR only provides one pass of computation

»Must write out data to file system in-between

Expensive for apps that need to reuse data

»Multi-step algorithms (e.g. PageRank)»Interactive data mining (many queries on

same data)

Users often hand-optimize by merging steps

SparkAims to address both problems

Programmability: clean, functional API»Parallel transformations on collections»5-10x less code than MR»Available in Scala, Java and Python

Performance:»In-memory computing primitives»Automatic optimization across operators

Spark Programmability

#include "mapreduce/mapreduce.h"

// User’s map functionclass SplitWords: public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while (i < n && isspace(text[i])) i++; // Find word end int start = i; while (i < n && !isspace(text[i])) i++; if (start < i) Emit(text.substr( start,i-start),"1"); } }};

REGISTER_MAPPER(SplitWords);

// User’s reduce functionclass Sum: public Reducer { public: virtual void Reduce(ReduceInput* input) { // Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt( input->value()); input->NextValue(); } // Emit sum for input->key() Emit(IntToString(value)); }};

REGISTER_REDUCER(Sum);

int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); MapReduceSpecification spec; for (int i = 1; i < argc; i++) { MapReduceInput* in= spec.add_input(); in->set_format("text"); in->set_filepattern(argv[i]); in->set_mapper_class("SplitWords"); }

// Specify the output files MapReduceOutput* out = spec.output(); out->set_filebase("/gfs/test/freq"); out->set_num_tasks(100); out->set_format("text"); out->set_reducer_class("Sum");

// Do partial sums within map out->set_combiner_class("Sum");

// Tuning parameters spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); // Now run it MapReduceResult result; if (!MapReduce(spec, &result)) abort(); return 0; }

Google MapReduce WordCount:

Spark Programmability

Spark WordCount:

val file = spark.textFile(“hdfs://...”)

val counts = file.flatMap(line => line.split(“ ”)) .map(word => (word, 1)) .reduceByKey(_ + _)

counts.save(“out.txt”)

Spark PerformanceIterative algorithms:

0 20 40 60 80 100 120 140

4.1121

K-means ClusteringHadoop MRSpark

0 10 20 30 40 50 60 70 80 90

0.9680

Logistic RegressionHadoop MRSpark

Spark ConceptsResilient distributed datasets (RDDs)

»Immutable, partitioned collections of objects»May be cached in memory for fast reuse

Operations on RDDs»Transformations (build RDDs), actions

(compute results)

Restricted shared variables»Broadcast, accumulators

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.cache()Block 1

Block 2

Block 3

Worker

Driver

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: search 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Fault RecoveryRDDs track lineage information that can be used to efficiently reconstruct lost partitions

Ex:messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS FileFiltered

RDDMapped

RDDfilter

(func = _.contains(...))map

(func = _.split(...))

Example: Logistic RegressionGoal: find best line separating two sets of points

– ––

––

target

random initial line

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

w automatically shipped to

cluster

Logistic Regression Performance

110 s / iteration

first iteration 80 s

further iterations 1 s

1 10 20 300

Hadoop

Number of Iterations

Other RDD OperationsTransformation

s(define a new

mapfilter

samplegroupByKeyreduceByKey

cogroup

flatMapunionjoin

crossmapValues

Actions(output a result)

collectreduce

takefold

countsaveAsTextFile

saveAsHadoopFile...

Spark in Java and PythonJavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

lines = sc.textFile(...)lines.filter(lambda x: “error” in x).count()

Shared VariablesSo far we’ve seen that RDD operations can use variables from outside their scope

By default, each task gets a read-only copy of each variable (no sharing)

Good place to enable other sharing patterns!

Example: Collaborative FilteringGoal: predict users’ movie ratings based on past ratings of other movies

1 ? ? 45 ? 3

? ? 3 5? ? 3

5 ? 5 ?? ? 1

4 ? ? ?? 2 ?

Movies

Model and AlgorithmModel R as product of user and movie feature matrices A and B of size U×K and M×K

Alternating Least Squares (ALS)»Start with random A & B»Optimize user vectors (A) based on movies»Optimize movie vectors (B) based on users»Repeat until converged

R A=BT

Serial ALSvar R = readRatingsMatrix(...)

var A = // array of U random vectorsvar B = // array of M random vectors

for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R))}

Range objects

Naïve Spark ALSvar R = readRatingsMatrix(...)

for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R)) .collect()}

Problem:

R re-sent to all

nodes in each

iteration

Efficient Spark ALSvar R = spark.broadcast(readRatingsMatrix(...))

for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect()}

Solution: mark R

as broadcas

t variable

Result: 3× performance improvement

AccumulatorsApart from broadcast, another common sharing pattern is aggregation

»Add up multiple statistics about data»Count various events for debugging

Spark’s reduce operation does aggregation, but accumulators are another nice way to express it

Usageval badRecords = sc.accumulator(0)val badBytes = sc.accumulator(0.0)

records.filter(r => { if (isBad(r)) { badRecords += 1 badBytes += r.size false } else { true }}).save(...)

printf(“Total bad records: %d, avg size: %f\n”, badRecords.value, badBytes.value / badRecords.value)

Accumulator RulesCreate with SparkContext.accumulator(initialVal)

“Add” to the value with += inside tasks»Each task’s effect only counted once

Access with .value, but only on master»Exception if you try it on workers

Retains efficiency and fault tolerance!

Job SchedulerCaptures RDD dependency graph

Pipelines functionsinto “stages”

Cache-aware fordata reuse & locality

Partitioning-awareto avoid shuffles

groupBy

Stage 3

Stage 1

Stage 2

= cached partition

User Community3000 people attended online training

600 meetup members

15 companies contributing

MapReduce model

Spark model

Future directions

Future DirectionsAs “big data” starts to be used for more apps, users’ demands are also growing:

»Latency: instead of training a model every night, can you train in real-time?

»High-level abstractions: matrices, graphs, etc – what is the equivalent of Matlab or R for clusters?

Spark StreamingExtends Spark API to do stream processing

»Run as a series of small, deterministic batches

Intermix with batch and ad-hoc queriessc.twitterStream(...)

.filter(_.contains(“spark”) ) .map(t => (t.user, 1)) .runningReduce(_ + _)

t = 1:

t = 2:

tweets pairs counts

map reduce

= RDD = partition

Streaming ResultsBetter performance than other models, while providing fault recovery properties they lack

0 20 40 60 80 1000

Nodes in Cluster

Record

Sliding WordCount + Top K

5 30s ckpts, 20 nodes30s ckpts, 40 nodes

cessin

Scalability Fault Recovery

Higher-Level AbstractionsSparkGraph: graph processing model

MLbase: declarative machine learning library

Shark: SQL queries

Selec-tion

Ag-grega-

SharkHadoop

ConclusionCommodity clusters are needed to handle big data, but pose key challenges (faults, stragglers)

Data-parallel models like MapReduce and Spark handle these automatically

Look for similar models for new problems

www.spark-project.org

Other ResourcesHadoop MapReduce: http://hadoop.apache.org/

Spark: http://spark-project.org

Hadoop video tutorials: www.cloudera.com/hadoop-training

Amazon Elastic MapReduce:http://aws.amazon.com/elasticmapreduce/

Behavior with Not Enough RAM

Cache disabled

25% 50% 75% Fully cached

10068.8

% of working set in memory

Matei Zaharia UC Berkeley AMPLab spark-project.org UC BERKELEY Big Data Processing with MapReduce...

Documents