Parallel Processing MapReduce, FlumeJava and Dryad

Parallel ProcessingMapReduce, FlumeJava and Dryad

Amir H. [email protected]

KTH Royal Institute of Technology

Amir H. Payberah (KTH) Parallel Processing 2016/09/12 1 / 72

What do we do when there is too much data toprocess?


Scale Up vs. Scale Out (1/2)

I Scale up or scale vertically: adding resources to a single node in asystem.

I Scale out or scale horizontally: adding more nodes to a system.


Scale Up vs. Scale Out (2/2)

I Scale up: more expensive than scaling out.

I Scale out: more challenging for fault tolerance and software devel-opment.


Taxonomy of Parallel Architectures

DeWitt, D. and Gray, J. “Parallel database systems: the future of high performance database systems”. ACMCommunications, 35(6), 85-98, 1992.


Taxonomy of Parallel Architectures

DeWitt, D. and Gray, J. “Parallel database systems: the future of high performance database systems”. ACMCommunications, 35(6), 85-98, 1992.


MapReduce


MapReduce

I A shared nothing architecture for processing large data sets with aparallel/distributed algorithm on clusters of commodity hardware.


Challenges

I How to distribute computation?

I How can we make it easy to write distributed programs?

I Machines failure.


Idea

I Issue:• Copying data over a network takes time.

I Idea:• Bring computation close to the data.• Store files multiple times for reliability.


Idea

I Issue:• Copying data over a network takes time.

I Idea:• Bring computation close to the data.• Store files multiple times for reliability.


Simplicity

I Don’t worry about parallelization, fault tolerance, data distribution,and load balancing (MapReduce takes care of these).

I Hide system-level details from programmers.

Simplicity!


MapReduce Definition

I A programming model: to batch process large data sets (inspiredby functional programming).

I An execution framework: to run parallel algorithms on clusters ofcommodity hardware.


Programming Model


Warmup Task

I We have a huge text document.

I Count the number of times each distinct word appears in the file

I If the file fits in memory: words(doc.txt) | sort | uniq -c

I If not?


Warmup Task




I If not?


Warmup Task




I If not?


MapReduce Programming Model

I words(doc.txt) | sort | uniq -c

I Sequentially read a lot of data.

I Map: extract something you care about.

I Group by key: sort and shuffle.

I Reduce: aggregate, summarize, filter or transform.

I Write the result.








I Write the result.








I Write the result.








I Write the result.








I Write the result.








I Write the result.


MapReduce Dataflow

I map function: processes data and generates a set of intermediatekey/value pairs.

I reduce function: merges all intermediate values associated with thesame intermediate key.


Word Count in MapReduce

I Consider doing a word count of the following file using MapReduce:

Hello World Bye World

Hello Hadoop Goodbye Hadoop


Word Count in MapReduce - map

I The map function reads in words one a time and outputs (word, 1)for each parsed input word.

I The map function output is:

(Hello, 1)

(World, 1)

(Bye, 1)

(World, 1)

(Hello, 1)

(Hadoop, 1)

(Goodbye, 1)

(Hadoop, 1)


Word Count in MapReduce - shuffle

I The shuffle phase between map and reduce phase creates a list ofvalues associated with each key.

I The reduce function input is:

(Bye, (1))

(Goodbye, (1))

(Hadoop, (1, 1))

(Hello, (1, 1))

(World, (1, 1))


Word Count in MapReduce - reduce

I The reduce function sums the numbers in the list for each key andoutputs (word, count) pairs.

I The output of the reduce function is the output of the MapReducejob:

(Bye, 1)

(Goodbye, 1)

(Hadoop, 2)

(Hello, 2)

(World, 2)


Combiner Function (1/2)

I In some cases, there is significant repetition in the intermediate keysproduced by each map task, and the reduce function is commutativeand associative.

Machine 1:(Hello, 1)

(World, 1)

(Bye, 1)

(World, 1)


(Hadoop, 1)

(Goodbye, 1)

(Hadoop, 1)


Combiner Function (2/2)

I Users can specify an optional combiner function to merge partiallydata before it is sent over the network to the reduce function.

I Typically the same code is used to implement both the combinerand the reduce function.


(World, 2)

(Bye, 1)


(Hadoop, 2)

(Goodbye, 1)


Example: Word Count - map

public static class MyMap extends Mapper<...> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}


Example: Word Count - reduce

public static class MyReduce extends Reducer<...> {

public void reduce(Text key, Iterator<...> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

while (values.hasNext())

sum += values.next().get();

context.write(key, new IntWritable(sum));

}

}


Example: Word Count - driver

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(MyMap.class);

job.setCombinerClass(MyReduce.class);

job.setReducerClass(MyReduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}


Implementation


Architecture


MapReduce Execution (1/7)

I The user program divides the input files into M splits.• A typical size of a split is the size of a HDFS block (64 MB).• Converts them to key/value pairs.

I It starts up many copies of the program on a cluster of machines.



I One of the copies of the program is master, and the rest are workers.

I The master assigns works to the workers.• It picks idle workers and assigns each one a map task or a reduce

task.



I A map worker reads the contents of the corresponding input splits.

I It parses key/value pairs out of the input data and passes each pairto the user defined map function.

I The intermediate key/value pairs produced by the map function arebuffered in memory.



I The buffered pairs are periodically written to local disk.• They are partitioned into R regions (hash(key) mod R).

I The locations of the buffered pairs on the local disk are passed backto the master.

I The master forwards these locations to the reduce workers.



I A reduce worker reads the buffered data from the local disks of themap workers.

I When a reduce worker has read all intermediate data, it sorts it bythe intermediate keys.



I The reduce worker iterates over the intermediate data.

I For each unique intermediate key, it passes the key and the cor-responding set of intermediate values to the user defined reducefunction.

I The output of the reduce function is appended to a final output filefor this reduce partition.



I When all map tasks and reduce tasks have been completed, themaster wakes up the user program.


Hadoop MapReduce and HDFS


Fault Tolerance - Worker

I Detect failure via periodic heartbeats.

I Re-execute in-progress map and reduce tasks.

I Re-execute completed map tasks: their output is stored on the localdisk of the failed machine and is therefore inaccessible.

I Completed reduce tasks do not need to be re-executed since theiroutput is stored in a global filesystem.


Fault Tolerance - Master

I State is periodically checkpointed: a new copy of master startsfrom the last checkpoint state.


MapReduce Limitation

I Redundant processing

I Lack of early termination

I Lack of iteration

I Lack of interactive processing

I Lack of real-time processing


FlumeJava


Motivation (1/2)

I It is easy in MapReduce:words(doc.txt) | sort | uniq -c

I What about this one?words(doc.txt) | grep | sed | sort | awk | perl


Motivation (1/2)




Motivation (2/2)

I Big jobs in MapReduce run in more than one Map-Reduce stages.

I Reducers of each stage write to replicated storage, e.g., HDFS.


FlumeJava

I FlumeJava is a library provided by Google to simply the creation ofpipelined MapReduce tasks.


Parallel Collections

I A few classes that represent parallel collections and abstract awaythe details of how data is represented.

I PCollection<T>: an immutable bag of elements of type T.

I PTable<K, V>: an immutable multi-map with keys of type K andvalues of type V.

I The main way to manipulate these collections is to invoke a data-parallel operation (transform) on them.














Transforms (1/2)

I parallelDo(): elementwise computation over an inputPCollection<T> to produce a new output PCollection<S>.

I groupByKey(): converts a multi-map of type PTable<K, V> intoa uni-map of type PTable<K, Collection<V>>.


Transforms (1/2)

I parallelDo(): elementwise computation over an inputPCollection<T> to produce a new output PCollection<S>.

I groupByKey(): converts a multi-map of type PTable<K, V> intoa uni-map of type PTable<K, Collection<V>>.


Transforms (2/2)

I combineValues(): takes an input PTable<K, Collection<V>>

and an associative combining function on Vs, and returns aPTable<K, V>, where each input collection of values has been com-bined into a single output value.

I flatten(): takes a list of PCollection<T>s and returns a sin-gle PCollection<T> that contains all the elements of the inputPCollections.


Transforms (2/2)

I combineValues(): takes an input PTable<K, Collection<V>>

and an associative combining function on Vs, and returns aPTable<K, V>, where each input collection of values has been com-bined into a single output value.

I flatten(): takes a list of PCollection<T>s and returns a sin-gle PCollection<T> that contains all the elements of the inputPCollections.


Word Count in FlumeJava

public class WordCount {

public static void main(String[] args) throws Exception {

Pipeline pipeline = new MRPipeline(WordCount.class);

PCollection<String> lines = pipeline.readTextFile(args[0]);

PCollection<String> words = lines.parallelDo(new DoFn<String, String>() {

public void process(String line, Emitter<String> emitter) {

for (String word : line.split("\\s+")) {

emitter.emit(word);

}

}

}, Writables.strings());

PTable<String, Long> counts = Aggregate.count(words);

pipeline.writeTextFile(counts, args[1]);

pipeline.done();

}

}


Dataflow Optimization (1/2)

I ParallelDo fusion


Dataflow Optimization (2/2)

I MapShuffleCombineReduce (MSCR): combining ParallelDo,GroupByKey, CombineValues, and Flatten into single MapRe-duces.

I Generalizes MapReduce• Multiple reducers/combiners• Multiple output per reducer• Pass-through outputs


FlumeJava Workflow


Dryad


Motivation (1/2)




Motivation (1/2)




Motivation (1/3)

I In Dryad, each job is represented with a DAG.

I Intermediate vertices write to channels.

I More operation than map and reduce, e.g., join and distribute.


Motivation (3/3)

I Dataflow is a popular abstraction for parallel programming.

I Don’t worry about the global state of a system: just write sim-ple vertices that maintain local state and communicate with othervertices through edges.

I MapReduce is a simple form of dataflow, with two types vertices:the mapper and the reducer


Motivation (3/3)





Motivation (3/3)





Programming Model


Programming Model (1/2)

I Jobs are expressed as a Directed Acyclic Graph (DAG): dataflow

I Vertices are computations.

I Edges are communication channels.


Programming Model (2/2)

I Each vertex can have several input and output channels.

I Each vertex runs one or more times.

I Stop when all vertices have completed their execution at least once.


Graph Description Operators (1/2)


Graph Description Operators (2/2)

GraphBuilder XSet = moduleX^N;

GraphBuilder DSet = moduleD^N;

GraphBuilder MSet = moduleM^(N*4);

GraphBuilder SSet = moduleS^(N*4);

GraphBuilder YSet = moduleY^N;

GraphBuilder HSet = moduleH^1;

GraphBuilder XInputs = (ugriz1 >= XSet) ||

(neighbor >= XSet);

GraphBuilder YInputs = ugriz2 >= YSet;

GraphBuilder XToY = XSet >= DSet >> MSet >= SSet;

for (i = 0; i < N*4; ++i) {

XToY = XToY ||

(SSet.GetVertex(i) >= YSet.GetVertex(i/4));

}

GraphBuilder YToH = YSet >= HSet;

GraphBuilder HOutputs = HSet >= output;

GraphBuilder final = XInputs || YInputs ||

XToY || YToH || HOutputs;


Word Count in DryadLINQ

public class WordCount {

public static void WordCountExample() {

var config = new DryadLinqContext(1);

var lines = new LineRecord[] { new LineRecord("This is a dummy line") };

var input = config.FromEnumerable(lines);

var words = input.SelectMany(x => x.Line.Split(’ ’));

var groups = words.GroupBy(x => x);

var counts = groups.Select(x =>

new KeyValuePair<string, int>(x.Key, x.Count()));

var toOutput = counts.Select(x =>

new LineRecord(String.Format("{0}: {1}", x.Key, x.Value)));

foreach (LineRecord line in toOutput) {

Console.WriteLine(line.Line);

}

}

}


Implementation


Dryad Architecture

I Job manager (JM)

I Vertices (V)

I Daemon (D)

I Name server (NS)


Job Manager

I Constructs the job’s DAG.

I Schedule the work across the available resources in the cluster

I Dynamic graph refinements.


Vertices and Channels

I Vertex: arbitrary binary application code.• The binary code will be sent to the corresponding node directly

from the JM.

I Channels: transport a finite sequence of structured items betweenvertices.

• Files, TCP pipes, or shared memory (FIFO)


Daemons

I Running on each computer in the cluster.

I Create processes on behalf of the JM.

I As a proxy that so that the JM can communicate with the remotevertices.


Name Server

I Enumerate all the available computers in the cluster.

I Exposes the position of each computer within the network topology:locality.


Dryad Execution (1/2)


Dryad Execution (2/2)

I Dataflow is mapped on a set of computation engines.

I During the runtime the JM monitors the states of the verticesthrough the daemons.

I When all input channels of a vertex become ready a new executionrecord is created for the vertex and placed in a scheduling queue.

I Prefer executing a vertex near its inputs.

I If every vertex eventually completes then the job is deemed to havecompleted successfully.


Job Stages and Scalability (1/2)


Job Stages and Scalability (2/2)

I Stage manager• Locality• Replicated stages to avoid straggler problem

I words(doc.txt) | grep | sed | sort | awk | perl


Fault Tolerance

I JM fails• Computation fails.

I Vertex computation fails• Restart vertex with different version number.• Previous instance of vertex may run in parallel with new instances.


Summary


Summary

I Scaling out: shared nothing architecture

I MapReduce• Programming model: Map and Reduce• Execution framework

I FlumeJava• Dataflow DAG• Parallel collection: PCollection and PTable• Transforms: ParallelDo, GroupByKey, CombineValues, Flatten

I Dryad• Dataflow DAG• Job manage, vertices and channels, name server


Questions?


Date post:	19-Nov-2021
Category:	Documents
Upload:	others
View:	12 times
Download:	0 times

Parallel Processing MapReduce, FlumeJava and Dryad

Documents