MapReduce_2

transcript

8/7/2019 MapReduce_2

1/26

Lecture 2

MapReduce

CPE 458 ParallelProgramming, Spring

2009Except as otherwise noted, the content of this presentation is

licensed under the Creative Commons Attribution 2.5License.http://creativecommons.org/licenses/by/2.5


2/26

Outline

MapReduce: Programming Model

MapReduce Examples

A Brief History

MapReduce Execution Overview

Hadoop

MapReduce Resources


3/26

MapReduce A simple and powerful interface that

enables automatic parallelization and

distribution of large-scale computations,combined with an implementation of thisinterface that achieves high performanceon large clusters of commodity PCs.

Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters,

Google Inc.


4/26

MapReduce More simply, MapReduce is:

A parallel programming model and associated

implementation.


5/26


6/26

Programming Models von Neumann model

Execute a stream of instructions (machine code)

Instructions can specify Arithmetic operations

Data addresses

Next instruction to execute

Complexity Track billions of data locations and millions of instructions

Manage with: Modular design

High-level programming languages (isomorphic)


7/26

Programming Models Parallel Programming Models

Message passing Independent tasks encapsulating local data

Tasks interact by exchanging messages Shared memory

Tasks share a common address space

Tasks interact by reading and writing this space asynchronously

Data parallelization Tasks execute a sequence of independent operations

Data usually evenly partitioned across tasks

Also referred to as Embarrassingly parallel


8/26

MapReduce:

Programming Model Process data using special map() and reduce()

functions

The map() function is called on every item in the inputand emits a series of intermediate key/value pairs

All values associated with a given key are groupedtogether

The reduce() function is called on every unique key,

and its value list, and emits a value that is added tothe output


9/26

MapReduce:

Programming Model

How nowBrown cow

How doesIt work now

brown 1cow 1does 1How 2

it 1now 2work 1

M

M

M

M

R

R

Input Output

Map

ReduceMapReduceFramework


10/26

MapReduce:

Programming Model More formally,

Map(k1,v1) --> list(k2,v2)

Reduce(k2, list(v2)) --> list(v2)


11/26

MapReduce Runtime System

1. Partitions input data

2. Schedules execution across a set of

machines3. Handles machine failure

4. Manages interprocess communication


12/26

MapReduce Benefits

Greatly reduces parallel programmingcomplexity Reduces synchronization complexity

Automatically partitions data

Provides failure transparency

Handles load balancing

Practical

Approximately 1000 Google MapReduce jobs runeveryday.


13/26

MapReduce Examples

Word frequency

Map

doc

Reduce

RuntimeSystem


14/26

MapReduce Examples

Distributed grep Map function emits if word

matches search criteria Reduce function is the identity function

URL access frequency Map function processes web logs, emits

Reduce function sums values and emits


15/26

A Brief History

Functional programming (e.g., Lisp) map() function

Applies a function to each value of a sequence reduce() function

Combines all elements of a sequence using abinary operator


16/26

MapReduce Execution

Overview1. The user program, via the MapReduce

library, shards the input data

User

ProgramInputData

Shard 0Shard 1Shard 2

Shard 3Shard 4Shard 5Shard 6

* Shards are typically 16-64mb in size


17/26

MapReduce Execution

Overview2. The user program creates process

copies distributed on a machine cluster.

One copy will be the Master and theothers will be worker threads.

User

Program

Master

WorkersWorkers

WorkersWorkers

Workers


18/26

MapReduce Resources

3. The master distributes M map and Rreduce tasks to idle workers.

M == number of shards R == the intermediate key space is divided

into R parts

MasterIdle

Worker

Message(Do_map_task)


19/26

MapReduce Resources

4. Each map-task worker reads assignedinput shard and outputs intermediate

key/value pairs. Output buffered in RAM.

MapworkerShard 0 Key/value pairs


20/26

MapReduce Execution

Overview5. Each worker flushes intermediate values,

partitioned into R regions, to disk and

notifies the Master process.

Master

Mapworker

Disk locations

LocalStorage


21/26

MapReduce Execution

Overview6. Master process gives disk locations to an

available reduce-task worker who reads

all associated intermediate data.

Master

Reduce

worker

Disk locations

remoteStorage


22/26

MapReduce Execution

Overview7. Each reduce-task worker sorts its

intermediate data. Calls the reduce

function, passing in unique keys andassociated key values. Reduce functionoutput appended to reduce-taskspartition output file.

Reduceworker

Sorts data PartitionOutput file


23/26

MapReduce Execution

Overview8. Master process wakes up user process

when all tasks have completed. Output

contained in R output files.

wakeup UserProgram

Master

Outputfiles


24/26

MapReduce Execution

Overview Fault Tolerance

Master process periodically pings workers

Map-task failure Re-execute

All output was stored locally

Reduce-task failure Only re-execute partially completed tasks

All output stored in the global file system


25/26

Hadoop

Open source MapReduce implementation http://hadoop.apache.org/core/index.html

Uses Hadoop Distributed Filesytem (HDFS)

http://hadoop.apache.org/core/docs/current/hdfs_design.html

Java ssh


26/26

References

Introduction to Parallel Programming and MapReduce,Google Code University

http://code.google.com/edu/parallel/mapreduce-tutorial.html

Distributed Systems

http://code.google.com/edu/parallel/index.html

MapReduce: Simplified Data Processingon Large Clusters

http://labs.google.com/papers/mapreduce.html Hadoop

http://hadoop.apache.org/core/

MapReduce_2

Documents