Post on 08-Apr-2018
transcript
8/7/2019 MapReduce_2
1/26
Lecture 2
MapReduce
CPE 458 ParallelProgramming, Spring
2009Except as otherwise noted, the content of this presentation is
licensed under the Creative Commons Attribution 2.5License.http://creativecommons.org/licenses/by/2.5
8/7/2019 MapReduce_2
2/26
Outline
MapReduce: Programming Model
MapReduce Examples
A Brief History
MapReduce Execution Overview
Hadoop
MapReduce Resources
8/7/2019 MapReduce_2
3/26
MapReduce A simple and powerful interface that
enables automatic parallelization and
distribution of large-scale computations,combined with an implementation of thisinterface that achieves high performanceon large clusters of commodity PCs.
Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters,
Google Inc.
8/7/2019 MapReduce_2
4/26
MapReduce More simply, MapReduce is:
A parallel programming model and associated
implementation.
8/7/2019 MapReduce_2
5/26
8/7/2019 MapReduce_2
6/26
Programming Models von Neumann model
Execute a stream of instructions (machine code)
Instructions can specify Arithmetic operations
Data addresses
Next instruction to execute
Complexity Track billions of data locations and millions of instructions
Manage with: Modular design
High-level programming languages (isomorphic)
8/7/2019 MapReduce_2
7/26
Programming Models Parallel Programming Models
Message passing Independent tasks encapsulating local data
Tasks interact by exchanging messages Shared memory
Tasks share a common address space
Tasks interact by reading and writing this space asynchronously
Data parallelization Tasks execute a sequence of independent operations
Data usually evenly partitioned across tasks
Also referred to as Embarrassingly parallel
8/7/2019 MapReduce_2
8/26
MapReduce:
Programming Model Process data using special map() and reduce()
functions
The map() function is called on every item in the inputand emits a series of intermediate key/value pairs
All values associated with a given key are groupedtogether
The reduce() function is called on every unique key,
and its value list, and emits a value that is added tothe output
8/7/2019 MapReduce_2
9/26
MapReduce:
Programming Model
How nowBrown cow
How doesIt work now
brown 1cow 1does 1How 2
it 1now 2work 1
M
M
M
M
R
R
Input Output
Map
ReduceMapReduceFramework
8/7/2019 MapReduce_2
10/26
MapReduce:
Programming Model More formally,
Map(k1,v1) --> list(k2,v2)
Reduce(k2, list(v2)) --> list(v2)
8/7/2019 MapReduce_2
11/26
MapReduce Runtime System
1. Partitions input data
2. Schedules execution across a set of
machines3. Handles machine failure
4. Manages interprocess communication
8/7/2019 MapReduce_2
12/26
MapReduce Benefits
Greatly reduces parallel programmingcomplexity Reduces synchronization complexity
Automatically partitions data
Provides failure transparency
Handles load balancing
Practical
Approximately 1000 Google MapReduce jobs runeveryday.
8/7/2019 MapReduce_2
13/26
MapReduce Examples
Word frequency
Map
doc
Reduce
RuntimeSystem
8/7/2019 MapReduce_2
14/26
MapReduce Examples
Distributed grep Map function emits if word
matches search criteria Reduce function is the identity function
URL access frequency Map function processes web logs, emits
Reduce function sums values and emits
8/7/2019 MapReduce_2
15/26
A Brief History
Functional programming (e.g., Lisp) map() function
Applies a function to each value of a sequence reduce() function
Combines all elements of a sequence using abinary operator
8/7/2019 MapReduce_2
16/26
MapReduce Execution
Overview1. The user program, via the MapReduce
library, shards the input data
User
ProgramInputData
Shard 0Shard 1Shard 2
Shard 3Shard 4Shard 5Shard 6
* Shards are typically 16-64mb in size
8/7/2019 MapReduce_2
17/26
MapReduce Execution
Overview2. The user program creates process
copies distributed on a machine cluster.
One copy will be the Master and theothers will be worker threads.
User
Program
Master
WorkersWorkers
WorkersWorkers
Workers
8/7/2019 MapReduce_2
18/26
MapReduce Resources
3. The master distributes M map and Rreduce tasks to idle workers.
M == number of shards R == the intermediate key space is divided
into R parts
MasterIdle
Worker
Message(Do_map_task)
8/7/2019 MapReduce_2
19/26
MapReduce Resources
4. Each map-task worker reads assignedinput shard and outputs intermediate
key/value pairs. Output buffered in RAM.
MapworkerShard 0 Key/value pairs
8/7/2019 MapReduce_2
20/26
MapReduce Execution
Overview5. Each worker flushes intermediate values,
partitioned into R regions, to disk and
notifies the Master process.
Master
Mapworker
Disk locations
LocalStorage
8/7/2019 MapReduce_2
21/26
MapReduce Execution
Overview6. Master process gives disk locations to an
available reduce-task worker who reads
all associated intermediate data.
Master
Reduce
worker
Disk locations
remoteStorage
8/7/2019 MapReduce_2
22/26
MapReduce Execution
Overview7. Each reduce-task worker sorts its
intermediate data. Calls the reduce
function, passing in unique keys andassociated key values. Reduce functionoutput appended to reduce-taskspartition output file.
Reduceworker
Sorts data PartitionOutput file
8/7/2019 MapReduce_2
23/26
MapReduce Execution
Overview8. Master process wakes up user process
when all tasks have completed. Output
contained in R output files.
wakeup UserProgram
Master
Outputfiles
8/7/2019 MapReduce_2
24/26
MapReduce Execution
Overview Fault Tolerance
Master process periodically pings workers
Map-task failure Re-execute
All output was stored locally
Reduce-task failure Only re-execute partially completed tasks
All output stored in the global file system
8/7/2019 MapReduce_2
25/26
Hadoop
Open source MapReduce implementation http://hadoop.apache.org/core/index.html
Uses Hadoop Distributed Filesytem (HDFS)
http://hadoop.apache.org/core/docs/current/hdfs_design.html
Java ssh
8/7/2019 MapReduce_2
26/26
References
Introduction to Parallel Programming and MapReduce,Google Code University
http://code.google.com/edu/parallel/mapreduce-tutorial.html
Distributed Systems
http://code.google.com/edu/parallel/index.html
MapReduce: Simplified Data Processingon Large Clusters
http://labs.google.com/papers/mapreduce.html Hadoop
http://hadoop.apache.org/core/