+ All Categories
Home > Documents > MapReduce_2

MapReduce_2

Date post: 08-Apr-2018
Category:
Upload: rhshriva
View: 231 times
Download: 0 times
Share this document with a friend

of 26

Transcript
  • 8/7/2019 MapReduce_2

    1/26

    Lecture 2

    MapReduce

    CPE 458 ParallelProgramming, Spring

    2009Except as otherwise noted, the content of this presentation is

    licensed under the Creative Commons Attribution 2.5License.http://creativecommons.org/licenses/by/2.5

  • 8/7/2019 MapReduce_2

    2/26

    Outline

    MapReduce: Programming Model

    MapReduce Examples

    A Brief History

    MapReduce Execution Overview

    Hadoop

    MapReduce Resources

  • 8/7/2019 MapReduce_2

    3/26

    MapReduce A simple and powerful interface that

    enables automatic parallelization and

    distribution of large-scale computations,combined with an implementation of thisinterface that achieves high performanceon large clusters of commodity PCs.

    Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters,

    Google Inc.

  • 8/7/2019 MapReduce_2

    4/26

    MapReduce More simply, MapReduce is:

    A parallel programming model and associated

    implementation.

  • 8/7/2019 MapReduce_2

    5/26

  • 8/7/2019 MapReduce_2

    6/26

    Programming Models von Neumann model

    Execute a stream of instructions (machine code)

    Instructions can specify Arithmetic operations

    Data addresses

    Next instruction to execute

    Complexity Track billions of data locations and millions of instructions

    Manage with: Modular design

    High-level programming languages (isomorphic)

  • 8/7/2019 MapReduce_2

    7/26

    Programming Models Parallel Programming Models

    Message passing Independent tasks encapsulating local data

    Tasks interact by exchanging messages Shared memory

    Tasks share a common address space

    Tasks interact by reading and writing this space asynchronously

    Data parallelization Tasks execute a sequence of independent operations

    Data usually evenly partitioned across tasks

    Also referred to as Embarrassingly parallel

  • 8/7/2019 MapReduce_2

    8/26

    MapReduce:

    Programming Model Process data using special map() and reduce()

    functions

    The map() function is called on every item in the inputand emits a series of intermediate key/value pairs

    All values associated with a given key are groupedtogether

    The reduce() function is called on every unique key,

    and its value list, and emits a value that is added tothe output

  • 8/7/2019 MapReduce_2

    9/26

    MapReduce:

    Programming Model

    How nowBrown cow

    How doesIt work now

    brown 1cow 1does 1How 2

    it 1now 2work 1

    M

    M

    M

    M

    R

    R

    Input Output

    Map

    ReduceMapReduceFramework

  • 8/7/2019 MapReduce_2

    10/26

    MapReduce:

    Programming Model More formally,

    Map(k1,v1) --> list(k2,v2)

    Reduce(k2, list(v2)) --> list(v2)

  • 8/7/2019 MapReduce_2

    11/26

    MapReduce Runtime System

    1. Partitions input data

    2. Schedules execution across a set of

    machines3. Handles machine failure

    4. Manages interprocess communication

  • 8/7/2019 MapReduce_2

    12/26

    MapReduce Benefits

    Greatly reduces parallel programmingcomplexity Reduces synchronization complexity

    Automatically partitions data

    Provides failure transparency

    Handles load balancing

    Practical

    Approximately 1000 Google MapReduce jobs runeveryday.

  • 8/7/2019 MapReduce_2

    13/26

    MapReduce Examples

    Word frequency

    Map

    doc

    Reduce

    RuntimeSystem

  • 8/7/2019 MapReduce_2

    14/26

    MapReduce Examples

    Distributed grep Map function emits if word

    matches search criteria Reduce function is the identity function

    URL access frequency Map function processes web logs, emits

    Reduce function sums values and emits

  • 8/7/2019 MapReduce_2

    15/26

    A Brief History

    Functional programming (e.g., Lisp) map() function

    Applies a function to each value of a sequence reduce() function

    Combines all elements of a sequence using abinary operator

  • 8/7/2019 MapReduce_2

    16/26

    MapReduce Execution

    Overview1. The user program, via the MapReduce

    library, shards the input data

    User

    ProgramInputData

    Shard 0Shard 1Shard 2

    Shard 3Shard 4Shard 5Shard 6

    * Shards are typically 16-64mb in size

  • 8/7/2019 MapReduce_2

    17/26

    MapReduce Execution

    Overview2. The user program creates process

    copies distributed on a machine cluster.

    One copy will be the Master and theothers will be worker threads.

    User

    Program

    Master

    WorkersWorkers

    WorkersWorkers

    Workers

  • 8/7/2019 MapReduce_2

    18/26

    MapReduce Resources

    3. The master distributes M map and Rreduce tasks to idle workers.

    M == number of shards R == the intermediate key space is divided

    into R parts

    MasterIdle

    Worker

    Message(Do_map_task)

  • 8/7/2019 MapReduce_2

    19/26

    MapReduce Resources

    4. Each map-task worker reads assignedinput shard and outputs intermediate

    key/value pairs. Output buffered in RAM.

    MapworkerShard 0 Key/value pairs

  • 8/7/2019 MapReduce_2

    20/26

    MapReduce Execution

    Overview5. Each worker flushes intermediate values,

    partitioned into R regions, to disk and

    notifies the Master process.

    Master

    Mapworker

    Disk locations

    LocalStorage

  • 8/7/2019 MapReduce_2

    21/26

    MapReduce Execution

    Overview6. Master process gives disk locations to an

    available reduce-task worker who reads

    all associated intermediate data.

    Master

    Reduce

    worker

    Disk locations

    remoteStorage

  • 8/7/2019 MapReduce_2

    22/26

    MapReduce Execution

    Overview7. Each reduce-task worker sorts its

    intermediate data. Calls the reduce

    function, passing in unique keys andassociated key values. Reduce functionoutput appended to reduce-taskspartition output file.

    Reduceworker

    Sorts data PartitionOutput file

  • 8/7/2019 MapReduce_2

    23/26

    MapReduce Execution

    Overview8. Master process wakes up user process

    when all tasks have completed. Output

    contained in R output files.

    wakeup UserProgram

    Master

    Outputfiles

  • 8/7/2019 MapReduce_2

    24/26

    MapReduce Execution

    Overview Fault Tolerance

    Master process periodically pings workers

    Map-task failure Re-execute

    All output was stored locally

    Reduce-task failure Only re-execute partially completed tasks

    All output stored in the global file system

  • 8/7/2019 MapReduce_2

    25/26

    Hadoop

    Open source MapReduce implementation http://hadoop.apache.org/core/index.html

    Uses Hadoop Distributed Filesytem (HDFS)

    http://hadoop.apache.org/core/docs/current/hdfs_design.html

    Java ssh

  • 8/7/2019 MapReduce_2

    26/26

    References

    Introduction to Parallel Programming and MapReduce,Google Code University

    http://code.google.com/edu/parallel/mapreduce-tutorial.html

    Distributed Systems

    http://code.google.com/edu/parallel/index.html

    MapReduce: Simplified Data Processingon Large Clusters

    http://labs.google.com/papers/mapreduce.html Hadoop

    http://hadoop.apache.org/core/