+ All Categories
Home > Documents > Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Date post: 09-Jan-2016
Category:
Upload: duff
View: 28 times
Download: 0 times
Share this document with a friend
Description:
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Hung- chih Yang 1 , Ali Dasdan 1 Ruey -Lung Hsiao 2 , D. Stott Parker 2 Yahoo! 1 Computer Science Department, UCLA 2 SIGMOD 2007, Beijing, China Presented by Jongheum Yeon , 2009. 08. 13. Outline. - PowerPoint PPT Presentation
29
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang 1 , Ali Dasdan 1 Ruey-Lung Hsiao 2 , D. Stott Parker 2 Yahoo! 1 Computer Science Department, UCLA 2 SIGMOD 2007, Beijing, China Presented by Jongheum Yeon, 2009. 08. 13.
Transcript
Page 1: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Hung-chih Yang 1, Ali Dasdan 1

Ruey-Lung Hsiao 2, D. Stott Parker 2

Yahoo! 1

Computer Science Department, UCLA 2

SIGMOD 2007, Beijing, China

Presented by Jongheum Yeon, 2009. 08. 13.

Page 2: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Outline

Introduction

Map-Reduce

Map-Reduce-Merge

Conclusions

2

Page 3: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Introduction

New data-processing systems should consider alterna-tives to using big, traditional databases

Map-Reduce does a good job, in a limited context, with extraordinary simplicity

Map-Reduce-Merge will try to extend the applicability without giving up too much simplicity

3

Page 4: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Introduction (cont’d)

4

Execution

Application

Storage

Language

ParallelDatabases

Map-Reduce

GFSBigTable

CosmosAzure

SQL Server

Dryad

DryadLINQScope

Sawzall

Hadoop

HDFSS3

Pig, Hive

SQL ≈SQL LINQ, SQLSawzall

Page 5: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Motivation

Many special purpose tasks that operate on and produce large amounts of data

Crawled documents, web requests, etc

Inverted indices, summaries, other kinds of derived data

Needs to be distributed across large number of machines to finish in a reasonable time

Parallelize the computation

Distribute data

Obscures original computation with these extra concerns

5

Page 6: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Benefits

Automatic parallelization and distribution

User code complexity and size reduced

Transparent fault-tolerance

I/O scheduling

Fine grained partitioning of tasks

Dynamically scheduled on available workers

Status and monitoring

6

Page 7: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Programming Model

Input & Output: each a set of key/value pairs

Programmer specifies two functions:

map (in_key, in_value) -> list (out_key, intermediate_value)

– Processes input key/value pair

– Produces set of intermediate pairs

reduce (out_key, list(intermediate_value)) -> list (out_value)

– Produces a set of merged output values (usually just one)

7

Page 8: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Data Flow

8

Data

Data

Data

Map

Map

Map

Reduce

Reduce

Page 9: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Data Flow

Map : Generate new Key and its value

Reduce : Integrate values of same key

9

Map

Map

Reduce

Reduce

Key1Value1

Key1Value1

KeyAValueX

KeyBValueY

KeyBValueZ

A=X

B=Y,Z

Page 10: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Architecture

10

Map

Map

Reduce

Reduce

Master

GFS GFS

Worker

Worker

Worker

Worker

Page 11: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Architecture

Master

Assigns and maintains the state of each map/reduce task

Propagating intermediate files to reduce tasks

Worker

Execute Map or Reduce by request of Master

11

Page 12: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Distributed Processing

12

Input 1 Input 2 … Input M

Map Map Map

1 2 1 2 R 2 R… … …

Shuffle

Reduce

Shuffle

Reduce

Shuffle

Reduce…

Output 1 Output 2 Output R…

Input File

IntermediateFile

Output File

Page 13: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Example

Inverted Index

13

IDS 연구실의 페이지

IDB 연구실의 페이지

DocID=1

DocID=2

wordID docID Loca-tion

101 1 1

2 1

201 1 2

2 2

203 1 3

2 3

301 1 0

302 2 0

Word docID

연구실 101

의 201

페이지 203

IDS 301

IDB 302

Inverted Index

Page 14: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Example (cont’d)

Input data to Map

Output of Map

14

Key(docID) Value(Text)

1 IDS 연구실의 페이지

2 IDB 연구실의 페이지

Key(wordID)

Value(docID:Locatio

n)

301 1:0

101 1:1

201 1:2

203 1:3

Key(wordID)

Value(docID:Locatio

n)

302 2:0

101 2:1

201 2:2

203 2:3

Data

Data

Data

Map

Map

Map

Reduce

Reduce

Page 15: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Example (cont’d)

Shuffle

Collect same keys and convey them to Reduce

Reduce writes the final result

15

Data

Data

Data

Map

Map

Map

Reduce

Reduce

Key(wordID)

Value(docID:Location)

101 1:1 2:1

201 1:2 2:2

203 1:3 2:3

301 1:0

302 2:0

101=1:1, 2:1

201=1:2, 2:2

203=1:3, 2:3

301=1:0

302=2:0

Page 16: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce : Example (cont’d)

Other Examples

Distributed Grep

Count URL Access Frequency

– <URL, 1>

– <URL, total count>

Reverse Web-Link Graph

– <target, source>

– <target, list(source)>

16

Page 17: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge

Map-Reduce is an extremely simple model, but with lim-ited context

Map-Reduce handles mainly homogeneous datasets

Relational operators are hard to implement with Map-Re-duce(especially join operations)

Map-Reduce-Merge tries to keep the simplicity of Map-Reduce while extending it to be more complete

17

Page 18: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge

Adds a merge phase to the Map-Reduce algorithm

Allows processing of multiple heterogeneous datasets

Like Map and Reduce, the Merge phase is implemented by the developer

Example:

Two datasets: department and employee

Goal: compute employee’s bonus based on individual re-wardsand department bonus adjustment

18

Page 19: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

19

Page 20: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge

Example

Match keys on dept_id in tables

20

Page 21: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge: Extending Map-Reduce

Change to reduce phase / Merge phase

Phases

1. Map: (k1, v1) → [(k2, v2)]

2. Reduce: (k2, [v2]) → [v3]

becomes:

1. Map: (k1, v1) → [(k2, v2)]

2. Reduce: (k2, [v2]) → (k2, [v3])

3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

21

Page 22: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge

Additional user-definable operations

Merger: same principle as map and reduce

– analogous to the map and reduce definitions, define logic to do the merge operation

Processor: processes data from one source

– process data on an individual source

Partition selector: selects the data that should go to the merger

– which data should go to which merger?

Configurable iterator: how to iterate through each list as the merging is done

– how to step through each of the lists as you merge

22

Page 23: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge

23

Page 24: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge : Relational Data Processing

Relational operators can be implemented using the Map-Reduce-Merge model. This includes:

Projection

Aggregation

Generalized selection

Joins

Set union

Set intersection

Set difference

Etc…

24

Page 25: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge : Example, Set Union

The two Map-Reduces emit each a sorted list of unique elements

The Merge merges the two lists by iterating in the follow-ing way:

Store the smallest value of two and increase it’s iterator by one

If they are equal, store one of them and increase both itera-tors

25

Page 26: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge : Example, Set Difference

We have two sets, A and B, we want to compute A-B

The two Map-Reduces emit each a sorted list of unique elements

The merge iterates simultaneously over the two lists:

If the value of A is less than B’s, store A’s value

If the value of B is smaller, increment B’s iterator

If the two are equal, increment both iterators

26

Page 27: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge : Example, Sort-Merge Join

Map: partition records into buckets which are mutually exclusive and each key range is assigned to a reducer

Reduce: data in the sets are merged into a sorted set => sort the data

Merge: the merger joins the sorted data for each key range

27

Page 28: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Map-Reduce-Merge : Optimizations

Map-reduce already optimizes using locality and backup tasks

Optimize the number of connections between the out-puts of the reduce phase and the input of the merge phase ( Example: Set intersection)

Combining two phases into one (example: ReduceMerge)

28

Page 29: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Copyright 2009 by CEBT

Conclusions

Map-Reduce-Merge allows us to work on heterogeneous datasets

Map-Reduce-Merge supports joins which Map-reduce didn’t directly do

Nextstep: develop an SQL-like interface and an optimizer which simplifies the development of a Map-reduce-Merge workflow

29


Recommended