+ All Categories
Home > Documents > Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Date post: 23-Feb-2016
Category:
Upload: dorjan
View: 41 times
Download: 0 times
Share this document with a friend
Description:
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Presented By: Niketan R. Pansare ([email protected]). Outline. Introduction: Map-Reduce, Databases Motivation & Contributions of this paper. Map-Reduce-Merge framework Implementation of relational operators Conclusion. - PowerPoint PPT Presentation
21
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Presented By: Niketan R. Pansare ([email protected])
Transcript
Page 1: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Presented By:

Niketan R. Pansare ([email protected])

Page 2: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Outline Introduction: Map-Reduce, Databases

Motivation & Contributions of this paper.

Map-Reduce-Merge framework

Implementation of relational operators

Conclusion

Page 3: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Introduction: Map Reduce

Programming model: Processing large data sets Cheap, unreliable hardware Distributed processing (recovery,

coordination,…) High degree of transparencies

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.

The user of the MapReduce library expresses the computation as two functions: map and reduce. (Note, map is not the higher order function, but a function passed to it.)

map (k1,v1) -> list(k2,v2)

reduce (k2, list (v2)) -> list(v2)

Page 4: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Map Reduce: Big Picture

Page 5: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

MapReduce: Pros/Cons Extremely good for data processing tasks on

homogenous data sets (Distributed Grep, Count of URL access frequency).

Bad for heterogeneous data sets (Eg: Employee, Department, Sales, …)

Can process heterogeneous data sets by “Homogenization” (i.e. inserts two extra attributes: key, data-source tag) Lots of extra disk space Excessive map-reduce communications Limited to queries that can be re-written as

equi-joins

Page 6: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Introduction: Databases

Simple extension of set theory (excluding Normalization, Indexing: See Codd’s paper from 1970).

Supports the following operators: 4 mathematical operations (Union,

Intersection, Difference, Cartesian Products)

4 extension operators (Selection, Projection, Join, Relational Division) proposed by Codd.

Others: Aggregation, Groupby, Orderby Example:

select s_name from Student s, Courses c where

s.s_course_id = c.c_course_id AND

c.c_course_name = ‘COMP 620’

Page 7: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

What interest me the most ?

Join: It is one of the most frequently occurring operations and perhaps one of the most difficult to optimize.

Properties of join: O/P lineage different than the I/P lineage.

(Cannot easily be plugged into MapReduce)

Many different ways to implement it: Nested loop join (Naïve algorithm) Hash join (smaller relation is hashed) Sort-Merge join (both relations sorted on join

attribute)

Page 8: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Motivation for this paper Databases as slow when we process large

amount of data.

Why ? Fastest databases are usually implemented in shared-everything SMP architectures (Eg: Itanium 64 processors, 128 cores, no clusters). Bottleneck: Memory access (Cache + Memory + Disk).

Then why not go for shared-nothing ? Joining two relations difficult with

current frameworks (i.e Map-Reduce) Why not extend it ?

Page 9: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Contribution of this paper

Process heterogeneous data sets using extension of MapReduce framework. Added Merge phase Hierarchical work-flows

Supported different join algorithms.

Map-Reduce-Merge is “relationally complete”.

Page 10: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Map-Reduce-Merge Map: (k1, v1) → [(k2, v2)]

Reduce: (k2, [v2]) → [v3]

becomes:

Map: (k1, v1) → [(k2, v2)]

Reduce: (k2, [v2]) → (k2, [v3])

Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

Page 11: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Programmer-Definable operations

Partition selector - which data should go to which merger?

Processor - process data on an individual source.

Merger - analogous to the map and reduce definitions, define logic to do the merge operation.

Configurable iterators - how to step through each of the lists as you merge

Page 12: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Projection Return the subset of the data passed

in.

Mapper can handle this: Map: (k1, v1) → [(k2, v2)]

v2 may have different schema than v1.

Page 13: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Aggregation At the Reduce phase, Map-Reduce

performs the sort-by-key and group-by-key functions to ensure that the input to a reducer is a set of tuples t = (k, [v]) in which [v] is the collection of all the values associated with the key k.

Therefore, reducer can implement “group-by” and “aggregate” operators.

Page 14: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Selection If selection condition involves only the

attributes of one data source, can implement in mappers.

If it’s on aggregates or a group of values contained in one data source, can implement in reducers.

If it involves attributes or aggregates from both data sources, implement in mergers.

Page 15: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Set union / intersection / difference

Let each of the two MapReduces emit a sorted list of unique elements.

Therefore, a naïve merge-like algorithm in merge sort can perform set union/intersection/difference (i.e. iteration over two sorted lists).

Page 16: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Cartesian Products Set the reducers up to output the two

sets you want the Cartesian product of.

Each merger will get one partition F from the first set of reducers, and the full set of partitions S from the second.

Each merger emits F x S.

Page 17: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Sort-Merge Join Map: partition records into key ranges

according to the values of the attributes on which you’re sorting, aiming for even distribution of values to mappers.

Reduce: sort the data.

Merge: join the sorted data for each key range.

Page 18: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Hash Join/Nested loop Map: use the same hash function for

both sets of mappers.

Reduce: produce a hash table from the values mapped. (For nested loop: Don’t hash)

Merge: operates on corresponding hash buckets. Use one bucket as a build set, and the other as a probe. (For nested loop: do loop-match instead of hash-probe.)

Page 19: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Conclusions Map-Reduce-Merge programming

model retains Map-Reduce’s many great features, while adding relational algebra to the list of database principles it upholds.

Suggestion (or more like future-work) to use Map-Reduce-Merge as framework for parallel databases.

Page 20: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

Thank You.

Questions ?

Page 21: Map-Reduce-Merge: Simplified  Relational  Data Processing on Large Clusters

References "MapReduce: Simplified Data Processing on Large Clusters" — paper

by Jeffrey Dean and Sanjay Ghemawat; from Google Labs

"Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters" — paper by Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker; from Yahoo and UCLA; published in Proc. of ACM SIGMOD, pp. 1029—1040, 2007

David DeWitt; Michael Stonebraker. "MapReduce: A major step backwards". databasecolumn.com. http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html. Retrieved 2008-08-27

"Google's MapReduce Programming Model -- Revisited" — paper by Ralf Lämmel; from Microsoft

Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM 13 (6): 377–387

http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=107061802

http://cs.baylor.edu/~speegle/5335/2007slides/MapReduceMerge.pdf


Recommended