Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Presented By:

Niketan R. Pansare ([email protected])

Outline Introduction: Map-Reduce, Databases

Motivation & Contributions of this paper.

Map-Reduce-Merge framework

Implementation of relational operators

Conclusion

Introduction: Map Reduce

Programming model: Processing large data sets Cheap, unreliable hardware Distributed processing (recovery,

coordination,…) High degree of transparencies

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.

The user of the MapReduce library expresses the computation as two functions: map and reduce. (Note, map is not the higher order function, but a function passed to it.)

map (k1,v1) -> list(k2,v2)

reduce (k2, list (v2)) -> list(v2)

Map Reduce: Big Picture

MapReduce: Pros/Cons Extremely good for data processing tasks on

homogenous data sets (Distributed Grep, Count of URL access frequency).

Bad for heterogeneous data sets (Eg: Employee, Department, Sales, …)

Can process heterogeneous data sets by “Homogenization” (i.e. inserts two extra attributes: key, data-source tag) Lots of extra disk space Excessive map-reduce communications Limited to queries that can be re-written as

equi-joins

Introduction: Databases

Simple extension of set theory (excluding Normalization, Indexing: See Codd’s paper from 1970).

Supports the following operators: 4 mathematical operations (Union,

Intersection, Difference, Cartesian Products)

4 extension operators (Selection, Projection, Join, Relational Division) proposed by Codd.

Others: Aggregation, Groupby, Orderby Example:

select s_name from Student s, Courses c where

s.s_course_id = c.c_course_id AND

c.c_course_name = ‘COMP 620’

What interest me the most ?

Join: It is one of the most frequently occurring operations and perhaps one of the most difficult to optimize.

Properties of join: O/P lineage different than the I/P lineage.

(Cannot easily be plugged into MapReduce)

Many different ways to implement it: Nested loop join (Naïve algorithm) Hash join (smaller relation is hashed) Sort-Merge join (both relations sorted on join

attribute)

Motivation for this paper Databases as slow when we process large

amount of data.

Why ? Fastest databases are usually implemented in shared-everything SMP architectures (Eg: Itanium 64 processors, 128 cores, no clusters). Bottleneck: Memory access (Cache + Memory + Disk).

Then why not go for shared-nothing ? Joining two relations difficult with

current frameworks (i.e Map-Reduce) Why not extend it ?

Contribution of this paper

Process heterogeneous data sets using extension of MapReduce framework. Added Merge phase Hierarchical work-flows

Supported different join algorithms.

Map-Reduce-Merge is “relationally complete”.

Map-Reduce-Merge Map: (k1, v1) → [(k2, v2)]

Reduce: (k2, [v2]) → [v3]

becomes:

Map: (k1, v1) → [(k2, v2)]

Reduce: (k2, [v2]) → (k2, [v3])

Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

Programmer-Definable operations

Partition selector - which data should go to which merger?

Processor - process data on an individual source.

Merger - analogous to the map and reduce definitions, define logic to do the merge operation.

Configurable iterators - how to step through each of the lists as you merge

Projection Return the subset of the data passed

in.

Mapper can handle this: Map: (k1, v1) → [(k2, v2)]

v2 may have different schema than v1.

Aggregation At the Reduce phase, Map-Reduce

performs the sort-by-key and group-by-key functions to ensure that the input to a reducer is a set of tuples t = (k, [v]) in which [v] is the collection of all the values associated with the key k.

Therefore, reducer can implement “group-by” and “aggregate” operators.

Selection If selection condition involves only the

attributes of one data source, can implement in mappers.

If it’s on aggregates or a group of values contained in one data source, can implement in reducers.

If it involves attributes or aggregates from both data sources, implement in mergers.

Set union / intersection / difference

Let each of the two MapReduces emit a sorted list of unique elements.

Therefore, a naïve merge-like algorithm in merge sort can perform set union/intersection/difference (i.e. iteration over two sorted lists).

Cartesian Products Set the reducers up to output the two

sets you want the Cartesian product of.

Each merger will get one partition F from the first set of reducers, and the full set of partitions S from the second.

Each merger emits F x S.

Sort-Merge Join Map: partition records into key ranges

according to the values of the attributes on which you’re sorting, aiming for even distribution of values to mappers.

Reduce: sort the data.

Merge: join the sorted data for each key range.

Hash Join/Nested loop Map: use the same hash function for

both sets of mappers.

Reduce: produce a hash table from the values mapped. (For nested loop: Don’t hash)

Merge: operates on corresponding hash buckets. Use one bucket as a build set, and the other as a probe. (For nested loop: do loop-match instead of hash-probe.)

Conclusions Map-Reduce-Merge programming

model retains Map-Reduce’s many great features, while adding relational algebra to the list of database principles it upholds.

Suggestion (or more like future-work) to use Map-Reduce-Merge as framework for parallel databases.

Thank You.

Questions ?

References "MapReduce: Simplified Data Processing on Large Clusters" — paper

by Jeffrey Dean and Sanjay Ghemawat; from Google Labs

"Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters" — paper by Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker; from Yahoo and UCLA; published in Proc. of ACM SIGMOD, pp. 1029—1040, 2007

David DeWitt; Michael Stonebraker. "MapReduce: A major step backwards". databasecolumn.com. http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html. Retrieved 2008-08-27

"Google's MapReduce Programming Model -- Revisited" — paper by Ralf Lämmel; from Microsoft

Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM 13 (6): 377–387

http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=107061802

http://cs.baylor.edu/~speegle/5335/2007slides/MapReduceMerge.pdf

http://en.wikipedia.org/wiki/Communications_of_the_ACM





Date post:	23-Feb-2016
Category:	Documents
Upload:	dorjan
View:	41 times
Download:	0 times

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Documents