Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters
Presented By:
Niketan R. Pansare ([email protected])
Outline Introduction: Map-Reduce, Databases
Motivation & Contributions of this paper.
Map-Reduce-Merge framework
Implementation of relational operators
Conclusion
Introduction: Map Reduce
Programming model: Processing large data sets Cheap, unreliable hardware Distributed processing (recovery,
coordination,…) High degree of transparencies
The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.
The user of the MapReduce library expresses the computation as two functions: map and reduce. (Note, map is not the higher order function, but a function passed to it.)
map (k1,v1) -> list(k2,v2)
reduce (k2, list (v2)) -> list(v2)
Map Reduce: Big Picture
MapReduce: Pros/Cons Extremely good for data processing tasks on
homogenous data sets (Distributed Grep, Count of URL access frequency).
Bad for heterogeneous data sets (Eg: Employee, Department, Sales, …)
Can process heterogeneous data sets by “Homogenization” (i.e. inserts two extra attributes: key, data-source tag) Lots of extra disk space Excessive map-reduce communications Limited to queries that can be re-written as
equi-joins
Introduction: Databases
Simple extension of set theory (excluding Normalization, Indexing: See Codd’s paper from 1970).
Supports the following operators: 4 mathematical operations (Union,
Intersection, Difference, Cartesian Products)
4 extension operators (Selection, Projection, Join, Relational Division) proposed by Codd.
Others: Aggregation, Groupby, Orderby Example:
select s_name from Student s, Courses c where
s.s_course_id = c.c_course_id AND
c.c_course_name = ‘COMP 620’
What interest me the most ?
Join: It is one of the most frequently occurring operations and perhaps one of the most difficult to optimize.
Properties of join: O/P lineage different than the I/P lineage.
(Cannot easily be plugged into MapReduce)
Many different ways to implement it: Nested loop join (Naïve algorithm) Hash join (smaller relation is hashed) Sort-Merge join (both relations sorted on join
attribute)
Motivation for this paper Databases as slow when we process large
amount of data.
Why ? Fastest databases are usually implemented in shared-everything SMP architectures (Eg: Itanium 64 processors, 128 cores, no clusters). Bottleneck: Memory access (Cache + Memory + Disk).
Then why not go for shared-nothing ? Joining two relations difficult with
current frameworks (i.e Map-Reduce) Why not extend it ?
Contribution of this paper
Process heterogeneous data sets using extension of MapReduce framework. Added Merge phase Hierarchical work-flows
Supported different join algorithms.
Map-Reduce-Merge is “relationally complete”.
Map-Reduce-Merge Map: (k1, v1) → [(k2, v2)]
Reduce: (k2, [v2]) → [v3]
becomes:
Map: (k1, v1) → [(k2, v2)]
Reduce: (k2, [v2]) → (k2, [v3])
Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])
Programmer-Definable operations
Partition selector - which data should go to which merger?
Processor - process data on an individual source.
Merger - analogous to the map and reduce definitions, define logic to do the merge operation.
Configurable iterators - how to step through each of the lists as you merge
Projection Return the subset of the data passed
in.
Mapper can handle this: Map: (k1, v1) → [(k2, v2)]
v2 may have different schema than v1.
Aggregation At the Reduce phase, Map-Reduce
performs the sort-by-key and group-by-key functions to ensure that the input to a reducer is a set of tuples t = (k, [v]) in which [v] is the collection of all the values associated with the key k.
Therefore, reducer can implement “group-by” and “aggregate” operators.
Selection If selection condition involves only the
attributes of one data source, can implement in mappers.
If it’s on aggregates or a group of values contained in one data source, can implement in reducers.
If it involves attributes or aggregates from both data sources, implement in mergers.
Set union / intersection / difference
Let each of the two MapReduces emit a sorted list of unique elements.
Therefore, a naïve merge-like algorithm in merge sort can perform set union/intersection/difference (i.e. iteration over two sorted lists).
Cartesian Products Set the reducers up to output the two
sets you want the Cartesian product of.
Each merger will get one partition F from the first set of reducers, and the full set of partitions S from the second.
Each merger emits F x S.
Sort-Merge Join Map: partition records into key ranges
according to the values of the attributes on which you’re sorting, aiming for even distribution of values to mappers.
Reduce: sort the data.
Merge: join the sorted data for each key range.
Hash Join/Nested loop Map: use the same hash function for
both sets of mappers.
Reduce: produce a hash table from the values mapped. (For nested loop: Don’t hash)
Merge: operates on corresponding hash buckets. Use one bucket as a build set, and the other as a probe. (For nested loop: do loop-match instead of hash-probe.)
Conclusions Map-Reduce-Merge programming
model retains Map-Reduce’s many great features, while adding relational algebra to the list of database principles it upholds.
Suggestion (or more like future-work) to use Map-Reduce-Merge as framework for parallel databases.
Thank You.
Questions ?
References "MapReduce: Simplified Data Processing on Large Clusters" — paper
by Jeffrey Dean and Sanjay Ghemawat; from Google Labs
"Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters" — paper by Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker; from Yahoo and UCLA; published in Proc. of ACM SIGMOD, pp. 1029—1040, 2007
David DeWitt; Michael Stonebraker. "MapReduce: A major step backwards". databasecolumn.com. http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html. Retrieved 2008-08-27
"Google's MapReduce Programming Model -- Revisited" — paper by Ralf Lämmel; from Microsoft
Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM 13 (6): 377–387
http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=107061802
http://cs.baylor.edu/~speegle/5335/2007slides/MapReduceMerge.pdf