Lecture 2 – MapReduce: Theory and Implementation
CSE 490h – Introduction to Distributed Computing, Winter 2008
Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
Last Class
How do I process lots of data?Distribute the work
Can I distribute the work?Maybe… if it’s not dependent on other tasksExample: Fibonnaci.
Last Class
What problems can occur?Large tasksUnpredictable bugsMachine failure
How do solve / avoid these?Break up into small chunks?Restart tasks?Use known working solutions
MapReduce
Concept from functional programming Implemented by Google Applied to large number of problems
Functional Programming Review
Java:int fooA(String[] list) {
return bar1(list) + bar2(list); }
int fooB(String[] list) { return bar2(list) + bar1(list); }
Do they give the same result?
Functional Programming Review
Functional Programming:fun fooA(l: int list) =
bar1(l) + bar2(l)
fun fooB(l: int list) = bar2(l) + bar1(l)
Do they give the same result?
Functional Programming Review
Operations do not modify data structures: They always create new ones
Original data still exists in unmodified form
Functional Updates Do Not Modify Structuresfun foo(x, lst) = let lst' = reverse lst in reverse ( x :: lst' )foo: a’ -> a’ list -> a’ list
The foo() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item.
But it never modifies lst!
Functions Can Be Used As Argumentsfun DoDouble(f, x) = f (f x)It does not matter what f does to its argument; DoDouble() will do it twice.
What is the type of this function? x: a’ f: a’ -> a’ DoDouble: (a’ -> a’) -> a’ -> a’
map (Functional Programming)
Creates a new list by applying f to each element of the input list; returns output in order.
f f f f f f
map f lst: (’a->’b) -> (’a list) -> (’b list)
map Implementation
This implementation moves left-to-right across the list, mapping elements one at a time
… But does it need to?
fun map f [] = [] | map f (x::xs) = (f x) :: (map f xs)
Implicit Parallelism In map
In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements
If order of application of f to elements in list is commutative, we can reorder or parallelize execution
This is the “secret” that MapReduce exploits
FoldMoves across a list, applying f to each element
plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list
f f f f f returned
initial
fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b
fold left vs. fold right
Order of list elements can be significant Fold left moves left-to-right across the list Fold right moves from right-to-leftSML Implementation:
fun foldl f a [] = a | foldl f a (x::xs) = foldl f (f(x, a)) xs
fun foldr f a [] = a | foldr f a (x::xs) = f(x, (foldr f a xs))
Example
fun foo(l: int list) = sum(l) + mul(l) + length(l)
How can we implement this?
Example (Solved)
fun foo(l: int list) = sum(l) + mul(l) + length(l)
fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lstfun mul(lst) = foldl (fn (x,a)=>x*a) 1 lstfun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
Google MapReduce
Input Handling Map function Partition Function Compare Function Reduce Function Output Writer
Input Handling
Divides up data into bite-size chunks Starts up tasks Assigns tasks to idle workers
Map
Input: Key, Value pair Output: Key, Value pairs Example: Annual Rainfall Per City
Map (Example)
Example: Annual Rainfall Per City map(String key, String value): // key: date // value: weather info foreach (City c in value) EmitIntermediate(c, c.temperature)
Partition Function
Allocates map output to particular reduces Input: key, number of reduces Output: Index of desired reduce Typical: hash(key) % numberOfReduces
Comparison
Sorts input for each reduce Example: Annual rainfall per city
Sorts rainfall data for each citySeattle: {0, 0, 0, 1, 4, 7, 10, …}
Reduce
Input: Key, Sorted list of values Output: Single value Example: Annual rainfall per city
Reduce
Input: Key, Sorted list of values Output: Single value Example: Annual rainfall per city
Reduce (Example)
Example: Annual rainfall per city reduce(String key, Iterator values):
// key: city // values: temperature sum = 0, count = 0 for each (v in values) sum += v count = count + 1 Emit(sum / count)
Output
Writes the output to storage (GFS, etc)
Data store 1 Data store nmap
(key 1, values...)
(key 2, values...)
(key 3, values...)
map
(key 1, values...)
(key 2, values...)
(key 3, values...)
Input key*value pairs
Input key*value pairs
== Barrier == : Aggregates intermediate values by output key
reduce reduce reduce
key 1, intermediate
values
key 2, intermediate
values
key 3, intermediate
values
final key 1 values
final key 2 values
final key 3 values
...
MapReduce for Google Local
Intersections Rendering Tiles Finding nearest gas stations