1
Map-Reduce and Its Children
Distributed File SystemsMap-Reduce and Hadoop
Dataflow SystemsExtensions for Recursion
3
Distributed File Systems
Files are very large, read/append. They are divided into chunks.
Typically 64MB to a chunk. Chunks are replicated at several
compute-nodes. A master (possibly replicated)
keeps track of all locations of all chunks.
4
Compute Nodes
Organized into racks. Intra-rack connection typically
gigabit speed. Inter-rack connection faster by a
small factor.
7
Implementations
GFS (Google File System – proprietary).
HDFS (Hadoop Distributed File System – open source).
CloudStore (Kosmix File System, open source).
9
The New Stack
Distributed File System
Map-Reduce, e.g.Hadoop
Object Store (key-valuestore), e.g., BigTable,
Hbase, Cassandra
SQL Implementations,e.g., PIG (relational
algebra), HIVE
10
Map-Reduce Systems
Map-reduce (Google) and open-source (Apache) equivalent Hadoop.
Important specialized parallel computing tool.
Cope with compute-node failures. Avoid restart of the entire job.
11
Key-Value Stores
BigTable (Google), Hbase, Cassandra (Apache), Dynamo (Amazon). Each row is a key plus values over a
flexible set of columns. Each column component can be a set
of values.
12
SQL-Like Systems
PIG – Yahoo! implementation of relational algebra. Translates to a sequence of map-reduce
operations, using Hadoop. Hive – open-source (Apache)
implementation of a restricted SQL, called QL, over Hadoop.
13
SQL-Like Systems – (2)
Sawzall – Google implementation of parallel select + aggregation.
Scope – Microsoft implementation of restricted SQL.
15
Map-Reduce
You write two functions, Map and Reduce. They each have a special form to be
explained. System (e.g., Hadoop) creates a
large number of tasks for each function. Work is divided among tasks in a precise
way.
17
Map-Reduce Algorithms
Map tasks convert inputs to key-value pairs. “keys” are not necessarily unique.
Outputs of Map tasks are sorted by key, and each key is assigned to one Reduce task.
Reduce tasks combine values associated with a key.
18
Coping With Failures
Map-reduce is designed to deal with compute-nodes failing to execute a task.
Re-executes failed tasks, not whole jobs. Failure modes:
1. Compute node failure (e.g., disk crash).2. Rack communication failure.3. Software failures, e.g., a task requires Java
n; node has Java n-1.
19
Things Map-Reduce is Good At
1. Matrix-Matrix and Matrix-vector multiplication.
One step of the PageRank iteration was the original application.
2. Relational algebra operations. We’ll do an example of the join.
3. Many other “embarrassingly parallel” operations.
20
Joining by Map-Reduce
Suppose we want to compute R(A,B) JOIN S(B,C), using k Reduce tasks. I.e., find tuples with matching B-
values. R and S are each stored in a
chunked file.
21
Joining by Map-Reduce – (2)
Use a hash function h from B-values to k buckets. Bucket = Reduce task.
The Map tasks take chunks from R and S, and send: Tuple R(a,b) to Reduce task h(b).
• Key = b value = R(a,b).
Tuple S(b,c) to Reduce task h(b).• Key = b; value = S(b,c).
22
Joining by Map-Reduce – (3)
Reducetask i
Map tasks sendR(a,b) if h(b) = i
Map tasks sendS(b,c) if h(b) = i
All (a,b,c) such thath(b) = i, and (a,b)is in R, and (b,c) isin S.
23
Joining by Map-Reduce – (4)
Key point: If R(a,b) joins with S(b,c), then both tuples are sent to Reduce task h(b).
Thus, their join (a,b,c) will be produced there and shipped to the output file.
24
Dataflow Systems
Arbitrary Acyclic Flow Among TasksPreserving Fault Tolerance
The Blocking Property
25
Generalization of Map-Reduce
Map-reduce uses only two functions (Map and Reduce). Each is implemented by a rank of
tasks. Data flows from Map tasks to Reduce
tasks only.
26
Generalization – (2)
Natural generalization is to allow any number of functions, connected in an acyclic network.
Each function implemented by tasks that feed tasks of successor function(s).
Key fault-tolerance (blocking ) property: tasks produce all their output at the end.
27
Many Implementations
1. Clustera – University of Wisconsin.2. Hyracks – Univ. of California/Irvine.3. Dryad/DryadLINQ – Microsoft.4. Nephele/PACT – T. U. Berlin.5. BOOM – Berkeley.6. epiC – N. U. Singapore.
28
Example: Join + Aggregation
Relations D(emp, dept) and S(emp,sal).
Compute the sum of the salaries for each department.
D JOIN S computed by map-reduce. But each Reduce task can also group
its emp-dept-sal tuples by dept and sum the salaries.
29
Example: Continued
A Third function is needed to take the dept-SUM(sal) pairs from each Reduce task, organize them by dept, and compute the final sum for each department.
31
Recursion
Transitive-Closure ExampleFault-Tolerance Problem
Endgame ProblemSome Systems and
ApproachesRecent research ideas contributed byF. Afrati, V. Borkar, M. Carey, N. Polyzotis
32
Applications Requiring Recursion
1. PageRank, the original map-reduce application is really a recursion implemented by many rounds of map-reduce.
2. Analysis of Web structure.3. Analysis of social networks.4. PDE’s.
33
Transitive Closure
Many recursive applications involving large data are similar to transitive closure :
Path(X,Y) :- Arc(X,Y)Path(X,Y) :- Path(X,Z) & Path(Z,Y)
Path(X,Y) :- Arc(X,Y)Path(X,Y) :- Arc(X,Z) & Path(Z,Y)
Nonlinear
(Right) Linear
34
Implementing TC on a Cluster
Use k tasks. Hash function h sends each node of
the graph to one of the k tasks. Task i receives and stores Path(a,b)
if either h(a) = i or h(b) = i, or both. Task i must join Path(a,c) with
Path(c,b) if h(c) = i.
35
TC on a Cluster – Basis
Data is stored as relation Arc(a,b). Map tasks read chunks of the Arc
relation and send each tuple Arc(a,b) to recursive tasks h(a) and h(b). Treated as if it were tuple Path(a,b). If h(a) = h(b), only one task receives.
36
TC on a Cluster – Recursive Tasks
Task iPath(a,b)received
StorePath(a,b)if new.Otherwise,ignore.
Look upPath(b,c) and/orPath(d,a) forany c and d
Send Path(a,c) totasks h(a) and h(c);send Path(d,b) totasks h(d) and h(b)
37
Big Problem: Managing Failure
Map-reduce depends on the blocking property.
Only then can you restart a failed task without restarting the whole job.
But any recursive task has to deliver some output and later get more input.
38
HaLoop (U. Washington)
Iterates Hadoop, once for each round of the recursion. Like iterative PageRank.
Similar idea: Twister (U. Indiana). Clever piece is the way HaLoop
tries to run each task in round i at a compute node where it can find its needed output from round i – 1.
39
Pregel (Google)
Views all computation as a recursion on some graph.
Nodes send messages to one another. Messages bunched into supersteps.
Checkpoint all compute nodes after some fixed number of supersteps.
On failure, rolls all tasks back to previous checkpoint.
40
Example: Shortest Paths Via Pregel
Node N
I found a pathfrom node M toyou of length L
5 3 6
I found a pathfrom node M toyou of length L+3
I found a pathfrom node M toyou of length L+5
I found a pathfrom node M toyou of length L+6
Is this theshortest path fromM I know about?
table ofshortestpathsto N
41
Using Idempotence
Some recursive applications allow restart of tasks even if they have produced some output.
Example: TC is idempotent; you can send a task a duplicate Path fact without altering the result. But if you were counting paths, the
answer would be wrong.
42
Big Problem: The Endgame
Some recursions, like TC, take a large number of rounds, but the number of new discoveries in later rounds drops. T. Vassilakis (Google): searches
forward on the Web graph can take hundreds of rounds.
Problem: in a cluster, transmitting small files carries much overhead.
43
Approach: Merge Tasks
Decide when to migrate tasks to fewer compute nodes.
Data for several tasks at the same node are combined into a single file and distributed at the receiving end.
Downside: old tasks have a lot of state to move.
Example: “paths seen so far.”
44
Approach: Modify Algorithms
Nonlinear recursions can terminate in many fewer steps than equivalent linear recursions.
Example: TC. O(n) rounds on n-node graph for
linear. O(log n) rounds for nonlinear.
45
Advantage of Linear TC
The data-volume cost (= sum of input sizes of all tasks) for executing linear TC is generally lower than that for nonlinear TC.
Why? Each path is discovered only once. Note: distinct paths between the same
endpoints may each be discovered.
48
Smart TC
(Valduriez-Boral, Ioannides) Construct a path from two paths:
1. The first has a length that is a power of 2.
2. The second is no longer than the first.
50
Other Nonlinear TC Algorithms
You can have the unique-decomposition property with many variants of nonlinear TC.
Example: Balance constructs paths from two equal-length paths. Favor first path when length is odd.
52
Incomparability of TC Algorithms
On different graphs, any of the unique-decomposition algorithms – left-linear, right-linear, smart, balanced – could have the lowest data-volume cost.
Other unique-decomposition algorithms are possible and also could win.
53
Extension Beyond TC
Can you convert any linear recursion into an equivalent nonlinear recursion that requires logarithmic rounds?
Answer: Not always, without increasing arity and data size.
54
Positive Points
1. (Agarwal, Jagadish, Ness) All linear Datalog recursions reduce to TC.
2. Right-linear chain-rule Datalog programs can be replaced by nonlinear recursions with the same arity, logarithmic rounds, and the unique-decomposition property.
55
Example: Alternating-Color Paths
P(X,Y) :- Blue(X,Y)P(X,Y) :- Blue(X,Z) & Q(Z,Y)Q(X,Y) :- Red(X,Z) & P(Z,Y)
56
The Case of Reachability
Reach(X) :- Source(X)Reach(X) :- Reach(Y) & Arc(Y,X)
Takes linear rounds as stated. Can compute nonlinear TC to get
Reach in O(log n) rounds. But, then you compute O(n2) facts
instead of O(n) facts on an n-node graph.
57
Reachability – (2)
Theorem: If you compute Reach using only unary recursive predicates, then it must take (n) rounds on a graph of n nodes. Proof uses the ideas of Afrati,
Cosmodakis, and Yannakakis from a generation ago.
58
Summary: Recursion
Key problems are “endgame” and nonblocking nature of recursive tasks.
In some applications, endgame problem can be handled by using a nonlinear recursion that requires O(log n) rounds and has the unique-decomposition property.
59
Summary: Research Questions
1. How do you best support fault tolerance when tasks are nonblocking?
2. How do you manage tasks when the endgame problem cannot be avoided?
3. When can you replace linear recursion with nonlinear recursion requiring many fewer rounds and (roughly) the same data-volume cost?