Date post: | 18-Jul-2015 |
Category: |
Data & Analytics |
Upload: | shantanu-sharma |
View: | 38 times |
Download: | 0 times |
Bounds for Overlapping Interval Join on MapReduce
Foto N. Afrati1, Shlomi Dolev2,Shantanu Sharma2, and Jeffrey D. Ullman3
1 National Technical University of Athens, Greece
2 Ben-Gurion University of the Negev, Israel
3 Stanford University, USA
2nd Algorithms and Systems for MapReduce and Beyond (BeyondMR)Brussels, Belgium (27 March 2015)
Outline
• Introduction
• Goal of Mapping Schema and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
2
Outline
• Introduction– Interval and Overlapping Intervals– Interval Join– Reducer capacity and Mapping Schema
• Goal of Mapping Schema and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
3
• Interval– A pair [starting time , ending time]
– A (time) interval, i, is represented by a pair of times
[Tsi , Te
i ], Tsi < Te
i , where Tsi and Te
i show the starting-point and the ending-point of the interval i, respectively
– Example:
• My talk,
• a phase of a project, a class of a professor
Introduction
4
Tsi = 10am
Talk
Tei = 10:30am
• Overlapping Intervals– Two intervals, say interval i and interval j are called
overlapping intervals if the intersection of both theinterval is nonempty
Introduction
5Non-overlapping intervals Overlapping intervals
i
j
Overlapping intervals
Talk
Coffee break
10am 10:35am
10:30am 11am
Introduction
6
EmpID Name Duration
𝑒1 U 1-Apr –1-June
𝑒2 V 1-May –1-July
𝑒3 W 1-Apr –1-July
𝑒4 X 1-Mar –1-June
𝑒5 Y 1-Mar –1-Aug
Phase Duration
Requirement Analysis (RA) 1-Mar – 1-May
Design (D) 1-Apr – 1-June
Coding (C) 1-May –1-Aug
1-Mar 1-Apr 1-May 1-June 1-July 1-Aug
Project Employee
Project
Employee
RADC
𝑒1𝑒2𝑒3𝑒4𝑒5
• Overlapping Interval Join: an example
Find all the employee that areinvolved in RA phase of theproject
• Reducer capacity
– An upper bound on the total number of intervalsthat are assigned to the reducer
– Example
• Reducer capacity to be the size of the main memory ofthe processors on which reducers run
• Communication cost
– Total amount of data to be transferred from the mapphase to reduce phase
– Tradeoff between the reducer capacity and communicationcost
Introduction
7
IntroductionMapping schema for interval join
An assignment of the set of intervals to some givenreducers, such that
– Respect the reducer capacity• The total number of intervals assigned to a reducer must be
less than or equal to the reducer capacity
– Assignment of inputs• For every output, it is required to assign every two
corrosponding overlapping corrossponding intervals to at leastone reducer in common
8Reducer
I1 I2 I3
Reducer Reducer Reducer
I1 I2 I3I1 I2 I3
State-of-the-Art
• B. Chawda, H. Gupta, S. Negi, T.A. Faruquie, L.V.Subramaniam, and M.K. Mohania, “Processing IntervalJoins On Map-Reduce,” EDBT, 2014.
• MapReduce-based 2-way and multiway interval joinalgorithms of overlapping intervals
• Not regarding the reducer capacity
• No analysis of a lower bound on replication ofindividual intervals
• No analysis of the replication rate of the algorithmsoffered therein
9
Outline
• Introduction
• Goal of Mapping Schema and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
10
• Interval join problem
– Assign all the intervals that share at least onecommon point of time to at least one reduce incommon for finding outputs
Goal of Mapping Schema
11
• An algorithm for variable-length intervals thatcan start at any time
– Before this, we consider two simple cases of
• Unit-length and equally-spaced intervals and providealgorithm
• Variable-length and equally-spaced intervals andprovide algorithm
• All the algorithms achieve almost matching upperbound on the replication rate to the lower bound
Our Contribution
12
Outline
• Introduction
• Goal of Mapping Schema and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
13
• Relations X and Y of n intervals
• All intervals do not have beginning beyond k andbefore 0
• Hence, spacing between starting points of two
successive intervals =kn < 1
Unit-Length and Equally-Spaced Intervals
14
0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25
X
Y
n = 9 and k = 2.25, so spacing = 0.25
• Divide the time-range from 0 to k intoequal-sized partitions of length w (say Ppartitions are created)
• Arrange P reducers
• Assign all intervals of X that exist in apartition pi to ith reducer
• Assign all intervals of Y that have theirstarting or ending-point in partition pi toith reducer
Unit-Length and Equally-Spaced Intervals-Algorithm
0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25
X
Y
n = 9 and k = 2.25
1 partition 2 partition 3 partition 5 partition 4 partition
• Does the algorithm work?
• Consider q =3wnk
+nk
+ 2
• q: the reducer capacity• w: length of a partition• n: the total number of intervals in a relation• k: the last starting point of an interval
• Count how many intervals lie in a partition, ifthey are less than or equal to q then we havea solution and the algorithm works.
Unit-Length and Equally-Spaced Intervals
16
• Does the algorithm work?
– Count 1: How many intervals of Y overlap with aninterval X in a partition of length w?
• Spacing is n/k, so at most 2wn/k intervals of Y canoverlap with an interval of X
– Count 2: How many intervals can have startingpoints after starting of xi and starting pointsbefore ending of xi.
• Intervals of X after starting point of xi = wn/k
• Intervals of X before starting point of xi = n/k
– Count 3: Do not forget to count xi itself and anidentical interval of Y i.e. yi.
Unit-Length and Equally-Spaced Intervals
17
0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25
X
Yn = 9 and k = 2.25
1 partition 2 partition 3 partition 5 partition 4 partition
• Does the algorithm work?
– Total number of intervals in a partition
– Count 1 + Count 2 + Count 3 =
2wnk
+wnk+nk
+ 2
= q
– OK. The algorithm works
Unit-Length and Equally-Spaced Intervals
18
Outline
• Introduction
• Goal of Mapping Schema and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
19
• Two types of intervals
– Big and small intervals
– Different length intervals
Variable-Length and Equally-Spaced Intervals
20
• Big and small intervals
– All the intervals of X are of length lmin
– All the intervals of Y are of length lmax
– The previous algorithm will work here too
– Note that an interval of X will be replicated toseveral reducers, while an interval of Y will bereplicated to at most two reducers
Variable-Length and Equally-Spaced Intervals
21
0 .7 1.4 2.1 2.8 3.5 4.2
X
Y
n = 6 and spacing = 0.7
• Variable-length intervals: A general case
– All the restriction regarding length of an intervaland spacing between two interval is removed
– Intervals can begin at some time greater than orequal to 0 and end by time T
– S: the total length of intervals in one relation
Variable-Length and Equally-Spaced Intervals
22
0 s s+1 s+2 s+3 T
X
Y
• Variable-length intervals: A general case
– Algorithm
• Divide the time range intoTw equal sized partitions
• ArrangeTw reducers
• Follow the same procedure as in the previous algorithm– i.e., assign all the intervals of X that belong to ith partition to ith
reducers and assign all the intervals of Y to reducers correspondingto their starting and ending points (only to at most two reducers)
Variable-Length and Equally-Spaced Intervals
23
0 s s+1 s+2 s+3 T
X
Y
• Variable-length intervals: A general case
– Does the algorithm work?
– Consider q =3nw + S
T
– Count the average number of intervals of X and Y sent to areducer; if they are less than or equal to the reducercapacity, then the algorithm will work
Variable-Length and Equally-Spaced Intervals
24
• Variable-length intervals: A general case
– Count 1: Average number of intervals of Yreceived by a reducer
•Replication∗Total number of inputs
total number of reducer
– An interval of Y is sent to at most to 2 reducers(Replication)
– There areTw reducers and n intervals in Y
• Average number of intervals of Y received by a
reducer =2∗nT/w
Variable-Length and Equally-Spaced Intervals
25
• Variable-length intervals: A general case
– Count 2: Average number of intervals of Xreceived by a reducer
•Replication∗Total number of inputs
total number of reducer
– Average length of intervals is S/n
– An interval of X is sent to at most to 1 + S/nw reducers
– There areTw reducers and n intervals in X
• Average number of intervals of X received by a
reducer =(1+S/nW)∗n
T/w
Variable-Length and Equally-Spaced Intervals
26
Average length/how much length a reducer
can hold
• Variable-length intervals: A general case
– Does the algorithm work?
– Total number of intervals that a reducer receive
= Count 1+ Count 2
2nwT
+(1+S/nW)wn
T=3wn+S
T
= q
The algorithm works
Variable-Length and Equally-Spaced Intervals
27
Outline
• Introduction
• Problem Statement and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
28
Conclusion
• An investigation for good MapReduce algorithms forthe problem of finding pairs of overlapping intervals
• Algorithms for:– Unit-sized and equally-spaced intervals
• Lower bounds on the replication rate = 2 or 2qnk
• Upper bounds on the replication rate =3
qT−SS2
– Big-small and equally-spaced intervals• Lower bounds on the replication rate = 2 or 2q
lmins
• Upper bounds on the replication rate =3
qT−SS2
– A general case for variable length intervals• Upper bounds on the replication rate =
3qT−S
S2
29Proofs of lower and upper bounds on the replication rate are given in the paper
Foto Afrati1, Shlomi Dolev2, Shantanu Sharma2, and Jeffrey D. Ullman3
1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece
[email protected] Department of Computer Science, Ben-Gurion University of the
Negev, Israel{dolev,sharmas}@cs.bgu.ac.il
3 Department of Computer Science, Stanford University, USA [email protected]
Presentation is available athttp://www.cs.bgu.ac.il/~sharmas/publication.html