+ All Categories
Home > Data & Analytics > Bounds for overlapping interval join on MapReduce

Bounds for overlapping interval join on MapReduce

Date post: 18-Jul-2015
Category:
Upload: shantanu-sharma
View: 38 times
Download: 0 times
Share this document with a friend
Popular Tags:
30
Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1 , Shlomi Dolev 2 , Shantanu Sharma 2 , and Jeffrey D. Ullman 3 1 National Technical University of Athens, Greece 2 Ben-Gurion University of the Negev, Israel 3 Stanford University, USA 2 nd Algorithms and Systems for MapReduce and Beyond (BeyondMR) Brussels, Belgium (27 March 2015)
Transcript

Bounds for Overlapping Interval Join on MapReduce

Foto N. Afrati1, Shlomi Dolev2,Shantanu Sharma2, and Jeffrey D. Ullman3

1 National Technical University of Athens, Greece

2 Ben-Gurion University of the Negev, Israel

3 Stanford University, USA

2nd Algorithms and Systems for MapReduce and Beyond (BeyondMR)Brussels, Belgium (27 March 2015)

Outline

• Introduction

• Goal of Mapping Schema and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

2

Outline

• Introduction– Interval and Overlapping Intervals– Interval Join– Reducer capacity and Mapping Schema

• Goal of Mapping Schema and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

3

• Interval– A pair [starting time , ending time]

– A (time) interval, i, is represented by a pair of times

[Tsi , Te

i ], Tsi < Te

i , where Tsi and Te

i show the starting-point and the ending-point of the interval i, respectively

– Example:

• My talk,

• a phase of a project, a class of a professor

Introduction

4

Tsi = 10am

Talk

Tei = 10:30am

• Overlapping Intervals– Two intervals, say interval i and interval j are called

overlapping intervals if the intersection of both theinterval is nonempty

Introduction

5Non-overlapping intervals Overlapping intervals

i

j

Overlapping intervals

Talk

Coffee break

10am 10:35am

10:30am 11am

Introduction

6

EmpID Name Duration

𝑒1 U 1-Apr –1-June

𝑒2 V 1-May –1-July

𝑒3 W 1-Apr –1-July

𝑒4 X 1-Mar –1-June

𝑒5 Y 1-Mar –1-Aug

Phase Duration

Requirement Analysis (RA) 1-Mar – 1-May

Design (D) 1-Apr – 1-June

Coding (C) 1-May –1-Aug

1-Mar 1-Apr 1-May 1-June 1-July 1-Aug

Project Employee

Project

Employee

RADC

𝑒1𝑒2𝑒3𝑒4𝑒5

• Overlapping Interval Join: an example

Find all the employee that areinvolved in RA phase of theproject

• Reducer capacity

– An upper bound on the total number of intervalsthat are assigned to the reducer

– Example

• Reducer capacity to be the size of the main memory ofthe processors on which reducers run

• Communication cost

– Total amount of data to be transferred from the mapphase to reduce phase

– Tradeoff between the reducer capacity and communicationcost

Introduction

7

IntroductionMapping schema for interval join

An assignment of the set of intervals to some givenreducers, such that

– Respect the reducer capacity• The total number of intervals assigned to a reducer must be

less than or equal to the reducer capacity

– Assignment of inputs• For every output, it is required to assign every two

corrosponding overlapping corrossponding intervals to at leastone reducer in common

8Reducer

I1 I2 I3

Reducer Reducer Reducer

I1 I2 I3I1 I2 I3

State-of-the-Art

• B. Chawda, H. Gupta, S. Negi, T.A. Faruquie, L.V.Subramaniam, and M.K. Mohania, “Processing IntervalJoins On Map-Reduce,” EDBT, 2014.

• MapReduce-based 2-way and multiway interval joinalgorithms of overlapping intervals

• Not regarding the reducer capacity

• No analysis of a lower bound on replication ofindividual intervals

• No analysis of the replication rate of the algorithmsoffered therein

9

Outline

• Introduction

• Goal of Mapping Schema and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

10

• Interval join problem

– Assign all the intervals that share at least onecommon point of time to at least one reduce incommon for finding outputs

Goal of Mapping Schema

11

• An algorithm for variable-length intervals thatcan start at any time

– Before this, we consider two simple cases of

• Unit-length and equally-spaced intervals and providealgorithm

• Variable-length and equally-spaced intervals andprovide algorithm

• All the algorithms achieve almost matching upperbound on the replication rate to the lower bound

Our Contribution

12

Outline

• Introduction

• Goal of Mapping Schema and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

13

• Relations X and Y of n intervals

• All intervals do not have beginning beyond k andbefore 0

• Hence, spacing between starting points of two

successive intervals =kn < 1

Unit-Length and Equally-Spaced Intervals

14

0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25

X

Y

n = 9 and k = 2.25, so spacing = 0.25

• Divide the time-range from 0 to k intoequal-sized partitions of length w (say Ppartitions are created)

• Arrange P reducers

• Assign all intervals of X that exist in apartition pi to ith reducer

• Assign all intervals of Y that have theirstarting or ending-point in partition pi toith reducer

Unit-Length and Equally-Spaced Intervals-Algorithm

0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25

X

Y

n = 9 and k = 2.25

1 partition 2 partition 3 partition 5 partition 4 partition

• Does the algorithm work?

• Consider q =3wnk

+nk

+ 2

• q: the reducer capacity• w: length of a partition• n: the total number of intervals in a relation• k: the last starting point of an interval

• Count how many intervals lie in a partition, ifthey are less than or equal to q then we havea solution and the algorithm works.

Unit-Length and Equally-Spaced Intervals

16

• Does the algorithm work?

– Count 1: How many intervals of Y overlap with aninterval X in a partition of length w?

• Spacing is n/k, so at most 2wn/k intervals of Y canoverlap with an interval of X

– Count 2: How many intervals can have startingpoints after starting of xi and starting pointsbefore ending of xi.

• Intervals of X after starting point of xi = wn/k

• Intervals of X before starting point of xi = n/k

– Count 3: Do not forget to count xi itself and anidentical interval of Y i.e. yi.

Unit-Length and Equally-Spaced Intervals

17

0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25

X

Yn = 9 and k = 2.25

1 partition 2 partition 3 partition 5 partition 4 partition

• Does the algorithm work?

– Total number of intervals in a partition

– Count 1 + Count 2 + Count 3 =

2wnk

+wnk+nk

+ 2

= q

– OK. The algorithm works

Unit-Length and Equally-Spaced Intervals

18

Outline

• Introduction

• Goal of Mapping Schema and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

19

• Two types of intervals

– Big and small intervals

– Different length intervals

Variable-Length and Equally-Spaced Intervals

20

• Big and small intervals

– All the intervals of X are of length lmin

– All the intervals of Y are of length lmax

– The previous algorithm will work here too

– Note that an interval of X will be replicated toseveral reducers, while an interval of Y will bereplicated to at most two reducers

Variable-Length and Equally-Spaced Intervals

21

0 .7 1.4 2.1 2.8 3.5 4.2

X

Y

n = 6 and spacing = 0.7

• Variable-length intervals: A general case

– All the restriction regarding length of an intervaland spacing between two interval is removed

– Intervals can begin at some time greater than orequal to 0 and end by time T

– S: the total length of intervals in one relation

Variable-Length and Equally-Spaced Intervals

22

0 s s+1 s+2 s+3 T

X

Y

• Variable-length intervals: A general case

– Algorithm

• Divide the time range intoTw equal sized partitions

• ArrangeTw reducers

• Follow the same procedure as in the previous algorithm– i.e., assign all the intervals of X that belong to ith partition to ith

reducers and assign all the intervals of Y to reducers correspondingto their starting and ending points (only to at most two reducers)

Variable-Length and Equally-Spaced Intervals

23

0 s s+1 s+2 s+3 T

X

Y

• Variable-length intervals: A general case

– Does the algorithm work?

– Consider q =3nw + S

T

– Count the average number of intervals of X and Y sent to areducer; if they are less than or equal to the reducercapacity, then the algorithm will work

Variable-Length and Equally-Spaced Intervals

24

• Variable-length intervals: A general case

– Count 1: Average number of intervals of Yreceived by a reducer

•Replication∗Total number of inputs

total number of reducer

– An interval of Y is sent to at most to 2 reducers(Replication)

– There areTw reducers and n intervals in Y

• Average number of intervals of Y received by a

reducer =2∗nT/w

Variable-Length and Equally-Spaced Intervals

25

• Variable-length intervals: A general case

– Count 2: Average number of intervals of Xreceived by a reducer

•Replication∗Total number of inputs

total number of reducer

– Average length of intervals is S/n

– An interval of X is sent to at most to 1 + S/nw reducers

– There areTw reducers and n intervals in X

• Average number of intervals of X received by a

reducer =(1+S/nW)∗n

T/w

Variable-Length and Equally-Spaced Intervals

26

Average length/how much length a reducer

can hold

• Variable-length intervals: A general case

– Does the algorithm work?

– Total number of intervals that a reducer receive

= Count 1+ Count 2

2nwT

+(1+S/nW)wn

T=3wn+S

T

= q

The algorithm works

Variable-Length and Equally-Spaced Intervals

27

Outline

• Introduction

• Problem Statement and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

28

Conclusion

• An investigation for good MapReduce algorithms forthe problem of finding pairs of overlapping intervals

• Algorithms for:– Unit-sized and equally-spaced intervals

• Lower bounds on the replication rate = 2 or 2qnk

• Upper bounds on the replication rate =3

qT−SS2

– Big-small and equally-spaced intervals• Lower bounds on the replication rate = 2 or 2q

lmins

• Upper bounds on the replication rate =3

qT−SS2

– A general case for variable length intervals• Upper bounds on the replication rate =

3qT−S

S2

29Proofs of lower and upper bounds on the replication rate are given in the paper

Foto Afrati1, Shlomi Dolev2, Shantanu Sharma2, and Jeffrey D. Ullman3

1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece

[email protected] Department of Computer Science, Ben-Gurion University of the

Negev, Israel{dolev,sharmas}@cs.bgu.ac.il

3 Department of Computer Science, Stanford University, USA [email protected]

Presentation is available athttp://www.cs.bgu.ac.il/~sharmas/publication.html


Recommended