+ All Categories
Home > Engineering > Theta join (M-bucket-I algorithm explained)

Theta join (M-bucket-I algorithm explained)

Date post: 23-Feb-2017
Category:
Upload: minsub-yim
View: 48 times
Download: 2 times
Share this document with a friend
77
Processing Theta Joins using MapReduce by Minsub Yim
Transcript
Page 1: Theta join (M-bucket-I algorithm explained)

Processing Theta Joins using MapReduce

by Minsub Yim

Page 2: Theta join (M-bucket-I algorithm explained)

Processing pipeline at a reducer

Goal: We want to minimize job completion time. Since it’s a function of both input and output, we need a way to model both inputs and outputs to a reducer.

Reducer Join OutputMapper Output

time = f(input size) time = f(output size)

Receive Mapper Output

Sort input by key

Read input

Run join algorithm

Send join output

Page 3: Theta join (M-bucket-I algorithm explained)

Theta Join Model

S_id Value

1 5

2 6

3 6

4 8

5 8

6 10

Dataset S Dataset TT_id Value

1 5

2 5

3 6

4 8

5 8

6 10

Assuming join condition: S.value = T.value

Page 4: Theta join (M-bucket-I algorithm explained)

Theta Join Model

S_id Value

1 5

2 6

3 6

4 8

5 8

6 10

Dataset S Dataset TT_id Value

1 5

2 5

3 6

4 8

5 8

6 10

Assuming join condition: S.value = T.value

5 5 6 8 8 105668810

[ Join Matrix M ]

: tuple satisfying the join condition

ST

Page 5: Theta join (M-bucket-I algorithm explained)

Theta Join Model (Examples)

5 5 6 8 8 1056688

10

Join condition: S.value <= T.value

ST 5 5 6 8 8 10

5668810

Join condition: abs (S.value - T.value) < 2

ST 5 5 6 8 8 10

5668810

Join condition: S.value = T.value

ST

Page 6: Theta join (M-bucket-I algorithm explained)

Theta Join Model (Examples)

5 5 6 8 8 1056688

10

Join condition: S.value <= T.value

ST 5 5 6 8 8 10

5668810

Join condition: abs (S.value - T.value) < 2

ST 5 5 6 8 8 10

5668810

Join condition: S.value = T.value

ST

Page 7: Theta join (M-bucket-I algorithm explained)

Theta Join Model (Examples)

5 5 6 8 8 1056688

10

Join condition: S.value <= T.value

ST 5 5 6 8 8 10

5668810

Join condition: abs (S.value - T.value) < 2

ST 5 5 6 8 8 10

5668810

Join condition: S.value = T.value

ST

Page 8: Theta join (M-bucket-I algorithm explained)

Goal Revisited

• We want to minimize job completion time

• We need to assign every true cell to exactly one reducer. (find a mapping from M to R)

Page 9: Theta join (M-bucket-I algorithm explained)

Goal Revisited

• We want to minimize job completion time

• We need to assign every true cell to exactly one reducer. (find a mapping from M to R)

• Goal: Find a mapping from the join matrix M to reducers that minimizes job completion time

Page 10: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)

(2)

(3)

(4)

[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4

5 5 6 8 8 105668810

Join condition: S.value = T.value

ST

(1)(2)

(3)(4)

[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3

(1)

(1)

(2)

(3)

(4)

Stndard equi-join algorithm Random

Page 11: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)

(2)

(3)

(4)

[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4

5 5 6 8 8 105668810

Join condition: S.value = T.value

ST

(1)(2)

(3)(4)

[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3

(1)

(1)

(2)

(3)

(4)

Stndard equi-join algorithm Random

Page 12: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)

(2)

(3)

(4)

[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4

5 5 6 8 8 105668810

Join condition: S.value = T.value

ST

(1)(2)

(3)(4)

[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3

(1)

(1)

(2)

(3)

(4)

Stndard equi-join algorithm Random

Page 13: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)

(2)

(3)

(4)

[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4

5 5 6 8 8 105668810

Join condition: S.value = T.value

ST

(1)(2)

(3)(4)

[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3

(1)

(1)

(2)

(3)

(4)

Stndard equi-join algorithm Random

Page 14: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)(2)

(3)

[R1] Input: S1, S2, T1, T2 Output: 2 tuples ![R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ![R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples !!Max-Reducer-Input: 6 Max-Reducer-Output: 5

Page 15: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)(2)

(3)

[R1] Input: S1, S2, T1, T2 Output: 2 tuples ![R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ![R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples !!Max-Reducer-Input: 6 Max-Reducer-Output: 5

Page 16: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

• We see there could be many possible mappings from join matrix to reducers

• We will see in different cases, which mapping is (close to) optimal and algorithms to compute such mapping.

Page 17: Theta join (M-bucket-I algorithm explained)

LemmaWe will be using the following lemma repeatedly to show how (close to) optimal each mapping is.

[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples

[ Proof ] Consider a reducer r that receives m records from T and n records from S. Then,

!!

2pc

mn � c2pmn � 2

pc

m+ n � 2pc

Page 18: Theta join (M-bucket-I algorithm explained)

LemmaWe will be using the following lemma repeatedly to show how (close to) optimal each mapping is.

[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples

[ Proof ] Consider a reducer r that receives m records from T and n records from S. Then,

!!

2pc

mn � c2pmn � 2

pc

m+ n � 2pc

Page 19: Theta join (M-bucket-I algorithm explained)

Cross Product• We first consider cross product, where all of

tuples from two datasets satisfy the join condition. The join matrix would look like the following:

5 5 6 8 8 105668810

Join condition: S X T

ST

Page 20: Theta join (M-bucket-I algorithm explained)

Cross Product• We first consider cross product, where all of

tuples from two datasets satisfy the join condition. The join matrix would look like the following:

5 5 6 8 8 105668810

Join condition: S X T

ST

Page 21: Theta join (M-bucket-I algorithm explained)

Cross Product• Since all entries of the join matrix are true, we

can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer.)

• Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI):

MRI

� |S||T |/r

� 2

r|S||T |

r

[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2

pc

Page 22: Theta join (M-bucket-I algorithm explained)

Cross Product• Since all entries of the join matrix are true, we

can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer.

• Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI):

MRI

� |S||T |/r

� 2

r|S||T |

r

[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2

pc

Page 23: Theta join (M-bucket-I algorithm explained)

Cross Product• We will revisit these two properties frequently to

see the quality of join mappings:

� |S||T |/rMRO and MRI � 2

r|S||T |

r

Page 24: Theta join (M-bucket-I algorithm explained)

p|S||T |/rCase 1: Suppose |S| and |T| are multiples of .

Namely, and .|S| = csp|S||T |/r |T | = cT

p|S||T |/r

Then, partitioning the join matrix with squares of size is an optimal mapping.p

|S||T |/r

Proof : is trivial. Each region mapped to a reducer !has output size: and input size: |S||T |/r 2

r|S||T |

r

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

Page 25: Theta join (M-bucket-I algorithm explained)

p|S||T |/rCase 1: Suppose |S| and |T| are multiples of .

Namely, and .|S| = csp|S||T |/r |T | = cT

p|S||T |/r

Then, partitioning the join matrix with squares of size is an optimal mapping.p

|S||T |/r

Proof : is trivial. Each region mapped to a reducer !has output size: and input size: |S||T |/r 2

r|S||T |

r

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

Page 26: Theta join (M-bucket-I algorithm explained)

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

5 5 6 8 8 105668810

ST

Suppose |S| = |T| = 6 and r = 9

Page 27: Theta join (M-bucket-I algorithm explained)

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

5 5 6 8 8 105668810

ST

Suppose |S| = |T| = 6 and r = 9

Page 28: Theta join (M-bucket-I algorithm explained)

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

5 5 6 8 8 105668810

ST

Suppose |S| = |T| = 6 and r = 9

Page 29: Theta join (M-bucket-I algorithm explained)

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

5 5 6 8 8 105668810

ST

Suppose |S| = |T| = 6 and r = 9

MRO = 4 = 2

r|S||T |

r

MRI = 4 = |S||T |/r

Page 30: Theta join (M-bucket-I algorithm explained)

Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

|S| < |T |/r |S|⇥ |T |/ris the optimal mapping.

Page 31: Theta join (M-bucket-I algorithm explained)

Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

|S| < |T |/r |S|⇥ |T |/ris the optimal mapping.

(e.g., |S| = 3, |T| = 20, r = 5)

Page 32: Theta join (M-bucket-I algorithm explained)

Case 3: The remaining case where . !

Let , !

Then, covering M with squares is a mapping worse than an optimal mapping by a factor no greater than 4.

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

|T |/r |S| |T |

CT =

$|T |/

r|S||T |

r

%CS =

$|S|/

r|S||T |

r

%

p|S||T |/r ⇥

p|S||T |/r

Page 33: Theta join (M-bucket-I algorithm explained)

If |S| and/or |T| is not a multiple of , scale each !

side by and/or respectively to !

cover M. Given , we see that

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

p|S||T |/r

✓1 +

1

CS

◆ ✓1 +

1

CT

|T |/r |S| |T |✓1 +

1

CS

◆r|S||T |

r 2

r|S||T |

r

Page 34: Theta join (M-bucket-I algorithm explained)

Hence, and

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

Comparing these with the lower bounds given above, we see that the MRO and MRI produced by this mapping are at most 4 times (twice for MRI) the lower bounds.

MRI 4p

|S||T |/rMRO 4|S||T |/r

Page 35: Theta join (M-bucket-I algorithm explained)

Implementation• Now we know how to (nearly) optimally partition

the join matrix. So let’s run it!!

• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.

• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!

Page 36: Theta join (M-bucket-I algorithm explained)

Implementation• Now we know how to (nearly) optimally partition

the join matrix. So let’s run it!!

• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.

• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!

Page 37: Theta join (M-bucket-I algorithm explained)

Implementation• Now we know how to (nearly) optimally partition

the join matrix. So let’s run it!!

• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.

• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!

Page 38: Theta join (M-bucket-I algorithm explained)

Mapping & Randomized Algorithm

Algorithm 1 : Map (Theta - Join) !Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) )

x 2 S [ T

x 2 S

Page 39: Theta join (M-bucket-I algorithm explained)

Mapping & Randomized Algorithm

Algorithm 1 : Map (Theta - Join) !Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) )

x 2 S [ T

x 2 S

1. Given a record ( WLOG ) 2. Get a row uniformly randomly 3. Get all the regions intersecting that row and output ( regID, (x, S) )

x 2 S

Page 40: Theta join (M-bucket-I algorithm explained)

Mapping & Randomized Algorithm

5 7 7 7 8 9577899

ST

Join condition: S.value = T.value

(1) (2)

(3)

3 5 1 5 1 2

6 2 2 3 6 4

(1,S1) (2,S1) (3,S2) (1,S3) (2,S3) (3,S4) (1,S5) (2,S5) (1,S6) (2,S6) (2,T1) (3,T1) (1,T2) (3,T2) (1,T3) (3,T3) (1,T4) (3,T4) (2,T5) (3,T5) (2,T6) (3,T6)

Input Tuple

Random Row/Col Output

MapReducer 1 : key 1 (regID)Input: S1, S3, S5, S6, T2, T3, T4Output: (S3,T2) (S3,T3) (S3,T4)

Reducer 2 : key 2 (regID)Input: S1, S3, S5, S6, T1, T5, T6Output: (S1,T1) (S5,T6) (S6,T6)

Reducer 3 : key 3 (regID)Input: S2, S4, T1, T2, T3, T4, T5, T6Output: (S2,T2) (S2,T3) (S2,T4) (S4,T5)

Reduce

S1.A = 5 S2.A = 7 S3.A = 7 S4.A = 8 S5.A = 9 S6.A = 9 T1.A = 5 T2.A = 7 T3.A = 7 T4.A = 7 T5.A = 8 T6.A = 9

Page 41: Theta join (M-bucket-I algorithm explained)

Cross Product… NOT!

• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.

• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?

• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm

Page 42: Theta join (M-bucket-I algorithm explained)

Cross Product… NOT!

• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.

• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?

• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm

Page 43: Theta join (M-bucket-I algorithm explained)

Cross Product… NOT!

• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.

• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?

• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm

Page 44: Theta join (M-bucket-I algorithm explained)

1BT vs ANY join algorithmLet . Any matrix to reducer mapping that has to cover at least of the cells of the join matrix, by Lemma 1, has MRI

1 � x > 0

x|S||T | |S||T |� 2

px|S||T |

[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2

pc

As we have seen, 1BT guarantees that MRI . !Hence,

4p|S||T |

MRI1BT

MRI

AnyJoinAlg

=4p

|S||T |/r2p

x|S||T |/r=

2px

Page 45: Theta join (M-bucket-I algorithm explained)

1BT vs ANY join algorithm

Page 46: Theta join (M-bucket-I algorithm explained)

1BT vs ANY join algorithm

When , the ratio < 3. !Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm.

x = 0.5

Page 47: Theta join (M-bucket-I algorithm explained)

1BT vs ANY join algorithm

When , the ratio < 3. !Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm.

x = 0.5

Page 48: Theta join (M-bucket-I algorithm explained)

M-Bucket-I• In the previous slide, we see that instead of

covering the entire matrix, mapping smaller regions would yield better MRI result.

• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.

• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm

Page 49: Theta join (M-bucket-I algorithm explained)

M-Bucket-I• In the previous slide, we see that instead of

covering the entire matrix, mapping smaller regions would yield better MRI result.

• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.

• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm

Page 50: Theta join (M-bucket-I algorithm explained)

M-Bucket-I• In the previous slide, we see that instead of

covering the entire matrix, mapping smaller regions would yield better MRI result.

• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.

• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm

Page 51: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

1) With probability n /|S|, sample approx. n records from |S|

2) Build k-quantiles (k buckets), where k < n 3) Iterate through |S| and count the number of

records in each bucket 4) Do the same for |T| and build the join matrix

accordingly

Page 52: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S_id Value

1 7

2 2

3 4

4 2

5 1

6 9

7 10

8 2

9 5

10 3

Dataset S Dataset T

T_id Value

1 5

2 5

3 6

4 8

5 8

6 10

7 2

8 4

9 1

10 3

Page 53: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S_id Value

1 7

2 2

3 4

4 2

5 1

6 9

7 10

8 2

9 5

10 3

Dataset S Dataset T

T_id Value

1 5

2 5

3 6

4 8

5 8

6 10

7 2

8 4

9 1

10 3

Sample S 7, 2, 2, 9, 2, 3

Sample T 5, 6, 8, 2, 1, 3

Page 54: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S_id Value

1 7

2 2

3 4

4 2

5 1

6 9

7 10

8 2

9 5

10 3

Dataset S Dataset T

T_id Value

1 5

2 5

3 6

4 8

5 8

6 10

7 2

8 4

9 1

10 3

Sample S 7, 2, 2, 9, 2, 3

Sample T 5, 6, 8, 2, 1, 3

Samples

Page 55: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S_id Value

1 7

2 2

3 4

4 2

5 1

6 9

7 10

8 2

9 5

10 3

Dataset S Dataset T

T_id Value

1 5

2 5

3 6

4 8

5 8

6 10

7 2

8 4

9 1

10 3

Sample S 7, 2, 2, 9, 2, 3

Sample T 5, 6, 8, 2, 1, 3

Samples

Buckets

S

T

0 2 3 9

0 1 5 8 1

1

Page 56: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S_id Value

1 7

2 2

3 4

4 2

5 1

6 9

7 10

8 2

9 5

10 3

Dataset S Dataset T

T_id Value

1 5

2 5

3 6

4 8

5 8

6 10

7 2

8 4

9 1

10 3

Sample S 7, 2, 2, 9, 2, 3

Sample T 5, 6, 8, 2, 1, 3

Samples

Buckets

S

T

0 2 3 9

0 1 5 8 1

1

4 1 4 1

1 5 3 1

Page 57: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S S S S S S S S S STTTTTTTTTT

Join condition: S.value = T.value

Page 58: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S S S S S S S S S STTTTTTTTTT

2 3 9

1

5

8

Join condition: S.value = T.value

Page 59: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S S S S S S S S S STTTTTTTTTT

2 3 9

1

5

8

Join condition: S.value = T.value

We now have candidate cells. How do we map these cells to reducers?

Page 60: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 2 ] M-Bucket-I Algorithm

Algorithm : M-Bucket-I !Input : maxInput, r, M 1: row = 0 2: while row < M.noOfRows do 3: (row,r) = CoverSubMatrix(row, maxInput, r, M) 4: if r < 0 then!5: return false 6: return true!

Page 61: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Algorithm : CoverSubMatrix !Input : row_s, maxInput, r, M 1: maxScore = -1, rUsed = 0 2: for i = 1 to maxInput-1 do 3: R_i = CoverRows(row_s, row_s + i, maxInput, M) 4: area = totalCandidateArea(row_s, row_s + i, M) 5: score = area/R_i.size 6: if score >= maxScore then!7: bestRow = row_s + i 8: rUsed = R_i.size 9: r = r - rUsed 10: return (bestRow + 1, r)

[ Step 2 ] M-Bucket-I Algorithm

Page 62: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Algorithm : CoverRows !Input : row_f, row_l, maxInput, M 1: Regions = 0; r = newRegion() 2: for all c_i in M.getColumns do 3: if r. cap < c_i.candidateInputCosts then!4: Regions = Regions U r 5: r = newRegion() 6: r.Cells = r.Cells U c_i.candidateCells 7: return Regions

[ Step 2 ] M-Bucket-I Algorithm

Page 63: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

[ Step 2 ] M-Bucket-I Algorithm

Page 64: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 0 cost : 4

[ Step 2 ] M-Bucket-I Algorithm

Page 65: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 0 cost : 4

row : 1 cost : 13/3 = 4.3

[ Step 2 ] M-Bucket-I Algorithm

Page 66: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 0 cost : 4

row : 1 cost : 13/3 = 4.3

row : 2 cost : 22/4 = 5.5

[ Step 2 ] M-Bucket-I Algorithm

Page 67: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 0 cost : 4

row : 1 cost : 13/3 = 4.3

row : 2 cost : 22/4 = 5.5

row : 3 cost : 31/7 = 4.428..

[ Step 2 ] M-Bucket-I Algorithm

Page 68: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 0 cost : 4

row : 1 cost : 13/3 = 4.3

row : 2 cost : 22/4 = 5.5

row : 3 cost : 31/7 = 4.428..

We choose the mapping with highest score!

(1) (2)(3) (4)

[ Step 2 ] M-Bucket-I Algorithm

Page 69: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 3 cost : 3

(1) (2)(3) (4) So on and so forth…

[ Step 2 ] M-Bucket-I Algorithm

Page 70: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

Final mapping!

[ Step 2 ] M-Bucket-I Algorithm

(1) (2)(3) (4)

(7)(6)(5)

(8) (9)(10)

(11) (12)(13)

Page 71: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

(1) (2)(3) (4)

However, we have mapped the candidate cells to > r reducers. !We do binary search until we get to the point where we a mapping to <= r reducers.(7)(6)(5)

(8) (9)(10)

(11) (12)(13)

[ Step 2 ] M-Bucket-I Algorithm

Page 72: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 3 ] Binary Search

MaxInput = |S|+|T| = 20

Num.Reducers = 1

MaxInput = 5

Num.Reducers = 13

Page 73: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 3 ] Binary Search

MaxInput = |S|+|T| = 20

Num.Reducers = 1

MaxInput = 5

Num.Reducers = 13

MaxInput = 12

Num.Reducers = 3

Page 74: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 3 ] Binary Search

MaxInput = |S|+|T| = 20

Num.Reducers = 1

MaxInput = 5

Num.Reducers = 13

MaxInput = 12

Num.Reducers = 3

MaxInput = 8

Num.Reducers = 5

Since 7 reducers are required when MaxInput = 7, we stop the binary search here and output the mapping with MRI = 8.

Page 75: Theta join (M-bucket-I algorithm explained)

Performance1 Bucket Theta Standard Equi Join

Data set Output size (billion)

Output Imbalance

Runtime (secs)

Output Imbalance

Runtime (secs)

Synth - 0 25.00 1.0030 657 1.0124 701

Synth - 0.4 24.99 1.0023 650 1.2541 722

Synth - 0.6 24.98 1.0033 676 1.7780 923

Synth - 0.8 24.95 1.0068 678 3.0103 1482

Synth - 1 24.91 1.0089 667 5.3124 2489

Skew

ed

Where Output Imbalance = MRI

Ave.RI

MRI

Ave.RI

Skew Resistance of 1 Bucket Theta

Page 76: Theta join (M-bucket-I algorithm explained)

Performance1 Bucket Theta Standard Equi Join

Data set Output size (billion)

Output Imbalance

Runtime (secs)

Output Imbalance

Runtime (secs)

Synth - 0 25.00 1.0030 657 1.0124 701

Synth - 0.4 24.99 1.0023 650 1.2541 722

Synth - 0.6 24.98 1.0033 676 1.7780 923

Synth - 0.8 24.95 1.0068 678 3.0103 1482

Synth - 1 24.91 1.0089 667 5.3124 2489

Skew

ed

Where Output Imbalance = MRI

Ave.RI

MRI

Ave.RI

Skew Resistance of 1 Bucket Theta

Page 77: Theta join (M-bucket-I algorithm explained)

Performance

Step Number of Buckets

1 10 100 1000 10,000 100,000 1,000,000

M-Bucket-I cost details (seconds)

Quantiles 0 115 120 117 122 124 122

Histogram 0 140 145 147 157 167 604

Heuristic 74.01 9.21 0.84 1.50 16.67 118.03 111.27

Join 49384 10905 1157 595 548 540 536

Total 49,458.01 11,169.21 1,422.84 860.5 843.67 949.03 1,373.27


Recommended