EFFICIENT PROCESSING OF RANK -AWARE
QUERIES IN MAP/REDUCE
O I K O N O M AK I S S P Y R I D O N
S O F T WAR E / E N G I N E E R AT P E O P L E P E R H O U R
Need for a new model
Exponential data growth
Need for analysis, utilization and scalability of more and more data
Need for parallel processing
Need to reduce reading time and data recovery
Need for convenience in terms of programmer
Cost
What is the Map/Reduce?
Distributed data processing programming model
and runtime environment that operates in a large
number of clusters of machines with parallel
processing
Is the Map/Reduce model reliable?
Map/Reduce
Weaknesses in Top-K Join Queries
What is the Top-K Join?
Weaknesses
Read all the data for the recovery of K results
Non-equitable distribution of workload per Reducer
Goals of the experiment
Implementation of Top-K Join queries in
Map/Reduce model in an efficient manner
Troubleshooting shown in Map / Reduce with:
Early Termination
Load Balancing
Design
Comparison of three algorithms (1 default and 2 new) Naive
EarlyTermination (using bounds)
EarlyTermination & LoadBalancing (using bounds and Longest Processing Time)
Pre-Elaboration Production of two data tables with Join attributes
Statistics for the data in the form of histograms
Elaboration Calculating bounds of histograms for each table
Run Map/Reduce
Design(2)
Early Termination
EarlyTermRecordReaderCheck Bounds
Send Data
Send Data
HDFS
Generated Sorted
Data
Histograms
EarlyTermInputFormat
Mapper
ReducersProcess
Early Termination & Load Balancing
EarlyTermRecordReaderCheck
BoundsSend Data
Send Data
HDFS
Generated Sorted
Data
Histograms
EarlyTermInputFormat
Mapper
Reducer
CustomPartitioner
Reducer Reducer
Experiment (1)
Parameters Values
Data Distribution: Zipfian
Number of data: 1.000.000 / table
Number of reducers: 10, 6
Number of K results: 10
Data skew: 0, 0.5, 1
Number of Joining Attributes: 10
Max value for data: 10000
Sorting: By score
Histograms: 10 bins
Cluster: 8 machines
Experiment Part – Comparison of algorithms (2)
0:00:00
0:07:12
0:14:24
0:21:36
0:28:48
0:36:00
0:43:12
0:50:24
0 0.5 1
Ru
nn
ing
tim
e
Skew
Naive
Early Termination
Early Termination & LoadBalancing
REDUCERS = 10
Experiment Part – Comparison of algorithms (3)
0
500000
1000000
1500000
2000000
2500000
0 0.5 1
Nu
mb
er
of
reco
rds
Skew
Naive
Early termination
Early termination & Load Balancing
REDUCERS = 10
Experiment Part – Comparison of algorithms (4)
0:00:00
0:02:53
0:05:46
0:08:38
0:11:31
0:14:24
0:17:17
6 10
Ru
nn
ing
tim
e
Number of Reducers
Early Termination
Early Termination & Load Balancing
REDUCERS = 6
Conclusion
By using the techniques proposed: :
Early Termination
Load Balancing
is possible to implement rank aware queries (Top-K) in
Map / Reduce efficiently and solving disadvantages of
the model Map / Reduce
Questions
????
Thank you.