+ All Categories
Home > Documents > HOME: HiveQL Optimization in Multi-Session Environment

HOME: HiveQL Optimization in Multi-Session Environment

Date post: 13-Nov-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
10
HOME : HiveQL Optimization in Multi-Session Environment Marwah N. Abdullah Computer Science Department Cairo University EGYPT [email protected] Mohamed H. Khafagy Computer Science Department Fayoum University EGYPT [email protected] Fatma A. Omara Computer Science Department Cairo University EGYPT [email protected] Abstract:- Analyzing big data has emerged as a significant activity for many organizations. This big data analysis is simplified by the MapReduce framework and execution environment, such as Hadoop and parallel systems, such as Hive. On the other, most of the MapReduce users have a complex query analysis that has expressed as individual MapReduce jobs. By using high-level query languages such as Pig, Hive, and Jaql, the user complex query expresses into workflows of MapReduce jobs. The work in this paper concerns about how to reuse the previous results in the hive output file in the same or different sessions to improve the Hive performance. This has been done by introducing an algorithm called HOME (HiveQL Optimization in Multi- Session Environment). To evaluate our developed HOME algorithm, it has implemented using 19 Different SQL Statement to reduce I/O in MapReduce Job. By developing HOME algorithm, a new HiveQL execution architecture based on materialized previous results has proposed. The framework implementation has built on top of the hive dataflow system without any change in Hive. To evaluate the proposed HiveQL architecture performance, the Star Schema benchmark SSB has been used. According to the experimental results, it is found that the performance of the developed HOME algorithm outperforms the Hive an estimated 67% on average. Key Words: Map-reduce, materialization, optimization, multi-session, complex query, SQL, HiveQL 1. Introduction The MapReduce is a framework to process the problem into parallel manner over big datasets using cluster of computers [1]. On the other hand, the Apache Hadoop is the implementation of MapReduce programming model, and it is an open source of Hadoop using a new way of storing and processing data. According to Apache Hadoop, the user complex query is divided into workflows, where each job in these workflows reads input data from Distributed File System [2,3,4,5]. The Apache Hadoop uses Hadoop Distributed File System (HDFS) and produces output that stored in HDFS. These output results are considered as input by the next job in the workflows [6,7]. The main drawback of the current feature of Apache Hadoop is that the intermediate results in the HDFS will be deleted at the end of the session execution. So, the work in this paper overcomes this drawback. On the other hands, Hive uses high-level query language HiveQL that expresses the user complex query into workflows of MapReduce jobs. Most of the data warehouse applications are implemented using HiveQL in Hadoop. This is because the HiveQL is mostly similar to SQL with a new feature [8]. The rest of this paper is organized as follows; related work is discussed in Section 2.Our proposed system is introduced in Section 3. Section 4 presents the environment setup, and Section 5 discusses the used SSB benchmark. In Section 6, the experimental results and system evolution are included. The conclusion and future work are in Section 7. 2. Related Work Scheduling shared scan studies how jobs submitted to a MapReduce system can be scheduled by sharing scans over the same files [9]. It introduces a new policies for scheduling MapReduce jobs with the goal of maximizing the expectation of sharing scans. This work differs from HOME in that it uses one particular type of sharing the opportunity that happens between concurrently running MapReduce jobs. HOME can use different types of sharing opportunities. Moreover, HOME enables sharing between Hive jobs by storing and reusing Hive job outputs which are executed at different times. Recent Advances in Information Technology ISBN: 978-1-61804-264-4 80
Transcript

HOME : HiveQL Optimization in Multi-Session Environment

Marwah N. Abdullah Computer Science Department

Cairo University EGYPT

[email protected]

Mohamed H. Khafagy Computer Science Department

Fayoum University EGYPT

[email protected]

Fatma A. Omara Computer Science Department

Cairo University EGYPT

[email protected] Abstract:- Analyzing big data has emerged as a significant activity for many organizations. This big data analysis is simplified by the MapReduce framework and execution environment, such as Hadoop and parallel systems, such as Hive. On the other, most of the MapReduce users have a complex query analysis that has expressed as individual MapReduce jobs. By using high-level query languages such as Pig, Hive, and Jaql, the user complex query expresses into workflows of MapReduce jobs. The work in this paper concerns about how to reuse the previous results in the hive output file in the same or different sessions to improve the Hive performance. This has been done by introducing an algorithm called HOME (HiveQL Optimization in Multi-Session Environment). To evaluate our developed HOME algorithm, it has implemented using 19 Different SQL Statement to reduce I/O in MapReduce Job. By developing HOME algorithm, a new HiveQL execution architecture based on materialized previous results has proposed. The framework implementation has built on top of the hive dataflow system without any change in Hive. To evaluate the proposed HiveQL architecture performance, the Star Schema benchmark SSB has been used. According to the experimental results, it is found that the performance of the developed HOME algorithm outperforms the Hive an estimated 67% on average. Key Words: Map-reduce, materialization, optimization, multi-session, complex query, SQL, HiveQL

1. Introduction The MapReduce is a framework to process the problem into parallel manner over big datasets using cluster of computers [1]. On the other hand, the Apache Hadoop is the implementation of MapReduce programming model, and it is an open source of Hadoop using a new way of storing and processing data. According to Apache Hadoop, the user complex query is divided into workflows, where each job in these workflows reads input data from Distributed File System [2,3,4,5]. The Apache Hadoop uses Hadoop Distributed File System (HDFS) and produces output that stored in HDFS. These output results are considered as input by the next job in the workflows [6,7]. The main drawback of the current feature of Apache Hadoop is that the intermediate results in the HDFS will be deleted at the end of the session execution. So, the work in this paper overcomes this drawback. On the other hands, Hive uses high-level query language HiveQL that expresses the user complex query into workflows of MapReduce jobs. Most of the data warehouse applications are implemented using HiveQL in Hadoop. This is because the HiveQL is

mostly similar to SQL with a new feature [8]. The rest of this paper is organized as follows; related work is discussed in Section 2.Our proposed system is introduced in Section 3. Section 4 presents the environment setup, and Section 5 discusses the used SSB benchmark. In Section 6, the experimental results and system evolution are included. The conclusion and future work are in Section 7. 2. Related Work Scheduling shared scan studies how jobs submitted to a MapReduce system can be scheduled by sharing scans over the same files [9]. It introduces a new policies for scheduling MapReduce jobs with the goal of maximizing the expectation of sharing scans. This work differs from HOME in that it uses one particular type of sharing the opportunity that happens between concurrently running MapReduce jobs. HOME can use different types of sharing opportunities. Moreover, HOME enables sharing between Hive jobs by storing and reusing Hive job outputs which are executed at different times.

Recent Advances in Information Technology

ISBN: 978-1-61804-264-4 80

Another work that tries to share computation in MapReduce is MRShare [10]. The main principle of the MRShare is to define the sharing opportunities among queries that are presented in the same batch to the MapReduce data analysis platform. The primary goal of MRShare is to avoid redundant execution by blending the execution of operators from different queries, and a cost model is submitted to find the best plan for integrating a group of queries seeming in the same batch. This work differs from HOME in that HOME focuses on reducing work for individual Hive jobs by using previously outputs

In ReStore [11], a system reuses intermediate outputs of MapReduce jobs in the workflow to speed up future executed workflows in the system. In addition to reusing the results of whole MapReduce jobs. ReStore creates more reuse opportunities by storing the output of some physical query. Implementation of ReStore as an extension to the Pig dataflow system [12]. This work differs from HOME in that it uses reuse of intermediate data of MapReduce jobs. HOME reuse intermediate results of Hive.

m2r2 has proposed an extensible and language independent framework for results materialization [13]. Its implementation is made on top of the Pig dataflow system and deals with automatic results caching, common sub-query matching and rewriting, as well as trash collection.

This work differs from HOME in that it made enhancement in one-session using Pig. HOME works in one and multi-session using Hive.

Generally, the main drawback in all techniques is the dealing with one session but HOME improve performance but using the output and share it in different sessions. 3 HiveQL Optimization in Multi-

Session Environment (HOME) The main feature of HOME system is to reuse the intermediate results in one and multi-sessions to improve the Hadoop performance by considering the following seven cases of HiveQL clauses without any change of Hive Code: Case 1: exists Query Case 2: subset of columns Case 3: Order By clauses

Case 4: Group By clauses Case 5: Having clauses Case 6: Where clauses Case 7: Join Statements 3.1 Parser The General SQL Parser (GSP) is used because of the ability of adding to the user applications powerful SQL functionality. By using the General SQL Parser library, the applications development time could be saved and then improve the performance [14]. The General SQL Parser reads the input SQL text of the query and extracts all information such as the name of columns, table name, ordered column, group by column, having condition, where condition, and join condition that HOME needed to determine if the new query is a subset of a previous query. 3.2 Results File The Results file is a text file that store name of output file and SQL statement from previous executions. Also, it contains the name of columns, table name, ordered column, group by column, having condition, where condition, join condition and all parser result [15,16,17]. 3.3 HOME System Architecture Fig.1 illustrates how HOME works. First, after receiving a new query, HOME checks if the query is existed in the previous result that stored in the Results file. In this case, it provides the same result without re-execution the statement.

Second, if the query does not exist in the Results file, HOME sends the statement to the parser to provide more information about the statement that is described in the previous section.

Third, HOME checks if the statement uses the same table and columns subset of the column in the previous statement, then it checks which statement is used.

In the following section, how HOME will deal with the seven cases will be illustrated. On the other hands, run the statement in HiveQL and save the information of statement in the Result file and also store the output of the statement will be illustrated.

Recent Advances in Information Technology

ISBN: 978-1-61804-264-4 81

Case 1: Existed Query After receiving the input HiveQL, HOME checks if the statement is existed in the Results file (i.e., reuse the old output from the saved results). If the statement doesn't exist, HOME uses the parser to provide more information about the statement to determine if the statement share information with previous statement stored in the Results file. For Example, suppose query Q1 in Star Schema benchmark SSB runs [18]. (lo_extendedprice, lo_orderdate, lo_discount) represents columns name from lineorder1 table and R1 represents file name that contains stored results. Fig.2 explains how HOME works in this case(i.e., Existed Query).

Fig.2 Existed Query

In the case of sharing table and column between old statements and a new statement, each statement will be processed as follow:

Case 2: Same Table –Subset Column If the new input of HiveQL has the same table with any statement in the Result file and a new column Col1 is a subset with the column in old statement in the Results file Col2 (Col1€ Col2), then HOME rewrites the input query to select new columns from the oldest table in the Results file. Suppose the output of the oldest statement R1 for HiveQL example running Q2 in Star Schema benchmark SSB (lo_extendedprice, lo_orderdate, lo_discount) represents columns name from lineorder1 table and r1 represents new table that is created from the result datatable and R1 represents file name that contains the stored results. Fig.3 explains how HOME works in this case (i.e., Same Table –Subset Column).

Fig.3 Same Table –Subset Column

Print Stored ResultsNot Need To ReWrite Query Just Print Stored Results

HOME SELECT lo_extendedprice, lo_orderdate, lo_discount

FROM lineorder1 R1

Session 0 .... n Input HiveQL

SELECT lo_extendedprice, lo_orderdate, lo_discount FROM lineorder1

Updated HiveQLSELECT lo_extendedprice, lo_orderdate FROM r1

HOME SELECT lo_extendedprice, lo_orderdate, lo_discount

FROM lineorder1 R1

Session 0 .... nInput HiveQL

SELECT lo_extendedprice, lo_orderdate FROM lineorder1

Fig.1 HOME System Diagram

Recent Advances in Information Technology

ISBN: 978-1-61804-264-4 82

Case 3: Order by Clauses If the new input of HiveQL has the same table with any statement in the Results file and a new column Col1 is a subset of the column in old statement in the Results file Col2 (Col1€ Col2) and a new ordered column is the same as in old ordered column, then HOME rewrites the input query to select new columns from the oldest table in the Results file without reordering columns again in the updated statement. Consider the output of the oldest Statement R2 for HiveQL running Q3 in Star Schema benchmark SSB. (p_name, p_type ,p_brand1) represents columns name from part1 table and r2 represents new table that is created from result datatable and R2 represents file name that contains stored results. Fig.4 explains how HOME works in this case(i.e., Order By).

Fig.4 Order by Case 4: Group by Clauses If the new input of HiveQL has the same table with any statement in the Results file and a new column Col1 is a subset of the column in old statement in the Results file Col2 (Col1€ Col2) and a new group by column is the same as in the old group by column, then HOME rewrites the input query to select new columns from the oldest table in the Results file without regrouping the columns again in the updated statement. Consider the output of the oldest statement R3 for HiveQL example running Q4 in Star Schema benchmark SSB. (p_name, p_partkey, p_size) represents columns name from part1 table and r3 represents new table that created from result data table and R3 represents file name that contain the stored results. Fig.5 explains how HOME works in this case (i.e., Group By).

Fig.5 Group by

Case 5: Having Clauses If the new input of HiveQL has the same table with any statement in the Results file and a new column Col1 is a subset with the column in old statement in the result file Col2 (Col1€ Col2) and a new group by column is the same as in the old group by column. In addition, having condition is the same in the new and old queries then HOME rewrites the input query to select new columns from oldest table in the Results file with no need to group columns and put having condition again in the updated statement. consider the output of the oldest statement R4 for HiveQL example running query Q5 in Star Schema benchmark SSB. (p_name, p_partkey, p_size) represents columns name from part1 table and r4 represents new table that created from result datatable and R4 represents file name that contain the stored results. Fig.6 explains how HOME works in this case(i.e., Having).

Fig.6 Having clauses

Updated HiveQLSELECT p_name ,p_type ROM r2

HOMESELECT p_name, p_type ,p_brand1 FROM part1

Order By p_name R2

Session 0 .... n Input HiveQL

SELECT p_name, p_type FROM part1 Order By p_name

Updated HiveQLSELECT p_name , sum(p_size) FROM r3

HOME SELECT p_name ,max(p_partkey), sum(p_size) FROM

part1 Group By p_name R3

Session 0 .... n Input HiveQL

SELECT p_name , sum(p_size) FROM part1 Group By p_name

Updated HiveQL

SELECT p_name ,max(p_partkey) FROM r4

HOME

SELECT p_name ,max(p_partkey), sum(p_size) FROM part1 Group By p_name Having sum(p_size)> 20 R4

Session 0 .... n Input HiveQL

SELECT p_name ,max(p_partkey) FROM part1 Group By p_name Having sum(p_size)> 20

Recent Advances in Information Technology

ISBN: 978-1-61804-264-4 83

Case 6: Where Clauses It consists three cases:

(1) Add Condition; if the new input of HiveQL has the same table with any statement in the Results file and a new columns Col1 is a subset or same with the column in old statement in the result file Col2 (Col1€ Col2) and a new query is added to it where condition then HOME rewrites the input query to select new columns from the oldest table in the Results file with insert where condition in the updated statement. Consider the output of the oldest statement R1 for HiveQL example running query Q6 in Star Schema benchmark SSB. (lo_extendedprice, lo_orderdate, lo_discount) represents columns name from lineorder1 table and r1 represents new table that is created from result datatable, and R1 represents file name that contains the stored results. Fig.7 explains how HOME works in this case (i.e., Where (1)).

Fig.7 Where clauses (1)

(2) Same Condition; If the new input of HiveQL has the same table with any statement in the Results file and a new columns Col1 is a subset with the column in the old statement in the Results file Col2 (Col1€ Col2) and a where condition is the same in both queries (old and new one), then HOME rewrites the input query to select new columns from the oldest table in the Results file without writing where condition in the updated statement again. Consider the output of the oldest statement R5 for HiveQL example running query Q7 in Star Schema benchmark SSB. (lo_orderdate, lo_discount, lo_quantity) represents columns

name from lineorder1 table and r5 represents new table that is created from result datatable, and R5 represents file name that contains the stored results. Fig.8 explains how HOME works in this case(i.e., Where (2)).

Fig.8 Where clauses(2)

(3) Different Condition; If the new input of

HiveQL has the same table with any statement in the Results file and a new column Col1 is a subset with the column in old statement in the Results file Col2 (Col1€ Col2) and a new where condition is a subset from the old where condition, then HOME rewrites the input query to select new columns from the oldest table in the Results file with new where condition in the updated statement. Consider the output of the oldest statement R5 for HiveQL example running query Q8 in Star Schema benchmark SSB. (lo_partkey ,lo_extendedprice, lo_quantity,lo_discount) represents columns name from lineorder1 table and r5 represents new table that is created from result datatable, and R5 represents file name that contains the stored results. Fig.9 explains how HOME works in this case (i.e., Where(3)).

Fig.9 Where clauses(3)

Updated HiveQLSELECT lo_extendedprice, lo_orderdate FROM r1 where

lo_discount < 7

HOME SELECT lo_extendedprice , lo_orderdate , lo_discount

FROM lineorder1 R1

Session 0 .... n Input HiveQL

SELECT lo_extendedprice, lo_orderdate FROM lineorder1 Where lo_discount < 7

Updated HiveQLSELECT lo_discount,lo_quantity FROM r5

HOME SELECT lo_orderdate, lo_discount, lo_quantity FROM lineorder1

Where lo_discount < 7 R5

Session 0 .... n Input HiveQL

SELECT lo_discount,lo_quantity FROM lineorder1 Where lo_discount < 7

Updated HiveQLSELECT lo_partkey, lo_extendedprice FROM r5 Where lo_quantity <

35

HOME SELECT lo_partkey ,lo_extendedprice , lo_quantity , lo_discount

FROM lineorder1 Where lo_discount < 7 R5

Session 0 .... n Input HiveQL

SELECT lo_partkey, l o_extendedprice FROM lineorder1 Where lo_discount < 7 and lo_quantity < 35

Recent Advances in Information Technology

ISBN: 978-1-61804-264-4 84

Case 7: Join Clauses It consists three cases: (1) Add Condition; If the new input of HiveQL has

the same table with any statement in the Results file and a new columns Col1 is a subset or the same of the column in the old statement in the Results file Col2 (Col1€ Col2) and join condition is the same in a new and old queries and added to a new where condition, then HOME rewrites the input query to select new columns from the oldest table in the Results file with new where condition in the updated statement without needing to join tables again. Consider the output of the oldest statement R6 for HiveQL example running Q9 query in Star Schema benchmark SSB. (lo_custkey, l.lo_discount, l.lo_revenue) represents columns name from lineorder1 (l) table, (c_custkey, c.c_city, c.c_nation) represents columns name from customer1 (c) table and r6 represents new table that is created from result datatable and R6 represents file name that contains the stored results. Fig.10 explains how HOME works in this case (i.e., Join (1)).

Fig.10 Join clauses (1)

(2) Same Condition; If the new input of HiveQL has the same table with any statement in the Results file and a new columns Col1 is a subset or same with the column in the old statement in the Results file Col2 (Col1€ Col2) and join condition is same in a new and old queries and a new where condition is same as the old where condition, then HOME rewrites the input query to select new columns from the oldest table in the Results file without needing to put where condition and join tables again in the updated

statement. Consider the output of the oldest statement R6 for HiveQL example running Q10 query in Star Schema benchmark SSB. (lo_custkey, l.lo_discount, l.lo_revenue) represents columns name from lineorder1 (l) table, (c_custkey, c.c_city, c.c_nation) represents columns name from customer1 (c) table and r6 represents new table that is created from result datatable and R6 represents file name that contains stored results. Fig.11 explains how HOME works in this case (i.e., Join(2)).

Fig.11 Join clauses (2)

(3) Different Condition; If the new input of

HiveQL has the same table with any statement in the Results file and a new columns Col1 is a subset or the same as the column in the old statement in the result file Col2 (Col1€ Col2) and join condition is the same in a new and old queries and a new where condition is a subset from old where condition, then HOME rewrites the input query to select new columns from the oldest table in the Results file with new where condition in the updated statement without needing to join tables again. Consider the output of the oldest statement R7 for HiveQL example running Q11 query in Star Schema benchmark SSB. (lo_custkey, l.lo_discount, l.lo_revenue) represents columns name from lineorder1 (l) table, (c_custkey, c.c_city, c.c_region) represents columns name from customer1 (c) table and r7 represents new table that is created from result datatable and R7 represents file name that contains the stored results. Fig.12 explains how HOME works in this case (i.e., Join(3)).

Updated HiveQLSELECT c.c_city ,c.c_nation ,l.lo_discount FROM r6 Where c.c_nation =

'UNITED STATES'

HOME SELECT c.c_city ,c.c_nation ,l.lo_discount, l.lo_revenue FROM customer1 c Join lineorder1 l on (l.lo_custkey = c.c_custkey) R6

Session 0 .... n Input HiveQL

SELECT c.c_city ,c.c_nation ,l.lo_discount FROM customer1 c Join lineorder1 l on(l.lo_custkey = c.c_custkey) Where c.c_nation = 'UNITED STATES'

Updated HiveQL

SELECT c.c_city ,c.c_nation ,l.lo_discount FROM r6

HOME SELECT c.c_city ,c.c_nation ,l.lo_discount, l.lo_revenue from customer1 c Join lineorder1 l on(l.lo_custkey = c.c_custkey) R6

Session 0 .... n Input HiveQL

SELECT c.c_city ,c.c_nation ,l.lo_discount FROM customer1 c Join lineorder1 l on(l.lo_custkey = c.c_custkey)

Recent Advances in Information Technology

ISBN: 978-1-61804-264-4 85

Fig.12 Join clauses (3)

4.EnvironmentSetup

A Hadoop cluster is used, Ubuntu 9.0.3 virtual machines, each one runs Java(TM) SE Runtime Environment on Netbeans IDE. Hadoop version 1.2.1 is installed and configured one Namenode and 2 Datanodes. Namenode and Datanodes have 20 GB of RAM, 4 cores and 40 GB disk. Hive 0.11.0 is installed on the Hadoop Namenode and Datanodes. 5.Used Benchmark

Star Schema Benchmark SSB is used. Star schema based on TPC-H benchmark and it is designed to measure performance of database products that support data warehouse applications. Our developed HOME uses three tables from SSB lineorder1, parts1 and customer1. Some modifications have been done on SSB queries because Hive is not support the relational database (i.e. HiveQL is not able to select from more than one table in the select statement where join statement is used to connect table) [15].

Three tables are used; lineorder1 with 524288 records, parts1 with 100000 records and customer1 with 15000 records, and19 select statements are generated for evaluation. In order to create reuse opportunities, by using or inputting 8 new queries HOME created or modified 10 queries, with reuse opportunity and one exists query.

6.Experimental Results

18 SSB queries (8 original, and 10 with modified substitution parameters) are executed. Each statement runs two times in different session and the performance is calculated in the second run.

A comparative study between HOME and Default Hive is done to evaluate the performance.

We Note that the execution time in the Hive in every case is almost fixed because it does not support multi-session Case 1:

By executing the Exists query, it is found that HOME has the best performance regarding to the execution time especially in the second time as shown in Fig.13. In this case, HOME has 100% reduction rate in the execution time relative to Hive.

Fig.13 Exists query Case 2:

By executing the input query for the case same table-subset column which is illustrated in Fig.3, HOME performance outperforms Hive with respect to the execution time especially in the second time as shown in Fig.14. In this case HOME made 32% reduction rate in the execution time relative to Hive.

Fig.14 Same Table-Subset Column

Case 3:

By executing the input query for the case Order By which is illustrated in Fig.4, HOME has the best performance regarding to the execution time especially in the second time as shown in Fig.15.

Updated HiveQL

SELECT c.c_region , l.lo_revenue FROM r7 Where l.lo_discount < 7

HOME SELECT c.c_city ,c.c_region, l.lo_revenue , lo_discount FROM customer1 c Join lineorder1 l on(l.lo_custkey = c.c_custkey) Where c.c_region = 'ASIA' R7

Session 0 .... n Input HiveQL

SELECT c.c_region , l.lo_revenue FROM customer1 c Join lineorder1 l on(l.lo_custkey = c.c_custkey) Where c.c_region = 'ASIA' and l.lo_discount < 7

0

5

10

1st RunMultiple Run

Tota

l Exe

cutio

n Ti

me

(sec

)

Hive

HOME

0

2

4

6

1st RunMultiple Run

Tota

l Exe

cutio

n Ti

me

(sec

)

Hive

HOME

Recent Advances in Information Technology

ISBN: 978-1-61804-264-4 86

In this case, HOME made 70% reduction rate in execution time relative to Hive.

Fig.15 Order by

Case 4:

By executing the input query for the case group by which is illustrated in Fig.5, HOME performance outperforms Hive with respect to the execution time especially in the second time as shown in Fig.16. In this case, HOME made 89% reduction rate in execution time relative to Hive.

Fig.16 Group by

Case 5:

By executing the input query having clauses which is illustrated in Fig.6, HOME performance outperforms Hive with respect to the execution time especially in the second time as shown in Fig.17. In this case, HOME made 89% reduction rate in execution time relative to Hive.

Fig.17 Having clauses

Case 6:

(1) By executing the input query where clauses (1) which is illustrated in Fig.7, HOME performance outperforms Hive with respect to the execution time especially in the second time as shown in Fig.18. In this case HOME made 38% reduction rate in execution time relative to Hive.

Fig.18 Where clauses (1)

(2)By executing the input query where clauses (2) which is illustrated in Fig.8, HOME performance outperforms Hive with respect to the execution time especially in the second time as shown inFig.19. In this case HOME made36% reduction rate in execution time relative to Hive.

Fig.19 Where clauses (2)

0

2

4

6

1st RunMultiple Run

Tota

l Exe

cutio

n Ti

me

(sec

)

Hive

HOME

02468

10

1st RunMultiple RunTo

tal E

xecu

tion

Tim

e (s

ec)

Hive

HOME

02468

10

1st RunMultiple Run

Tota

l Exe

cutio

n Ti

me

(sec

)

HiveHOME

0

1

2

3

4

1st RunMultiple Run

Tota

l Exe

cutio

n Ti

me

(sec

)

HiveHOME

012

3

4

1st RunMultiple RunTo

tal E

xecu

tion

Tim

e (s

ec)

Hive

HOME

Recent Advances in Information Technology

ISBN: 978-1-61804-264-4 87

(2) (3)

(4) (3)By executing the input query where clauses (3) which is illustrated in Fig.9, HOME performance outperforms Hive with respect to the execution time especially in the second time as shown in Fig.20. In this case, HOME made 27% reduction rate in execution time relative to Hive.

Fig.20 Where clauses (3)

Case 7:

(1) By executing the input query for case Join clauses (1) which is illustrated in Fig.10, HOME performance outperforms Hive with respect to the execution time especially in the second time as shown in Fig.21. In this case HOME made 81% reduction rate in execution time relative to Hive.

Fig.21 Join clauses (1)

(2) By executing the input query for case Join clauses (2) which is illustrated in Fig.11, HOME performance outperforms Hive with respect to the execution time especially in the second time as shown in Fig.22. In this case HOME made 89% reduction rate in the execution time relative to Hive.

Fig.22 Join clauses (2)

(3) By executing the input query for Join clauses (3) which is illustrated in Fig.12, HOME performance outperforms Hive with respect to the execution time especially in the second time as shown in Fig.23. In this case, HOME made 88% reduction rate in execution time relative to Hive.

Fig.23 Join clauses (3)

7 Conclusion and Future Work

Following the idea of HOME work, HOME examines the possibility of porting mechanism in big data environments. In this work, a results are materialization and reuse previous results .

An initial prototype framework on top of hive/Hadoop has implemented and the SSB Benchmark is used to evaluate our work. The results show that when there exists sharing opportunity specially when using multi-session, query execution time can be immensely reduced by reusing previous results. Seven cases (Exists Query, subset of columns, Order By clauses, Group By clauses, Having clauses, Where clauses and Join Statement) have been implemented using Hive and our HOME system and it is found that the HOME

0

1

2

3

4

1st RunMultiple Run

Tota

l Exe

cutio

n Ti

me

(sec

)

HiveHOME

02468

1012

1st RunMultiple Run

Tota

l Exe

cutio

n Ti

me

(sec

)

HiveHOME

05

10152025

1st RunMultiple Run

Tota

l Exe

cutio

n Ti

me

(sec

)

HiveHOME

02468

1012

1st RunMultiple Run

Tota

l Exe

cutio

n Ti

me

(sec

)

HiveHOME

Recent Advances in Information Technology

ISBN: 978-1-61804-264-4 88

performance outperforms the Hive with respect to the execution time. In HOME, the execution time reduction is around 27%-100% relative to Hive.

In future work, our developed HOME like will be examined using different workload characteristics, data distributions and cluster sizes. We intend to minimize the imposed overhead by minimizing the size of stored data. Finally, the problem of capacity storage needs to investigate [19].

ACKNOWLEDGMENT :

The authors would like to thank Fawzya Ramadan , Hussien Shehata and Radhya Sahal for their technical support.

References:

[1] J. Dean S. Ghemawat, MapReduce: Simplified data processing on large clusters, Proc. OSDI, 2004, PP. 137–150.

[2] Khafagy, M.H. , Feel, H.T.A., Distributed Ontology Cloud Storage System, IEEE Proceeding of the 2012 Second Symposium on Network Cloud Computing and Applications, Pages48-52

[3] Feel, H.T. , Khafagy, M.H.OCSS, Ontology Cloud Storage System, IEEE Network Cloud Computing and Applications (NCCA), 2011 First International Symposium on Pages 9-13

[4] Haytham Al Feel, Mohamed Khafagy, Search content via Cloud Storage System. International Journal of Computer Science Issues (IJCSI)b Volume 8 Issue 6, 2011

[5] Borthakur, Dhruba , The hadoop distributed file system: Architecture and design, Hadoop Project Website 11: 21, 2007.

[6] Apache Hadoop. Available at: http://hadoop.apache.org/.

[7] A.Alexandrov, D.Battr´e, D. Warneke, E. Nijkamp, F. Hueske, M.Heimel, O. Kao, S. Ewen and V.Markl ,Massively Parallel Data Analysis with PACTs on Nephele, PVLDB, 3(2) , 2010, pp. 1625–1628.

[8] Capriolo, Edward, D. Wampler, and J. Rutherglen, Programming hive, O'Reilly Media, Inc.", 2012.

[9] P. Agrawal, D. Kifer, C. Olston, Scheduling shared scans of large data files, Proc. VLDB Endow (PVLDB), 1(1) , 2008, pp.958–969.

[10] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, N. Koudas, MRShare: sharing across multiple queries in MapReduce, Proc. VLDB Endow. (PVLDB) 3(1-2) , 2010, pp.494–505.

[11] I. Elghandour , A. Aboulnaga, ReStore: Reusing results of MapReduce jobs in Pig, (Demo). In Proc ACM SIGMOD, 2012.

[12] Ch.Olston, B. Reed, UT Karsh Srivastava, R. Kumar, and A. Tomkins, Pig Latin: A Not-So-Foreign Language for Data Processing, SIGMOD Conference, 2008, pp. 1099–1110.

[13] Kalavri, Vasiliki, Hui Shang, and Vladimir Vlassov,m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data, Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on. IEEE, 2013.

[14] GSP Parser URL: http://www.sqlparser.com/.

[15] Szénási, S., Distributed Region Growing Algorithm for Medical Image Segmentation, International Journal of Circuits, Systems and Signal Processing, 2014, Vol. 8, No. 1, pp.173-181, ISSN 1998-4464.

[16] Z. Kerekes, Z. Toth, S. Szenasi, Z. Vamossy, Sz. Sergyan, Colon Cancer Diagnosis on Digital Tissue Images, Proceedings of IEEE 9th International Conference on Computational Cybernetics. Tihany, 2013, pp. 159-163.

[17] Szénási, S., Distributed Implementations of Cell Nuclei Detection Algorithm, Recent Advances in Image, Audio and Signal Processing, WSEAS Press, Budapest, 2013, pp. 105-109

[18] O’Neil, Patrick E., Elizabeth J. O’Neil, and Xuedong Chen, The star schema benchmark (SSB), Pat (2007).

[19] Ebada Sarhan, Atif Ghalwash, Mohamed Khafagy, Specification and implementation of dynamic web site benchmark in telecommunication area, Proceedings of the 12th WSEAS international conference on Computers 2008, Pages 863-86.

Recent Advances in Information Technology

ISBN: 978-1-61804-264-4 89


Recommended