Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

1/12

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME

12

WORK LOAD ANALYSIS SECURITY ASPECTS AND

OPTIMIZATION OF WORKLOAD IN HADOOP

CLUSTERS

Atul U. Patil1, T.I. Bagban

2,B.S.Patil

3, R.U.Patil

4, S.A.Gondil

5

1M.E CSE, ADCET/ Shivaji University, Kolahpur, India

2Asso.Prof. DKTE,Ichalkarnji)/ Shivaji University, Kolahpur, India

3Asso.Prof. PVPIT, Budhgaon)/ Shivaji University, Kolahpur, India

4Asst.Prof. BVCOE, Kolhapur)/ Shivaji University, Kolahpur, India

5Asst.Prof, Bharthi vidhyapit palus, Pune,India.

ABSTRACT

This paper discusses a propose cloud system that mixes On-Demand allocation of resources

with improved utilization, opportunistic provisioning of cycles from idle cloud nodes to alternative

processes .Because for cloud computing to avail all the demanded services to the cloud customers is

extremely troublesome. It's a significant issue to fulfil cloud consumer’s needs. Hence On-Demand

cloud infrastructure exploitation Hadoop configuration with improved C.P.U. utilization and storage

hierarchy improved utilization is projected using Fair4s Job scheduling algorithm. therefore all cloud

nodes that remains idle are all in use and additionally improvement in security challenges andachieves load balancing and quick process of huge information in less quantity of your time and

method all kind of jobs whether or not it\'s massive or little. Here we have a tendency to compare the

GFS read write algorithm and Fair4s job scheduling algorithm for file uploading and file

downloading; and enhance the C.P.U. utilization and storage utilization. Cloud computing moves the

appliance software system and databases to the massive data centres, wherever the management of

the information and services might not be totally trustworthy. thus this security drawback is finding

by encrypting the information using encryption/decryption algorithm and Fair4s Job scheduling

algorithm that solve the problem of utilization of all idle cloud nodes for larger data.

Keywords: C.P.U Utilization, Encryption/decryption algorithm, Fair4s Job scheduling algorithm,

GFS, Storage utilization.

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &

TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print)

ISSN 0976 – 6375(Online)

Volume 6, Issue 3, March (2015), pp. 12-23

© IAEME: www.iaeme.com/IJCET.asp

Journal Impact Factor (2015): 8.9958 (Calculated by GISI)

www.jifactor.com

IJCET

© I A E M E


2/12


3/12



14

measure helpful for serving to Hadoop operators determine system bottleneck and figure out

solutions for optimizing performance. several previous efforts are accomplished in numerous areas,

together with network systems [06], a cloud infrastructure that mixes on-demand allocation of

resources with expedient provisioning of cycles from idle cloud nodes to different processes by

deploying backfill virtual machines (VMs) [21].A model for securing Map/Reduce computationwithin the cloud. The model uses a language primarily based security approach to enforce data flow

policies that vary dynamically because of a restricted revocable delegation of access rights between

principals. The decentralized label model (DLM) is employed to specific these policies[18].A new

security design, Split Clouds, that protects the data hold on in a cloud, whereas the architecture lets

every organization hold direct security controls to their data, rather than exploit them to cloud

providers. The main of the model includes of time period data summaries, in line security gateway

and third party auditor. By the mix of the 3 solutions, the design can prevent malicious activities

performed even by the safety administrators within the cloud providers [20].Several studies [19],

[20], [21] have been conducted for workload analysis in grid environments and parallel computer

systems.

They proposed various methods for analysing and modelling workload traces. However, the job characteristics and scheduling policies in grid are much different from the ones in a Hadoop

system.

III. THE PROPOSED SYSTEM

Fig.1 System Architecture


4/12



15

Cloud computing has become a viable, thought resolution for processing, storage and

distribution, however moving massive amounts of knowledge in associated out of the cloud

presented an insurmountable challenge[4].Cloud computing is a very undefeated paradigm of service

destined computing and has revolutionized the means computing infrastructure is abstracted and

used. Three most well-liked cloud paradigms include:

1. Infrastructure as a Service (IaaS)

2. Platform as a Service (PaaS)

3. Software as a Service (SaaS)

The thought can even be extended to info as a Service or Storage as a Service. Scalable

database management system (DBMS) each for update intensive application workloads, in addition

as decision support systems square measure important a part of the cloud infrastructure. Initial styles

embody distributed databases for update intensive workloads and parallel database systems for

analytical workloads. Changes in information access patterns of application and therefore the have to

be compelled to scale intent on thousands of commodity machines led to birth of a replacementcategory of systems referred to as Key-Value stores[11].In the domain of data analysis, we propose

the Map Reduce paradigm and its open-source implementation Hadoop, in terms of usability and

performance.

The System has six modules:

1. Hadoop Configuration( Cloud Server Setup)

2. Login & Registration

3. Cloud Service Provider(CSP)

4. Fair4s Job Scheduling Algorithm

5. Encryption/decryption module

6.

Administration files(Third Party Auditor)

3.1 Hadoop Configuration (Cloud Server Setup)

The Apache Hadoop is a framework that permits for the decentralized process of huge data

sets across clusters of computers using straightforward programming models. it's designed to

proportion from single servers to several thousand nodes, providing massive computation and

storage capacity, instead of think about underlying hardware to give large availability, the

infrastructure itself is intended to handle failures at the application layer, thus delivering a most

available service on prime of a cluster of nodes, every of which can be vulnerable to failures [6].

Hadoop implements Map reduce, using the HDFS. The Hadoop Distributed File System allows users

to possess one available namespace, unfold across several lots of or thousands of servers, making

one massive file system. Hadoop has been incontestable on clusters with more than two thousandnodes. The present style target is ten thousand node clusters.

Hadoop was inspired by MapReduce, framework during which associate application is de-

escalated into varied tiny parts. Any of those parts (also referred to as fragments or blocks) may be

run on any node within the cluster. The present Hadoop system consists of the Hadoop architecture,

Map-Reduce, the Hadoop distributed file system (HDFS).


5/12



16

Fig.2 Architecture of hadoop

JobTracker is that the daemon service for submitting and following MapReduce jobs in

Hadoop. There’s just one Job tracker method run on any hadoop cluster. Job tracker runs on its own

JVM process. In an exceedingly typical production cluster its run on a separate machine. Every slave

node is designed with job tracker node location. The JobTracker is single purpose of failure for the

Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop

performs; scheduling applications submit jobs to the task trackers. [9].A TaskTracker is a slave node daemon within the cluster that accepts tasks (Map, reduce and

Shuffle operations) from a JobTracker. There’s just one Task tracker method run on any hadoop

slave node. Task tracker runs on its own JVM method. Each TaskTracker is designed with a group of

slots, these indicate the amount of tasks that it will settle for. The TaskTracker starts a separate JVM

methods to try and do the particular work (called as Task Instance) this is often to confirm that

process failure doesn't take down the task tracker [10].

Namenode stores the entire system namespace. Information like last modified time, created

time, file size, owner, permissions etc. are stored in Namenode [10].The current Apache Hadoop

ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS).

The Hadoop Distributed File System (HDFS)

HDFS is a fault tolerant and self-healing distributed filing system designed to point out acluster of business normal servers into a massively scalable pool of storage. Developed specifically

for large-scale process workloads where quality, flexibility and turnout square measure necessary,

HDFS accepts data in any format despite schema, optimizes for prime system of measurement

streaming, and scales to tried deployments of 100PB and on the way side [8].

3.2 Login and Registration

It offer Interface to Login. Client will upload the file and download file from cloud and

obtain the detailed summery of his account. During this means security is provided to the consumer

by providing consumer user name and password and stores it in info at the most server that ensures

the safety. Any information uploaded and downloaded, log record has every activity which may be

used for more audit trails. With this facility, it ensures enough security to consumer and informationhold on at the cloud servers solely may be changed by the consumer.


6/12



17

3.3 Cloud Service Provider (Administrator)

It is administration of user and information. Cloud service supplier has an authority to feature

and take away clients. It ensures enough security on client’s information hold on at the cloud servers.

Conjointly the log records of every registered and authorize consumer on cloud solely will access the

services. This specific consumer log record is helps in improve security.

3.4 Job Scheduling AlgorithmMap-Reduce is a distributed processing model and an implementation for process and

generating giant datasets that's amenable to a broad style of real-time tasks. Clients specify the

workload computation in terms of a map and a reduce operate additionally Users specify a map

operate that processes a key/value combine to come up with a collection of intermediate key/value

pairs, and a reduce operate that merges all intermediate values related to an equivalent intermediate

key. Programs written during this purposeful style area unit

Automatically parallelized and executed on an oversized cluster of commodity machines. The

run-time system takes care of the main points of partitioning the computer file, scheduling the

program's execution across a collection of machines, handling machine failures, and managing thedesired inter-machine communication. This enables programmers with none expertise with parallel

and distributed systems to simply utilize the resources of an oversized distributed system [7].

Our implementation of Fair4s Job scheduling algorithm runs on an oversized cluster of

commodity machines and is very scalable. Map-Reduce is Popularized by open-source Hadoop

project. Our Fair4s Job scheduling algorithm works on process of enormous files by dividing them

on variety of chunks and assignment the tasks to the cluster nodes in hadoop multimode

configuration. In these ways in which our planned Fair4s Job programming algorithm improves the

utilization of the Cluster nodes with parameters like time, CPU, and storage.

3.4.1 Features of Fair4sExtended functionalities available in Fair4s scheduling algorithm create it workload efficient

than GFS read write algorithm square measure listed out below these functionalities permits

algorithm to provides out efficient performance in process huge work load from totally different

clients.

1. Setting Slots Quota for Pools- All jobs are divided into many pools. Every job belongs to at least

one of those pools. Whereas in Fair4S, every pool is designed with a maximum slot occupancy. All

jobs belonging to a uniform pool share the slots quota, and also the range of slots employed by these

jobs at a time is restricted to the utmost slots occupancy of their pool. The slot occupancy higher

limit of user teams makes the slots assignment a lot of versatile and adjustable, and ensures the slots

occupancy isolation across totally different user teams. Though some slots are occupied by some

giant jobs, the influence is barely restricted to the native pool within.2. Setting Slot Quota for Individual Users-In Fair4S, every user is designed with a most slots

occupance. Given a user, regardless of what number jobs he/she submits, the entire range of

occupied slots won't exceed the quota. This constraint on individual user avoids that a user submit

too many roles and these jobs occupy too several slots.

3. Assigning Slots based on Pool Weight- Fair4S, every pool is designed with a weight. All pools

that look ahead to a lot of slots type a queue of pools. Given a pool, the prevalence times within the

queue is linear to the burden of the pool. Therefore, a pool with a high weight are allotted with a lot

of slots. Because the pool weight is configurable, the pool weight-based slot assignment policy

decreases small jobs’ waiting time (for slots) effectively.

4. Extending Job Priorities- Fair4S introduces an in depth and quantified priority for every job. The

task priority is described by associate degree integral range ranged from zero to a thousand.Generally, at intervals a pool, a job with a better priority will preempt the slots used by another job


7/12



18

with a lower priority. A quantified job priority contributes to differentiate the priorities of small jobs

in numerous user-groups. Programming Model

3.4.2 Fair4s Job Scheduling Algorithm

A job scheduling algorithm, Fair4S, which is modeled to be biased for small jobs. In varietyof workloads Small jobs account for the majority of the workload, and lots of them require instant

responses, which is an important factor at production Hadoop systems. The inefficiency of Hadoop

fair scheduler and GFS read write algorithm for handling small jobs motivates us to use and analyze

Fair4S, which introduces pool weights and extends job priorities to guarantee the rapid responses for

small jobs [1] In this scenario clients is going to upload or download file from the main server where

the Fair4s Job Scheduling Algorithm going to execute. On main server the mapper function will

provide the list of available cluster I/P addresses to which tasks are get assigned so that the task of

files splitting get assigned to each live clusters. Fair4s Job Scheduling Algorithm splits file according

to size and the available cluster nodes.

3.4.3 Procedure of Slots Allocation

1. The primary step is to allot slots to job pools. Every job pool is organized with two parameters of

maximum slots quota and pool weight. In any case, the count of slots allotted to a job pool wouldn't

exceed its most slots quota. If slots demand for one job pool varies, the utmost slots quota is

manually adjusted by Hadoop operators. If a job pool requests additional slots, the scheduler first

judges whether or not the slots occupance of the pool can exceed the quota. If not, the pool are

appended with the queue and wait for slot allocation. The scheduler allocates the slots by round-

robin algorithm. Probabilistically, a pool with high allocation weight are additional likely to be

allotted with slots.

2. The second step is to allot slots to individual jobs. Every job is organized with a parameter of job

priority that may be a worth between zero and a thousand. The duty priority and deficit are removedand mixed into a weight of the duty. Inside employment pool, idle slots are allotted to the roles with

the highest weight.

3.5 Encryption/decryption

In this, file get encrypted/decrypted by exploitation the RSA encryption/decryption algorithm

encryption/decryption algorithm uses public key & private key for the encryption and

decipherment of data. Consumer transfer the file in conjunction with some secrete/public key so

private key's generated & file get encrypted. At the reverse method by using the public

key/private key pair file get decrypted and downloaded. Like client upload the file with the public

key and also the file name that is used to come up with the distinctive private key's used for

encrypting the file. During this approach uploaded file get encrypted and store at main servers and sothis file get splitted by using the Fair4s Scheduling algorithm that provides distinctive security

feature for cloud data. In an exceedingly reverse method of downloading the data from cloud servers,

file name and public key wont to generate secrete and combines The all parts of file so data get

decrypted and downloaded that ensures the tremendous quantity of security to cloud information.


8/12



19

Fig.3 RSA encryption/decryption

3.6 Administration of client files(Third Party Auditor)

This module provides facility for auditing all client files, as numerous activities are done by

client. Files Log records and got created and hold on Main Server. for every registered client Log

record is get created that records the varied activities like that operations (upload/download)

performed by client. Additionally Log records keep track of your time and date at that varied

activities carried out by client. For the security and security of the client data and conjointly for the

auditing functions the Log records helps. Additionally for the Administrator Log record facility is

provided that records the Log info of all the registered clients. In order that Administrator will

control over the all the info hold on Cloud servers. Administrator will see client wise Log records

that helps us to notice the fraud information access if any fake user attempt to access the info hold on

Cloud servers.Registered Client Log records:

Fig.4 List of Log records of clients.


9/12



20

IV. RESULTS

Our results of the project will be explained well with the help of project work done on

number of clients and one main server and then three to five secondary servers so then we have get

these results bases on three parameters taken into consideration like1) Time

2) CPU Utilization

3) Storage Utilization.

Our evaluation examines the improved utilization of Cluster nodes i.e. Secondary servers by

uploading and downloading files by using Fair4s scheduling algorithm versus GFS read write

algorithm from three perspectives. First is improved time utilization and second is improved CPU

utilization also the storage utilization also get improved tremendously.

4.1 Results for time utilization

Fig.5 Time Utilization Graph For Uploading Files

Fig. 5 shows time utilization for GFS and Fair4s algorithm for uploading files.

These are:

Uploading File Size(in Kb) Time (in milisec) for GFS Time (in milisec) for Fair4s

1742936 1720 107

4734113 928 170

6938669 1473 117

11527296 1857 704

3057917 253 38

17385800 1859 839


10/12



21

Fig.06 Time Utilization Graph for Download Files

Fig. 06 shows time utilization for GFS and Fair4s for downloading files.

These are:

Number of Files Time (in milisec) for GFS Time (in milisec) for Fair4s

5 840 795

7 1937 1852

9 4814 3698

11 5143 4111

4.2 Results for CPU utilization

Fig.07 CPU Utilizationon Graph for GFS Files

Fig.08 describes the CPU utilization for GFS files on number of cluster nodes.


11/12



22

Fig.08 Describes CPU utilization graph on Fair4s Algorithm on number of Cluster nodes in Hadoop.

V. CONCLUSION

We have proposed improved cloud architecture that mixes On-Demand schedulingof

infrastructure resources with optimized utilization, opportunistic provisioning of cycles from idle

nodes to different processes. A cloud infrastructure using Hadoop configuration with improved

processor utilization and storage space utilization is proposed using Fair4s Job scheduling algorithm.

Hence all unutilized nodes that remains idle are all get utilised and mostly improvement in security

problems and achieves load balancing and quick process of huge data in less amount of your time.

We tend to compare the GFS read write algorithm and fair4s map reduce algorithm for file uploading

and file downloading; and optimizes the processor utilization and storage space use. During this

paper, we tend to additionally plan a number of the techniques that area unit implemented to guarddata and propose design to protect data in cloud. This model was proposed to store data in cloud in

encrypted information using RSA technique that relies on encryption and decryption of data. Till

currently in several planned works, there's Hadoop configuration for cloud infrastructure. However

still the cloud nodes remains idle. Hence no such work on C.P.U. utilization for GFS read write

algorithm versus fair4s scheduling algorithm and storage utilization for GFS read write algorithm

versus fair4s algorithm, done.

We give the backfill problem solution using an on-demand user workload on cloud structure

using hadoop. We tend to contribute to an increase of the processor utilization and time utilization

between GFS and Fair4s. In our work additionally all cloud nodes area unit get fully utilised , no any

cloud stay idle, additionally processing of file get at faster rate so tasks get processed at less quantity

of your time that is additionally a big advantage hence improve utilization. We tend to additionally

implement RSA algorithm to secure the data, hence improve security.

VI. REFERENCES

1. ZujieRen, Jian Wan“Workload Analysis, Implications, and Optimization on a Production

Hadoop Cluster:A Case Study on Taobao”,CO IEEE TRANSACTIONS ON SERVICES

COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2014.

2. M. Zaharia, D. Borthakur, J.S. Sarma, S. Shenker, and I. Stoica, ‘‘Job Scheduling for Multi-

User Mapreduce Clusters,’’ (Univ.California, Berkeley, CA, USA, Tech. Rep. No.

UCB/EECS-2009-55, Apr. 2009).


12/12



23

3. Y. Chen, S. Alspaugh, and R.H. Katz, ‘‘Interactive Analytical Processing in Big Data

Systems: A Cross-Industry Study of Mapreduce Workloads,’’ Proc. VLDB Endowment, vol.

5, no. 12, Aug. 2012

4. Divyakant Agrawal et al., “Big Data and Cloud Computing: Current State and Future

Opportunities”, EDBT, pp 22-24, March 2011.5. Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, ‘‘Workload Characterization on a Production

Hadoop Cluster: A Case Study on Taobao,’’ in Proc. IEEE IISWC, 2012, pp. 3-13.

6. Jeffrey Dean et al., “MapReduce: simplified data processing on large clusters”,

communications of the acm, Vol S1, No. 1, pp.107-113, 2008 January.

7. Y. Chen, S. Alspaugh, D. Borthakur, and R.H. Katz, ‘‘Energy Efficiency for Large-Scale

Mapreduce Workloads with Significant Interactive Analysis,’’ in Proc. EuroSys, 2012, pp. 43

56.

8. Stackoverflow(2014,07,14).“HadoopArchitecture Internals: use of job and task

trackers”[English].Available:http://stackoverflow.com/questions/11263187/hadoop

architecture-internals-use-of-job-and-task-trackers

9.

S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, ‘‘An Analysis of Traces from aProduction Mapreduce Cluster,’’ in Proc. CCGRID, 2010, pp. 94-103.

10. J. Dean et al.,“MapReduce: a flexible data processing tool”,In CACM, Jan 2010.

11. M. Stonebraker et al., “MapReduce and parallel DBMSs: friends or foes?” In CACM. Jan

2010.

12. X. Liu, J. Han, Y. Zhong, C. Han, and X. He, ‘‘Implementing WebGIS on Hadoop: A Case

Study of Improving Small File I/O Performance on HDFS,’’ in Proc. CLUSTER, 2009, pp. 1-

8.

13. A. Abouzeid et al., “HadoopDB: An Architectural Hybrid of MapReduce and DBMS

Technologies for Analytical Workloads”, In VLDB 2009.

14.

S. Das et al., “Ricardo: Integrating R and Hadoop”, In SIGMOD 2010.15.

J. Cohen et al.,“MAD Skills: New Analysis Practices for Big Data”, In VLDB, 2009.

16. Gaizhen Yang et al., “The Application of SaaS-Based Cloud Computing in the University

Research and Teaching Platform”, ISIE, pp. 210-213, 2011.

17. Paul Marshall et al., “Improving Utilization of Infrastructure Clouds”,IEEE/ACM

International Symposium, pp. 205-2014, 2011.

18. F. Wang, Q. Xin, B. Hong, S.A. Brandt, E.L. Miller, D.D.E. Long, and T.T. Mclarty, ‘‘File

System Workload Analysis for Large Scale Scientific Computing Applications,’’ in Proc.

MSST, 2004,

19. ]pp. 139-152.[23] M. Zaharia, D. Borthakur, J.S. Sarma, K. Elmeleegy, S. Shenker, andI.

Stoica, ‘‘Delay Scheduling: A Simple Technique for AchievingLocality and Fairness in

Cluster Scheduling,’’ in Proc. EuroSys, 2010, pp. 265-278.20. E. Medernach, ‘‘Workload Analysis of a Cluster in a Grid Environment,’’ in Proc. Job

Scheduling Strategies Parallel Process. 2005, pp. 36-61

21. K. Christodoulopoulos, V. Gkamas, and E.A. Varvarigos, ‘‘Statistical Analysis and Modeling

of Jobs in a Grid Environment,’’ J. Grid Computing, vol. 6, no. 1, 2008.

22. Gandhali Upadhye and Astt. Prof. Trupti Dange, “Nephele: Efficient Data Processing Using

Hadoop” International journal of Computer Engineering & Technology (IJCET), Volume 5,

Issue 7, 2014, pp. 11 - 16, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.

23. Suhas V. Ambade and Prof. Priya Deshpande, “Hadoop Block Placement Policy For

Different File Formats” International journal of Computer Engineering & Technology

(IJCET), Volume 5, Issue 12, 2014, pp. 249 - 256, ISSN Print: 0976 – 6367, ISSN Online:

0976 – 6375.

Date post:	01-Jun-2018
Category:	Documents
Upload:	iaeme-publication
View:	224 times
Download:	3 times

Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

Documents