+ All Categories
Home > Documents > Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

Date post: 01-Jun-2018
Category:
Upload: iaeme-publication
View: 224 times
Download: 3 times
Share this document with a friend

of 12

Transcript
  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    1/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    12

    WORK LOAD ANALYSIS SECURITY ASPECTS AND

    OPTIMIZATION OF WORKLOAD IN HADOOP

    CLUSTERS 

    Atul U. Patil1, T.I. Bagban

    2,B.S.Patil

    3, R.U.Patil

    4, S.A.Gondil

    1M.E CSE, ADCET/ Shivaji University, Kolahpur, India

    2Asso.Prof. DKTE,Ichalkarnji)/ Shivaji University, Kolahpur, India

    3Asso.Prof. PVPIT, Budhgaon)/ Shivaji University, Kolahpur, India

    4Asst.Prof. BVCOE, Kolhapur)/ Shivaji University, Kolahpur, India

    5Asst.Prof, Bharthi vidhyapit palus, Pune,India.

    ABSTRACT

    This paper discusses a propose cloud system that mixes On-Demand allocation of resources

    with improved utilization, opportunistic provisioning of cycles from idle cloud nodes to alternative

    processes .Because for cloud computing to avail all the demanded services to the cloud customers is

    extremely troublesome. It's a significant issue to fulfil cloud consumer’s needs. Hence On-Demand

    cloud infrastructure exploitation Hadoop configuration with improved C.P.U. utilization and storage

    hierarchy improved utilization is projected using Fair4s Job scheduling algorithm. therefore all cloud

    nodes that remains idle are all in use and additionally improvement in security challenges andachieves load balancing and quick process of huge information in less quantity of your time and

    method all kind of jobs whether or not it\'s massive or little. Here we have a tendency to compare the

    GFS read write algorithm and Fair4s job scheduling algorithm for file uploading and file

    downloading; and enhance the C.P.U. utilization and storage utilization. Cloud computing moves the

    appliance software system and databases to the massive data centres, wherever the management of

    the information and services might not be totally trustworthy. thus this security drawback is finding

    by encrypting the information using encryption/decryption algorithm and Fair4s Job scheduling

    algorithm that solve the problem of utilization of all idle cloud nodes for larger data.

    Keywords:  C.P.U Utilization, Encryption/decryption algorithm, Fair4s Job scheduling algorithm,

    GFS, Storage utilization.

    INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &

    TECHNOLOGY (IJCET)

    ISSN 0976 – 6367(Print)

    ISSN 0976 – 6375(Online)

    Volume 6, Issue 3, March (2015), pp. 12-23

    © IAEME: www.iaeme.com/IJCET.asp

    Journal Impact Factor (2015): 8.9958 (Calculated by GISI)

    www.jifactor.com

    IJCET

    © I A E M E

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    2/12

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    3/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    14

    measure helpful for serving to Hadoop operators determine system bottleneck and figure out

    solutions for optimizing performance. several previous efforts are accomplished in numerous areas,

    together with network systems [06], a cloud infrastructure that mixes on-demand allocation of

    resources with expedient provisioning of cycles from idle cloud nodes to different processes by

    deploying backfill virtual machines (VMs) [21].A model for securing Map/Reduce computationwithin the cloud. The model uses a language primarily based security approach to enforce data flow

    policies that vary dynamically because of a restricted revocable delegation of access rights between

    principals. The decentralized label model (DLM) is employed to specific these policies[18].A new

    security design, Split Clouds, that protects the data hold on in a cloud, whereas the architecture lets

    every organization hold direct security controls to their data, rather than exploit them to cloud

    providers. The main of the model includes of time period data summaries, in line security gateway

    and third party auditor. By the mix of the 3 solutions, the design can prevent malicious activities

    performed even by the safety administrators within the cloud providers [20].Several studies [19],

    [20], [21] have been conducted for workload analysis in grid environments and parallel computer

    systems.

    They proposed various methods for analysing and modelling workload traces. However, the job characteristics and scheduling policies in grid are much different from the ones in a Hadoop

    system.

    III.  THE PROPOSED SYSTEM

    Fig.1 System Architecture

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    4/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    15

    Cloud computing has become a viable, thought resolution for processing, storage and

    distribution, however moving massive amounts of knowledge in associated out of the cloud

    presented an insurmountable challenge[4].Cloud computing is a very undefeated paradigm of service

    destined computing and has revolutionized the means computing infrastructure is abstracted and

    used. Three most well-liked cloud paradigms include:

    1. Infrastructure as a Service (IaaS)

    2. Platform as a Service (PaaS)

    3. Software as a Service (SaaS)

    The thought can even be extended to info as a Service or Storage as a Service. Scalable

    database management system (DBMS) each for update intensive application workloads, in addition

    as decision support systems square measure important a part of the cloud infrastructure. Initial styles

    embody distributed databases for update intensive workloads and parallel database systems for

    analytical workloads. Changes in information access patterns of application and therefore the have to

    be compelled to scale intent on thousands of commodity machines led to birth of a replacementcategory of systems referred to as Key-Value stores[11].In the domain of data analysis, we propose

    the Map Reduce paradigm and its open-source implementation Hadoop, in terms of usability and

    performance.

    The System has six modules:

    1.  Hadoop Configuration( Cloud Server Setup)

    2.  Login & Registration

    3.  Cloud Service Provider(CSP)

    4.  Fair4s Job Scheduling Algorithm

    5.  Encryption/decryption module

    6. 

    Administration files(Third Party Auditor)

    3.1 Hadoop Configuration (Cloud Server Setup)

    The Apache Hadoop is a framework that permits for the decentralized process of huge data

    sets across clusters of computers using straightforward programming models. it's designed to

    proportion from single servers to several thousand nodes, providing massive computation and

    storage capacity, instead of think about underlying hardware to give large availability, the

    infrastructure itself is intended to handle failures at the application layer, thus delivering a most

    available service on prime of a cluster of nodes, every of which can be vulnerable to failures [6].

    Hadoop implements Map reduce, using the HDFS. The Hadoop Distributed File System allows users

    to possess one available namespace, unfold across several lots of or thousands of servers, making

    one massive file system. Hadoop has been incontestable on clusters with more than two thousandnodes. The present style target is ten thousand node clusters.

    Hadoop was inspired by MapReduce, framework during which associate application is de-

    escalated into varied tiny parts. Any of those parts (also referred to as fragments or blocks) may be

    run on any node within the cluster. The present Hadoop system consists of the Hadoop architecture,

    Map-Reduce, the Hadoop distributed file system (HDFS).

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    5/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    16

    Fig.2 Architecture of hadoop

    JobTracker is that the daemon service for submitting and following MapReduce jobs in

    Hadoop. There’s just one Job tracker method run on any hadoop cluster. Job tracker runs on its own

    JVM process. In an exceedingly typical production cluster its run on a separate machine. Every slave

    node is designed with job tracker node location. The JobTracker is single purpose of failure for the

    Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop

    performs; scheduling applications submit jobs to the task trackers. [9].A TaskTracker is a slave node daemon within the cluster that accepts tasks (Map, reduce and

    Shuffle operations) from a JobTracker. There’s just one Task tracker method run on any hadoop

    slave node. Task tracker runs on its own JVM method. Each TaskTracker is designed with a group of

    slots, these indicate the amount of tasks that it will settle for. The TaskTracker starts a separate JVM

    methods to try and do the particular work (called as Task Instance) this is often to confirm that

    process failure doesn't take down the task tracker [10].

    Namenode stores the entire system namespace. Information like last modified time, created

    time, file size, owner, permissions etc. are stored in Namenode [10].The current Apache Hadoop

    ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS).

    The Hadoop Distributed File System (HDFS)

    HDFS is a fault tolerant and self-healing distributed filing system designed to point out acluster of business normal servers into a massively scalable pool of storage. Developed specifically

    for large-scale process workloads where quality, flexibility and turnout square measure necessary,

    HDFS accepts data in any format despite schema, optimizes for prime system of measurement

    streaming, and scales to tried deployments of 100PB and on the way side [8].

    3.2 Login and Registration

    It offer Interface to Login. Client will upload the file and download file from cloud and

    obtain the detailed summery of his account. During this means security is provided to the consumer

    by providing consumer user name and password and stores it in info at the most server that ensures

    the safety. Any information uploaded and downloaded, log record has every activity which may be

    used for more audit trails. With this facility, it ensures enough security to consumer and informationhold on at the cloud servers solely may be changed by the consumer.

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    6/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    17

    3.3 Cloud Service Provider (Administrator)

    It is administration of user and information. Cloud service supplier has an authority to feature

    and take away clients. It ensures enough security on client’s information hold on at the cloud servers.

    Conjointly the log records of every registered and authorize consumer on cloud solely will access the

    services. This specific consumer log record is helps in improve security.

    3.4 Job Scheduling AlgorithmMap-Reduce is a distributed processing model and an implementation for process and

    generating giant datasets that's amenable to a broad style of real-time tasks. Clients specify the

    workload computation in terms of a map and a reduce operate additionally Users specify a map

    operate that processes a key/value combine to come up with a collection of intermediate key/value

    pairs, and a reduce operate that merges all intermediate values related to an equivalent intermediate

    key. Programs written during this purposeful style area unit

    Automatically parallelized and executed on an oversized cluster of commodity machines. The

    run-time system takes care of the main points of partitioning the computer file, scheduling the

    program's execution across a collection of machines, handling machine failures, and managing thedesired inter-machine communication. This enables programmers with none expertise with parallel

    and distributed systems to simply utilize the resources of an oversized distributed system [7].

    Our implementation of Fair4s Job scheduling algorithm runs on an oversized cluster of

    commodity machines and is very scalable. Map-Reduce is Popularized by open-source Hadoop

    project. Our Fair4s Job scheduling algorithm works on process of enormous files by dividing them

    on variety of chunks and assignment the tasks to the cluster nodes in hadoop multimode

    configuration. In these ways in which our planned Fair4s Job programming algorithm improves the

    utilization of the Cluster nodes with parameters like time, CPU, and storage.

    3.4.1 Features of Fair4sExtended functionalities available in Fair4s scheduling algorithm create it workload efficient

    than GFS read write algorithm square measure listed out below these functionalities permits

    algorithm to provides out efficient performance in process huge work load from totally different

    clients.

    1. Setting Slots Quota for Pools- All jobs are divided into many pools. Every job belongs to at least

    one of those pools. Whereas in Fair4S, every pool is designed with a maximum slot occupancy. All

     jobs belonging to a uniform pool share the slots quota, and also the range of slots employed by these

     jobs at a time is restricted to the utmost slots occupancy of their pool. The slot occupancy higher

    limit of user teams makes the slots assignment a lot of versatile and adjustable, and ensures the slots

    occupancy isolation across totally different user teams. Though some slots are occupied by some

    giant jobs, the influence is barely restricted to the native pool within.2. Setting Slot Quota for Individual Users-In Fair4S, every user is designed with a most slots

    occupance. Given a user, regardless of what number jobs he/she submits, the entire range of

    occupied slots won't exceed the quota. This constraint on individual user avoids that a user submit

    too many roles and these jobs occupy too several slots.

    3. Assigning Slots based on Pool Weight- Fair4S, every pool is designed with a weight. All pools

    that look ahead to a lot of slots type a queue of pools. Given a pool, the prevalence times within the

    queue is linear to the burden of the pool. Therefore, a pool with a high weight are allotted with a lot

    of slots. Because the pool weight is configurable, the pool weight-based slot assignment policy

    decreases small jobs’ waiting time (for slots) effectively.

    4. Extending Job Priorities- Fair4S introduces an in depth and quantified priority for every job. The

    task priority is described by associate degree integral range ranged from zero to a thousand.Generally, at intervals a pool, a job with a better priority will preempt the slots used by another job

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    7/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    18

    with a lower priority. A quantified job priority contributes to differentiate the priorities of small jobs

    in numerous user-groups. Programming Model

    3.4.2 Fair4s Job Scheduling Algorithm

    A job scheduling algorithm, Fair4S, which is modeled to be biased for small jobs. In varietyof workloads Small jobs account for the majority of the workload, and lots of them require instant

    responses, which is an important factor at production Hadoop systems. The inefficiency of Hadoop

    fair scheduler and GFS read write algorithm for handling small jobs motivates us to use and analyze

    Fair4S, which introduces pool weights and extends job priorities to guarantee the rapid responses for

    small jobs [1] In this scenario clients is going to upload or download file from the main server where

    the Fair4s Job Scheduling Algorithm going to execute. On main server the mapper function will

    provide the list of available cluster I/P addresses to which tasks are get assigned so that the task of

    files splitting get assigned to each live clusters. Fair4s Job Scheduling Algorithm splits file according

    to size and the available cluster nodes.

    3.4.3 Procedure of Slots Allocation

    1. The primary step is to allot slots to job pools. Every job pool is organized with two parameters of

    maximum slots quota and pool weight. In any case, the count of slots allotted to a job pool wouldn't

    exceed its most slots quota. If slots demand for one job pool varies, the utmost slots quota is

    manually adjusted by Hadoop operators. If a job pool requests additional slots, the scheduler first

     judges whether or not the slots occupance of the pool can exceed the quota. If not, the pool are

    appended with the queue and wait for slot allocation. The scheduler allocates the slots by round-

    robin algorithm. Probabilistically, a pool with high allocation weight are additional likely to be

    allotted with slots.

    2. The second step is to allot slots to individual jobs. Every job is organized with a parameter of job

    priority that may be a worth between zero and a thousand. The duty priority and deficit are removedand mixed into a weight of the duty. Inside employment pool, idle slots are allotted to the roles with

    the highest weight.

    3.5 Encryption/decryption

    In this, file get encrypted/decrypted by exploitation the RSA encryption/decryption algorithm

    encryption/decryption algorithm uses public key & private key for the encryption and

    decipherment of data. Consumer transfer the file in conjunction with some secrete/public key so

    private key's generated & file get encrypted. At the reverse method by using the public

    key/private key pair file get decrypted and downloaded. Like client upload the file with the public

    key and also the file name that is used to come up with the distinctive private key's used for

    encrypting the file. During this approach uploaded file get encrypted and store at main servers and sothis file get splitted by using the Fair4s Scheduling algorithm that provides distinctive security

    feature for cloud data. In an exceedingly reverse method of downloading the data from cloud servers,

    file name and public key wont to generate secrete and combines The all parts of file so data get

    decrypted and downloaded that ensures the tremendous quantity of security to cloud information.

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    8/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    19

    Fig.3 RSA encryption/decryption

    3.6  Administration of client files(Third Party Auditor)

    This module provides facility for auditing all client files, as numerous activities are done by

    client. Files Log records and got created and hold on Main Server. for every registered client Log

    record is get created that records the varied activities like that operations (upload/download)

    performed by client. Additionally Log records keep track of your time and date at that varied

    activities carried out by client. For the security and security of the client data and conjointly for the

    auditing functions the Log records helps. Additionally for the Administrator Log record facility is

    provided that records the Log info of all the registered clients. In order that Administrator will

    control over the all the info hold on Cloud servers. Administrator will see client wise Log records

    that helps us to notice the fraud information access if any fake user attempt to access the info hold on

    Cloud servers.Registered Client Log records:

    Fig.4 List of Log records of clients.

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    9/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    20

    IV.  RESULTS

    Our results of the project will be explained well with the help of project work done on

    number of clients and one main server and then three to five secondary servers so then we have get

    these results bases on three parameters taken into consideration like1) Time

    2) CPU Utilization

    3) Storage Utilization.

    Our evaluation examines the improved utilization of Cluster nodes i.e. Secondary servers by

    uploading and downloading files by using Fair4s scheduling algorithm versus GFS read write

    algorithm from three perspectives. First is improved time utilization and second is improved CPU

    utilization also the storage utilization also get improved tremendously.

    4.1 Results for time utilization

    Fig.5 Time Utilization Graph For Uploading Files

    Fig. 5 shows time utilization for GFS and Fair4s algorithm for uploading files.

    These are:

    Uploading File Size(in Kb) Time (in milisec) for GFS Time (in milisec) for Fair4s

    1742936 1720 107

    4734113 928 170

    6938669 1473 117

    11527296 1857 704

    3057917 253 38

    17385800 1859 839

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    10/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    21

    Fig.06 Time Utilization Graph for Download Files

    Fig. 06 shows time utilization for GFS and Fair4s for downloading files.

    These are:

    Number of Files Time (in milisec) for GFS Time (in milisec) for Fair4s

    5 840 795

    7 1937 1852

    9 4814 3698

    11 5143 4111

    4.2 Results for CPU utilization

    Fig.07 CPU Utilizationon Graph for GFS Files

    Fig.08 describes the CPU utilization for GFS files on number of cluster nodes.

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    11/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    22

    Fig.08 Describes CPU utilization graph on Fair4s Algorithm on number of Cluster nodes in Hadoop.

    V.  CONCLUSION

    We have proposed improved cloud architecture that mixes On-Demand schedulingof

    infrastructure resources with optimized utilization, opportunistic provisioning of cycles from idle

    nodes to different processes. A cloud infrastructure using Hadoop configuration with improved

    processor utilization and storage space utilization is proposed using Fair4s Job scheduling algorithm.

    Hence all unutilized nodes that remains idle are all get utilised and mostly improvement in security

    problems and achieves load balancing and quick process of huge data in less amount of your time.

    We tend to compare the GFS read write algorithm and fair4s map reduce algorithm for file uploading

    and file downloading; and optimizes the processor utilization and storage space use. During this

    paper, we tend to additionally plan a number of the techniques that area unit implemented to guarddata and propose design to protect data in cloud. This model was proposed to store data in cloud in

    encrypted information using RSA technique that relies on encryption and decryption of data. Till

    currently in several planned works, there's Hadoop configuration for cloud infrastructure. However

    still the cloud nodes remains idle. Hence no such work on C.P.U. utilization for GFS read write

    algorithm versus fair4s scheduling algorithm and storage utilization for GFS read write algorithm

    versus fair4s algorithm, done.

    We give the backfill problem solution using an on-demand user workload on cloud structure

    using hadoop. We tend to contribute to an increase of the processor utilization and time utilization

    between GFS and Fair4s. In our work additionally all cloud nodes area unit get fully utilised , no any

    cloud stay idle, additionally processing of file get at faster rate so tasks get processed at less quantity

    of your time that is additionally a big advantage hence improve utilization. We tend to additionally

    implement RSA algorithm to secure the data, hence improve security.

    VI. REFERENCES

    1.  ZujieRen, Jian Wan“Workload Analysis, Implications, and Optimization on a Production

    Hadoop Cluster:A Case Study on Taobao”,CO IEEE TRANSACTIONS ON SERVICES

    COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2014.

    2.  M. Zaharia, D. Borthakur, J.S. Sarma, S. Shenker, and I. Stoica, ‘‘Job Scheduling for Multi-

    User Mapreduce Clusters,’’ (Univ.California, Berkeley, CA, USA, Tech. Rep. No.

    UCB/EECS-2009-55, Apr. 2009).

  • 8/9/2019 Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

    12/12

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

    ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 © IAEME 

    23

    3.  Y. Chen, S. Alspaugh, and R.H. Katz, ‘‘Interactive Analytical Processing in Big Data

    Systems: A Cross-Industry Study of Mapreduce Workloads,’’ Proc. VLDB Endowment, vol.

    5, no. 12, Aug. 2012

    4.  Divyakant Agrawal et al., “Big Data and Cloud Computing: Current State and Future

    Opportunities”, EDBT, pp 22-24, March 2011.5.  Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, ‘‘Workload Characterization on a Production

    Hadoop Cluster: A Case Study on Taobao,’’ in Proc. IEEE IISWC, 2012, pp. 3-13.

    6.  Jeffrey Dean et al., “MapReduce: simplified data processing on large clusters”,

    communications of the acm, Vol S1, No. 1, pp.107-113, 2008 January.

    7.  Y. Chen, S. Alspaugh, D. Borthakur, and R.H. Katz, ‘‘Energy Efficiency for Large-Scale

    Mapreduce Workloads with Significant Interactive Analysis,’’ in Proc. EuroSys, 2012, pp. 43

    56.

    8.  Stackoverflow(2014,07,14).“HadoopArchitecture Internals: use of job and task

    trackers”[English].Available:http://stackoverflow.com/questions/11263187/hadoop

    architecture-internals-use-of-job-and-task-trackers

    9. 

    S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, ‘‘An Analysis of Traces from aProduction Mapreduce Cluster,’’ in Proc. CCGRID, 2010, pp. 94-103.

    10.  J. Dean et al.,“MapReduce: a flexible data processing tool”,In CACM, Jan 2010.

    11.  M. Stonebraker et al., “MapReduce and parallel DBMSs: friends or foes?” In CACM. Jan

    2010.

    12.  X. Liu, J. Han, Y. Zhong, C. Han, and X. He, ‘‘Implementing WebGIS on Hadoop: A Case

    Study of Improving Small File I/O Performance on HDFS,’’ in Proc. CLUSTER, 2009, pp. 1-

    8.

    13.  A. Abouzeid et al., “HadoopDB: An Architectural Hybrid of MapReduce and DBMS

    Technologies for Analytical Workloads”, In VLDB 2009.

    14. 

    S. Das et al., “Ricardo: Integrating R and Hadoop”, In SIGMOD 2010.15. 

    J. Cohen et al.,“MAD Skills: New Analysis Practices for Big Data”, In VLDB, 2009.

    16.  Gaizhen Yang et al., “The Application of SaaS-Based Cloud Computing in the University

    Research and Teaching Platform”, ISIE, pp. 210-213, 2011.

    17.  Paul Marshall et al., “Improving Utilization of Infrastructure Clouds”,IEEE/ACM

    International Symposium, pp. 205-2014, 2011.

    18.  F. Wang, Q. Xin, B. Hong, S.A. Brandt, E.L. Miller, D.D.E. Long, and T.T. Mclarty, ‘‘File

    System Workload Analysis for Large Scale Scientific Computing Applications,’’ in Proc.

    MSST, 2004,

    19.  ]pp. 139-152.[23] M. Zaharia, D. Borthakur, J.S. Sarma, K. Elmeleegy, S. Shenker, andI.

    Stoica, ‘‘Delay Scheduling: A Simple Technique for AchievingLocality and Fairness in

    Cluster Scheduling,’’ in Proc. EuroSys, 2010, pp. 265-278.20.  E. Medernach, ‘‘Workload Analysis of a Cluster in a Grid Environment,’’ in Proc. Job

    Scheduling Strategies Parallel Process. 2005, pp. 36-61

    21.  K. Christodoulopoulos, V. Gkamas, and E.A. Varvarigos, ‘‘Statistical Analysis and Modeling

    of Jobs in a Grid Environment,’’ J. Grid Computing, vol. 6, no. 1, 2008.

    22.  Gandhali Upadhye and Astt. Prof. Trupti Dange, “Nephele: Efficient Data Processing Using

    Hadoop” International journal of Computer Engineering & Technology (IJCET), Volume 5,

    Issue 7, 2014, pp. 11 - 16, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.

    23.  Suhas V. Ambade and Prof. Priya Deshpande, “Hadoop Block Placement Policy For

    Different File Formats” International journal of Computer Engineering & Technology

    (IJCET), Volume 5, Issue 12, 2014, pp. 249 - 256, ISSN Print: 0976 – 6367, ISSN Online:

    0976 – 6375.


Recommended