+ All Categories
Home > Documents > Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Date post: 11-Dec-2016
Category:
Upload: mohammad-ali
View: 215 times
Download: 1 times
Share this document with a friend
30
J Grid Computing (2013) 11:281–310 DOI 10.1007/s10723-013-9255-6 Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions Jawwad Shamsi · Muhammad Ali Khojaye · Mohammad Ali Qasmi Received: 6 February 2012 / Accepted: 28 March 2013 / Published online: 21 April 2013 © Springer Science+Business Media Dordrecht 2013 Abstract Data-intensive systems encompass terabytes to petabytes of data. Such systems require massive storage and intensive computa- tional power in order to execute complex queries and generate timely results. Further, the rate at which this data is being generated induces extensive challenges of data storage, linking, and processing. A data-intensive cloud provides an abstraction of high availability, usability, and effi- ciency to users. However, underlying this abstraction, there are stringent requirements and challenges to facilitate scalable and resourceful services through effective physical infrastructure, smart networking solutions, intelligent software tools, and useful software approaches. This paper analyzes the extensive requirements which exist in data-intensive clouds, describes various challenges related to the paradigm, and assess nu- merous solutions in meeting these requirements and challenges. It provides a detailed study of the solutions and analyzes their capabilities in meet- ing emerging needs of widespread applications. J. Shamsi (B ) · M. A. Khojaye · M. A. Qasmi Systems Research Laboratory, FAST-National University of Computer and Emerging Sciences, Karachi, Pakistan e-mail: [email protected] Keywords Data-intensive cloud computing · Scalability · Fault tolerance · Heterogeneity · Large scale data management · Cloud data storage 1 Introduction Massive popularity and wide-scale deployment of the Internet has enormously increased the rate of data generation and computation [45, 67]. This huge growth has also highlighted immense po- tential for utilization and analysis of data over a wide set of users and its applications. Con- sequently, unprecedented data-related challenges have emerged. Consider an example of a simple Internet search engine that ranks documents on the basis of relative frequency of search terms in its data- collection. The search engine could be enhanced if it includes consideration of user-clicks while obtaining popular results. Similarly, geographical location of users could be incorporated to increase relevancy. The two enhancements mentioned here may seem plausible; however, considering the massive dataset of Internet documents and the di- verse geo-location of Internet users, they require comprehensive collection, efficient storage and retrieval, extensive linkage, meticulous investiga- tion, and methodical analysis; most importantly, in a precise and timely manner. Further, extensive
Transcript
Page 1: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

J Grid Computing (2013) 11:281–310DOI 10.1007/s10723-013-9255-6

Data-Intensive Cloud Computing: Requirements,Expectations, Challenges, and Solutions

Jawwad Shamsi · Muhammad Ali Khojaye ·Mohammad Ali Qasmi

Received: 6 February 2012 / Accepted: 28 March 2013 / Published online: 21 April 2013© Springer Science+Business Media Dordrecht 2013

Abstract Data-intensive systems encompassterabytes to petabytes of data. Such systemsrequire massive storage and intensive computa-tional power in order to execute complex queriesand generate timely results. Further, the rate atwhich this data is being generated inducesextensive challenges of data storage, linking, andprocessing. A data-intensive cloud provides anabstraction of high availability, usability, and effi-ciency to users. However, underlying thisabstraction, there are stringent requirements andchallenges to facilitate scalable and resourcefulservices through effective physical infrastructure,smart networking solutions, intelligent softwaretools, and useful software approaches. This paperanalyzes the extensive requirements whichexist in data-intensive clouds, describes variouschallenges related to the paradigm, and assess nu-merous solutions in meeting these requirementsand challenges. It provides a detailed study of thesolutions and analyzes their capabilities in meet-ing emerging needs of widespread applications.

J. Shamsi (B) · M. A. Khojaye · M. A. QasmiSystems Research Laboratory, FAST-NationalUniversity of Computer and Emerging Sciences,Karachi, Pakistane-mail: [email protected]

Keywords Data-intensive cloud computing ·Scalability · Fault tolerance · Heterogeneity ·Large scale data management · Cloud datastorage

1 Introduction

Massive popularity and wide-scale deployment ofthe Internet has enormously increased the rate ofdata generation and computation [45, 67]. Thishuge growth has also highlighted immense po-tential for utilization and analysis of data overa wide set of users and its applications. Con-sequently, unprecedented data-related challengeshave emerged.

Consider an example of a simple Internetsearch engine that ranks documents on the basisof relative frequency of search terms in its data-collection. The search engine could be enhancedif it includes consideration of user-clicks whileobtaining popular results. Similarly, geographicallocation of users could be incorporated to increaserelevancy. The two enhancements mentioned heremay seem plausible; however, considering themassive dataset of Internet documents and the di-verse geo-location of Internet users, they requirecomprehensive collection, efficient storage andretrieval, extensive linkage, meticulous investiga-tion, and methodical analysis; most importantly, ina precise and timely manner. Further, extensive

Page 2: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

282 J. Shamsi et al.

requirements of meeting availability, scalability,and high performance also exist.

The extensive challenges mentioned aboveare not restricted to search engines. With theemergence of clouds, the notion of computinghas incorporated new requirements of providingefficient user access and storage [80]. Further, theterms of availability and scalability are inherentwith cloud systems. In addition, for a multi-usersystem, a cloud system needs to fulfill the require-ments of privacy and access controls.

In the data-intensive world we live, require-ments and challenges also vary with applications,For example, an iterative application such aspage-rank computation algorithm requires itera-tive computation until a point of convergence isreached. In comparison, a streaming applicationwould prefer processing stream of events in orderto provide timely results.

In this research, we are motivated by the hugegrowth at which data is being generated, themassive contribution it has made to differentapplications, and the enormous potential itpossess in improving performance of computingsystems. These considerations necessitate the fol-lowing questions: 1) what are the challenges and re-quirements associated with dif ferent data-intensiveapplications?, and 2) are the computing plat-forms capable to provide ef f icient solutions in thisparadigm?

Through this paper, we plan to investigatealong these questions. We provide an extensivesurvey to the academic, research, developer andindustrial communities by exploring requirementsof data-intensive clouds, investigating the exist-ing challenges, and studying the available solu-tions. In analyzing these issues, we consider awide range of scenarios, including infrastructurerelated problems and platform related matters.Realizing the significance and wide-scale deploy-ment of Hadoop and MapReduce, the paperalso describes various extensions of the Hadoopframework which have been proposed to enhanceperformance.

In a recent study [104], Sakr et al. surveyedlarge scale data management approaches inclouds. Our work is much different than this

survey paper. First and foremost, in compari-son to the survey paper, we adopt a challenge-centric approach. In that, we did an extensivesearch of the literature and identified require-ments and challenges related to data intensivecomputing. Specific to each challenge, we describesolutions and analyze their strengths and weak-nesses. Second, our approach is more extensive inwhich we consider many issues related to physicaland infrastructural requirements of data-intensiveclouds. These include network constraints, re-source sharing considerations, billing issues, effectof hardware advancements, data placement mat-ters and many other related considerations. Third,we also discussed application-specific challenges,such as capabilities for iterative algorithms andsuitability for join operations. Considering thelarge scale expansion of data-intensive cloud com-puting, in which many applications and platformshave been utilized, we dedicate a separate sec-tion on application-specific enhancements in or-der to provide an extensive view about applicationspecific challenges and solutions.

Consequently, our work is significant with mul-tiple benefits. For researchers, it provides a com-prehensive analysis of the existing work and iden-tifies challenges; whereas for academicians, itoffers a thorough study of the subject. Our workis also useful for the developer community in un-derstanding strengths and weaknesses of differentsolutions. The industrial community could alsofind our work useful in understanding the re-quirements and assessing capabilities of these so-lutions. The remainder of this paper is organizedas follows: Section 2 explains different conceptsabout data intensive computing and describes var-ious requirements in the field. Section 3 elabo-rates on challenges and solutions, while Section 4mentions application-specific solutions for data-intensive systems. Section 5 concludes the paperwith analysis and future directions of research.

2 Data-Intensive Clouds

This section explains background concepts aboutdata intensive computing. The section begins by

Page 3: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 283

explaining information about data intensive com-puting and cloud computing. It builds upon thisdiscussion to define data-intensive cloud comput-ing and continue upon this definition to mentionvarious requirements and issues associated withthe domain.

2.1 Background Information

Data Intensive computing refers to computing oflarge scale data. Gorton et al. describe types ofapplications and research issues for data intensivesystems [45]. Such systems may either includepure data-intensive systems or they may also con-tain data/compute-intensive systems. In that, theformer type of systems devote most of their timeto data manipulation or data I/O, whereas in thelatter type data computation is dominant. Nor-mally, parallelization techniques and high perfor-mance computing [99] are adopted to encounterthe challenges related to data/compute-intensivesystems.

With the growth of data-intensive computing,traditional differences between data/compute-intensive systems and pure data-intensive systemshave started to merge and both are collectivelyreferred as data-intensive systems. Major researchissues for data-intensive systems include manage-ment, handling, fusion, and analysis of data. Of-ten, time-sensitive applications are also deployedon data-intensive systems.

The Pacific Northwest National Laboratoryhas proposed a comprehensive definition: Data-Intensive computing is managing, analyzing, andunderstanding data at volumes and rates that pushthe frontiers of current technologies [67].

A wide set of requirements and issues arisewhen data-intensive applications are deployed onclouds. The cloud must be scalable and available.It should also facilitate huge data analysis andmassive input/output operations. Considering theadministrative challenges and the developmentrequirements a cloud should offer, we proposethe following definition for data-intensive cloudcomputing:

Data-intensive cloud computing involves studyof both programming techniques and platforms to

solve data-intensive tasks and management and ad-ministration of hardware and software which canfacilitate these solutions.

Depending upon its usage, a data-intensivecloud could either be deployed as a privatecloud—supporting users of a specific organiza-tion, or it may be deployed as a public cloud—providing shared resources to a number of users.A data-intensive cloud entails many challengesand issues. These include data-centric issues suchas implementing efficient algorithms and tech-niques to store, manage, retrieve, and analyzethe data and communication-centric issues such asdissipation of information, placement of replicas,data locality, and retrieval of data. Note that issuesin the two categories may be interrelated. For in-stance, data locality often leads to faster executionof data.

Grossman and Gu [47] discussed varieties ofcloud infrastructures for data intensive comput-ing. Figure 1 illustrates the two architectural mod-els for such a system: a cloud could provide EC2-like instances for data-intensive computing, or itcould offer computing platforms (like MapRe-duce) to its users. In the former case, a user isrequired to select tools and a platform for com-puting, and the cloud provider is responsible forstorage and computing power. The provider is alsoliable for replication, fault tolerance, and consis-tency. In comparison, for platform-based cloudcomputing, application-specific solutions [20, 132]exist which provide enhanced performance.

Fig. 1 Architecture model of data-intensive cloudcomputing

Page 4: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

284 J. Shamsi et al.

In this paper, we mainly resort to the lattercategory (data-intensive computing platforms) asthey specifically address challenges and solutionsto data intensive computing. However, duringthe paper, we discuss a few infrastructure-relatedissues such as effective network utilization andresource sharing which may well be applied toboth the types.

2.2 Suitable Systems for Data-Intensive CloudComputing

In order to comprehend the challenges of data-intensive clouds it is pertinent to understand therelated types of systems which can utilize data-intensive clouds.

In a research study, Abadi [1] discusses thepossibilities of types of data-intensive systemswhich can be deployed on a cloud. The authorcompares the requirements for transactional sys-tems and analytical systems. Transactional sys-tems rely on ACID (Atomicity, Consistency, Iso-lation, and Durability) guarantees provided bydatabases. The author mentions that such systemsare unlikely to be deployed on a cloud because ofthe difficulties in facilitating locks, commits, anddata partitioning in a shared-nothing architecture.

Further, ACID guarantees are difficult to main-tain over a cloud system, which is replicated anddistributed across multiple geographical locations[19]. Such systems have strong requirements ofprivacy and trust. For these systems, fault toler-ance is generally denoted as the capability of thesystem to ensure ACID guarantees in case of afault.

In comparison, analytical systems have mostlywrite once and read many architecture. For suchsystems, requirements for distributed locking andcommit are relaxed. They are more suitable fora shared-nothing architecture as the query loadcan be divided across multiple hosts. For suchsystems, fault tolerance is the ability of the sys-tem to provide un-interrupted execution of query.Such systems are therefore more likely to avail thebenefits of cloud systems.

Considering the appropriateness of analyti-cal systems for data-intensive applications, twotypes of software platforms can be used to builddata-intensive clouds: (1) Parallel Databases with

shared-nothing architecture and (2) NOSQL sys-tems, which are distributed and non-relationaldata storage systems.

In databases relying on shared-nothing archi-tecture (such as Teradata [113] and Gamma [36]),a table is horizontally divided across multiplenodes. The division can be implemented either ina round robin manner or through hashing of in-dexes [110]. Distributing indexes gives advantagesof distributed query processing and heavy stor-age. Results of queries from individual nodes aremerged and shuffled to process final results. Thesedatabases have fast retrieval capabilities, whichare aided through advancements in indexing suchas B-trees.

In comparison, NOSQL systems such asMapReduce (Hadoop) [33], MongoDB [12], andCassandra [72] do not support a descriptive SQLlanguage for query processing. The storage isnormally provided through a distributed stor-age, which is spanned across large number ofmachines.

The lack of strong consistency in NOSQL(which are also referred as MR-like) systems hasbeen debated in research community. NOSQLsystems appeared to be inspired from the CAPtheorem [19] which states that out of the threecharacteristics of consistency, availability, and nopartition, only two can be achieved at a timeby a distributed system. However, in a blog,Abadi [2] highlighted some potential problems inthe CAP theorem. Abadi argued that it is notnecessary that consistency be compromised onlyto achieve availability. Instead, consistency mayalso be compromised for latency. For instance,Yahoo’s PNUTS [28] relaxes consistency (by im-plementing eventual consistency) and availabilityin order to achieve low latency. Similarly, in caseof network partition, Dynamo DB from Amazon[35] relaxes consistency to achieve availability;whereas, under the normal scenario, it gives upconsistency in order to decrease latency.

NOSQL systems such as MapReduce have alsobeen compared with parallel databases. In a blog[37], David DeWitt and Michael Stonebrakermentioned the lack of schema as a major limi-tation for MR-like systems. This implies that re-trieval of documents would be slower due to lackof indexed data.

Page 5: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 285

Later, in a research study [91], Pavlo et al.have compared the two models for data-intensivetasks. The authors argued that the flexible SQLenvironment and high speed of execution arepropelling for parallel databases. In comparison,MapReduce offers ease of installation.

In response to the arguments from these twosources, Dean and Ghemawat—the two propo-nents of the MapReduce system, highlighted het-erogeneity and fault tolerance as the two majorstrengths of the MapReduce system [34]. The au-thors also mentioned that MapReduce is powerfulto compute several complex tasks such as comput-ing in-links and out-links for page-ranking.

Some researchers have proposed the use ofMapReduce in conjunction with databases [110].The authors argued that MR systems are use-ful for ETL (Extract Transform Load) capabili-ties, whereas the database system could be usedfor efficient query-processing. Other major ad-vantages of MapReduce (Hadoop) over paralleldatabases are open source architecture and easeof installation. MapReduce is also cost effectivecompared to parallel databases.

The ability to read encrypted and compresseddata has also been considered as major require-ments for both the architectures [1, 110]. Al-though the original version of MapReduce doesnot provide these features, possibilities exist inwhich encrypted data can be read [123].

While scalability, cost effectiveness, hetero-geneity, and fault tolerance have been character-istics of MapReduce-style frameworks [34, 61, 72],

Fig. 2 Types of data-intensive systems

speed of execution and ease of development havebeen the propelling reasons for shared-nothingdatabases. Figure 2 illustrates the comparison be-tween the two platforms.

Considering this comparison, many extensionsand application-specific solutions have been pro-posed for MR-like systems. For instance, MR-likesystems were initially argued to be restricted tobatch processing. However, many solutions havebeen proposed to introduce real-time processing[14] or stream processing [131] in cloud. Wideusage and open source architecture have yieldedmany application-specific solutions. These solu-tions demonstrate variety of usages such as index-ing [75], join [74], faster execution [55], transac-tional systems [4], and streaming [59]. A detaileddescription of these solutions is mentioned inSection 4.

We now explain the requirements and expecta-tions for data-intensive clouds.

2.3 Requirements and Expectations ofData-Intensive Clouds

A data intensive cloud system entails several re-quirements related to scalability, availability andelasticity [30]. Further, issues such as infrastruc-ture support, hardware issues, and software plat-forms are also important. Depending upon thescope of an application and the type of servicesa cloud provides, these requirements may vary foreach application.

Note that a data-intensive cloud is differentfrom a traditional cloud. In that, the former iscapable to process and manage massive amountof data. However, in addition to the challengesrelated to data processing and management, adata-intensive system should also meet the re-quirements of a traditional cloud system [9]such as scalability, fault tolerance, and availabil-ity. We now describe significant requirementsfor data-intensive clouds. We have mentionedthese requirements with respect to data-intensivecomputing.

1) Scalability

A data-intensive cloud should be able to sup-port a large number of users without any no-ticeable performance degradation. Large scaling

Page 6: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

286 J. Shamsi et al.

may be achieved through addition of commodityhardware.

2) Availability and Fault Tolerance

The strict requirement of availability is tied withthe ability of the system to tolerate faults. Faultscould occur at the infrastructure/ physical layer orthey could also arise at the platform (or applica-tion) layer. As mentioned, in analytical systems,fault tolerance denotes the capability of the sys-tem to facilitate query execution with little inter-ruption. Comparatively, in transactional systems,ACID guarantees must be ensured [1]. Overall,the system should have the ability to sustain boththe transient failures (such as network conges-tion, bandwidth limitation and CPU availability)and persistent failures (such as network outages,power faults, and disk failures).

3) Flexibility and Efficient User AccessMechanism

A data-intensive cloud should facilitate a flexibledevelopment environment in which desired tasksand queries should be easily implemented. A sig-nificant requirement is to facilitate efficient mech-anism for data access. For intensive tasks, theframework should also support parallel and highperformance access and computing methods.

4) Elasticity

Elasticity refers to the capability of the cloud toutilize system resources as per the needs and us-age. This implies that more capacity can be addedto existing system [30]. The resources may shrinkor grow according to the current state of the cloud.

5) Sharing—Effective Resource Utilization

Many applications share clouds for their com-putation. This is specifically true for a privatecloud. For instance in [57], the authors mentionedthat data is shared between multiple applicationsof Facebook. Sharing reduces the overhead ofdata-duplication and yields in better resource uti-lization. Efficient and effective mechanisms areneeded to facilitate this sharing requirement.

6) Heterogeneous Environment

The cloud system should support heterogeneousinfrastructure. Homogenous configuration is not

always possible for data-intensive systems [127].In such an environment, issues such as differingcomputation power across cloud machines, vary-ing disk speeds [69], and networking hardwarewith dissimilar capacity are not infrequent. Con-sequently, a cloud may have to encounter varyingdelays.

7) Data Placement and Data Locality

Big data systems have complex requirements ofdata placement [56]. Issues to be considered in-clude, data locality, fast data loading and queryprocessing, efficient storage space utilization, re-duce network overhead, ability to support variouswork patterns, and low power. Multiple copiesof data sets may be maintained to achieve faulttolerance, load balancing, availability, and datalocality [97]. Consistency requirements vary withthe type of application being hosted on the cloud.It has also been suggested that data-intensive ap-plications with strong consistency requirementsare less likely to be deployed on clouds [1].

8) Effective Data Handling

Fault tolerance should be aided by effectivedata handling. For instance, many tasks in data-intensive computing are multi-stage. Handling ofintermediate data is important for such tasks. Afailure in intermediate steps of the work flowshould not drastically effect system execution.

9) Effective Storage Mechanism

The storage mechanism should facilitate fast andefficient retrieval of documents. Since data is dis-tributed, effective utilization of disk is important.

10) Support For Large Data Sets

In a cloud environment, data-intensive systemsshould provide scalable support for huge datasets[66]. A cloud should be able to execute a largenumber of queries with only a small latency [29].Considering the varieties of data-intensive com-puting, characteristics such as huge files and largenumber of small files in the directory are alsobeneficial.

11) Privacy and Access Control

In cloud computing, data is outsourced and storedon cloud servers. With this requirement, issues

Page 7: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 287

of data protection data privacy are induced. Al-though encryption may be used to protect sensi-tive data, it induces additional cost of encryptionor decryption in the system.

12) Billing

For a public cloud, an efficient billing mechanismis needed as it covers the cost of cloud operations.A user may be charged on the basis of three com-ponents. These include (i) data storage, (ii) dataaccess, and (iii) data computation. The inclusionof these components in the billing may vary de-pending upon the type of service a provider offersto its customers.

13) Power Efficiency

Data intensive cluster consume tons of electricalpower. Low-power solutions save infrastructuralcost and ease cooling requirements. In a power-constraint environment, such solutions could alsoyield lead to enhanced capacity and increasedcomputational power.

14) Efficient Network Setup

Cloud providers use over-provisioning for profitmaximization. In a multi-user cloud environment,network problems such as congestion, bandwidthlimitation, and excessive network delays couldbe induced. Problems such as high packet lossand TCP Incast [118] could also arise. A dataintensive cloud should be able to encounterthese challenges. Effective bandwidth utilization,efficient downloading and uploading, and low la-tency data-access are critical requirements. How-ever, with multiple users accessing intensive datain the cloud, network problems such as conges-tion, bandwidth limitations, and TCP Incast areplausible.

15) Efficiency

A data-intensive computing system must beefficient in fulfilling its core tasks. Intensive tasksrequire multi-stage pipeline execution, intelligentworkflows, and effective distribution and retrievalcapabilities. These requirements collectively de-termine the efficiency of the system. With thediversity in data intensive computing, algorithmsand techniques also vary for each application.For instance, some algorithms (such as page-rank

or N-body computation) require optimization foriterative computation.

The above set of requirements provides a com-prehensive view of the needs and objectives ofa data intensive system. Meeting these require-ments is essential for improving the efficacy andapplicability of a system.

3 Challenges and Solutions

This section discusses and elaborates on variouschallenges and solutions related to data inten-sive cloud computing. The discussion is motivatedby the requirements and expectations mentionedin the previous section. However, in mentioningthese challenges and their solutions, we remainedfocus on data-intensive paradigm. For instance,we do not elaborate on issues such as backuppower to promote availability as they are con-sidered outside the scope of the paper. Table 1presents a summary of the challenges and relatedsolutions for data-intensive computing.

3.1 Scalability

Challenges A cloud should be well-equippedto provide scalability. While adding physical re-sources contributes toward increasing scalability;effective management and utilization of resourcesand appropriate mechanism of task-mapping arecritical in maintaining them.

Solutions Scalability is the core requirements fordata-intensive clouds. In order to support theserequirements, there could be numerous consider-ations related to file system, programming plat-forms, storage and database systems, and data-ware house systems.

We now describe scalable solutions which existat file system, platform, and database and stor-age layers (Fig. 3). Note that data warehousing-based solutions are described in Section 3.4 underflexibility and efficient user access.

1) File System

Scalability in file system determines the capabilityof the file system to store and process big data.Distributed file systems such as (GFS) [43] and

Page 8: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

288 J. Shamsi et al.

Table 1 Challenges and solutions of data intensive cloud computing

S. no. Challenge Content

1. Scalability MapReduce [33], Hadoop [51], Data warehousing and analytics infrastructureat Facebook [115], Hive [116], Cassandra [72], BigTable [24], GFS [43],MongoDB [12], Dynamo DB [35], Hbase [53], Pig [94]

2. Availability, fault detection, Globally Distributed Storage Systems [34], MapReduce [112], HiTune [31],and fault tolerance CloudSense [71], Riak [102]

3. Flexibility and efficient Dryad [61], DryadLINQ [126], Pig Latin [88], All-Pairs [82], Sawzall [95]user access

4. Elasticity ElasTras [32] Zephyr [40]5. Sharing of a cluster for Otus [101], Mesos [57] Google compute clusters [108], ARIA [119], delay

multiple platforms scheduling [128], disk head scheduling [120, 121]6. Heterogeneous system LATE [127], MR-Predict [116], heterogeneity-aware task distribution [124].7. Data handling, locality, Volley [6], RCFile [56], automatic replication of intermediate data [65]

and placement8. Effective storage mechanism pWalrus [3], Megastore [10], DiskReduce [41]9. Privacy and access control Delegation of RESTful Resources in storage cloud [11], Airavat [103]10. Billing Exertion-based billing for cloud [122]11. Power efficiency FAWN [8], MAR [107], BEEMR [25], CS and AIS [73]12. Network problems TCP RTO for Incast [118], WhyHigh [70]13. Efficiency Section 4

Hadoop Distributed File Systems (HDFS) [17]have provided significant solutions for data inten-sive systems. Both GFS and HDFS have manysimilarities. They store data in large blocks of size64 MB, which allows low seek- time and increasedefficiency.

In HDFS, specific nodes (called data nodes)store all the data. Meta information is stored onname node, which provides lookup services. Eachblock is by default replicated thrice on data nodes.High replication improves availability and datalocality—a concept in which a task is preferred tobe executed near to the location of data. This re-duces network bottleneck. Both GFS and HDFSshare many similarities, where the latter has beeninspired by the former. However a different set ofnaming conventions are used in GFS.

Fig. 3 Scalability at different layers

2) Programming Platforms

MapReduce has been the most popular program-ming platform for data-intensive computing. Itwas initially proposed by Google [33]. Later, itwas adopted by the open source community as theHadoop [51] project.

Hadoop provides execution of MapReducetasks over a cluster of machines which are basedon commodity hardware. The core functionality ofthe framework is provided through its two phasesMap and Reduce. In both the phases, a notion of<key, value> pair is used for input and outputoperations.

In the Map phase, an intensive task is dividedinto a large number of smaller, independent, andidentical map tasks. Each map task is executedindependently on one of the available node on thecluster. While scheduling map tasks on a cluster,the MapReduce framework strives to ensure datalocality. The reduce phase involves aggregation of<key, value> pairs from all the map tasks overthe network. In that, all the <key, value> pairsthat are emitted in the map phase, are mergedand delivered to the node which executes thereducer.

Many organizations such as Facebook [114],Yahoo, and Adobe have implemented Hadoop-

Page 9: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 289

based solutions. Google has implemented a pro-prietary version of MapReduce, where it is usedfor the generation of data for production websearch service, sorting, data mining, machinelearning, and many other systems [33].

Sector/Sphere [48] are solutions for distributeddata-intensive computing. In that, Sector is a file-based distributed file system, which is similar toGFS/HDFS, whereas Sphere is a distributed pro-gramming platform that utilizes Sector. Sphere istightly coupled with sector. In that, sphere appli-cations can provide feedback to sector to improvedata locality.

Sphere provides greater flexibility by allowingarbitrary UDF (User defined Functions). Sectorcan also support data processing on various levelsof granularity. The authors of Sector reported 2–4 times faster execution as compared to Hadoop.However, scalability of Sector/Sphere has notbeen discussed.

3) Distributed Storage and Database Systems

BigTable [20] is a scalable data storage platformfrom Google which stores data from many Googleapplications. It is spanned over thousands of com-modity servers with size in petabytes. The data isstored in the form of a sparse multi-dimensional,sorted map, in which the data is indexed by a row-key, a column-key, and the time stamp. BigTableis built upon the GFS, which is used to store dataon the servers.

Hbase [53] is an open source version of theBigTable distributed storage system. It runs ontop of HDFS and provides BigTable-like capabil-ities for the management of large volume of struc-tured data. Hbase is written in Java to achieveplatform-independence. Because of its portabilityand the capability to scale to a very large size,it is being used in many data-center applicationsincluding Facebook, Twitter, etc. Usually, HBaseand HDFS are deployed in the same cluster toimprove data locality [60].

HBase consists of three major componentswhich include HBaseMaster, HRegionServer andHBaseClient [58]. The master is responsible forassigning regions to region servers and for recov-ery. In addition, the master is also responsible formanaging administrative tasks such as resizing ofregions, replication of data among different region

servers. The client is responsible for finding regionservers for which it should request for read/writeoperations. The region server is responsible formanaging client read and write requests. It com-municates with the master to get a list of regions toserve and to tell the master that it is alive. HBaseis planned to be supported by a query languageHBQL [54].

Cassandra [72] is a distributed data manage-ment system implemented by Facebook. It allowsusers to manage very huge dataset, distributedover a number of commodity hardware, with nosingle point of failure.

The structure of Cassandra is key-value store.Cassandra adapts a column-oriented approach inwhich data is stored as sections of databases suchthat keys are mapped to various values grouped bycolumn families. Although the number of columnfamilies is defined during the creation of database,columns can be dynamically added or removedin a family. Cassandra also incorporates a row-oriented approach in that the values from a col-umn family for each key are stored together. Thecombination of column oriented and row-orientedapproaches leads Cassandra to a hybrid model fordata storage and management.

Cassandra uses a peer-to-peer model whichmeans that there is no single master but all thenodes are working as masters. This result in highlyscalability in both read and writes operation.

MongoDB [12] is an open source, document-based database management system, which isdesigned for storing, retrieving, and managingdocument-oriented or semi-structured data. Itprovides support for many features such asdynamic queries and secondary indexes, fastatomic updates, and replication with automaticfailover.

Replication is provided via a topology knownas a replica set. Replica sets distribute data forredundancy and automate failover in the event ofoutages. Most replica sets consists of one primarynode and one or more secondary nodes. Clientsdirect all writes to the primary node, while thesecondary nodes are read-only and replicate fromthe primary asynchronously. If the primary nodefails, the cluster will pick a secondary node andautomatically promote it to primary in order tosupport automated failover. However, when the

Page 10: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

290 J. Shamsi et al.

earlier primary appears online it will work as asecondary.

MongoDB also have the ability to scale hor-izontally via a range-based partitioning mecha-nism, known as auto-sharding. Through this datais automatically distributed and managed acrossnodes.

Amazon Dynamo DB [35] is another highlyavailable and scalable distributed data store thatis built for Amazon’s AWS cloud platform. Dy-namo is a key-value storage system which hasbeen successful in managing server failures, datacenter failures, and network partitions. It providesdesired level of availability and performance tothe user. One of the great features of Dynamo isincremental scalability which enables it to allowservice owners to scale up and down dependingupon their current request load.

To achieve high availability, Dynamo sacrificesconsistency. It uses consistent hashing [63] to dis-tribute load among multiple storage hosts. Objectversioning is used in order to facilitate multipleversions of an object. Management is initiatedwith a generic object and subsequent versions areused to reflect in changes. The Dynamo DB sys-tem is completely decentralized, in which addingor removing storage nodes do not require anymanual partitioning or redistribution. Dynamohas been able to provide scalable storage servicesfor S3 and other related services of AWS.

Riak [102] is a distributed and scalable NOSQLdatabase. It utilizes MapReduce as a platform.Its’ main strength is the distributed architecture,which avoids a single point of failure. It storesdata in buckets, which is similar to the conceptof tables in a relational database. Riak is hostedon a cluster of machines, in which the data isdistributed among nodes. Each node hosts a set ofvirtual nodes, where each virtual node is respon-sible for some data-storage. In Riak, data-storageis computed using a 160-bit binary hash of bucketand key pair.

Riak incorporate eventual consistency and highfault tolerance. Each key has designated primaryand secondary vnodes. Riak also provides repli-cation of data, where frequency of replication ismanaged by the user.

3.2 Availability, Fault Detection, and FaultTolerance

Challenges For big data clouds, faults are normand failures and crashes could occur at any time.Although in a distributed environment, MapRe-duce (and similar systems) provide high fault tol-erance and high availability in which only the taskswhich do not respond in a reasonable amountof time are restarted through speculative execu-tion; fault detection and determining the reasonof failures remains an issue due to large size ofthe cluster. In [112], the authors argue that time-based detection of faults for MapReduce systemsis difficult as the execution depends upon numberof factors including the size of the cluster, size ofthe input, and the type of the job.

Solutions In [42], the authors presented fault andavailability analysis of Google cloud storage sys-tem. The system consists of three layers includingBigTable, GFS, and Linux file system. Availabilityand fault tolerance at any of the layer is significantin promoting availability of the cloud. The authorsmentioned that a node could become unavailablefor a large number of reasons including over-loading of the node or network failure such as ifthe response of the heart beat messages are notreceived by the monitoring system. However, only10 % of the failures lasted more than 15 minutes.The authors also argue that transient failures donot have a drastic impact on the availability of thecloud due to high replication strategies. In addi-tion, the authors observed that much of the non-transient failures happen in bursts which occurdue to rack failures. Analysis on the past failureshas aided authors to develop analytical models forfuture availability and choices for data placementand replication strategies.

Kahuna [112] is a fault detection tool which isbased on detecting performance problems. Theidea is that under normal scenarios, MapReducenodes tend to perform symmetrically and a nodewhich performs differently is the reason for cre-ating performance issues. The similarity is de-tected by observing different characteristics such

Page 11: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 291

as CPU-usage, network traffic, and completiontimes of Map tasks.

While the underline theme of Kahuna seems tobe justified, it’s applicability in an heterogeneousenvironment remains to be seen.

HiTune [31] is a data flow-based performanceanalysis tool from Intel which is focused onanalyzing cloud run-time behavior. The idea isthat run-time analysis could be used to detectfailures and improve system performance. It usestrackers, which are installed on each node of thecloud. Each tracker monitors its correspondingnode and sends characteristics (such as CPU cy-cles, disk bandwidth etc) to the aggregation en-gine. The engine links the information with thehelp of an analysis engine. As such a flow ofexecution plan is constructed which can help di-agnose performance issues and provide systemimprovements. The authors describe three testcases from Intel clusters for performance tun-ing on Hadoop. Through this, problems relatedto Hadoop scheduling, application hotspots, andslow disk were detected and rectified.

HiTune has been analyzed extensively. For in-stance, processor micro architecture events andpower state behaviors of Hadoop jobs have beenanalyzed using the dataflow model. Moreover, ithas also been applied to Hive by extending theoriginal Hadoop dataflow model to include addi-tional phases and stages.

3.3 Flexibility and Efficient User Access

Challenges Although MapReduce has been ex-tensively used for data-intensive applications; itoffers a rigid programming environment in whichany tasks are needed to be converted to map andreduce tasks.

Solutions Dryad [61] is a distributed dataprocessing platform from Microsoft, which sup-ports large-scale data operations over thousandsof nodes. It provides improved flexibility ascompared to MapReduce system. The Dryadframework implements a Directed Acyclic Graph(DAG) for each job, in that the nodes repre-sent programs or tasks and edges correspond to

communication between them. Incorporating theconsiderations of data locality and availability ofresources, the graph is automatically mapped onphysical resources by the Dryad execution frame-work. Dryad is supported by DryadLINQ [126]—a procedural language to specify tasks.

Much of the simplicity of the Dryad (sched-uler and fault tolerance model) stems from theassumption that vertices are deterministic. In caseof non-deterministic vertices in an application,it must guarantee that every terminating execu-tion produces an output and failure-free execu-tion could be generated. In a general case, wherevertices can produce side-effects, it might be verydifficult to ensure this.

Pig [94] and Hive [58, 115] are data warehous-ing systems built on top of Hadoop. They allowqueries and analysis on large data set stored onHadoop-compatible file system. Pig uses a script-ing language, PigLatin [88], by using which a pro-grammer is free from writing MapReduce tasks.Instead, these tasks are generated by the systemin response to the scripting language. In compar-ison, Hive uses a SQL-like declarative languageknown as HiveQL, which allows a user to writecustom MapReduce functions, where applicable.HiveQL is compiled in to MapReduce jobs whichare executed in parallel on Hadoop. Data is storedin tables, where each table consists of a numberof rows and a fixed number of columns. A type isassociated with each column, which can either beprimitive (like integer, float or string) or complex(like arrays, lists or structs).

Similar efforts have been made by Moretti et al.[82]. In that, the authors provide a high-level pro-gramming abstraction of the All-Pairs problem.The idea is to free a developer from issues suchas resource sharing, management, and parallelismand provide an efficient solution for a populardata-intensive problem.

Sawzall [95], is a high performance computingsystem from Google. It is motivated to provideease of interface on a distributed cluster environ-ment such as MapReduce. For a very large dataset, which is distributed over hundreds or thou-sands of machines, a relational database approachfor querying and analysis is not feasible. Sawzall

Page 12: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

292 J. Shamsi et al.

exploits inherent parallelism of such systems byusing a procedural language. Computation is per-formed in two phases: In the filtering phase, eachhost executes the query on the portion of datasetstored on it. The results from each host are col-lected in the aggregator phase, which combinesthe results and store them in a file.

Although Sawzall is useful only for tasks whichare associative and commutative, it provides asimple and powerful platform by masking thecomplexities of parallel programming.

3.4 Elasticity

Challenges Data-intensive clouds should be ca-pable to scale according to the state of the system.That is, during moments of high-demand and flashcrowds the cloud should scale to meet the needs.Similarly, during periods of low-usage, the cloudshould shrink. These adjustments in the cloud aresupported through virtualization, where VirtualMachines (VMs) are migrated from one physicalmachine to another in order to support resourceprovisioning and elastic load balancing.

While these solutions are established at theinfrastructure layer, at the application layer (orthe data storage layer) challenges arise due tothe possibility of service interruption which mayhappen due to live migration. Further, scaling outimplies partitioning of a database. In addition,query processing during the process of migrationis also likely to be effected. The intensity of thechallenge is likely to be increased with multi-tenancy, which is a promising feature provided bydata-intensive clouds.

Solutions ElasTras [32] is a transactional distrib-uted data store for cloud systems. It provides elas-ticity through transactional managers which arecapable to allocate/de-allocate resources on de-mand. Zephyr [40] adds to the capabilities of Elas-tree by incorporating live migration. It minimizesservice interruption by allowing transactions bothat the source and destination, simultaneously. Theprocess of migration involves transfer of metadatato the destination. Once the transfer of metadatais completed, new transactions are initiated at thedestination, while existing transactions are beingcompleted at the source.

The work proposed by Zephyr is important,as live migration is significant to provide elas-ticity. However, techniques for load balancingand affinity aware destination selection are alsoneeded to be incorporated.

Note that infrastructure-layer related solutionsare not discussed as they are considered outsidethe scope of the paper.

3.5 Sharing of a Clusters for Multiple Platforms

Sharing of clusters induce multiple challenges ofsharing of data, hardware resources, and networkresources [109]. The severity of challenges in-creases if resource constraints and timing require-ments are needed to be satisfied.

1) Understanding Resource Requirements

Challenges For data intensive systems, suchas Hadoop and Dryad, understanding resourcerequirements and usage is important. This isspecifically true when batch jobs are continu-ously being executed for different datasets. Un-derstanding resource usage and attribution forsuch systems would provide detailed informa-tion about requirements of a cluster and per-formance monitoring for the applications beingexecuted. Solution: Otus [101] is a resource at-tribution tool which monitors the behavior ofdata-intensive applications in a cluster. It observesevents from different software stacks such as Op-erating Systems and MapReduce to infer resourceutilization and relate performance of the servicecomponents.

2) Data and Resource Sharing

Challenges While multiple platforms (such asDryad, Hadoop, MPI etc.) exist for data-intensivecomputing; each has its own significance and thereis no platform which is efficient and optimal forall the data intensive applications. In [57], theauthors reported that scenarios exist for Face-book and Yahoo users, where they may like tobuild multiple clusters for their usage. A simpleapproach is to setup a separate cluster for eachapplication and transfer data in each. However,this technique is inefficient as it requires dataduplication.

Page 13: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 293

Solution In [57], the authors proposed Mesoswhich provides sharing capability for multipleframeworks on the same cluster. Mesos acts as anintermediary between the cluster and the frame-work by offering resources to each framework.The user accepts the resources as per her choice.The framework is also responsible to scheduletasks on these resources. The architecture ofMesos is simple; it requires a single master forthe whole cluster and a slave for each node. Slavenodes communicate with the master and offer re-sources to multiple frameworks, whereas the mas-ter does the coordination and resource scheduling.

A major limitation of Mesos is that it requiresporting for different frameworks. Further, thecentralized master could become a single point offailure. Under high scalability requirements, thismay lead to poor performance.

3) Meeting Resource and Timing Constraints

Challenges In a shared environment, users maycompete for resources to meet their timing dead-lines. In such a scenario, resource scheduling playsa critical part in meeting an application’s expec-tations. Consider Hadoop; it has a fair scheduler[129], which ensures that each user gets a min-imum number of resources for task execution.However, it does not provide any assurance thatuser’s requirements for task execution are met ina shared environment. In addition, there is alwaysa conflict between fairness in scheduling and datalocality [128]. In large clusters, tasks complete atsuch a high rate that resources can be reassignedto new jobs on a timescale much smaller than jobdurations. However, a strict implementation offair sharing compromises locality, because the jobto be scheduled next according to fairness mightnot have data on the nodes that are currentlyfree.

Similarly, for intensive tasks, resource assign-ment such that task placement constraints are sat-isfied is important. In [108], Sharma et al. arguesthat for long running jobs, reducing the delays intask assignment is significant. The authors per-formed a study on Google clusters. They iden-tified two major types of constraints: 1) Hardwarearchitecture and 2) Kernel version and observed

that constraints could increase the task assignmentdelays to two to six times.

Solutions In [119], the authors are motivatedto solve this problem by providing a schedulerwhich can estimate the number of required mapand reduce tasks in order to meet soft guaran-tee for a user. The proposed framework ARIA(Automatic Resource Inference and Allocation)[119], builds a profile for a new job by analyzingdifferent phases (such as map, reduce, shuffle, andsort). Based on this profile and user’s service levelexpectations, task execution parameters are esti-mated for subsequent jobs. The framework alsoincorporates a scheduler which determines orderof jobs and the resources required to executethem.

The proposed model has been designed forscenarios without node failures. This model needsto be extended and evaluated for different casesthat incorporate failures.

In [128], the authors have proposed a delayscheduling algorithm which improves locality atthe expense of fairness. The idea is that if a localtask cannot be executed than any other task maybe executed for a small duration. The algorithmtemporarily relaxes fairness to improve locality byasking jobs to wait for a scheduling opportunityon a node with local data. Experiments revealthat very small amount of waiting is enough tobring locality close to 100 %. Delay schedulingperforms well in typical Hadoop workloads be-cause Hadoop tasks are short relative to jobs, andbecause there are multiple locations where a taskcan run to access each data block.

The scheme performs well in environmentswhere most tasks are short and multiple locationsare available in which a task can run to reada given data block. However, the effectivenessneeds to be evaluated for different behaviors suchas longer tasks and or fewer data blocks.

4) Disk Head Scheduling

Challenges In a shared environment, read re-quests for multiple workloads may be issued. Un-der such a scenario, disk scheduling could playa significant role in access latencies of users.For instance, interdependence between different

Page 14: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

294 J. Shamsi et al.

datasets or interference between different work-loads could reduce the speed of the disk I/O.

Challenges arise if the data is striped acrossmultiple clusters. If heads are not co-scheduledthan a client may have to wait until it is able toread the data from all the servers. The problembecomes severe, considering that access patternsare not pre-defined and multiple users access thecluster at the same time.

Solution In [120, 121], the authors proposed adisk-scheduling scheme which co-schedules dataacross all servers in the cluster. The scheme alsoprovides performance insulation such that perfor-mance of individual workloads does not degradewhen they share a cluster. The scheme minimizesinteractions between datasets by time slicing diskheads and slack assignment.

3.6 Heterogeneous Environment

1) Timely Completion of Slow Tasks

Challenges In heterogeneous systems, executionspeed of tasks varies because the hardware re-sources such as CPU processing speed, accessspeed and bandwidth of disks [69], and network-ing equipment varies throughout the cluster. Insuch a scenario, Hadoop’s strategy of initiating aredundant task in response to slow nodes (calledspeculative execution), is ineffective [127], as itis based on a heuristic in which slow tasks aredetected by comparing a task’s completion statuswith average task execution in the system.

Solution For such systems, Zaharia et al. [127]proposed a scheduling algorithm LATE (LongestApproximate Time to End), which identifies slowtasks and prioritizes them according to their ex-pected completion time. Slow tasks are executedon fast available nodes to prevent thrashing andpromote timely completion of jobs.

The authors show that LATE can improve theresponse time of MapReduce jobs by a factor oftwo in large clusters on EC2. The authors alsoevaluated the algorithm by running different jobslike Sort, Grep and WordCount application. Theperformance of LATE is more effective in Grepand Sort compare to WordCount. The reason is

that reducer has to perform more work on Grepand Sort application. In jobs where reducers domore work and maps are a smaller fraction ofthe total time, LATE will work more efficientlycompare to Hadoop’s scheduler.

2) Scheduling Workloads

Challenges In a heterogeneous cluster, tasks areexecuted at differing speeds. I/O bound jobs spentmore time in reading and writing data, whereasCPU bound jobs rely on CPU for completion.In such a scenario, it is important that work-loads are distributed according to the processingcapabilities.

Solutions MR-Predict [116] is focused tosolve the above mentioned challenge. It dividesMapReduce workload in three categories basedon their I/O and CPU load. For any new task,workload type is predicted by the MR-Predictframework. The task is then handled accordingly.In that, separate queues are maintained for eachcategory of workload.

3) Heterogeneity-Aware Task Distribution

Challenges In a heterogeneous environment—where processing speeds are likely to vary amongnodes, high processing nodes are expected to com-plete more tasks. In such a scenario, the concept ofequally distributing data in order to provide datalocality is likely to create network bottleneck. Forheterogeneous systems, effective data placementstrategies are required in order to ensure efficienttask scheduling.

Solution Xie [124] proposed that in order to dis-tribute tasks according to the capabilities of nodesin a heterogeneous cluster, data should be avail-able to high processing nodes so that tasks can bereadily assigned to them and time to transfer thedata can be reduced. As a solution, data locality istied with the processing capabilities of the nodesuch that more data is stored at the nodes withhigh processing speed. Thus when high processingnodes complete their tasks, additional tasks can beassigned to them without incurring further delay.

A potential problem with this technique is thatit is assumed that high processing nodes have

Page 15: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 295

sufficient storage available to store more data. Inaddition, processing speeds in the cluster may varydue to multi-tenancy.

3.7 Data Handling, Locality, and Placement

Data placement refers to the decision of determin-ing location of data for storage. It is often tiedwith data locality—a concept of locating data inclose proximity of execution of task. This reducesnetwork latency and decreases amount of net-work traffic. The strategy has been well adoptedfor MapReduce in order to reduce networkbottleneck.

A naive approach to locate data in a close prox-imity of the user is to replicate copies near user’slocation. However, several considerations existwhich require an improved strategy [6]. Theseinclude:

1) User’s Mobility For certain users with veryhigh mobility, it is difficult to determine de-fault locations for mobile users.

2) Data Sharing and Dependence In a cloud en-vironment, data is often dependent or shared.Migrating one date set could provoke unde-sired operations on other related data sets.Migration may also bring consistency issues indata centers.

3) Bandwidth Limitations Migrating a datasetcould be costly if bandwidth is limited. Fur-ther, migrating dataset to the nearest locationmay also affect the available bandwidth forother users.

4) Resource Constraints and Over ProvisioningIn a cloud environment, data centers areover provisioned for profitability. Migrationof data may not only disturb the cost modelof the clouds but it may also affect the avail-ability of resources in data centers.

1) Data Placement

Challenges For data-intensive applications,placement of data-analysis servers is also critical[22]. If the data analysis is performed on serversdedicated for application, then it may affect theapplication performance. Conversely, if separateinfrastructure is deployed, then it inducesadditional cost of physical hardware, and brings

up the challenge of replication and consistency.Additionally, network latency issues should alsobe considered.

Solution Volley is an automatic data placementtool from Microsoft [6]. Cloud administratorsspecify locations of data centers and the cost andbilling model as input to the system. Administra-tors also specify the replication model, which de-scribe the number of replicas at a specific location.As a third criteria, the administrators specify theirchoice between migration cost and better perfor-mance; where performance is measured throughuser perceived latencies and inter-data center la-tency. Volley takes user access logs as inputs. Ac-cess logs contain user IP addresses, description ofdata items accessed, and the structure of requests.Considering all the above mentioned points, thevolley system makes decisions about migration ofdatasets.

In Volley, the model for migration covers manyaspects. However, it would be important to ob-serve its performance in a dynamic system whererelationship between datasets and dependenciesamong them are not static.

2) Fast Data Loading and Query Processing

Challenges Data placement decisions are alsomotivated by the needs of fast data loading andfast query processing. For big data systems, row-based structures are not efficient as undesiredcolumns have to be read. The issue could becomesevere if a row is spanned over multiple systems.Similarly, column-based structures can cause highnetwork traffic. Therefore, storing data effectivelyfor big data systems remains a challenge.

Solution RCfile [56] is a data placement struc-ture built on top of Hadoop. The authors men-tion that there are four important requirementsof data placement structure. These include: 1) fastdata loading, 2) fast query processing, 3) highlyefficient storage utilization, and strong adaptavityto dynamic work load patterns.

RCfile is motivated to solve these problems.In an RCfile system, a table is partitioned intorow groups such that each column is stored sep-arately. Within each row group, data is stored in

Page 16: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

296 J. Shamsi et al.

compressed form to reduce the cost of networktransfer. A flexible row group size is allowed inorder to meet the challenges of efficient storageutilization. The scheme has been incorporated bythe two data warehouse solutions Pig and Hive.

3) Replicating Intermediate Data

Challenges For many data intensive applicationsflow of intermediate data is significant [65]. Manyapplications including, MapReduce, Dryad, andPig generate intermediate data. The data is gen-erated in one stage of the task and is used byanother stage. The data is normally temporaryand serve as bottleneck, which could affect theperformance either due to bandwidth constraintsor disk failures. Further, loss of this data impliesthat the task-generating the intermediate data isre-executed and all the other tasks relying on suchdata would be halted.

Solution In [65], the authors suggested that au-tomatic replication of the intermediate data couldreduce the effect of these anticipated failures.Furthermore, Replication can minimize cascadedre-execution. For efficiency, the authors suggestedthe use of background jobs for data replication.However, the replication technique mentionedhere is likely to increase the cost of operation.

3.8 Effective Storage Mechanism

1) Parallel Access Mechanism

Challenges A major issue in data-intensive com-puting is to provide storage mechanism which canfacilitate high performance computing and paral-lel access mechanisms. Parallel file systems suchas PVFS [50] and GPFS [106] meet these require-ments by providing high degree of concurrencyand ease of access in writes and reads. However,the user interface mechanism of these systems isrestricted and requires additional administrativeworks for ease of access.

Solution pWalrus [3] is a system which is moti-vated to fill this gap. The authors argue that forcloud-based data intensive computing, a storagesystem is needed which can both facilitate theaccess mechanism of cloud and can also offer

services for data intensive computing by allowingrandom access reads and writes. A pWalrus sys-tem consists of a number of Amazon S3 servers;all of them have same parallel file system. Allthe servers have the same view of the S3 storage,allowing users to connect to any of the server. Inaddition to accessing the data through S3 objects,users can also access the data by directly accessingthe parallel file system. Mapping of S3 objectsto file is also available to users. To facilitate ac-cess between files and objects the pWalrus systemstores additional configuration files.

A major limitation of pWalrus is that the simul-taneous use of public and private storage servicesis not possible. Additionally, the issues such asfacilitating atomic writes for S3 and files at thesame time are under consideration.

2) Low latency for Storage Systems

Challenges While data-intensive applicationsneed to be scalable, interactive systems mustalso possess low latency while meeting therequirements of scalability. At the same time, aninteractive system should also be consistent andhighly available.

Solution Megastore [10] is motivated to meetthe above mentioned requirements for interac-tive systems. It combines the ACID propertiesfrom relational databases and scalability seman-tics from NoSQL databases. In Megastore, data ispartitioned such that ACID semantics are ensuredwithin the partitions, while consistency remainslimited across them. Megastore has been imple-mented for a variety of Google applications.

3) Saving Disk Space

Challenges Since data-intensive systems involvemassive data, effective utilization of disk spaceis significant in improving resource utilization.However, high replication factor leads to highdisk usage. For instance, in HDFS, each block isreplicated thrice for fault tolerance. This leads to200 % extra utilization of disk space.

Solutions DiskReduce [41] is focused to reducethis overhead. The authors proposed and imple-mented a RAID based replication mechanism,

Page 17: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 297

which reduces the disk usage between 10 and25 %. In DiskReduce, a background processreplicates copies of blocks with lower overheadRAID encoding. For each encoded block, thecorresponding copy of HDFS block is removed.Current implementation supports RAID5 andRAID6 encodings.

While the effort is substantial in reducingthe overhead; for time sensitive applications, theprocess of encoding and decoding could lead toextra time and lower performance.

In another work, the authors proposed datacompression for increased I/O performance ofHadoop [21]. Performing compression using suchconfiguration in MapReduce job improves bothtime and energy efficiency [23]. Compression alsoeffectively improves the efficiency of networkbandwidth and disk space. The authors also ana-lyze how compression can improve performanceand energy efficiency for MapReduce workloads.For read-heavy text data, compression provides35–60 % energy savings. For highly compressibledata, the savings are even higher.

3.9 Privacy and Access Control

1) Access Control

Challenges For a cloud storage system, providingaccess to multiple users, delegation of rights be-tween the users remains an issue. This is due tothe fact that it is highly likely that there would besome resources which are needed to be accessedonly by a limited number of users.

Large storage systems require interactions be-tween multiple users. This requirement introducesadditional challenges of procurement and man-agement of access controls between the users.In such systems, chain servicing ACLs becomesineffective due to involvement of multiple users[26]. In addition, in large storage systems, dynamiccreation of objects may be required and interac-tion between multiple users and objects may beneeded [52].

With data outsourcing and replication, addi-tional requirements of maintaining sovereignty ofdata exist [93]. Sovereignty implies that data andits replicated copies are stored at a location whichdoes not violate a specific policy. That is, data

is stored only at a place where it is allowed tobe stored. The problem is challenging as cloudproviders may intentionally or un-intentionallyreplicate copies at locations where it is financiallyor administratively feasible to them.

Solution In [52], the authors proposed a modelwhich provides dynamic delegation of rights withcapabilities for accounting and access confine-ment. However, the model needs to be evaluatedto test functionality and scalability.

Balraj [11] et al. proposed a REST based ap-proach in which query string is passed through aURI. Chain delegation could be built in that useragents are used for delegation of rights.

2) Privacy

Challenges For a cloud application which ana-lyzes data, protecting privacy of the data (and theprovider of the data) is important. The authors ofAiravat [103] are focused on this goal. They arguethat anonymity is not comprehensive as anony-mous data has been used to access confidentialinformation in the past [85]. Mandatory accesscontrols (MAC) have been effective but it cannotprevent privacy concerns related to the processeddata, as this could be violated through malicioussoftware.

Solution Considering these issues, the authorsproposed Airavat—a MapReduce based solutionfor protecting privacy of the users. It employsdifferential privacy [38] which adds minor noise tothe output data without much effect on the qualityof the output. Differential privacy provides pri-vacy protection to the data provider. In addition,Airavat utilizes mandatory access controls on topof MapReduce to prevent information leakagethrough system resources.

3.10 Billing

Challenges Incorporating accurate billing mech-anism is significant and challenging for data-intensive cloud systems. With massive require-ments to access and compute huge amount ofdata, appropriate and methods are needed to com-pute billing.

Page 18: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

298 J. Shamsi et al.

Solution For a cloud, it is pertinent to have anefficient billing system. A fair billing system fordata intensive computing entails three compo-nents [12]. These include:

• Cost of Data Storage• Cost of Data Access• Cost of Computation

Of these three components, cost related tocomputation is normally billed in CPU hours.While cost for storage and access are chargedin terms of bytes. In [122], the authors arguethat charging data access in terms of bytes isinefficient. Storage access depends upon a num-ber of factors which include data locality, work-load characteristics, inter-workload relationship,transfer size, and bandwidth limitations. Billing onnumber of bytes is unfair and would not include allthese factors.

The authors suggested that for storage access,the users should be billed according to an exertionbased system, such as charging a user according todisk time. Further, inter-workload dependenciesshould also be minimized.

It is important to note that for a multi-usersystem, access to storage may be delayed due toscheduling. In such a case, the above mentionedbilling scheme would be unfair as time for diskaccess may not be deterministic.

3.11 Power Efficiency

Challenges Improving energy efficiency is a ma-jor concern for cloud providers [14]. In data-intensive clouds, the complexity of this challengeincreases due to many special considerations. Forinstance, using idle mode for unused resourcesis a popular method for traditional clouds. How-ever, data-intensive clouds have latency-sensitiverequirements of online analysis [79] and randomaccess [8]. For such systems, using idle mode isnot useful. Further, data-intensive systems havestrong requirements of scalability. For scalablesystems, power requirements are likely to be in-creased due to addition of resources [13]. Multi-core systems are also being utilized for clouds.Reducing power usage for such systems is alsodesirable [107].

Solutions

1) Random Access

FAWN [8] is a flash-memory based system, whichis designed to promote low power for data-intensive applications requiring random access.The focus is on large-key value systems, wheredata is stored in small objects such as images ortweets. Such systems are I/O intensive requiringlarge number of random data access requests. Forsuch systems, disk-based storage provides poorseek performance and require high power. Alter-natively, DRAM based storage systems are ex-pensive and require high power. FAWN is moti-vated to provide low power for these applications.A FAWN cluster consists of embedded CPUswhich utilizes flash-based storage. The use of em-bedded CPUs reduces the power requirement,whereas the flash-based storage is suitable forrandom access.

Similarly, Meisnar et al. [70] analyzed latencyand power relationship for OLDI (OnLine DataIntensive) workloads for Google servers. Exam-ples of such systems include search products, on-line advertising, and machine translation. For suchsystems, idle mode is not suitable as it leads to veryhigh latency. Instead acceptable query latency canbe obtained by using coordinated full system ac-tive low power mode. Coordination can lead tobalance work load among servers while maintainthe power efficiency.

2) Multi-Core Technology

Advancements in multi-core technology have leadto their utilization in cloud systems. Shang andWang [107] have proposed power saving strategiesfor multi-core data intensive systems. Such sys-tems have variable CPU and I/O workloads. Fornon-uniform loads, existing method for switchingto busy/idle ratio cannot be very useful. Thesesystems may also have I/O wait operations whichmay affect the job completion time. The authorshave suggested that during I/O wait phases, CPUfrequency can be scaled down without having anyeffect on job completion time.

MAR (modeless, adaptive, and rule-based) isa power management scheme which is based onthe method of scaling down frequencies duringI/O wait operations. Using feedback for I/O wait,

Page 19: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 299

CPU frequency can be controlled. While MARhas shown improvement in power savings for I/Owait operations, its effect for data intensive sys-tems which do not have long I/O wait phases islikely to be reduced.

3) MapReduce based systems

Energy conservation has also been a major fo-cus for MapReduce based systems. In [25], Chenet al. discuss the possibilities of power consump-tion for MapReduce based Interactive Analysis(MIA) systems. For interactive systems, conven-tional strategy of increasing hardware utiliza-tion is not sufficient. For such systems, the au-thors have proposed energy efficient MapRe-duce called BEEMR (Berkeley Energy EfficientMapReduce). Like the conventional MapReduceframework, The BEMR framework is capable tohold large volume of data. The interactive job isexecuted on a small pool of dedicated machineswhich operate on full capacity with their associatestorage, whereas less time-sensitive jobs are runon the rest of the machines. The BEMR frame-work is aided by a workload manager which pro-vides energy-efficient workload management.

Similarly, Lang and Patel [73], have evaluatedpower saving strategies for MapReduce basedcloud systems. The focus is on two categories oftechniques for power conservation. In the firstapproach, CS (Covering Set), a small number ofnodes (known as CS nodes) are selected with ahigh replication factor. In that, at least one copyof each unique block is stored on the CS nodes.During periods of low utilization, some or all ofthe non-CS nodes are powered off in order toconserve power.

The second approach, called All-In Strategy(AIS), differs from the CS technique, in that allthe nodes are operated at full speed in order tocomplete the task. The nodes are switched to idle(low power) only during periods of no utilization.Evaluations reveal that effectiveness of the twotechniques is dependent on workload complexityand time of transition from and to low powermodes. The CS approach is better only for linearworkloads and large transition time, whereas theAIS approach is useful in all the cases.

3.12 Network Problems

1) TCP Incast

Challenges Use of commodity hardware hasbeen flourishing in data centers. In such an ar-rangement, low cost switches, with 48 ports and1 Gbps bandwidth are used at the top of the rack[7]. With low cost switches and top of the racksetup, TCP Incast [118] may occur. This happenswhen multiple senders communicate with a singlereceiver over a short period of time, and packetsfrom these flows converge on a switch, then thebuffer of the switch may become full and packetloss may result. Such instances are possible indata-intensive cloud computing as an applicationmay issue time-sensitive queries to servers that areall connected to a switch. TCP Incast could resultin low throughput, excessive delay, and poor userexperience.

In addition, many data-intensive applicationsexhibit the characteristics of barrier synchronizedworkloads. That is, a client (or an application)queries number of clusters and waits to receiveresponse. The client cannot proceed until and un-less it receives response from all the servers. Bar-rier synchronized scenarios could encounter TCPIncast problem due to which long delays mightoccur. For many data-intensive applications suchas search engines and recommendation systems,this setup could induce long delays.

Diversity in data-intensive cloud computing im-plies that traffic requirements are multi-modal.Studies suggest that data center traffic would haverequirements of low latency, high burst, and highutilization [7]. Data centers must be equipped tohandle all such scenarios.

Solution To solve the TCP Incast problem,Vasudevan et al. [118] proposed that TCPRetransmission Time Out (RTO) be reduced.Through real experiments, the authors observedthat microsecond timeouts allowed servers toscale up to 47 in barrier synchronized communi-cation environment.

2) Incorrect Network Configuration

Challenges In many data-intensive cloud sys-tems, Content-Distribution Networks (CDNs) are

Page 20: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

300 J. Shamsi et al.

used to reduce client latencies by redirectingclients to the nearest server. In [70], the authorsexamine CDN network for Google and observedthat redirection does not always provide optimallatency. The authors utilized ping and trace routeutilities for anomaly detection and observed thatincorrect routing configuration, lack of peering,and traffic engineering are the main causes forlatency inflation.

Solution The authors conclude that improvingCDN performance does not always require addingnew nodes. It is equally important that one shouldeffectively use and configure existing nodes. Theirsolution, WhyHigh [70] has been in use at Googleto improve performance of Google CDN.

Above mentioned contributions emphasizedthe widespread applicability of data intensive sys-tems. Further, they assert that considering thewide scale applicability, application-specific en-hancements are pertinent. In the next section, wedescribe application-level enhancements for data-intensive systems.

4 Application-Specific Solutions for DataIntensive Systems

In the previous section, we have discussed generalchallenges and solutions related to data intensive

cloud systems. These were applicable to a wide va-riety of applications. In data-intensive computing,challenges and solutions also vary with respect toapplications and there are a few scenarios whereapplication-specific solutions are developed in or-der to achieve higher efficiency. For instance,facilitating shared memory could be useful forcomputing page-ranking. Similarly, efficient uti-lization of disks could be useful for high speedsorting. The purpose of this section is to elaboratesolutions which have been proposed to enhanceefficiency of data intensive systems with respectto some specific application.

Since cloud has been expanded to incorporatehardware enhancements such as GPUs and singlechip cores [105], the section also elaborates onthese hardware enhancements which can be ex-ploited to achieve enhanced performance. Table 2presents a summarized view of these enhance-ments.

1) Processing of Incremental Updates

Consider an example of computing web-indexes.In such a case the dataset is continuously changingand receiving incremental updates. Under sucha scenario, the data-intensive task of computingindexes should only be executed on the modifieddataset in order to compute the updated index.

In [92], the authors are motivated by thisneed. They proposed Percolator system [92] which

Table 2 Application-specific solutions for data-intensive systems

S. no. Issue Solutions

1 Processing of incremental updates Percolator [92], Incoop [15], CBP [76]2 Stream processing and real-time computation S4 [86], Storm [111], Hadoop Online [27], D-Stream [131],

Meeting user deadlines [64], Facebook messaging [18]2 Iterative algorithms Twister [39], Haloop [20], Spark [130], iMapReduce [133]3 Join operations Multi-way Joins [62, 74]4 Dynamic tasks CIEL [84]5 Shared memory for page ranking Piccolo [96]6 Data sampling Lazy MapReduce [83]7 Searching over encrypted data Rank-based keyword search [123]8 Sorting TritonSort [100], GPUTerasort [46]9 Support for large number of files GIGA+ [90]10 Incorporating hardware enhancements SCC [105], Mars [55], Discmarc [81], MR-J [68],

using GPU and FPGA [44], Phoenix plus [125]11 Enhanced scalability for hadoop Hadoop Nextgen [87]12 Hybrid approach for transactional Hadoop DB [4]

and analytical systems13 MapReduce on different platforms MapReduce on mobile [78], MapReduce on Azure [49]

Page 21: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 301

provides processing of incremental updates forGoogle’s web search engine. The system relieson Google BigTable [24] for determining changedweb pages and computes the index only on theupdated pages. It utilizes a chunk server forstoring metadata information and a Percolatorworker which scans the system for correspondingchanges. Update mechanism is triggered throughuser defined up calls. The authors mention that byusing the Percolator system the time to computingthe complete index is drastically reduced.

CBP (Continuous Bulk Processing) [76] is asimilar framework from Yahoo. It is motivated byenabling state information or utilizing prior com-putations to get the updated results. The state isintegrated into the parallel execution frameworkin order to reduce the overhead for the developer.CBP is a generalized architecture based on a cus-tom execution plan. The framework has also beenapplied to develop an efficient log-processing ap-plication on MapReduce [77].

Incoop [15] is motivated to solve similar prob-lems. However, instead of using efforts from a de-veloper to implement procedures and algorithmsfor update; it relies on system-level characteristicsto process updated mechanism. Advantages ofsuch an approach are efficient processing, usageof conventional MapReduce-based development,and ease of development. The Incoop systemutilizes Incremental HDFS to detect similaritiesbetween consecutive jobs. It also uses contractionphase, in which large tasks are divided into smallersubtasks in order to promote task-re-usage. Fur-ther the system incorporates memorization basedscheduler to reduce data movement [16]. The sys-tem has been built on top of MapReduce systemto provide further flexibility.

2) Stream Processing and Real-timeComputation

While Hadoop has been widely accepted for batchprocessing, its use has remained limited for streamprocessing. Batch processing jobs usually work onstatic data. On the other end, streaming jobs suchas complex event processing, real time search-ing, and advertisement personalization, requirestream of events that flow into the system at agiven data rate.

In Hadoop Online Prototype (HOP) [27], theauthors are motivated to address the limitationof batch processing. In that, the output from themap tasks is directly sent to the reducer throughsocket. This allows faster execution of the systemas processing can be started while the data is beingreceived.

Storm [111] is a platform independent distrib-uted real-time computation system. It provides aset of primitives for stream processing and con-tinuous computation. A storm cluster executes setof topologies (similar to MapReduce jobs in aHadoop cluster). Note that due to stream process-ing, topologies are continuously executed. A mas-ter node provides job tracking to worker nodes.Streams are processed through topologies whichare created by a user and contain sequence oflogic and job flows. A topology is continuouslyexecuted in order to facilitate stream processing.Storm provides strong guarantees for no data loss.

Similarly, S4 (Simple Scalable Streaming Sys-tem) is a distributed stream processing engine thatis inspired from the MapReduce model [86]. Itprovides platform which is scalable, partially faulttolerant, pluggable and using which developer caneasily develop applications that required contin-uous processing of unbounded streams of data.As a practical demonstration, the authors alsoimplemented the design and implementation forautomatic tuning of one or more parameters forsearch advertisement system using live traffic.

The S4 and Storm frameworks process streamswith one record at a time. This raises concerns forfault tolerance. The D-Streams [131] frameworkis motivated to provide high fault tolerance andstrong consistency. It considers stream computa-tions as a series of deterministic jobs on smallintervals (e.g. 1 sec). To support consistency andstreaming, each record is processed atomicallywithin the small interval it arrives. The D-Streamframework utilizes memory to keep a record ofintermediate states and reduce delay.

Efforts have also been made to provide softreal-time assurance for data-intensive tasks. In[64], Kc and Anyanwu describe such an effortrelated to the Hadoop platform. The proposedframework is based on estimating the job com-pletion time and providing scheduling based onuser preferences. The duration of job completion

Page 22: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

302 J. Shamsi et al.

depends upon various factors including executiontime and input processing time for map and re-duce phases. The core functionality is achievedthrough an efficient scheduler, which takes thedeadline from the user, considers the parametersfor job completion time, and prioritizes jobs ac-cordingly. The scheduler may refuse to accept ajob if it estimates that meeting the user deadline isnot possible.

In addition, many enhancements have beenmade to the hadoop architecture in order to re-duce latency and facilitate latency-sensitive ap-plications [18]. For instance, to provide latency-sensitive messaging in Facebook, HDFS is modi-fied to incorporate two Avatar nodes—a masternode and a slave node. The master avatar nodelogs its transaction in network file system, whichis available to be read by secondary avatar node.The secondary node keeps itself updated by con-stantly reading the logs. The data nodes also com-municate with both the avatar nodes. In case of afailure, the secondary node can be utilized for re-covery and check pointing. Similarly for facebook[18], the RPC time out mechanism is modifiedto detect network outages. The motivation is toreduce delay time for failure detection.

3) Iterative algorithms

Certain applications that require iterative compu-tation such as page rank are not well suited forHadoop. Iterative algorithms require that data iscomputed iteratively until a convergence point isreached. Under the default Hadoop system, thisraises following concerns:

• Although the execution flow for each iterationis same, the MapReduce system initiates anew job for each iteration. This induces highoverhead.

• There is no distinction between static data(data that remains same) and dynamic data.During each iteration, data is loaded in thesystem through MapReduce jobs. This de-creases efficiency and leads to avoidableloss of network resources and computationalpower.

• The detection of convergence would requirean additional MapReduce job to determine ifa convergence point is reached.

The above issues have motivated many re-searchers to propose improvements in the de-fault execution strategies for MapReduce andHadoop. Twister [39], Haloop [20], Spark [130],and iMapReduce [133] are efforts along thesedirections. Haloop is built on top of Hadoop. Thesystem caches the static data after the completionof first iteration. During the subsequent cycles, thedata is loaded from the cache. For the detection ofconvergence point, the system caches the outputof the reducer and compares it at the end ofeach iteration.

Twister extends the MapReduce programmingmodel with support for broadcast and scatter.It uses publish/subscribe methods for messagecommunication. Requirements for iterative algo-rithms are handled through programming exten-sions. For instance, static data for each iterationis loaded through a newly introduced conf igurephase. Twister increases the execution speed byexecuting Map and Reduce phases in memorycache. However, this limits the scalability of thesystem.

In order to reduce overhead of MapReduce foriterative tasks, the iMapReduce system introducesthe concept of persistent Map and Reduce tasks.During each cycle, the Map task gets the inputthrough the output of the Reduce task directory.Termination condition is determined through amaster which merges all the reduce tasks.

In addition to page-rank, many other applica-tions, such as k-means clustering, and neural andsocial network analysis are likely to benefit fromthese enhancements.

4) Join Operations

As MapReduce processes data sequentially, it isalso considered inefficient for performing multi-way joins [62]. In order to solve these prob-lems, the authors in [62] has proposed filtering-join-aggregation programming model as an exten-sion of MapReduce filtering-aggregation model.A new join function is introduced in the frame-work which automatically joins multiple datasetaccording to user define criteria. For improvedperformance, a one-to-many shuffling strategy isalso introduced, which shuffles all intermediatekey/values in one go.

Page 23: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 303

Strategies of one-to-one join and one-to-manyjoin using map and reduce phases have also beendiscussed by Jimmy Lin [74]. The author also ex-plains the usage of MapReduce for computed in-verted indexes for search engines. Lin’s approachincreases the scalability of index operations byusing composite keys during the map phase.

5) Dynamic Tasks

In some scenarios, tasks could be created dynami-cally which are dependent with each other. In suchcases, Dryad’s approach of specifying a directedacyclic graph (DAG) at the time of job submissionis restrictive.

CIEL [84] supports dynamic creation of tasksin which a task could either produce an object(output) or spawn another task. Parallelism isachieved by initiating independent tasks in par-allel. Note that developers are not required tocreate dynamic task graphs; instead they are gen-erated using a dedicated scripting language calledskywriting.

Architecturally, a CIEL system is not muchdifferent than a Hadoop system. A master nodecoordinates all the jobs and worker nodes per-form execution and sends heart beat messages tothe master. Like Hadoop and Dyraid, CIEL sup-ports data locality and transparently provides faulttolerance and scalability. The authors evaluatedCIEL by porting many features of Hadoop andsetting up a cluster on Amazon EC2.

6) Shared Memory for Page Ranking

For many intensive tasks such as page rank, k-means computation, and n-body simulation; shar-ing of state is needed among the computationnodes. For instance, the page-rank algorithm re-quires access to neighbor’s page rank. However, ifthe neighbor’s page rank is being computed on adifferent node then providing access to the sharedstate becomes challenging [96].

Piccolo [96] is a data-centric tool for intensivecalculations which provide shared memory acrossthe computing nodes. This feature is implementedthrough a kernel which is launched as multiple in-stances concurrently executing on many comput-ing nodes. Distributed memory states are sharedthrough set of in-memory tables. Besides provid-ing efficient implementation for many algorithms

which require shared memory, Piccolo can also bebeneficial which require immediate notificationsfor modification in shared states.

7) Data Sampling

Smart processing techniques are needed when asnapshot of output is required. If the completedata set is analyzed then the query would take alot of time to yield the result. The problem be-comes severe if processing on small subset of datadoes not yield desired result. In [83], the authorsare focused to solve this problem. They suggestedguidelines through which a user can decide if LazyMapReduce (processing on small subset) wouldbe beneficial and what granularity is needed. Theauthors suggested that Lazy MapReduce cannotbe applied arbitrarily. Further, the benefit of thelazy scheme is high when the processing costis high.

The work is preliminary and is needed to befurther evaluated for extensive results.

8) Searching Over Encrypted Data

Sensitive data may be stored in an encryptedformat on a cloud. Instances may arise when asmall portion of this data is needed to be ac-cessed; for instance, when searching is requiredover the encrypted data. In such a scenario, de-cryption before doing the search is possible butexpensive.

Wang et al. [123] proposed a rank based key-word search scheme for this problem. In that,encrypted files are outsourced on a cloud. Userssubmit search queries to the cloud, which retrievesranked-based results, where ranks are computedaccording to keywords.

9) Sorting

For fast sorting over a wide-spread cluster,efficient utilization of disk is important. Triton-Sort [100] is a dedicated framework for high speedsorting. It has been applied to sort 100 terabytesof data, which is spanned over 832 disks and52 nodes. In order to maximize the disk usage,data is localized, incorporating higher number ofdisks per node. In addition, application-specificin-memory buffers have been used to minimizedisk seeks.

Page 24: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

304 J. Shamsi et al.

Similarly, GPUTerasort [46] is a sorting frame-work from Microsoft research. It exploits instruc-tion level parallelism and high-bandwidth GPUmemory to sort billion of records. It utilizesI/O operation support and resource managementfrom CPU to achieve peak I/O performance.

10) Support for Large Number of Files

Some data-intensive applications, such as logprocessing, require storing a large number of filesin a single directory. This feature enables appli-cations to properly isolate data while maintainingthe desired storage criteria. However, this require-ment is difficult to implement due to lack of sup-port from the underlying file system.

GIGA+ [90] is an effort in this direction. Thesystem distributes a directory over a number ofmachines in a cluster. Index is partitioned amongall servers in the cluster. Caching of directory ispermissible in order to allow faster access. Clientscache directory index, which may have stale point-ers. An outdated entry may be corrected when aclient contacts an incorrect server.

11) Incorporating Hardware Enhancements

Advancements in hardware technologies haveprovided various opportunities to exploit paral-lelism and enhanced speed of execution.

1) Single-Chip Cloud Computer

Single Chip Cloud (SCC) [105] is an initiativeform Intel. In that, a large number of cores (48cores) are embedded on a single chip in orderto enhance performance. Each core is capable ofbehaving like a single compute node and can runseparate OS and software stack. Socket communi-cation (message passing) is used to support com-munication among the nodes. The SCC platformis supported by fine grained power management.

In [89], the authors highlight the scalability bot-tlenecks of MapReduce with data partition anddata sorting. The authors provide scalable imple-mentation that effectively utilizes the Single ChipCloud Computer (SCC) interconnection networkand on-chip shared communication buffers. Theimplementation provides linear or super linearscaling of applications with realistic datasets for asingle SCC node.

Although the performance of the implemen-tation seem promising for the given realisticdatasets, the implementation of dynamic taskscheduling, the design choices for implementingthe full MapReduce execution path, including I/Oetc, and further analysis of applications are stillquestionable.

2) Incorporating GPUs for Data-Intensive CloudComputing

GPU (Graphical Processing Unit) has been ex-ploited for data Intensive tasks. Mars [55] is aMapReduce framework which is motivated bythis characteristic. It exploits GPU’s inherent par-allelism by mapping independent Map tasks onGPU threads. DisMaRC [81] is a similar effort;in which it implements MapReduce using theCUDA programming abstraction for GPUs.

MR-J [68] is a Java based MapReduce frame-work which exploits parallelism of underlyingmulti-core processors. A job is recursively dividedacross multiple cores through divide and conquerstrategy.

All of these systems observe noticeableimprovements over conventional MapReducesystems.

For some data-intensive applications, advance-ments in hardware technologies such as branchprediction do not offer much performance en-hancements [44]. Motivated by this consideration,Gokhale et al. [44] evaluated the performance ofGPU and FPGA based hardware for data inten-sive tasks. The authors observed orders of magni-tude increment in computing cycles and speedupfor data intensive tasks. The authors also utilizedNAND flash and I/O Memory to stream datato coprocessors. Although speedup is observed,they noted that bandwidth limitation of the CPUmemory limits the increase in speed.

3) Shared Memory Access

Phoenix is a modified version of MapReducewhich is focused on shared memory access. In-stead of utilizing the distributed fie system forcommunication, the system utilizes shared mem-ory among multi-core systems in order to achieveenhanced performance [125]. While the perfor-mance of Phoenix is good for small scale systems,for large -scale systems shared memory becomes

Page 25: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 305

a bottleneck for non-uniform access. Yoo et al.proposed an enhanced version of Phoenix whichprovide faster execution for up to 256 threads.

Advancements in hardware technologies arelikely to provide performance increments formany data-intensive applications; however, thescale of speedup is likely to vary for differentapplications.

12) Enhanced Scalability for Hadoop

Considering the widespread usage of Hadoop andthe growing number of applications benefitingfrom it, the current model of Hadoop is consid-ered limited in scalability. It has been reportedthat the framework has hit the scalability limit of4,000 nodes [5, 87]. The argument is that in orderto accommodate extensive growth rate of data,a new Hadoop-like framework is needed whichshould be able to support up to 10,000 nodes with200,000 cores. In the current implementation, amajor limitation is the MapReduce JobTracker.

The Hadoop nextgen framework is aimed toprovide improved scalability, availability, relia-bility, backward compatibility, and cluster man-agement. Further, it is also motivated to providepredictable latency to users. The architecture ofthe next generation of MapReduce is motivatedfrom the Falkon [98] framework. In that, resourcemanagement and job scheduling are handled bytwo separate components. For each application,an application master manages the applicationscheduling; whereas the global resource manage-ment is handled by a resource manager.

The framework has been released as Hadoop0.23.

13) Hybrid Approach for Transactional and An-alytical Systems

One of the main arguments against NOSQLsystems is the lack of proper schema. And itslimited capability to process structured data. InHadoopDb [4], the authors are motivated bythis challenge. They proposed a hybrid approachin which a cluster of independent databasesis formed. MapReduce is used to implementmessage communication and achieve scalability,whereas postgreSQL is used as database, whichresides on each host. The authors argue that thehybrid approach provides benefits of both the

NOSQL systems and parallel databases (see Fig. 2and Section 2.2).

While the approach used in HadoopDB ispromising, its scalability is needed to be exploredfor large data-intensive systems.

14) MapReduce on Different platforms

The original version of MapReduce was proposedfor Linux platform. It has also been extended toAzure [49] and Android [78].

Although Mobile Cloud Computing (MCC) hasbeen emerging, it is not ready yet to solve data-intensive tasks. On the other end, Azure is autility-based computing platform and may be uti-lized for data-intensive tasks.

The enhancements describe in this sectiondemonstrate the wide applicability and usagefor data-intensive applications. In addition, itpresents new challenges of heterogeneity, applica-bility, and extensive evaluation to the researchcommunity.

5 Discussions and Research Directions

Data-intensive computing induces enormous chal-lenges of processing data at high speed. Highperformance computing techniques and advance-ments in hardware technologies provides exten-sive solutions of efficient processing in this emerg-ing area. As the rate of data generation is climb-ing and the applications that require processingare increasing, data-intensive computations aremoving to clouds. With these severe challengesof availability, scalability, fault tolerance, and re-source sharing exists. In addition, issues such asdata placement and content distribution, privacy,network setup, and billing are also important.Challenges also vary for different applications asthey have differing requirements of consistency,usability, flexibility, and data flow.

With the growth in applications and increasein demand, many new challenges are likely to beemerged. First of all, scalability of data-intensivesystems is likely to be challenged. While distrib-uted solutions could provide improved scalability,they incur higher cost of consistency and latency.Effective utilization of resources and incorporat-

Page 26: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

306 J. Shamsi et al.

ing elasticity in order to meet the needs of data-intensive computing are also important.

Another important area of consideration isthe ability to provide real-time processing orstreaming for data-intensive systems. Whilestreaming-based solutions exist; they suffer withlow scalability (due to timing constraints) anddegraded consistency.

As data-intensive applications continue toemerge, requirements for resource optimizationand efficient utilization become significant. Onemajor area in this direction is effective utilizationof power or green computing. The idea of nanodata centers [117] has been proposed in orderto reduce total power consumption of a cloud.However, with massive storage and processingrequirements, nano data centers are unlikely to beadopted for data-intensive computing. Similarly,the efficacy of other solutions of power optimiza-tion that rely on effective routing and resource-provisioning is needed to be explored.

A major goal of data intensive computing plat-forms is to provide data locality or to reducebottleneck in data transmission. With the increasein size and scale of data intensive computing,the goal of providing locality could be an issue.Further, requirements for consistency should alsobe explored. In a multi-user environment, fairscheduling could also violate data locality andeffect performance.

It has been suggested and observed that trans-actional systems are unlikely to be deployed oncloud due to their strong requirements of locking,transaction commitment, and consistency. How-ever, recent advancements in data storage haveintroduces the possibility of using clouds for trans-actional systems as well [21]. It remains to beseen how such solutions are being adopted bythe community. This also opens a question if itis feasible to use MR-like system for transactionprocessing.

Another important domain of research is thepossibility of using more than one data center fora cloud. In that, multiple data centers are used be-yond the aim of fault tolerance and geo-location.In such cases, hardware solutions, middlewareapproaches, and software tools are needed tobe effective [47]. In addition, distributed cloudsare likely to further elevate issues of consistency.

However, they may be helpful in reducing delay,increasing scalability, and enhancing availability.With this, possibilities and challenges of provid-ing interoperability between two different cloudswhich are managed by two different organizationsare also needed to be investigated.

References

1. Abadi, D.: Data management in the Cloud: limitationsand opportunities. In: IEEE Data Engineering (2009)

2. Abadi, D.: Problems with CAP and Yahoo’s littleknown NOSQL System. Available. http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html. Last accessed 4 Oct 2012

3. Abe, Y., Gibson, G.: pWalrus: Towards better inte-gration of parallel file systems into cloud storage. In:Workshop on Interfaces and Abstractions for Scien-tific Data Storage (IASDS10), co-located with IEEEInt. Conference on Cluster Computing 2010 (Clus-ter10), Heraklion, Greece (2010)

4. Abouzeid, A., Bajda-Pawlikowskim, K., Abadi, D.,Silberschatzm, A., Rasin, A.: HadoopDB: An archi-tectural hybrid of MapReduce and DBMS technolo-gies for analytical workloads. In: VLDB (2009)

5. Agrawal, S.: Hadoop NextGen. Hadoop India Sum-mit (2011)

6. Agrawal, S., Dunagan, J., Jain, N., Saroiu, S., Wolman,A., Bhogan, H.: Volley: Automated data placementfor geo-distributed cloud services. In: Usenix NSDI(2010)

7. Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J.,Patel, P., Prabhakar, B., Sengupta, S., Sridharan, M.:DCTCP: efficient packet transport for the commodi-tized data center. In: ACM SIGCOMM (2010)

8. Andersen, D., Franklin, J., Kaminsky, M.,Phanishayee, A., Tan, L., Vasudevan, V.: FAWN: afast array of wimpy nodes. In: Communications of theACM (2011)

9. Armbrust, M., Fox, A., Griffith, R., Joseph, A., Katz,R., Konwinski, A., Lee, G., Patterson, D., Rabkin,A., Stoica, I., Zaharia, M.: Above the Clouds: ABerkeley View of Cloud Computing. UCB/EECS-2009-28, EECS Department, University of California,Berkeley (2009)

10. Baker, J., Bond, C., Corbett, J., Furman, J., Khorlin,A., Larson, J., Leon, J., Li, Y., Lloyd, A., Vadim,Y.: Megastore: providing scalable, highly availablestorage for interactive services. In: Proceedings ofthe Conference on Innovative Data system Research(CIDR), pp. 223–234 (2011)

11. Balraj, K., Gunabalan, S.: An approach to achievedelegation of sensitive. RESTful resources on storagecloud. In: 2nd Workshop on Software Services: CloudComputing and Applications based on Software Ser-vices. Timisoara (2011)

Page 27: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 307

12. Banker, K.: MongoDB in Action. Manning Publica-tions (2012)

13. Barroso, L.A.: Warehouse-scale computing: enteringthe teenage decade. In: ISCA (2011)

14. Belady, C.: In the data center, power and cooling costsmore than IT equipment it supports. In: ElectronicsCooling Magazine (2007)

15. Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.,Pasquimi, R.: Incoop: MapReduce for IncrementalComputations. Max Planck Institute. Technical Re-port: MPI-SWS-2011-003 (2011)

16. Bhatotia, P., Wieder, A., Akkus, I., Rodrigues, R.,Acar, U.: Large-scale Incremental Data Processingwith Change Propagation. Usenix Hotcloud (2011)

17. Borthakur D.: HDFS Architecture Guide. ApacheFoundation (2008)

18. Borhtakur, D., Sarma, J., Gray, J.: Apache Hadoopgoes realtime at Facebook. In: ACM SIGMOD,Athens, Greece (2011)

19. Brewer, E.: Towards robust distributed systems. In:ACM Symposium on the Principles of DistributedComputing. Portland, OR, USA (2000)

20. Bu, Y., Howe B., Balazinska, M., Ernst, M.: HaLoop:efficient iterative data processing on large clusters.J. Proceedings VLDB Endowment 3(1–2), 285–296(2010)

21. Cao, Y., Chun Chen, C., Guo, F., Jiang, D., Lin, Y.,Ooi, B., Vo, H., Wu, S., Xu, Q.: ES2: A cloud datastorage system for supporting both OLTP and OLAP.In: IEEE ICDE (2011)

22. Chambliss, D.: An architecture for storage-hosted ap-plication extensions. IBM J. Res. Develop. (0018-8646) 52(4), 427 (2008)

23. Chang, F., Ganapathi, A., Katz, R.: To compressor not to compress—compute vs. IO tradeoffsfor MapReduce energy efficiency. University ofCalifornia–Berkeley. Technical Report (2010)

24. Chen, Y., Dean, J., Ghemawat, S., Hsieh, W., Wallach,D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.:Bigtable: a distributed storage system for structureddata. ACM Trans. Comput. Syst. 26(2), 1–26 (2008)

25. Chen, Y., Alspaugh, S., Borthakur, D., Katz, R.: En-ergy efficiency for large-scale MapReduce workloadswith significant interactive analysis. In: ACM Eu-roSys, Article 4, pp. 1–26 (2012)

26. Close, T.: ACL’s Don’t. Technical Report HP Labo-ratories (2009)

27. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.:MapReduce Online. Usenix NSDI (2010)

28. Cooper, B., Ramakrishnan, R., Srivastava, U.,Silberstein, A., Bohannon, Jacobson, A., Puz, N.,Weaver, D., Yernani, R.: PNUTS: Yahoo!’s hosteddata serving platform. In: VLDB (2008)

29. Cooper, B., Baldeschwieler, E., Fonseca, R., James,J., Kistler, J., Narayan, P., Neerdaels, C., Negrin, T.,Ramakrishnan, R., Silberstein, A., Srivastava, U.,Stata, R.: Building a Cloud for Yahoo!. In: IEEE DataEngineering (2009)

30. Cooper, B., Silberstein, A., Tam, E., Ramakrishnan,R., Sears, R.: Benchmarking Cloud Serving SystemsYCSB. In: SOCC (2010)

31. Dai, J., Huang, J., Huang, S., Bo Huang, B., Liu, Y.:HiTune: dataflow-based performance analysis for bigdata cloud. In: Usenix HotCloud (2011)

32. Das, S., Agrawal, D., Abbadi, A.: ElasTras: an elastictransactional data store in the cloud. In: Usenix Hot-clud (2009)

33. Dean, J., Ghemawat, S.: MapReduce: Simplified dataprocessing on large clusters. In: OSDI (2004)

34. Dean, J., Ghemawat, S.: MapReduce: a flexible dataprocessing tool. Commun. ACM 53(1), 72–77 (2010)

35. DeCandia, G., Hastorun, D., Jampani, M.,Kakulapati, G., Lakshman, A., Pilchin, A.,Sivasubramanian, S., Vosshall, P., Vogels, W.:Dynamo: Amazon’s highly available key-value store.In: Proc. SOSP (2007)

36. DeWitt, D.J., Gerber, R.H., Graefe, G., Heytens,M.L., Kumar, K.B., Muralikrishna, M.: GAMMA:A high-performance dataflow database machine. In:VLDB, pp. 228–237 (1986)

37. DeWitt, D., Stonebraker, M.: MapReduce: A majorstep backwards. Database Column Blog (2008).http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html

38. Dwork, C.: Differential privacy. In: ICALP (2006)39. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae,

S., Qiu, J.: Geoffrey Fox. Twister: a runtime for iter-ative MapReduce. In: Proceedings of the 19th ACMInternational Symposium on High Performance Dis-tributed Computing. HPDC (2010)

40. Elmore, A., Das, S., Agrawal, D., Abbadi, A.: Zephyr:live migration in shared nothing databases for elasticcloud platforms. In: ACM SIGMOD (2011)

41. Fan, B., Tantisiriroj, W., Xiao, L., Gibson, G.:DiskReduce: RAID for data-intensive scalable com-puting. In: PDSW Super Computing (2009)

42. Ford, D., Labelle, F., Popovici, F., Stokely, M.,Truong, V., Barroso, L., Grimes, C., Quinlan, S.:Availability in globally distributed storage systems.In: OSDI (2010)

43. Ghemawat, S., Gobio, H., Leung, T.: The Google filesystem. ACM SIGOPS Oper. Syst. Rev. 7(5), 29–43(2003)

44. Gokhale, M., Cohen, J., Yoo, A., Marcus Miller, M.,Jacob, A., Ulmer, C., Pearce, R.: Hardware technolo-gies for high-performance data-intensive computing.IEEE Computer 41(4), 60–68 (2008)

45. Gorton, I., Greenfield, P. Szalay, A., Williams, R.:Data-intensive computing in the 21st century. IEEEComputer 41(4), 30–32 (2008)

46. Govindaraju, N., Gray, J., Kumar, R., Manocha,D.: GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Database Management.MSR Tech Report December (2005)

47. Grossman, R., Gu, Y.: On the varieties of clouds fordata intensive computing. In: IEEE Data Engineering(2009)

48. Gu, Y., Grossman, R.: Towards Efficient and Sim-plified Distributed Data Intensive Computing. IEEETrans. Parallel Distrib. Syst. 22(6), 974–984 (2010)

49. Gunarathne, T., Wu, T., Qiu, J., Fox, G.: MapReducein the clouds for science. In: IEEE Second Interna-

Page 28: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

308 J. Shamsi et al.

tional Conference on Cloud Computing Technologyand Science (CloudCom) (2010)

50. Haddad, I.: PVFS: a parallel virtual file system forLinux clusters. Linux J. 2000(80) (2000)

51. Hadoop. The Apache Hadoop Project. http://hadoop.apache.org/

52. Harnik, D., Kolodner, E., Ronen, S., Satran, J.Shulman-Peleg, A., Tal, S.: Secure access mechanismsfor cloud storage. In: 2nd Workshop on Software Ser-vices: Cloud Computing and Services: Cloud Com-puting and Applications based on Software Services(2011)

53. HBase: The Apache HBase Project. http://hbase.apache.org/

54. HBql Homepage—http://www.hbql.com/. Last ac-cessed 10 Oct 2012

55. He, B., Fang, W., Govindaraju, N., Luo, Q., Want, T.:Mars: a MapReduce framework on graphics proces-sors. In: PACT (2008)

56. He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang,X., Xu, Z.: RCFile: a fast and space-efficient dataplacement structure in MapReduce-based warehousesystems. In: IEEE ICDE (2011)

57. Hindman, B., Konwinski, A., Zaharia, M., Ali Ghodsi,A., Joseph, A., Katz, R., Scott Shenker, S., Stoica, I.:Mesos: a platform for fine-grained resource sharing inthe data center. In: Usenix NSDI (2011)

58. Hive HBase Integration. https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

59. HStreaming Project. http://www.hstreaming.com/Last accessed 7 Oct 2012

60. Huang, J., Ouyang, X., Jose, J., Wasi-ur-Rahman,M., Wang, H., Luo, M., Subramoni, H., Murthy,C., Panda, D.: High-performance design of HBasewith RDMA over infiniBand. In: IEEE 26th Interna-tional Parallel & Distributed Processing Symposium(IPDPS) (2012)

61. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.:Dryad: distributed data-parallel programs from se-quential building blocks. In: ACM SIGOPS/Eurosys(2007)

62. Jiang, D., Tung, A.K. H, Chen, G.: Map-Join-Reduce:Towards Scalable and Efficient Data Analysis onLarge Clusters. IEEE (2010)

63. Karger, D., Lehman, E., Leighton, T., Panigrahy, R.,Levine, M., Lewin, D.: Consistent hashing and ran-dom trees: distributed caching protocols for relievinghot spots on the World Wide Web. In: Proceedingsof the Twenty-Ninth Annual ACM Symposium ontheory of Computing (El Paso, Texas, United States,4–6 May 1997). STOC ’97. ACM Press, New York,pp. 654–663 (1997)

64. Kc, K., Anyanwu, K.: Scheduling Hadoop jobs tomeet deadlines. In: IEEE CloudCom (2010)

65. Ko, S., Hoque, I., Cho, B., Gupta, I.: On availability ofintermediate data in cloud computations. In: UsenixHotOS (2009)

66. Kollodner, E.: Data-intensive storage services onclouds: limitations, challenges, and enablers. In: 2ndWorkshop on Software Services: Cloud Computingand Applications based on Software Services (2011)

67. Kouzes R., Anderson G., Elbert S., Gorton, I., Gracio,D.: The changing paradigm of data-intensive comput-ing. IEEE Computer 42(1), 26–34 (2009)

68. Kovoor, G., Singer, J., LujánBuilding, M.: A JavaMapReduce framework for multi-core architectures.In: Third Workshop on Programmability Issues forMulti-Core Computers (MULTIPROG) (2010)

69. Krevat, E., Joseph Tucek, J., Gregory, G.: Disks arelike snowflakes: no two are alike. In: HotOS (2011)

70. Krishnan, R., Madhyastha, H., Jain, S., Srinivasan, S.,Krishnamurthy, A., Anderson, T., Gao, J.: Moving be-yond end-to-end path information to optimize CDNperformance. In: Internet Measurement Conference(IMC), pp. 190–201 (2009)

71. Kung, H., Lin, C.-K., Vlah, D.: CloudSense: Contin-uous fine-grain cloud monitoring with compressivesensing. In: Usenix HotCloud (2011)

72. Lakshman, A., Malik, P.: Cassandra—a decentralizedstructured storage system. ACM SIGOPS Oper. Syst.Rev. 44(2), 35–40 (2010)

73. Lang, W., Patel, J.: Energy management for mapre-duce clusters. In: VLDB’10 (2010)

74. Lin, J., Dyer, C.: Data Intensive Text Processing withMapReduce. Morgan and Claypool Publishers (2010)

75. Lin, J., Ryaboy, D., Weil, K.: Full-text indexing foroptimizing selection operations in large-scale data an-alytics. In: MapReduce (2011)

76. Logothetis, D., Olston, C., Reed, B., Webb, K.,Yocum, K.: Stateful bulk processing for incrementalanalytics. In: Proc. ACM Symposium on Cloud com-puting, SoCC ’10 (2010)

77. Logothetis, D., Trezzo, C., Webb, K. Webb, Yocum,K.: In-situ MapReduce for log processing. In: UsenixHotCloud (2011)

78. Marinelli, E.: Hyrax: Cloud computing on mobile de-vices using MapReduce. MS thesis, CMU (2009)

79. Meisner, D., Sadler, C., Barroso, L., Weber, W.,Wenisch, T.: Power management of online data-intensive services. In: ISCA ’11 (2011)

80. Miceli, C., Miceli, M., Jha, S., Kaiser, H., Merzky, A.:Programming abstractions for data intensive comput-ing on clouds and Grids. In: 9th IEEE/ACM Interna-tional Symposium on Cluster Computing and the Grid

81. Mooley, A., Murthy, K., Singh, H.: DisMaRC: ADistributed Map Reduce framework on CUDA.UTAustin Tech Report (2009)

82. Moretti, C., Bulosan, J., Thain, D., Flynn, P.: All-Pairs: an abstraction for data-intensive cloud comput-ing. IEEE Trans. Parallel Distrib. Syst. 21(1), 33–46(2010)

83. Morton, K., Balazinska, M., Grossman, D., Olston, C.:The case for being lazy: How to leverage lazy evalua-tion in MapReduce. In: Proceedings of the 2nd Inter-national Workshop on Scientific Cloud Computing.ScienceCloud (2011)

84. Murray, D., Schwarzkopf, M., Smowton, C., Smith, S.,Madhavapeddy, A., Hand, S.C.: A universal executionengine for distributed data-flow computing. In: NSDI(2011)

85. Narayanan, A., Shmatikov, V.: Robust de-anony-mization of large sparse datasets. In: S&P (2008)

Page 29: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Data-Intensive Cloud Computing 309

86. Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4:distributed stream computing platform. In: Data Min-ing Workshops (ICDMW) (2010)

87. Next Generation of Hadoop. Blog. http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/

88. Olston, C., Reed, B., Srivastava, U., Kumar, R.,Tomkins, A.: Pig Latin: a not-so-foreign language fordata processing. In: ACM SIGMOD (2008)

89. Papagiannis, A., Nikolopoulos, D.: Scalable runtimesupport for data-intensive applications on the single-chip cloud computer. In: 3rd Many-core ApplicationsResearch Community Many-core Applications Re-search Community (MARC) Symposium (2011)

90. Patil, S., Gibson, G.: Scale and concurrency ofGIGA+: file system directories with millions of files.In: Proceedings of the 9th USENIX Conference onFile and Storage Technologies (FAST ’11). San JoseCA (2011)

91. Pavlo, A., Paulson, E., Rasin, A., Abadi, J., Dewitt,J., Madden, S., Stonebraker, M.M.: A comparison ofapproaches to large-scale data analysis. In: SIGMOD’09. ACM (2009)

92. Peng, D., Dabek, F.: Large-scale incremental process-ing using distributed transactions and notifications. In:OSDI (2010)

93. Peterson, Z., Gondree, M., Beverly, M.: A positionpaper on data sovereignty: the importance of geolo-cating data in the cloud. In: Usenix HotCloud (2011)

94. Pig. http://pig.apache.org/95. Pike, R., Dorward, S., Griesemer, R., Quinla, S.: In-

terpreting the data: parallel analysis with Sawzall. Sci.Program. J. (Special Issue on Grids and WorldwideComputing Programming Models and Infrastructure)13(4), 227–298

96. Power, R., Li, J.: Piccolo: building fast, distributedprograms with partitioned tables. In: Usenix OSDI(2010)

97. Qiao, L.: Integration of server, storage and databasestack: moving processing towards data. In: 2008 IEEE24th International Conference on Data Engineering(1-4244-1836-4, 978-1-4244-1836-7), p. 1200 (2008)

98. Raicu, I., Zhao, Y., Dumitrescu, C., Foster, L., Wilde,M.: FALKON: a Fast and Light-weight task execu-tiON framework. In: ACM SC (2007)

99. Raicu, I., Ian Foster, I., Zhao, Y., Szalay, A., Little,P., Moretti, C., Chaudhary, A., Thain, D.: Towardsdata intensive many-task computing. In: Data Inten-sive Distributed Computing Challenges and Solutionsfor Large-Scale Information Management (2012)

100. Rasmussen, A., Porter, G., Conley, M., Madhyasthay,H.V., Mysore, R.N., Pucher, A., Vahdat, A.: Triton-Sort: a balanced large-scale sorting system. In: UsenixNSDI (2011)

101. Ren, K., López, J., Gibson, G.: Otus: resource attribu-tion in data-intensive clusters. In: Mapreduce (2011)

102. Riak. https://wiki.basho.com/display/RIAK/Riak (2011)103. Roy, I., Setty, S., Kilzer, A., Shmatikov, V., Witchel,

E.: Airavat: security and privacy for MapReduce. In:Usenix NSDI (2010)

104. Sakr, S., Liu, A., Batista, M., Alomari, M.: A survey oflarge scale data management approaches in cloud en-vironments. IEEE Commun. Surv. Tutor. 13(3), 311–336 (2011)

105. SCC. Single chip Cloud Computer Project. http://www.intel.com/content/www/us/en/research/intel-labs-single-chip-cloud-computer.html. Last accessed 6 Oct2012

106. Schmuck, F., Haskin, R.: GPFS: a shared-disk filesystem for large computing clusters. In: FAST ’02:Proceedings of the 1st USENIX Conference on Fileand Storage Technologies. USENIX Association,Berkeley, CA (2002)

107. Shang, P., Wang, J.: A novel power management forCMP systems in data-intensive environment. In: Par-allel & Distributed Processing Symposium (IPDPS)(2011)

108. Sharma, B., Chudnovsky, V., Hellerstein, J., Rifaat,R., Das, C.: Characterizing logical constraints ingoogle compute clusters. In: Symposium on CloudComputing (2011)

109. Shieh, A., Kandulaz, S., Greenberg, A., ChanghoonKim, C., Saha, B.: Sharing the Data Center Network.NSDI (2011)

110. Stonebraker, M., Abadi, D., Dewitt, D., Madden, S.,Paulson, E., Pavlo, A., Rasin, A.: MapReduce andparallel DBMSs: friends or foes. Commun. ACM53(1), 65–71 (2010)

111. Storm. https://github.com/nathanmarz/storm/wiki. Lastaccessed 7 Oct 2012

112. Tan, J., Pan, X., Kavulya, S., E. Marinelli, E.,Kavulya, S., Gandhi, R., Narasimhan, P.: Kahuna:Problem diagnosis for MapReduce-based cloud com-puting environments. In: 12th IEEE/IFIP NOMS(2010)

113. Teradata Corp. Database Computer System Manual,Release 1.3. Los Angeles, CA (1985)

114. Thusoo, A., Shao, Z., Anthony, S., Borthakur, D.,Jain, N., Sarma, J., Murthy, R., Liu, H.: Data ware-housing and analytics infrastructure at facebook. In:ACM SIGMOD (2010)

115. Thusoo, A., Sarma, J.S. , Jain, N., Shao, Z., Chakka,P., Zhang, N., Antony, S., Liu, H., Murth, R.: Hive|apetabyte scale data warehouse using Hadoop. In:ICDE (2010)

116. Tian, C., Zhou, H., He, Y., Zha, L.: A DynamicMapReduce Scheduler for Heterogenous WorkloadsIEEE GCC (2009)

117. Valancius, V., Laoutaris, N., Massoulié, L., Diot, C.,Rodriguez, P.: Greening the internet with nano datacenters. In: ACM CONEXt (2009)

118. Vasudevan, V., Amar Phanishayee, A., Shah, H.,Krevat, E., Andersen, D., Ganger, G., Gibson, G.,Mueller, B.: Safe and effective fine-grained TCPretransmissions for datacenter communication. In:ACM SIGCOMM (2009)

119. Verma, A., Cherkasova, L., Campbell, R.: ARIA: au-tomatic resource inference and allocation for MapRe-duce environments. In: Autonomic Computing Con-ference ICAC (2011)

Page 30: Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

310 J. Shamsi et al.

120. Wachs, M., Ganager, G.: Co-Scheduling of disk headtime in cluster-based storage. In: IEEE SRDS (2009)

121. Wachs, M., Ganger, G.: Improving storage bandwidthguarantees with performance insulation. TechnicalReport. Parallel Data Laboratory Carnegie MellonUniversity (2010)

122. Wachs, M., Lianghong Xu, L., Kanevskyy, A.,Ganger, G.: Exertion-based billing for cloud storageaccess. In: Usenix HotCloud (2011)

123. Wang, C., Cao, N., Li, J., Ren, K., Lou, W.: Se-cure Ranked Keyword Search Over Encrypted CloudData. IEEE Computer Society (2010)

124. Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors,J., Manzanres, A., Qin, X.: Improving MapReduceperformance through data placement in heterogenousHadoop clusters. In: IEEE IPDPSW (2010)

125. Yoo, R., Romano, A., Kozyrakis, C.: Phoenix rebirth:scalable MapReduce on a large-scale shared-memorysystem. In: IISWC (2009)

126. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Er-lingsson, U. Gunda, P., Currey, J.: DryadLINQ: asystem for general-purpose distributed data-parallelcomputing using a high-level language. In: OSDI(2008)

127. Zaharia, M., Konwinski, A., Joseph, A., Katz, R.,Stoica, I.: Improving MapReduce performance in het-erogeneous environments. In: Usnix OSDI (2008)

128. Zaharia, M., Borthakur, D., Sarma, J., Elmeleeg, K.,Shenker, S., Stoica, I.: Delay Scheduling: A Sim-ple Technique for Achieving Locality and Fairnessin Cluster Scheduling. University of California—Berkeley. Technical Report (2009)

129. Zaharia, M., Borthakur, D., Sarma, J., Elmeleeg, K.,Shenker, S., Stoica, I.: Job Scheduling for Multi-User MapReduce Clusters. University of California–Berkeley. Technical Report (2009)

130. Zaharia, M., Chowdhury, M., Franklin, M., Shenker,S., Stoica, I.: Spark: cluster computing with workingsets. In: 2nd USENIX Workshop on Hot Topics inCloud Computing (HotCloud ’10) (2010)

131. Zaharia, M., Das, T., Li, H., Shenker, S., Scotia, I.:Discretized streams: an efficient and fault-tolerantmodel for stream processing n large clusters. In:Usenix Hotcloud (2012)

132. Zhang, B., Ruan, Y., Wu, T., Qiu, J., Hughes, A., Fox,G.: Applying twister to scientific applications. In: 2ndIEEE International Conference on Cloud ComputingTechnology and Science (CloudCom2010) (2010)

133. Zhang, Y., Gao, Q., Gaoy, L., Wang, C.: iMapRe-duce: a distributed computing framework for iterativecomputation. In: Proceedings of DataCloud 2011: TheFirst International Workshop on Data Intensive Com-puting in the Clouds (2011)


Recommended