Gurmeet_cloud_computing_seminar_report_updated

8/7/2019 Gurmeet_cloud_computing_seminar_report_updated

1/25

Cloud Computing in Distributed Systems

Seminar Report

Submitted in partial fulfillment of the requirements

for the degree of

Bachelor of Technology

by

Gurmeet Singh

Roll No. 07010150

Under the supervision of

Dr. Diganta Goswami

Associate Professor

Department of Computer Science and Engineering

Indian Institute of Technology, Guwahati

India

April 2010


2/25

Acknowledgements

I sincerely thank my supervisor Dr. Diganta Goswami, whose guidance helped me to do a literature reviewon the topic Cloud Computing in Distributed Systems. His suggestions to critically analyze the documentsand focus on the issues addressed in the existing work with an innovative view point helped me develop aresearch outlook.

I would extend my thanks to Mr. Karthik R, M.Tech second year student at IITG for explaining me hiswork with Dr. Goswami on An Open Cloud Architecture for Provision of IaaS during the initial stages ofthe research.

I also thank Dr. Saikat Guha, researcher at Microsoft Research Bangalore for his suggesting me to readthe work done on Google Filesystem and Amazons Highly Available key-value store during my searchfor work implemented on a large scale.

i


3/25

Abstract

Cloud Computing as an Internet-based computing; where resources, software and information are pro-vided to computers on-demand, like a public utility; is emerging as a platform for sharing resources likeinfrastructure, software and various applications. The majority of cloud computing infrastructure consists ofreliable services delivered through data centers and built on servers. Clouds often appear as single points ofaccess for all consumers computing needs. Commercial offerings of the cloud are expected to meet qualityof service guarantees for customer satisfaction and typically offer service level agreements. The deploymentof cloud computing can be easily observed while working on Internet, be it Google Docs or Google Apps,YouTube Video sharing or Picassa Image sharing, Amazons Shopping Cart or eBays PayPal, the examplesare numerous. This paper does a literature survey on some of the prominent applications of Cloud Com-puting, and how they meet the requirements of reliability, availability of data, scalability of software andhardware systems and overall customer satisfaction.


4/25

Contents

1 Introduction 2

2 Eucalyptus 3

2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Associated Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Service Orientation 5

3.1 Cloud Computing Open Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Performance Model Driven QoS Guarantees and Optimization . . . . . . . . . . . . . . . . . . 6

4 File System 8

4.1 Design Of Google File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 Issues Addressed by the Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Data Processing 13

5.1 Implementation of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.1 Using Map and Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.2 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Issues Addressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Data Availability 17

6.1 Design of Dynamo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.1.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.2 Issues Addressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7 Conclusion 21

1


5/25

Chapter 1

Introduction

Cloud computing includes hosting several services over the Internet, divided into three categories: Software-

as-a-Service (SaaS), Platform-as-a-Service (PaaS) and Infrastructure-as-a- Service (IaaS).

In SaaS is a model of software deployment where a provider licenses an application to users for use asa service on demand. The vendors may host the application on web servers or download the application tothe consumer device; and after the on-demand contract expires, disabling it. Google Apps, Google Docs [7],Acrobat.com and Salesforce.com are major SaaS providers.

In PaaS deployment of applications is provided without the cost and complexity of buying and managingthe underlying hardware and software layers, providing all of the facilities required to support the completelife cycle of the web applications. Amazon Web Services [2], Azure Services Platform, Rackspace Cloud andGoogle App Engine are some examples of this category.

IaaS is the delivery of computer infrastructure, usually a platform virtualization; where instead ofpurchasing servers, software, data center space or network equipment, clients instead buy the resources asa fully outsourced service. IaaS like Amazon Web Services provides virtual server instances with uniqueIP addresses and blocks of storage on demand. Amazon Elastic Compute Cloud [2] and Eucalyptus [1] areprominent examples of IaaS.

This report looks into the design considerations and system architechture of some of the well-knownapplications of Cloud Computing. The use of core Distributed System techniques is highlighted in suchrenderings. Further addressed are the issues faced by the developers before, during and after the designimplementation. A comprehensive report of diverse Cloud renderings having different requirements for theirsystem due to varying expectations of the customers is presented.

Following this is a chapter on Eucalyptus that is an IaaS rendering developed for research purposes and

research environment. Next is a review of design suggestions focussed on customer satisfaction by usingService Oriented Architechture or Quality of Service guarantees to the customers while optimizing the profitof the cloud vendors. The google file system discussed next focusses on the requirements of only the Googlelike dominance of append operations in contrast to random writes. Chapter on MapReduce that followsaims to improve performance while keeping the design as simple as using just Map and Reduce functions offunctional programming. Lastly Amazons Dynamo looks into the issues of high reliability and availabilitywhile trading off consistency for achieving it.

2


6/25

Chapter 2

Eucalyptus

EUCALYPTUS [1] is an open-source cloud-computing framework for research purposes that uses storage

and computational infrastructure. It is composed of several hierarchical components, viz. Cloud Controller,Cluster Controller and Node Controller which interact with each other while supplying facilities to the cloudclient.

Cloud computing systems delivering Infrastructure as a Service dynamically provision Virtual Machineinstances to the client for hosting software services. Scheduling of the VM instances is one of the crucialquestions in cloud computing. Eucalyptus attempts to solve the issues of VM scheduling, storage of data,network between the nodes of cloud and definition of user interfaces.

2.1 Design

The four components of Eucalyptus that have their own Web-service interface for communication with othercomponents are described as follows:

Node Controller: Every node that runs Virtual Machine instances has an execution of Node Con-troller, NC. An NC is expected to reply to the describeResource and describeInstance queries fromCluster Controller (CC) about the nodes number of cores, memory size or disk space available; andhandle its subsequent control requests of runInstance by creating virtual networks endpoint and in-structing hypervisor to run instance, and terminateInstance by instructing hypervisor to end VM,rupturing network end-point and cleaning the local data.

Cluster Controller: CC is the head of many NCs forming a cluster. It has the job of connecting

the Cloud Controller, CLC to the NCs. It distributes the general requests of CLC to all nodes in thecluster and also trickles down the specific requests of CLC to a set of nodes in the cluster.

Cloud Controller: CLC issues runInstances, describeInstances, terminateInstances and describeRe-sources commands to a CC or a set of CCs. It manages all this information and being the only entrypoint to the cloud it schedules the VM instances. CLC also gives user visible interface to the cloud, forthem to sign up and query the system; as well as cloud administrator interface for inspecting systemcomponents availability.

Walrus: It is a data storage device which streams data in and out of the cloud, and also stores theVM images uploaded to Walrus and accessed from the nodes. It supports concurrent and serial datatransfer.

Apart from these high-level components, an essential part of Eucalyptus is the Virtual Overlay Net-work which is VLAN implementation running over the top of Virtual Machines. Users attach a VM instanceto a Network at the boot time. There is a unique VLAN tag for each such network which helps connectVMs to the public Internet and at the same time separate VMs belonging to different cloud allocations.

3


7/25

2.2 Associated Issues

Designed for the academic and research purposes, Eucalyptus deploys an infrastructure for VM creationcontrolled by the user. During the design the main issue was use of resources found within research envi-ronment. Hence the design of Eucalyptus uses hardware commonly found in existing laboratories, including

Linux clusters and server farms.

The networking used is simple and flat Virtual Networking which addresses three issues.

Connectivity: Virtual overlay network provides connectivity of nodes to public Internet and to othernodes running VM instances scheduled by the same cloud allocation. Connectivity can be partial too,so that at least one of the VM instance from a set of instances has connectivity to Internet, using whichuser can log in and access all the instances.

Isolation: The overlay network isolates of the network of the nodes of one cloud allocation from thatof the nodes of some othe cloud allocation for security issues. This prevents VM instance of one cloudallocation to acquire MAC address of physical resource and interfear with VM instances of other cloud

allocations on the same resource. Performance Owing to reduced performace overheads of Virtual Networking in the recent years, the

use of such a network design is favoured.

Research is further facilitated by the modular nature of the design, helping researchers replace onecomponent for enhancement without the need to interfere with others.

Eucalyptus simple design is such that it just offers the basic requirement of provisioning of services [4].It suffer from huge internal network traffic due to frequent access to data centers by the nodes. The cloudsystems are configured having the peak traffic in to consideration. So most of the nodes and hence theresources are left idle most of the time[5].

In AOCAPI[5] when CC gives a removeInstance command to the NC, it will neither remove the diskimage from the machine nor disturb the file system. It will just mark it as disabled, treated same as removingthe image. Hence, it can be again be marked as enabled and can run when the image has to be reloaded. Thiswould eliminate the overhead in fetching the disk image from data center hence reducing network overheads.

For this purpose of smart scheduling the address controller is used which decides to the address whereeach user request must be forwarded. The address controller consults the usage register and recent index.The usage register monitors usage of all nodes by by recording the information like CPU load on a nodeexerted by each virtual machine instance running on it. The recent index which records recent set of nodesused by each user. It stores the address of virtual machine instance to which a request was sent last timefor a user, along with time stamp to find the most recent one.

4


8/25

Chapter 3

Service Orientation

The aim of a cloud computing platform is to deliver services to the cloud clients, yet most of the platforms

have not yet adopted service oriented architecture (SOA) to guarantee Quality of Service guarantees to theclients. At the same time, there should be run time optimization of the cloud so as to attain maximum profitin the cloud constrained by those QoS guarantees and Service Level Agreements, SLAs between the cloudvendor and clients.

3.1 Cloud Computing Open Architecture

Figure 3.1: CCOA Overview. Figure taken from [6]

The Cloud Computing Open Architecture, CCOA presented in [6] amalgamates the service orientedarchitechture with the virtualization techniques by its seven architectural principles and ten interconnectedmodules. This architecture meets the end objectives of creating scalable provisioning platform for cloudcomputing which can be configured based on the customer requirements, proposing shared services to providecloud offerings to business consumers in a unified way, and maximizing business value through monetizationof computing.

The first of the seven principles, illustrated along with the ten modules in Fig 3.1 also from [6], theIntegrated Cloud Ecosystem Management includes four modules which are interdependent andgive and take services from one another. Cloud Vendor Dashboard is used for managing internaloperations of the cloud, Cloud Partners Dashboard for Cloud partners who collaborate with Cloud

5


9/25

Vendors to leverage services to the Cloud Client while also providing them components through itsinterface to the rest of the cloud. Cloud Client Dashboard is the centre of the unified framework clientsuse to access services, like Web portals or program based business to enterprise users channel or phonebased individual customer representative channel. Cloud Vendors and Cloud Clients also interface theCloud Ecosystem Management module which supervises cloud activities while managing memberships.

The second principle of Virtualization of Infrastructure is met by using Hardware components inplug and play mode for hardware virtualization and managing software images, codes, sharing etc. forsoftware virtualization. The module used is the Core Infrastructure of the Cloud.

Third principle is Service Orientation which is provided by the Cloud Horizontal Business sub-module, which are the platform services share by a range a customers; and Cloud Vertical Businesssubmodule which are more domain or industry specific.

The fourth principle of Extensible Provisioning and Subscription for cloud segregates CloudProvisioning Services from Cloud Subscription Services which share role defining framework and noti-fication framework but operate provisioning process and subscription process separately.

The fifth one about Configurable Cloud Offerings are the cloud business solutions in the form ofIaaS like storage cloud Google Docs, SaaS like software leveraged by PayPal for customers, Applicationas a Service like web based development tools and Business process as a service like software testingplatforms.

The next module of Cloud Information Architecture is responsible for the effective communicationof various modules with each other and helps meet the sixth principle of Unified InformationExchange.

Lastly the Cloud Quality Governace module is identifies quality indicators and governs their state touse the Quality of Service parameters for defining reliability, response time and security. With thismodule we attain the most important principle of Cloud Quality and Governace.

This architecture successfully amalgamates the power of service oriented architecture missing in many

cloud offerings and the existing use of virtualization technology missing in pure service oriented architectures.

3.2 Performance Model Driven QoS Guarantees and Optimization

While the previous work focuses on how to provide services to the customers in a unified way while makingthe best out of the resources, this Performance Model Driven Cloud in [4] monitors the performance deliveredby the cloud, ensures QoS guarantees to customers as well as optimizes the profits in the cloud constrainedby these QoS and SLAs.

Performance model predicts and makes optimal decisions about many decision variables to foresee in-

teractions among these decisions and hence optimizing decisions in autonomic control. A performance modellike LQM has good correspondence to layered resource behaviour. The performance parameters of LQM areexternal services, CPU demands of entries and requests within entries. LQM is an extended queueing networkmodel which predicts throughputs, queueing delays, service delays and utilization of resources.

Quality of Service is a goal of Cloud management which is treated as a constraint on the resourceoptimization, that is seeking maximum profit out of minimum number of resources. For a service of classc, the associated price customer pays to the application is Pc, and response time of the service is assumedto be the measure of its QoS.

The workload given to a class of service c describes the intensity of the streams of user requests for theservice, in terms either of a throughput fc for user class c, or the number Nc of users that are interactingand their think time Zc which represents the users mean delay between receiving a response, and issuing its

next request. For each service class c there is a required throughput fc,min or a required user response timeRc,max. Rc,max can be expressed as a minimum user throughput requirement using Littles result:

fc fc,min = Nc/(Rc,max + Zc)

6


10/25

Now original delay requirements are changed to throughput requirements and optimization will consideronly the throughput.

Network Flow Model, NFM, depicting the flow of execution commands at the processors is usedfor the purpose of optimization. The nodes of NFM are the entities and the arcs arcs with their weightsrepresent the flow of demand and CPU-sec of execution per second.

Figure 3.2: Network Flow Model. Figure taken from [4].

Each host h has a price of CPU execution of Ch per CPU-sec, including unused CPU-sec allocated inorder to reduce contention delays, In the NFM results, each task t has a reservation ht in CPU-sec per sec,on some host h. If App and App are the sets of user classes and tasks involved with App.

PROFITApp = cAppPcfc (h,t)AppChht

The cloud optimization is to maximize the total profits:

TOTAL = AppPROFITApp

This approach of optimization for profit maximization is effective as well as scalable to meet new chal-lenges of Cloud Computing. The scalability for very large cloud shall come from scaling the performancemodel calculations by partitioning them to subsets of processors. These subsets can be very large though,

as observed during the implementations of [4], so as to accommodate many applications. Further work isbeing done to account for VM overhead costs, memory allocation, communication delays and licensing costsof software replicas.

7


11/25

Chapter 4

File System

Google is one of the most prominent online examples of the Cloud Computing paradigm. It has a host of

offerings ranging from storage cloud of Google Documents to application cloud of Google Apps. For thispurpose the underlying filesystem must be adapted to the needs of the customers and keeping in mind thenature of operations that most of the clients follow.

Google File System [7] is a scalable distributed file system for large distributed data-intensive applica-tions. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers highaggregate performance to a large number of clients. The largest cluster to date provides hundreds of ter-abytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed byhundreds of clients.

Several observations led to the development of the filesystem:

Since for such a large system component failures are norms, so the system must have constant moni-toring, error detection, fault tolerance, and automatic recovery as an integral component.

The block sizes and the I/O operation parameters among other design assumptions must keep in mindthat the file sizes are huge, multi-GB files are quite common.

Append data is more common than overwriting existing data at the end of files. Moreover, oncewritten, the files are only read, and often only sequentially. Random writes are practically non-existent.Hence performance optimization and atomicity guarantees for the append operation became the focusof design, while caching the data blocks at the client lost its appeal.

Lastly, co-designing the applications and the filesystems application programming interface hugelybenefits the overall system by increasing the flexibility. For instance relaxing GFSs consistency model

to simplify the file system without burdening the applications.

4.1 Design Of Google File System

A Google File System cluster consists of a single master and multiple chunkservers. This cluster is accessedby multiple clients. A GFS client code is associated to each application implements file system API andcommunicates with the master and chunkservers to exchange data on the behalf of the application.

Chunkserver and the Chunks The files are divided into chunks of fixed size, discussed in details in

Section 4.2, and each chunk is assigned a unique chunk handler by the master during its creation. Thechunks are replicated on multiple chunkservers for the purpose of reliability; by default three replicasare stored, though user is given all the facility to increase or decrease the level of replication supportedfor a particular regions of the file namespace.

8


12/25

A chunkserver stores the information about what chunks it has. Master does not persistently storea record of which chunkservers have a replica of a given chunk. But since the queries regardingchunk locations from all the clients come to master, the master maintains this database by polling thechunkservers at the start-up and later exchanging HeartBeat messages with them.

Master

GFS master, like chunkservers and the clients, is a Linux machine that maintains all the file systemmetadata including namespace, access control information, file to chunk mappings and chunk locations.Yet the master does not become a bottleneck for the system. Clients never read or write file datathrough master.

If a client wants to read data from a file it finds chunk index within the file using chunk size andbyte offset specified by the application. It then asks master which chunkserver it should requestfor the file name and index and caches the obtained information about chunk handle and locationof replicas temporarily using file name and index as the key. Further interaction is done withchunkserver only and master is not disturbed while the file is read or written. Master is nextrequested when the cached information expires or the file is reopened.

Master stores three kinds of metadata in its memory, file and chunk namespaces, mappings

from files to chunks, and the locations of each chunks replicas. Since metadata is stored in-memory, master operations are fast. The first two kinds of metadata, for the sake of reliabilityare kept persistently by logging operations on masters local disk and also replicated on remotemachines. While the last one, chunk replicas locations are not stored persistently as discussed inSection 4.1.

As introduced above, master maintains operation logs that contain records of metadata changesand logical time line of operations for reliability. Logical time line helps to identify files and chunksuniquely and define the order of concurrent operations. Log size should be small to keep startupfast, hence master checkpoints its state whenever log grows beyond a size. Master now usesanother file for logging and checkpoint is created in a separate thread. Recovery needs latestcheckpoint and the subsequent log files.

Master is also involved in many other processes like atomic record appends, write operations,

snapshot operations, namespace management and locking, replica placement, garbage collectionand stale replica detection. Implementation details of some are given in the following Section 4.2,please refer to [7] for further details.

4.2 Issues Addressed by the Implementation

A large number of issues are addressed by the File System, details of which are obscured by the implemen-tation. These have been studied and listed below:

The new file system obviates the requirement of caching file data at both clients and the chunkservers.Clients need not cache as the files and the working sets are huge and most of references to dataare sequential, so caching would lead to little benefits, but invite cache coherence issues along withcomplicating the client. Chunkservers also need not cache file data as chunks are stored in the form oflocal files at the chunkservers which can utilize in-built Linuxs buffer cache.

Deciding the chunk size is one of the key design parameters. A large chunk size reduces clients needto request master with too many file name and index to get the chunkserver handle and locations;reduces network overhead since more operations of clients are now with the same chunkserver andhence it maintains persistent TCP connection with same chunkserver for longer duration; and obviouslyreduces the metadata stored at the master.

Whereas for a small file with small number of chunks, perhaps one, the only chunkservers can becomehot spots for access. Practically applications using GFS read files distributed over many files sequen-tially, so hot spots are not a major issue. So a chunk size is appropriately chosen to be 64 MB whichis quite large compared to typical filesystems block sizes.

9


13/25

If master persistently stores the chunk locations then problem of keeping the master and chunkserversin sync crops up. Hence, GFS uses polling of chunkservers at startup and HeartBeat messages to mon-itor chunkservers status. This makes life easy as the chunkservers in a large clusters join and leavecluster, change names, restart, fail and so on.

GFS decouples the data flow and control flow entirely. While control flows from client to the

Primary Replica holding lease on the chunk, given by the master, and then to other Secondary Replicasall from Primary Replica; data can flows in any order, so client sends it to the nearest chunkserver,be it primary or not, similarly all chunkservers forward data to nearest chunkserver which has not yetreceived the data. Serial data transfer utilizes each machines network bandwidth compared to sendingdata in topology such as tree and forwarding data to nearest chunkserver avoids network bottlenecksand high latency links. Also the latency is minimized by forwarding the data to next chunkserver assoon as it starts pipelining.

Figure 4.1: Write Control Flow and Data Flow. Figure taken from [7].

From the observation of the kinds of data requests by clients, it follows that append operations are the

most common ones, so GFS provides an atomic append operation called Record Append which isheavily used by multiple clients on different machines to append same file concurrently. GFS appendsdata given to the file at lease once atomically (i.e. as one continuous sequence of bytes) at an offset ofits own choosing and returns that offset.

Since the chunk size is fixed, the amount of data that can be appended to a chunk should also berestricted, else the chunks overflow soon and data is lost. For this when client gives append frequent(control flow) to a Primary Replica while pushing data to all replicas serially, the Primary checks ifadding data would cause overflow. If so it pads the chunks to its maximum size and tell Secondariesto do the same. Then it replies to the client saying it should retry on some other chunk. Fixing therecord append data to at most one fourth of chunk size ensures at least four appends on the chunk.

If append fails at any of the replicas, client retries it, leading to different replicas of same chunk havingdifferent data possibly having partial or complete duplicates of the same record. Such replicas are

allowed in GFS since append only requires that record is written on all replicas at the same offset,what goes on at other locations is immaterial as that offset is not given to the application so it wontfetch this offset in a normal case.

The Snapshot operation makes an instantaneous copy of a file or a directory tree without muchinterruptions to ongoing mutations. It is used by users to quickly create branch copies of huge datasets or to checkpoint current state before making changes that might require an easy roll back.

The Snapshot operation is mostly handled by the Master which logs the operation immediately onarrival of a request. It then calls off the leases it gave to the chunkservers holding chunks of the files sothat when a client wants to write to a chunk it is not allowed due to absence of lease with the expectedchunkserver and hence the client contacts master to find lease holder. So master will at that point oftime create a copy of the chunk. This shifts huge overhead of snapshot operation while snapshot is

being taken, to small overhead while the snapshot files chunks are being written.After master revokes the leases on the files chunks, it creates a copy of the metadata of the file whichalso point to the same chunks as the source file. Hence these chunks have reference count more thanone (at least two) which is observed when a write request for some chunk arrives and hence master then

10


14/25

decides to first create a copy of the chunk on the same chunkserver (hence reducing network overhead)and then replying to the client.

Replica Placement The chunks of GFS are replicated at the chunkservers distributed across multiplemachine racks to ensure reliability in situations such as failure of entire rack or a network switch discon-necting a rack from the system temporarily. Also, this helps to utilize network bandwidth of multiple

racks especially for chunk read requests, if reads of a chunk too are distributed over chunkservers (andhence across racks).

Garbage Collection The lazily done garbage collection by the GFS master has several notable pointsabout its mechanism, which enhance performance of the system, listed below:

Safety against accidental delete: When a file is deleted by an application master logs the operation,marks it deleted and renames it to a hidden name which can be seen by the application under thehidden name as well as restored to the original name, hence providing Safety against accidentaldelete.

Scan of the file system namespace: Master does a regular Scan of the file system namespace duringwhich it removes the hidden files if they have existed for more than three days, a configurableinterval.

Scan of the chunk namespace : Masters similar regular Scan of the chunk namespace identifieschunks not reachable from any file and erases metadata of those chunks. During the HeartBeatmessages with chunkservers, they report to the master a subset of its chunks and master tellswhich chunks are not available in masters metadata, and hence are free to be deleted by thechunkserver.

Simple and Reliable : The above method is Simple and Reliable for large scale distributed systemlike GFS where component failure is a norm. For situations like unsuccessful chunk creation insome chunkservers and successful in a few, master may not be aware of the existence of somechunks which are removed during the Garbage Collection.

The Major Advantage: The primary advantage of the Garbage Collection Mechanism is that it ismostly done along with background activities of the master like HeartBeat messages and chunk

and file system namespace scan. Master does garbage collection when it is relatively free so as togive timely service to the client requests.

The Main Disadvantage: The only disadvantage is encountered when applications create anddelete temporary files repeatedly leading to tight storage and hence preventing instant re-usabilityof storage. This is addressed by accelerating storage re-usability when a deleted file is deletedagain, just like deleting a file from trash or recycle bin in our personal systems. Also, user canapply different replication and reclamation policies and configure time after which a hidden file isremoved from namespace and in-memory metadata, to files in different part of the namespace.

Availability Issues Component failures in the system can lead to unavailability of data or in worstcase corrupted data. Here we discuss how GFS ensures availability and the next section on DataIntegrity explains how corruption of data is handled. Availability is ensured at three levels:

Fast Recovery: All the components of the system, the chunkservers and the master are designedto restart from a shut down as well as from a failure within seconds, and restore their states inno time. In fact, there is no difference between a normal termination and a component failure.

Chunk Replication: As discussed earlier, a chunk is replicated on several, usually three, chunkserversacross different machine racks to ensure reliability and hence availability.

Master Replication: The master state, operation logs, and checkpoints are all replicated on mul-tiple machines for reliability. A change to masters state is considered committed only after it hasbeen successfully written to local disk as well as all replicas.

When master fails it can start in no time. But when it is not able to, perhaps due to disk ormachine failure, then a Monitoring Infrastructure outside GFS starts A New Master on adifferent machine with the replicated log and a canonical name, say gfs-test, which is nothing buta DNS alias of the master.

Also provided are Shadow Masters, which lag the primary master by a seconds fraction, toenhance Scalability and Availability for read operations. Due to this negligible lag, the applicationscan end up reading stale file metadata like directory contents, but not the file data itself as it isread from the chunksevers. They also provide read-only access to filesystem when master is down.

11


15/25

Data Integrity It is not feasible to check if a replica has uncorrupted data by comparing data acrossseveral chunkservers due to the network and performance overheads as well as since GFS does notguarantee that legal replicas of different chunks will have the same data; consider for example recordappend operation which can end up with chunks having different data (as discussed in details in 4.2)at offsets which are not used by the application, but the mere existence of such useless data can leadto false interpretation that data on a chunk is illegal.

So GFS decided to verify correctness of the data (ensuring reliability against component failures cor-rupting data) on a chunk by simply calculating 32 checksum for each 64 KB block of a 64 MB chunk,and storing it with other chunk metadata and also storing it with other logs. Some notable pointsabout checksum mechanism are as:

Read Operation: When a client or another chunkserver requests for data from a chunkserver, itverifies for checksum for the range of blocks requested and then returns data. If there is comecorrupted data, the chunkserver returns an error to who requested and reports mismatch to themaster. While the requester asks master a location of some replica of the chunk and repeatsthe request, master copies a valid replica from another chunkserver to some chunkserver, oncompletion of which master instructs the chunkserver with corrupted data to delete its replica.GFS client code reduces the performance overhead by trying to align reads at block boundaries.

Append Operation: Being a dominant operation for GFS, checksum calculation for this operationis heavily optimized by incrementally recalculating checksum for the last written block of a chunkand computing new check-sums for successive blocks. If that partially check-summed block hadcorrupted data we dont detect it now and let it be detected during Read Operation to eliminateoverheads from Append Operation.

Write Operation: If a write operations data traverses blocks such that operation has to partiallyoverwrite the first and/or the last blocks, then we need to check if the data not being overwrittenon the first and/or the last blocks is correct or not. So write operation does the check andcontinues to write. A mismatch is handled in a way similar to the read operation.

Chunkservers can, during their idle times, scan and verify contents of inactive chunks which are notchecked by any read/write operation for a long time, following which master places a valid chunk

for every illegal chunk detected at the scanning chunkserver. This prevents the master from falselybelieving that it has enough legal copies of inactive chunks while most of the copies might have beencorrupted.

Hence GFS with its centralized master approach meets the requirements of simplicity, flexibility, relia-bility, fault tolerance and high performance. It is widely used within Google as storage platform as well asfor research purposes, for instance deploying MapReduce explained in the next chapter.

12


16/25

Chapter 5

Data Processing

For data processing and computations on large data sets, Cloud Computing needs a simplified programming

model which is parallelizable, fault tolerant and does data distribution and load balancing over a largedistributed system. On top of it, it should be easy to use. To address these needs, Google came withits MapReduce [3] which delivers all the requirements along with the ease of use as the messy details ofparallelization etc. are hidden in a library. The implementation and issues addressed discussed in thesubsequent sections explain the working of this simple approach to data processing.

5.1 Implementation of MapReduce

MapReduce is based on Map and Reduce primitives of functional programming languages like Lisp. Thisis done as most of the computations at Google involve a Map function to map a set of values to a set of

intermediate (key, value) pairs and then combining those intermediates with same key and Reducing thecombined data of values with same key to get the final output.

5.1.1 Using Map and Reduce

There is a MapReduce library which a use of MapReduce uses to express the computation to be done interms of Map and Reduce functions, which have to be written by the user only.

Map: This function has to be written by the user to produce, from the given input ( key, value)pair,a set of intermediate (key, value) pairs, in such a way that all the values of the same intermediate

key say I shall be grouped together by the underlying MapReduce library and given as input to theReduce function.

This can be summarized as: (k1, v1) list(k2, v2)

The Map function as in normal functional programming, can change the domain of the input key andvalues so that (k1, v1) transforms to (k2, v2).

Reduce: This function made by the user accepts an intermediate key I from the MapReduce libraryand a set of intermediate values for that key. It merges or combines the data of these intermediatevalues to a smaller set of values; mostly zero or one values are generated. To handle the situationof very large set of intermediate values, an iterator is used to deliver the intermediate values at theReduce function.

This can be summarized as: (k2,list(v2)) list(v2)

The Reduce function as expected, does not change the domains of its input keys and values.

Exemplification: For instance in a simple case of finding the lines of a file that have occurrence ofa set of patterns, the Distributed Systems would prefer a Distributed Grep over a conventional one.

13


17/25

The Map function has to be written such that it sends a line to output if it has the pattern, the key;whereas Reduce is an identity function that copies input to the output file.

Similarly to count the number of times each word occurs in a large collection of files, Map takes the(file name, f ilecontents) as input and returns (word, number of occurrences) as the intermediateoutput. The reduce function on continuously getting a list of count values of each word from its iteratoras the input, keeps on incrementing the count for each word and outputs the final count for each word.

5.1.2 Execution

Figure 5.1: Execution of Map and Reduce. Figure taken from [3].

The figure above shows the flow of a MapReduce operation. The input data is partitioned into a set ofM splits, processed in parallel by different machines. Reduce invocations are distributed by partitioning theintermediate key space into R partitions, using a user defined partitioning function, say hash(key) mod R;note that R is also defined by the user.

1. The MapReduce library splits the input files into M pieces of typically 16 MB to 64 MB per piece andforks many copies of the program on a cluster of machines.

2. One of the copies of the program is the Master who assigns M Map tasks and R Reduce tasks to therest of the workers. The master picks idle workers and assigns each one a map task or a reduce task.The master is also responsible to store for each task, a state of the task- idle, in-progress or completed,and the identity of worker machine handling the non-idle tasks.

3. A worker assigned a map task reads the contents of the corresponding input split, parses (key, value)pairs out of the input data, passes each pair to the Map function, and buffers the intermediate(key,value) pairs in the memory.

4. The buffered pairs are periodically written to local disk, partitioned into R regions by the partitioningfunction. The locations of these buffered pairs on the local disk are passed back to the master, whoforwards these locations to the reduce workers.

14


18/25

5. A reduce worker on being notified by the master about these locations, uses remote procedure calls toread the data from the local disks of the map workers and sorts the data by the intermediate keys,probably using external sort if data is large.

6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key, itpasses the key and the corresponding set of intermediate values to the users Reduce function, whose

output is appended to a final output file for this reduce partition.

7. When all map tasks and reduce tasks have been completed, the master wakes up the user program,and the MapReduce call in the user program returns back to the user code.

5.2 Issues Addressed

Fault Tolerance: To obscure the fault tolerance from the users, the MapReduce library must handlefault tolerance. Master finds out if a worker has failed through its periodic ping messages to the worker,the failure of reply of which makes the master mark the worker as failed.

For failure of a Map Worker while task is in progress or even after the worker has completed itsmap task, the task has to be re-executed on some other worker with its state reset to idle; asthe failed worker was writing its output on its local disk which is now not accessible to ReduceWorkers. But master on receiving a completion message for an already completed task ignoresthe message. If the completion message is for a map task never completed earlier, then masterrecords names of R files where the Map Worker wrote its output and subsequently all ReduceWorkers should access this new worker for the intermediate output.

For failure of a Reduce Worker while task is in progress, it is reset to idle state and is rescheduledon some other worker. But the already completed reduce task need not be re-executed sinceReduce Workers write their output in the global filesystem. If multiple reduce tasks are executedon multiple machines, on completion there will be multiple rename operations for the same finaloutput file, which is handled by the atomic rename operation of the filesystem used, in this case

GFS 4. For failure of Master, situation can be handled by making the master write periodic checkpoints

and on a failure recovery start from previous checkpoint. But since there is only one masterso probability that its fails is negligible. So current implementation of MapReduce aborts thecomputation and restarts the entire process if client wants.

Network Bandwidth: Google observed that network bandwidth is a scarce resource and henceMapReduce optimizes use of the same by making the Map tasks scheduled preferably on same machinethat contains a replica (refer to GFS) of the input data. If that fails, try to schedule it on somemachine on the same network switch as the one containing a replica. Further writing a single copy ofthe intermediate data to the Map Workers local disk saves network bandwidth.

Load Balancing: For best performance, values of M and R should be much larger than the number of

workers available to firstly improve dynamic load balancing by spreading tasks to many workers, andas and when they complete, assigning them new jobs; and secondly speeding up recovery if a workerfails, as then many map tasks can be re-executed on some other machine.

But M is limited as for underlying file system, a data chunk can be read completely if it is less than64 MB. R on the other hand is ofter limited by user as increasing R increases the number of outputfiles. So M is chosen so that each task is roughly 16 MB or 64 MB, and R is a multiple of M. Typicallyfor 2,000 worker machines, a MapReduce computation uses M = 200,000 and R = 5,000.

Partitioning Function: MapReduce leaves it for user to decide the partition function to partitionintermediate data to R partitions as a user may want to see output in a particular fashion. For instancefor if output keys are URLs, user can specify partition function such that all entries for same host fallinto the same output file, using something like hash(Host-name(urlkey)) mod R.

Combiner Function: If there is a lot of repetition in the intermediate keys produced by map tasks;for instance in case of the word counting example where word frequencies follow a Zipf distribution,each map task will produce many records of the form (the, 1), all of which will have to be sent over thenetwork to a single reduce task; then to reduce the overheads MapReduce allows the user to specify

15


19/25

an optional Combiner function, executed on each machine that performs a map task, that does partialmerging of this data before it is sent over the network. This function speeds up certain classes ofMapReduce operations.

The implementation of MapReduce scales to large clusters of machines comprising thousands of ma-

chines, making efficient use of the machine resources and hence being suitable for use on many of the largecomputational problems encountered at Google. It has been widely deployed for large-scale machine learningproblems, clustering problems for the Google News and Froogle products, extraction of data used to producereports of popular queries, extraction of properties of web pages for new experiments and products andlarge-scale graph computations.

The use of functional programming with user specified map and reduce leads to easily parallelized largecomputations. Fault tolerance is handled by re-executions, data distribution by Master, load balancing bysimple techniques as carefully choosing number of Map and Reduce Workers. All this is guaranteed hidingthe complexities from the user and hence is easy to use, making MapReduce good for simplified large scaledata processing.

16


20/25

Chapter 6

Data Availability

The Amazon Web Services are a collection of remote computing services (also called web services) that

together make up a cloud computing platform, offered over the Internet by Amazon.com. In August 2006,Amazon introduced Amazon Elastic Compute Cloud (Amazon EC2), a virtual site farm, allowing users touse the Amazon infrastructure with its high reliability to run diverse applications ranging from runningsimulations to web hosting.

The biggest challenge Amazon.com, one of the largest e-commerce platform in the world, faces is Relia-bility at a big scale, as even the slightest breakdown can lead to significant financial consequences and shakethe customer trust. Dynamo, used by some of the Amazons core services, is a highly available key-valuestorage system used to provide an always-on experience. To achieve high availability, Dynamo sacrificesconsistency under certain failure scenarios.

6.1 Design of Dynamo

Since when dealing with the possibility of network failures, strong consistency and high data availabilitycannot be achieved simultaneously, Dynamo is designed to be an eventually consistent data store; that isall updates reach all replicas eventually, changes are allowed to propagate to replicas in the background, andconcurrent, disconnected work is tolerated in contrast to the Google File System discussed earlier.

6.1.1 Observations

Some interesting observations about Amazon, apart from applications requiring high availability, that influ-

enced its design are listed as:

A large part of Amazons services can work with a simple query model having simple read and writeoperations, and do not need any relational schema. Dynamo targets applications that need to storeobjects that are relatively small (usually less than 1 MB) as opposed to files of terabyte order handledby GFS.

Since the data stores have poor availability, Dynamo targets applications that operate with weakerconsistency, the C of ACID, if this results in high availability. Dynamo does not provide any isolationguarantees and permits only single key updates.

In Amazons platform, services have stringent latency requirements, so services must be able to con-figure Dynamo such that they consistently achieve their latency and throughput requirements. As a

result the trade-offs are in performance, cost efficiency, availability, and durability guarantees.

A crucial requirement for many Amazon applications is that they need an always write-abledata store where no updates are rejected due to failures or concurrent writes. Dynamo targets suchapplications.

17


21/25

6.1.2 Architecture

Dynamo stores objects through two operations: get() and put(). The get(key) operation locates the objectreplicas associated with the key in the storage system and returns a single object or a list of objects withconflicting versions along with a context. The put(key, context, object) operation finds where the replicas

of the object should be placed based on the associated key, and writes the replicas to disk. The contextincludes information such as the version of the object.

The core distributed systems techniques used in Dynamo are partitioning, replication, versioning, mem-bership, failure handling and scaling each of which is explained below.

Partitioning: To make Dynamo scale incrementally it uses mechanism to dynamically partition thedata over the set of storage hosts, called nodes. Dynamo uses consistent hashing to distribute theload across multiple storage hosts. Consistent hashing treats the output range of a hash function asa fixed ring. Each node is assigned a random value within this space which represents its position onthe ring. Each data item identified by a key is assigned to a node by hashing the data items key toyield its position on the ring, and then walking the ring clockwise to find the first node with a position

larger than the items position.Each node is made responsible for the region in the ring between itself and its predecessor node onthe ring. Hence the joining or leaving of a node only affects its neighbors and other nodes remainunaffected.

Since random position assignment of each node on the ring leads to non-uniform data and load distri-bution, Dynamo uses a variant of consistent hashing: instead of mapping a node to a single point inthe circle, each node gets assigned to multiple points in the ring. A virtual node looks like a singlenode in the system, but when a new node is added to the system, it is assigned multiple positions,called tokens in the ring.

Using virtual nodes has the following advantages:

When a node is unavailable, the load handled by this node is evenly dispersed across the remaining

available nodes.

When a node is again available, or a new node is introduced to the system, the newly availablenode receives an almost amount of load from each of the other available nodes.

Heterogeneity in the physical infrastructure is taken into consideration by assigning the numberof virtual nodes that node is responsible for, based on its capacity.

Figure 6.1: Nodes and Keys in the ring. Figure taken from [2].

Replication: To achieve high availability and durability, Dynamo replicates its data on multiple hosts.Each data item is replicated at N hosts. Each key, k, is assigned to a coordinator node who is the incharge of the replication of the data items that fall within its range.

Each node is responsible for the region of the ring between it and its Nth predecessor. In the Figure6.1, node B replicates the key k at nodes C and D in addition to storing it locally. Node D will storethe keys that fall in the ranges (A, B], (B, C], and (C, D]. The list of nodes that is responsible forstoring a particular key is called the preference list. The preference list for a key is constructed by

18


22/25

skipping positions in the ring to ensure that the list contains only distinct physical nodes, as a nodemay be holding more than one virtual positions in the ring.

Version Handling: Dynamo uses vector clocks in order to find conflict between different versions ofthe same object. A vector clock, associated with every version of every object, is a list of (node, counter)pairs. If the counters on the first objects clock are less than or equal to all of the nodes in the second

clock, then the first is an ancestor of the second and can be forgotten. Otherwise, the two changes areconsidered to be in conflict and require reconciliation.

Figure 6.2: Version Handling. Figure taken from [2].

In Dynamo, when a client wishes to update an object, it must specify which version (found in thecontext obtained from an earlier read operation) it is updating. On getting a read request, if Dynamohas multiple branches (as shown in the Figure 6.2 above) that cannot be syntactically reconciled, itwill return all the objects at the leaves, with the version information in the context. An update using

this context is considered to have reconciled the divergent versions and the branches are collapsed intoa single new version as D3 and D4 are reconciled to D5 by the node Sx, that merges the vector clocksof the two versions.

Ring Membership: The administrator uses a command line tool or a browser to connect to a Dynamonode and issue a membership change to join a node to a ring or remove a node from a ring. A gossip-based protocol, in which each node contacts a peer chosen at random every second and the two nodesefficiently reconcile their persisted membership change histories, propagates membership changes. Thishelps to maintain an eventually consistent view of membership. When a node starts for the first time,it chooses its set of tokens and maps nodes to their respective token sets.

Failure Detection: Since nodes are informed of permanent node memberships by the explicit nodejoin and leave methods, temporary node failures can be detected by the individual nodes when they

fail to communicate with others while forwarding requests.Node A may consider node B failed if node B does not respond to node As messages, even if B isresponsive to node Cs messages. Then Node A uses alternate nodes to service requests that map toBs partitions; also A periodically retries B to check for its recovery.

Scalability: When X is added to the system, it gets responsibility of storing keys in the ranges(F, G], (G, A] and (A, X]. As a consequence, nodes B, C and D should not store the keys in theserespective ranges and they will offer to transfer the appropriate set of keys to X. When a node isremoved from the system, the reallocation of keys happens in a reverse process.

6.2 Issues Addressed

Conflict Resolution: For many Amazon services, rejecting customer updates results in a poor cus-tomer experience. Like the shopping cart service must allow customers to add and remove items fromtheir shopping cart even amidst network and server failures. Hence complexity of conflict resolution ispushed to the reads in order to ensure that writes are never rejected.

19


23/25

Secondly, since the data store shall use simple policies like last write wins for conflict resolution whereasapplication is more aware of the data schema hence can choose a resolution mechanism to best satisfythe client. Like shopping cart should merge two versions to give a unified shopping cart!

Decentralization : In the past, centralized approach resulted in several outages and to avoid it adecentralized approach is used which leads to a simpler, more scalable and more available system as

well as maintains symmetry in the system.

Latency Sensitivity: Dynamo is built for applications that require at least 99.9% of read and writeoperations to be performed within a few hundred milliseconds. To meet these stringent latency require-ments, each node maintains enough routing information locally to route a request to the appropriatenode directly.

Vector Clock Size: Multiple server failures can lead to an object being written by nodes that are notin the top N nodes in the preference list, causing the vector clock size to grow. This is handled by addinganother field to vector clock, which now stores (node,counter, time stamp), where time stamp isthe last time the node updated the object. When number of triplets in an object reach a maximumallowable value, the vector clock with least time-stamp is removed.

Ring Membership: In Amazon node outages are often transient but may last for extended intervalsbut rarely implies a permanent failure, and therefore should not result in re-balancing of the partitionassignment or repair of the unreachable replicas. Hence it uses an explicit mechanism to initiate theaddition and removal of nodes from a Dynamo ring.

But the protocol of Ring Membership discussed earlier can result in a logically partitioned Dynamoring. For instance say, the administrator added node A to the ring, later added node B to the ring;then nodes A and B would each consider itself a member of the ring, yet neither of them would beimmediately aware of the other leading to logical partition of the key space.

To prevent logical partitions, some Dynamo nodes play the role of seeds, which are nodes discoveredvia an external mechanism and are known to all nodes. Because all nodes eventually reconcile theirmembership with a seed, logical partitions are highly unlikely.

Performance and Durability: A few customer-facing services required higher levels of performance

than the 99.9th percentile. So now each write operation is stored in the buffer and gets periodicallywritten to storage by a writer thread. Read operations first check if the requested key is present in thebuffer to avoid overheads.

This scheme trades durability for performance, as a server crash can result in missing writes that werequeued up in the buffer. Hence Dynamo chooses one out of the N replicas to perform a normal durablewrite. Since the coordinator waits only for W responses, the performance of the write operation isnot affected by the performance of only one durable write among more than W non-durable ones.

20


24/25

Chapter 7

Conclusion

The design of several diverse platforms deploying cloud computing are studied in details and their issues have

been highlighted for easy deployment of similar solutions for such issues. For instance simply using Linuxclusters and server farms with easy and modular design of Eucalyptus developed for reseach environments.Applications that need to focus on delivery of best services to the customers and maximizing profit have theaforementioned architecture to easily deliver services and satisfy the customers by use of service orientationor guarantee them QoS at the same time as optimizing the vendors profit. Some applications requiring highavailability can use design similar to Dynamos, to trade off between availability and consistency. For thosefocussing on performance can deploy a simple MapReduce architecture of Google. Google file system apartfrom detailing a system that is scalable, fault tolerant and delivers high performance; also teaches to observethe customer behavior and requirements closely and optimizing to deliver them best services possible.

21


25/25

Bibliography

[1] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, andDmitrii Zagorodnov. The Eucalyptus Open-source Cloud-computing System. In Proceedings of CloudComputing and Its Applications [online], October 2008.

[2] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman,Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels. Dynamo: Amazons

Highly Available Key-value Store. SOSP07, October 1417, 2007, Stevenson, Washington, USA.

[3] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004.

[4] Jim (Zhanwen) Li, John Chinneck, Murray Woodside, Marin Litoiu, Gabriel Iszlai. Performance ModelDriven QoS Guarantees and Optimization. In CLOUD 09: Proceedings of the 2009 ICSE Workshop onSoftware Engineering Challenges of Cloud Computing, Pages 1522.

[5] Karthik R, Diganta Goswami. An Open Cloud Architecture for Provision of IaaS. [Accepted]

[6] Liang-Jie Zhang and Qun Zhou. CCOA: Cloud Computing Open Architecture. IEEE International Con-ference on Web Services, 2009.

[7] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. SOSP03, October

1922, 2003, Bolton Landing, New York, USA.

Date post:	09-Apr-2018
Category:	Documents
Upload:	sunnykinger
View:	214 times
Download:	0 times